6.3. dbt with Airflow

Running dbt Core with Airflow using source freshness and Cosmos DbtDag so you avoid redefining DAG dependencies and only run when data is fresh.

Introduction

A common pain in data orchestration is redefining DAG dependencies that already exist in your dbt project. Duplicating the dependency graph in Airflow is error-prone and drifts when models change. A related issue is running jobs when there is no fresh data—scheduled runs that do unnecessary work and burn cost.

Ideally you would trigger dbt often (e.g. every 30 minutes) and have the run know which parts of the DAG to execute based on data freshness—so-called data-driven or state-aware orchestration. This article describes how to orchestrate dbt Core with Airflow using source freshness and Astronomer Cosmos DbtDag: you keep a single dependency graph (in dbt) and only run models that have fresh inputs.

The Problem

Manual dependency duplication

When you hand-write Airflow DAGs that mirror dbt’s DAG, you must keep two sources of truth in sync. Adding or renaming a model, or changing refs, forces DAG updates. That’s extra complexity and a steady source of mistakes.

Unnecessary runs

If runs are purely time-based (e.g. “run every hour”), you often execute dbt when upstream data hasn’t changed. That wastes compute and can make it harder to reason about what actually ran for a given data interval.

What we want

  • Rerun dbt on a regular cadence (e.g. every 30 minutes).

  • Only run the parts of the DAG that have fresh data (and, in advanced setups, only models affected by code changes).

  • Not maintain a separate dependency graph in Airflow—reuse dbt’s.

Project structure and manifest

Layer split

Organize the dbt project into clear layers, for example:

  • Source layer (raw on Databricks): managed by Airflow (or your orchestrator).

  • Stage/core layer: owned by data engineers.

  • Reporting layer: owned by analysts.

This supports multi-project setups and clear ownership. See Multi-Project dbt Setups and dbt’s guidance on dbt Mesh.

Manifest and “modified+” runs

Use dbt’s manifest so you can run “modified+” (only changed models and their downstream). With dbt Core you can store the manifest on S3 (or similar) and have Airflow use it. That addresses code freshness and reduces redundant model runs when only a subset of the project changed.

dbt Core with Airflow: source freshness and DbtDag

Idea

  • Use source freshness (and optionally manifest-based selection) to determine which parts of the dbt graph have fresh inputs.

  • Run DbtDag from Astronomer Cosmos on the downstream models that depend on those fresh sources, and respect your domain split (e.g. separate DAGs or task groups per domain).

So: “freshness on sources” drives what is eligible to run; Cosmos DbtDag builds the actual Airflow DAG from the dbt project so you don’t maintain dependencies by hand.

Benefits

  • No manual dependency graph in Airflow — Cosmos builds the DAG from your dbt project (dbt ls, manifest, or project parsing).

  • Stay on dbt Core and your current execution environment (e.g. Databricks).

  • Full control of where and how dbt runs.

Tradeoffs

  • Possible duplicate or redundant runs — You may rerun models that didn’t need to run (e.g. no “model freshness” or “build_after” for models, only source freshness). You can mitigate with manifest-based selection and good scheduling.

  • Configuration effort — Configuring Cosmos, source freshness, and domain boundaries takes some setup; analysts may rely on data engineers for this.

How it fits together

  1. Source freshness Use dbt’s source freshness (e.g. dbt source freshness) to know which sources are up to date. You can run this as an Airflow task and use the result (or a derived artifact) to decide which “entry points” in the dbt DAG have fresh data.

  2. DbtDag Use Cosmos’s DbtDag to generate the Airflow DAG from your dbt project. You can scope it to a subset of models (e.g. by tag, path, or the set of models downstream of fresh sources) so each run only executes the relevant part of the graph.

  3. Domain split Structure DAGs or task groups by domain (e.g. finance, marketing) so that each Cosmos DbtDag runs a coherent subset of models and respects ownership and blast radius.

Example: Cosmos DbtDag (conceptual)

Cosmos can build a DAG from your dbt project. You typically point it at your project path and optionally at a manifest or selection criteria:

Cosmos DbtDag scoped to staging and core models
"""
Cosmos DbtDag example: build an Airflow DAG from a dbt project.
Combine with a preceding task or sensor that checks source freshness
to run only the subset of models with fresh data.
"""
from astronomer_cosmos.factory import DbtDag

dag = DbtDag(
    dbt_project_name="my_dbt_project",
    dbt_root_path="/path/to/dbt/project",
    dbt_conn_id="dbt_default",
    select_config={"paths": ["models/staging", "models/core"]},
    schedule_interval="@hourly",
)

You would combine this with a preceding task (or sensor) that checks source freshness and passes which models (or entry points) to run, so the DbtDag only runs the subset with fresh data. Exact patterns depend on Cosmos version and how you expose “fresh” models (e.g. XCom, a small manifest, or tagged models).

Implementation considerations

  • Where dbt runs Cosmos can run dbt in-process or in a container/Databricks. Align with your existing choice (e.g. from 6.1. Schedule with Airflow) for consistency and cost.

  • Freshness and “modified+” Combine source freshness with manifest-based selection when possible: store the manifest (e.g. from dbt compile or dbt run) in S3 and pass it to Cosmos so you can run “modified+” and avoid re-running unchanged models when only code changed.

  • Analyst experience If analysts own reporting models, document how to add models and tags so they land in the right Cosmos DAG/selection without editing Airflow code. Use tags and folder structure to keep the domain boundaries clear.

When this approach fits

  • You want to keep dbt Core and your current Airflow (or MWAA) setup.

  • You’re willing to invest in Cosmos, source freshness, and optional manifest-based selection to get data-driven runs.

  • You want to avoid maintaining a separate dependency graph in Airflow and can accept some configuration and possible redundant runs.

Resources