5.4. Data Backfilling Strategies

Practical approaches for backfilling historical data in dbt incremental models, including migration strategies and event_date dependencies.

Introduction

Data backfilling is the process of populating historical data in your data warehouse, either during initial migrations or when adding new models to existing pipelines. Unlike regular incremental runs that process recent data, backfills process large historical date ranges efficiently.

This article covers three common backfilling scenarios:

  1. dbt Incremental Backfills: Using is_incremental() logic for date-based processing

  2. Event Date Dependencies: Managing pipelines where all models depend on a single event_date variable

  3. Data Migration Backfills: Moving historical data from legacy systems during platform migrations

Understanding these patterns enables reliable historical data processing without overwhelming compute resources or creating data quality issues.

Scenario 1: dbt Incremental Backfills

When backfilling dbt incremental models, leverage the is_incremental() macro to process historical dates efficiently.

The Pattern

Incremental models use is_incremental() to determine whether to process all data (full refresh) or just new data:

{{
    config(
        materialized='incremental',
        incremental_strategy='replace_where',
        incremental_predicates=["event_date = '{{ var('event_date') }}'"],
        partition_by=['event_date']
    )
}}

SELECT
    event_date,
    user_id,
    revenue
FROM {{ ref('raw_transactions') }}
{% if is_incremental() %}
    WHERE event_date = '{{ var('event_date') }}'
{% else %}
    -- Full refresh: process all historical data
    WHERE event_date >= '2020-01-01'
{% endif %}

Backfill Execution

For historical date ranges, run dbt with specific dates:

# Backfill single date
dbt run --select my_model --vars '{"event_date": "2025-01-15"}'

# Backfill date range (requires loop script)
for date in $(seq -f "%Y-%m-%d" $(date -d "2025-01-01" +%s) 86400 $(date -d "2025-01-31" +%s))
do
    dbt run --select my_model --vars "{\"event_date\": \"$date\"}"
done

Key Benefits

  • is_incremental() returns True during backfills, triggering efficient date-specific processing

  • replace_where strategy only replaces data for the specific partition

  • Idempotent: can safely rerun any date without affecting others

  • Resource-efficient: processes one date at a time instead of full table scans

Best Practices

  1. Partition Predicates: Always include incremental_predicates matching your WHERE clause

  2. Start Date Logic: Set a reasonable historical start date in the else block

  3. Testing: Validate backfill logic with --full-refresh before running production backfills

  4. Monitoring: Track progress and failures when backfilling large date ranges

Scenario 2: Event Date Dependencies

In well-architected data pipelines, all models should depend on a single event_date variable. This pattern enables atomic processing where an entire day’s data is processed together.

The Architecture

# dbt_project.yml
vars:
  event_date: "{{ (modules.datetime.datetime.now() - modules.datetime.timedelta(days=1)).strftime('%Y-%m-%d') }}"

All models reference this variable:

-- models/core/daily_metrics.sql
{{
    config(
        materialized='incremental',
        incremental_strategy='replace_where',
        incremental_predicates=["event_date = '{{ var('event_date') }}'"],
        partition_by=['event_date']
    )
}}

SELECT
    '{{ var("event_date") }}' AS event_date,
    user_id,
    COUNT(*) as event_count
FROM {{ ref('raw_events') }}
WHERE event_date = '{{ var("event_date") }}'
GROUP BY user_id

Airflow Integration

DAGs pass the execution date to dbt:

from airflow.decorators import dag, task
from datetime import timedelta

@dag(schedule="@daily", catchup=True)
def daily_processing():

    @task
    def run_dbt_models(ds=None):
        # ds is Airflow's execution date (YYYY-MM-DD)
        return f"dbt run --vars '{{\"event_date\": \"{ds}\"}}'"

    run_dbt_models()

Backfill Execution

Airflow’s backfill command handles date iteration automatically:

# Backfill January 2025
airflow dags backfill daily_processing \
    --start-date 2025-01-01 \
    --end-date 2025-01-31

Each DAG run processes one date atomically. If a date fails, it can be rerun independently without affecting other dates.

Advantages

  1. Atomic Processing: Entire day succeeds or fails together

  2. Clear Dependencies: All models process the same logical date

  3. Easy Monitoring: Track progress by date, not by model

  4. Simplified Backfills: Airflow handles date iteration

  5. Data Consistency: All models for a date are synchronized

Handling Late-Arriving Data

Sometimes data arrives late. Handle this with lookback windows:

SELECT
    event_date,
    user_id,
    event_type
FROM {{ ref('raw_events') }}
WHERE event_date BETWEEN
    DATE('{{ var("event_date") }}') - INTERVAL 3 DAYS
    AND DATE('{{ var("event_date") }}')

This reprocesses the last 3 days on each run, catching late arrivals while maintaining the event_date partition structure.

Scenario 3: Data Migration Backfills

When migrating from a legacy system, you need to backfill historical data while the new system is being built.

Migration Strategy

Phase 1: Dual-Write Setup

Start writing to both old and new systems:

-- models/staging/stg_orders.sql
{{
    config(
        materialized='incremental',
        incremental_strategy='append'
    )
}}

{% set cutover_date = '2025-06-01' %}

SELECT
    order_id,
    customer_id,
    order_date,
    'new_system' as source_system
FROM {{ source('new_system', 'orders') }}
WHERE order_date >= '{{ cutover_date }}'

{% if is_incremental() %}
    AND order_date >= (SELECT MAX(order_date) FROM {{ this }})
{% endif %}

Phase 2: Historical Backfill

Backfill data from the legacy system:

-- models/staging/stg_orders_historical.sql
{{
    config(
        materialized='table',
        tags=['backfill', 'one_time']
    )
}}

{% set cutover_date = '2025-06-01' %}

SELECT
    order_id,
    customer_id,
    order_date,
    'legacy_system' as source_system
FROM {{ source('legacy_system', 'orders') }}
WHERE order_date < '{{ cutover_date }}'

Phase 3: Merge Historical and Current

Union historical and current data:

-- models/core/orders.sql
{{
    config(materialized='view')
}}

SELECT * FROM {{ ref('stg_orders_historical') }}

UNION ALL

SELECT * FROM {{ ref('stg_orders') }}

Migration Execution Plan

  1. T-minus 30 days: Set up dual-write, test data quality

  2. T-minus 7 days: Run historical backfill in staging environment

  3. Cutover date: Switch analytics to new system

  4. T-plus 30 days: Validate data consistency, decommission old system

Data Validation During Migration

Compare record counts and metrics between systems:

-- models/validation/migration_reconciliation.sql
WITH legacy_counts AS (
    SELECT
        DATE(order_date) as date,
        COUNT(*) as order_count,
        SUM(order_amount) as total_amount
    FROM {{ source('legacy_system', 'orders') }}
    WHERE order_date BETWEEN '2024-01-01' AND '2025-05-31'
    GROUP BY DATE(order_date)
),
new_counts AS (
    SELECT
        DATE(order_date) as date,
        COUNT(*) as order_count,
        SUM(order_amount) as total_amount
    FROM {{ ref('stg_orders_historical') }}
    GROUP BY DATE(order_date)
)
SELECT
    l.date,
    l.order_count as legacy_count,
    n.order_count as new_count,
    l.order_count - n.order_count as count_diff,
    l.total_amount as legacy_amount,
    n.total_amount as new_amount,
    l.total_amount - n.total_amount as amount_diff
FROM legacy_counts l
LEFT JOIN new_counts n ON l.date = n.date
WHERE ABS(l.order_count - COALESCE(n.order_count, 0)) > 0

Run this validation daily during the migration window.

Best Practices for Backfills

Resource Management

Control compute resources during large backfills:

# profiles.yml
my_project:
  target: prod
  outputs:
    prod:
      type: databricks
      threads: 8
    backfill:
      type: databricks
      threads: 2  # Fewer threads for backfills

Run backfills with the backfill profile:

dbt run --target backfill --vars '{"event_date": "2025-01-15"}'

Incremental Backfill Pattern

Process large date ranges in chunks:

#!/bin/bash
# backfill_script.sh

START_DATE="2023-01-01"
END_DATE="2025-12-31"

current_date="$START_DATE"

while [ "$current_date" != "$END_DATE" ]; do
    echo "Processing $current_date"

    dbt run \
        --select tag:backfill \
        --target backfill \
        --vars "{\"event_date\": \"$current_date\"}"

    if [ $? -eq 0 ]; then
        echo "✓ Success: $current_date"
    else
        echo "✗ Failed: $current_date"
        exit 1
    fi

    # Move to next date
    current_date=$(date -I -d "$current_date + 1 day")
done

Backfill Monitoring

Track backfill progress in a metadata table:

-- models/admin/backfill_log.sql
{{
    config(
        materialized='incremental',
        unique_key='backfill_id'
    )
}}

SELECT
    '{{ var("event_date") }}' as backfill_date,
    '{{ invocation_id }}' as backfill_id,
    CURRENT_TIMESTAMP() as completed_at,
    '{{ target.name }}' as target_env
{% if is_incremental() %}
    WHERE '{{ var("event_date") }}' NOT IN (SELECT backfill_date FROM {{ this }})
{% endif %}

Query this table to see which dates have been processed.

Common Pitfalls

Overwriting Future Data

When backfilling, ensure you don’t overwrite data that’s already been processed for future dates:

-- Bad: Could overwrite future data
WHERE event_date = '{{ var("event_date") }}'

-- Good: Only process if date is in the past or today
WHERE event_date = '{{ var("event_date") }}'
{% if var("event_date") > modules.datetime.date.today().isoformat() %}
    AND 1=0  -- Skip future dates
{% endif %}

Memory Issues with Full Refresh

Avoid full refresh for large historical tables:

# Bad: Tries to process years of data at once
dbt run --full-refresh --select huge_table

# Good: Backfill date by date
dbt run --select huge_table --vars '{"event_date": "2025-01-01"}'

Partition Misalignment

Ensure your WHERE clause matches incremental_predicates:

{{
    config(
        incremental_predicates=["event_date = '{{ var('event_date') }}'"]
    )
}}

-- Must match the predicate exactly
WHERE event_date = '{{ var("event_date") }}'
-- Not: WHERE event_date >= '{{ var("event_date") }}'