5.4. Data Backfilling Strategies
Practical approaches for backfilling historical data in dbt incremental models, including migration strategies and event_date dependencies.
Introduction
Data backfilling is the process of populating historical data in your data warehouse, either during initial migrations or when adding new models to existing pipelines. Unlike regular incremental runs that process recent data, backfills process large historical date ranges efficiently.
This article covers three common backfilling scenarios:
dbt Incremental Backfills: Using
is_incremental()logic for date-based processingEvent Date Dependencies: Managing pipelines where all models depend on a single
event_datevariableData Migration Backfills: Moving historical data from legacy systems during platform migrations
Understanding these patterns enables reliable historical data processing without overwhelming compute resources or creating data quality issues.
Scenario 1: dbt Incremental Backfills
When backfilling dbt incremental models, leverage the is_incremental() macro to process historical dates efficiently.
The Pattern
Incremental models use is_incremental() to determine whether to process all data (full refresh) or just new data:
{{
config(
materialized='incremental',
incremental_strategy='replace_where',
incremental_predicates=["event_date = '{{ var('event_date') }}'"],
partition_by=['event_date']
)
}}
SELECT
event_date,
user_id,
revenue
FROM {{ ref('raw_transactions') }}
{% if is_incremental() %}
WHERE event_date = '{{ var('event_date') }}'
{% else %}
-- Full refresh: process all historical data
WHERE event_date >= '2020-01-01'
{% endif %}
Backfill Execution
For historical date ranges, run dbt with specific dates:
# Backfill single date
dbt run --select my_model --vars '{"event_date": "2025-01-15"}'
# Backfill date range (requires loop script)
for date in $(seq -f "%Y-%m-%d" $(date -d "2025-01-01" +%s) 86400 $(date -d "2025-01-31" +%s))
do
dbt run --select my_model --vars "{\"event_date\": \"$date\"}"
done
Key Benefits
is_incremental()returnsTrueduring backfills, triggering efficient date-specific processingreplace_wherestrategy only replaces data for the specific partitionIdempotent: can safely rerun any date without affecting others
Resource-efficient: processes one date at a time instead of full table scans
Best Practices
Partition Predicates: Always include
incremental_predicatesmatching your WHERE clauseStart Date Logic: Set a reasonable historical start date in the
elseblockTesting: Validate backfill logic with
--full-refreshbefore running production backfillsMonitoring: Track progress and failures when backfilling large date ranges
Scenario 2: Event Date Dependencies
In well-architected data pipelines, all models should depend on a single event_date variable. This pattern enables atomic processing where an entire day’s data is processed together.
The Architecture
# dbt_project.yml
vars:
event_date: "{{ (modules.datetime.datetime.now() - modules.datetime.timedelta(days=1)).strftime('%Y-%m-%d') }}"
All models reference this variable:
-- models/core/daily_metrics.sql
{{
config(
materialized='incremental',
incremental_strategy='replace_where',
incremental_predicates=["event_date = '{{ var('event_date') }}'"],
partition_by=['event_date']
)
}}
SELECT
'{{ var("event_date") }}' AS event_date,
user_id,
COUNT(*) as event_count
FROM {{ ref('raw_events') }}
WHERE event_date = '{{ var("event_date") }}'
GROUP BY user_id
Airflow Integration
DAGs pass the execution date to dbt:
from airflow.decorators import dag, task
from datetime import timedelta
@dag(schedule="@daily", catchup=True)
def daily_processing():
@task
def run_dbt_models(ds=None):
# ds is Airflow's execution date (YYYY-MM-DD)
return f"dbt run --vars '{{\"event_date\": \"{ds}\"}}'"
run_dbt_models()
Backfill Execution
Airflow’s backfill command handles date iteration automatically:
# Backfill January 2025
airflow dags backfill daily_processing \
--start-date 2025-01-01 \
--end-date 2025-01-31
Each DAG run processes one date atomically. If a date fails, it can be rerun independently without affecting other dates.
Advantages
Atomic Processing: Entire day succeeds or fails together
Clear Dependencies: All models process the same logical date
Easy Monitoring: Track progress by date, not by model
Simplified Backfills: Airflow handles date iteration
Data Consistency: All models for a date are synchronized
Handling Late-Arriving Data
Sometimes data arrives late. Handle this with lookback windows:
SELECT
event_date,
user_id,
event_type
FROM {{ ref('raw_events') }}
WHERE event_date BETWEEN
DATE('{{ var("event_date") }}') - INTERVAL 3 DAYS
AND DATE('{{ var("event_date") }}')
This reprocesses the last 3 days on each run, catching late arrivals while maintaining the event_date partition structure.
Scenario 3: Data Migration Backfills
When migrating from a legacy system, you need to backfill historical data while the new system is being built.
Migration Strategy
Phase 1: Dual-Write Setup
Start writing to both old and new systems:
-- models/staging/stg_orders.sql
{{
config(
materialized='incremental',
incremental_strategy='append'
)
}}
{% set cutover_date = '2025-06-01' %}
SELECT
order_id,
customer_id,
order_date,
'new_system' as source_system
FROM {{ source('new_system', 'orders') }}
WHERE order_date >= '{{ cutover_date }}'
{% if is_incremental() %}
AND order_date >= (SELECT MAX(order_date) FROM {{ this }})
{% endif %}
Phase 2: Historical Backfill
Backfill data from the legacy system:
-- models/staging/stg_orders_historical.sql
{{
config(
materialized='table',
tags=['backfill', 'one_time']
)
}}
{% set cutover_date = '2025-06-01' %}
SELECT
order_id,
customer_id,
order_date,
'legacy_system' as source_system
FROM {{ source('legacy_system', 'orders') }}
WHERE order_date < '{{ cutover_date }}'
Phase 3: Merge Historical and Current
Union historical and current data:
-- models/core/orders.sql
{{
config(materialized='view')
}}
SELECT * FROM {{ ref('stg_orders_historical') }}
UNION ALL
SELECT * FROM {{ ref('stg_orders') }}
Migration Execution Plan
T-minus 30 days: Set up dual-write, test data quality
T-minus 7 days: Run historical backfill in staging environment
Cutover date: Switch analytics to new system
T-plus 30 days: Validate data consistency, decommission old system
Data Validation During Migration
Compare record counts and metrics between systems:
-- models/validation/migration_reconciliation.sql
WITH legacy_counts AS (
SELECT
DATE(order_date) as date,
COUNT(*) as order_count,
SUM(order_amount) as total_amount
FROM {{ source('legacy_system', 'orders') }}
WHERE order_date BETWEEN '2024-01-01' AND '2025-05-31'
GROUP BY DATE(order_date)
),
new_counts AS (
SELECT
DATE(order_date) as date,
COUNT(*) as order_count,
SUM(order_amount) as total_amount
FROM {{ ref('stg_orders_historical') }}
GROUP BY DATE(order_date)
)
SELECT
l.date,
l.order_count as legacy_count,
n.order_count as new_count,
l.order_count - n.order_count as count_diff,
l.total_amount as legacy_amount,
n.total_amount as new_amount,
l.total_amount - n.total_amount as amount_diff
FROM legacy_counts l
LEFT JOIN new_counts n ON l.date = n.date
WHERE ABS(l.order_count - COALESCE(n.order_count, 0)) > 0
Run this validation daily during the migration window.
Best Practices for Backfills
Resource Management
Control compute resources during large backfills:
# profiles.yml
my_project:
target: prod
outputs:
prod:
type: databricks
threads: 8
backfill:
type: databricks
threads: 2 # Fewer threads for backfills
Run backfills with the backfill profile:
dbt run --target backfill --vars '{"event_date": "2025-01-15"}'
Incremental Backfill Pattern
Process large date ranges in chunks:
#!/bin/bash
# backfill_script.sh
START_DATE="2023-01-01"
END_DATE="2025-12-31"
current_date="$START_DATE"
while [ "$current_date" != "$END_DATE" ]; do
echo "Processing $current_date"
dbt run \
--select tag:backfill \
--target backfill \
--vars "{\"event_date\": \"$current_date\"}"
if [ $? -eq 0 ]; then
echo "✓ Success: $current_date"
else
echo "✗ Failed: $current_date"
exit 1
fi
# Move to next date
current_date=$(date -I -d "$current_date + 1 day")
done
Backfill Monitoring
Track backfill progress in a metadata table:
-- models/admin/backfill_log.sql
{{
config(
materialized='incremental',
unique_key='backfill_id'
)
}}
SELECT
'{{ var("event_date") }}' as backfill_date,
'{{ invocation_id }}' as backfill_id,
CURRENT_TIMESTAMP() as completed_at,
'{{ target.name }}' as target_env
{% if is_incremental() %}
WHERE '{{ var("event_date") }}' NOT IN (SELECT backfill_date FROM {{ this }})
{% endif %}
Query this table to see which dates have been processed.
Common Pitfalls
Overwriting Future Data
When backfilling, ensure you don’t overwrite data that’s already been processed for future dates:
-- Bad: Could overwrite future data
WHERE event_date = '{{ var("event_date") }}'
-- Good: Only process if date is in the past or today
WHERE event_date = '{{ var("event_date") }}'
{% if var("event_date") > modules.datetime.date.today().isoformat() %}
AND 1=0 -- Skip future dates
{% endif %}
Memory Issues with Full Refresh
Avoid full refresh for large historical tables:
# Bad: Tries to process years of data at once
dbt run --full-refresh --select huge_table
# Good: Backfill date by date
dbt run --select huge_table --vars '{"event_date": "2025-01-01"}'
Partition Misalignment
Ensure your WHERE clause matches incremental_predicates:
{{
config(
incremental_predicates=["event_date = '{{ var('event_date') }}'"]
)
}}
-- Must match the predicate exactly
WHERE event_date = '{{ var("event_date") }}'
-- Not: WHERE event_date >= '{{ var("event_date") }}'