4.1. dbt Incremental Models

Advanced strategies for implementing efficient incremental data models in dbt with Databricks.

Introduction

Incremental models are one of dbt’s most powerful features for handling large datasets efficiently. Instead of rebuilding entire tables on every run, incremental models only process new or changed data, dramatically reducing compute costs and execution time.

This guide covers advanced strategies for implementing robust incremental models on Databricks, including:

  • Core incremental strategies and their use cases

  • Understanding incremental predicates and partition optimization

  • Establishing event_date as a standard for all incremental models

  • Managing schema changes with append_new_columns

  • Implementing automated data retention policies

  • Solving column ordering issues in incremental models

Key Benefits of Incremental Models

  • Performance: Process only new/changed records instead of full table rebuilds

  • Cost Efficiency: Significantly reduced compute costs for large datasets

  • Scalability: Handle tables with billions of records efficiently

  • Flexibility: Support various update strategies based on business needs

Core Incremental Strategies

dbt supports several incremental strategies, each optimized for different use cases:

append (Default)

Simply adds new records to the target table without checking for duplicates.

{{
    config(
        materialized='incremental',
        incremental_strategy='append'
    )
}}

SELECT * FROM {{ ref('raw_events') }}
{% if is_incremental() %}
    WHERE created_at > (SELECT MAX(created_at) FROM {{ this }})
{% endif %}

merge (Recommended)

Performs MERGE operations, ideal for handling updates and inserts (upserts).

{{
    config(
        materialized='incremental',
        incremental_strategy='merge',
        unique_key='user_id'
    )
}}

SELECT user_id, email, updated_at
FROM {{ ref('raw_users') }}
{% if is_incremental() %}
    WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}

replace_where (Databricks Optimized)

Replaces data based on a predicate condition, excellent for time-partitioned data.

{{
    config(
        materialized='incremental',
        incremental_strategy='replace_where',
        incremental_predicates=["event_date = " ~ event_date()]
    )
}}

Understanding Incremental Predicates

Incremental predicates define which partitions of data to replace or merge. They’re crucial for performance optimization and data consistency.

Basic Predicate Usage

{{
    config(
        materialized='incremental',
        incremental_strategy='replace_where',
        incremental_predicates="event_date = '{{ var('event_date') }}'"
    )
}}

Dynamic Predicate with Macros

{{
    config(
        materialized='incremental',
        incremental_strategy='replace_where',
        incremental_predicates=["event_date = " ~ event_date()]
    )
}}

Multiple Predicates

{{
    config(
        materialized='incremental',
        incremental_strategy='replace_where',
        incremental_predicates=[
            "event_date = " ~ event_date(),
            "region = '" ~ var('region') ~ "'"
        ]
    )
}}

Partition Strategy with partition_by

Proper partitioning is essential for query performance and cost optimization in Databricks.

Date-based Partitioning (Recommended)

{{
    config(
        materialized='incremental',
        partition_by=['event_date'],
        incremental_strategy='replace_where',
        incremental_predicates=["event_date = " ~ event_date()]
    )
}}

Multi-column Partitioning

{{
    config(
        materialized='incremental',
        partition_by=['event_date', 'region'],
        incremental_strategy='replace_where'
    )
}}

Event Date as the Standard

Establishing event_date as a consistent column across all incremental models provides numerous benefits:

Benefits of event_date Standardization

  • Consistent Partitioning: All tables partition on the same column

  • Simplified Joins: Easy to join tables across different time periods

  • Uniform Retention: Apply the same retention policies across all models

  • Query Optimization: Databricks can optimize queries knowing the partition structure

Implementation Pattern

{{
    config(
        materialized='incremental',
        on_schema_change='append_new_columns',
        incremental_strategy='replace_where',
        incremental_predicates=["event_date = " ~ event_date()],
        post_hook="{{ data_retention(90) }}",
        partition_by=['event_date']
    )
}}

SELECT
    {{ event_date() }} AS event_date,
    user_id,
    action_type,
    created_at
FROM {{ ref('raw_events') }}
{% if is_incremental() %}
    WHERE DATE(created_at) = {{ event_date() }}
{% endif %}

Schema Evolution with on_schema_change

The on_schema_change='append_new_columns' setting allows tables to evolve gracefully when new columns are added to source data.

{{
    config(
        materialized='incremental',
        on_schema_change='append_new_columns'
    )
}}

Benefits: * Automatically adds new columns from source data * Prevents model failures when schema changes occur * Maintains backward compatibility with existing data

Considerations: * New columns will be NULL for existing records * Column type changes may still cause failures * Always test schema changes in development first

Data Retention Management

Large incremental models require automated data retention to manage storage costs and comply with data governance policies.

Custom Data Retention Macro

1{% macro data_retention(days_to_keep=365, date_col='event_date') %}
2    DELETE FROM {{ this }}
3    WHERE {{ date_col }} < DATE_ADD({{ event_date() }}, -{{ days_to_keep }})
4    AND {{ event_date() }}  <= CURRENT_DATE()
5{%- endmacro %} 

Usage in Incremental Models

{{
    config(
        materialized='incremental',
        post_hook="{{ data_retention(90) }}"
    )
}}

Retention Configuration Examples

-- Retain 30 days of data
post_hook="{{ data_retention(30) }}"

-- Retain 1 year with custom date column
post_hook="{{ data_retention(365, 'transaction_date') }}"

-- Retain 90 days (default)
post_hook="{{ data_retention() }}"

Column Ordering Problem and Solution

One of the most challenging issues with incremental models is maintaining consistent column ordering between the source data and target table. When Databricks creates a table, the column order matters for incremental operations. If your dbt model’s SELECT statement has columns in a different order than the target table’s schema, incremental runs will fail with type errors.

The issue occurs because the SHOW CREATE TABLE command reveals the actual column order in Databricks, and if your SQL SELECT has a different order, incremental operations fail. This commonly happens when running incremental models multiple times or when pre-commit hooks (like sqlfluff) reorder your SQL columns automatically.

Solution: Custom Materialization

The most robust solution is to create a custom materialization that handles column ordering automatically:

 1{% macro databricks__get_incremental_replace_where_sql(arg_dict) %}
 2    {#
 3      This is the definitive override for the `replace_where` strategy.
 4      It fixes the column-order problem by using the `SELECT` form of the
 5      `INSERT ... REPLACE WHERE` command, which allows us to explicitly
 6      order the columns to match the target table.
 7    #}
 8    {%- set target_relation = arg_dict['target_relation'] -%}
 9    {%- set temp_relation = arg_dict['temp_relation'] -%}
10    {%- set predicates = arg_dict['incremental_predicates'] -%}
11
12    {# Get the ordered list of columns from the target table #}
13    {%- set dest_columns = adapter.get_columns_in_relation(target_relation) -%}
14    {%- set dest_cols_csv = dest_columns | map(attribute='name') | join(', ') -%}
15
16    INSERT INTO {{ target_relation }}
17    REPLACE WHERE
18        {# This robust logic handles both string and list predicates #}
19        {% if predicates is sequence and predicates is not string %}
20            {{ predicates | join(' AND ') }}
21        {% else %}
22            {{ predicates }}
23        {% endif %}
24    -- Use a SELECT instead of original 'TABLE {{ temp_relation.render() }}'
25    SELECT {{ dest_cols_csv }} FROM {{ temp_relation.render() }}
26
27{% endmacro %} 

Best Practice: Match YAML Schema Order

The safest approach is to ensure your model’s YAML schema matches Databricks’ table creation order. This prevents the column ordering issue from occurring in the first place and makes your models more predictable and maintainable.

Complete Example Implementation

Production-ready Incremental Model

Here’s a complete example that incorporates all the best practices discussed:

{{
    config(
        materialized='incremental',
        on_schema_change='append_new_columns',
        incremental_strategy='replace_where',
        incremental_predicates=["event_date = " ~ event_date()],
        post_hook="{{ data_retention(90) }}",
        partition_by=['event_date']
    )
}}

SELECT
    {{ event_date() }} AS event_date,
    user_id,
    session_id,
    event_type,
    properties,
    created_at
FROM {{ ref('raw_user_events') }}
{% if is_incremental() %}
    WHERE DATE(created_at) = {{ event_date() }}
{% endif %}

Corresponding Schema Definition

version: 2

models:
  - name: user_events_incremental
    description: "User events processed incrementally with automated retention"
    columns:
      - name: event_date
        description: "Partition date for the event"
        data_type: date
        tests:
          - not_null
      - name: user_id
        description: "Unique identifier for the user"
        data_type: string
        tests:
          - not_null
      - name: session_id
        description: "Session identifier"
        data_type: string
      - name: event_type
        description: "Type of user event"
        data_type: string
      - name: properties
        description: "Event properties as JSON"
        data_type: string
      - name: created_at
        description: "Original event timestamp"
        data_type: timestamp

Best Practices and Recommendations

Performance Optimization

  • Always use event_date partitioning for time-series data

  • Choose appropriate incremental strategies based on data patterns

  • Use replace_where for event data, merge for dimension tables

  • Implement proper indexing on unique_key columns

Data Quality

  • Always test incremental logic with dbt build --full-refresh

  • Implement data quality tests on incremental models

  • Monitor for duplicate records when using append strategy

  • Validate that incremental predicates match partition keys

Operational Excellence

  • Standardize on event_date across all incremental models

  • Implement automated data retention policies

  • Use consistent naming conventions for incremental models

  • Document incremental strategies in model descriptions

Cost Management

  • Right-size partitions to avoid small file problems

  • Use appropriate retention periods to manage storage costs

  • Monitor incremental model performance and adjust strategies

  • Consider using insert_overwrite for full partition refreshes

Common Pitfalls and Solutions

Small Files Problem

Incremental models can create many small files, impacting performance. Use optimize and vacuum operations regularly:

post_hook=[
    "OPTIMIZE {{ this }}",
    "{{ data_retention(90) }}",
    "VACUUM {{ this }} RETAIN 168 HOURS"
]

Inconsistent Data Types

Schema changes can cause type mismatches in incremental runs. Use strict schema definitions and test changes thoroughly:

columns:
  - name: event_date
    data_type: date
    tests:
      - not_null
  - name: amount
    data_type: decimal(10,2)

Partition Pruning Issues

Queries scan too many partitions due to improper predicate pushdown. Ensure partition columns are used in WHERE clauses:

{% if is_incremental() %}
    WHERE event_date = {{ event_date() }}  -- Ensures partition pruning
{% endif %}

Conclusion

Incremental models are essential for building scalable data pipelines in dbt. By following these patterns and best practices:

  • Use consistent event_date partitioning across all models

  • Implement automated data retention policies

  • Choose appropriate incremental strategies for your use cases

  • Solve column ordering issues with custom materializations

  • Monitor and optimize performance regularly

These strategies will help you build robust, efficient, and maintainable incremental models that scale with your data growth while controlling costs and maintaining high performance.

References