7.4. Data Retention

Overview

This article outlines a practical way to track and enforce data retention in a dbt project. It uses model meta to declare intended retention (for example, 90d, 2y, permanent), a post-hook macro to stamp warehouse-level tags, and an optional housekeeping macro to delete old rows by date.

What you implement

  • A macro to set a table tag retention=<value> based on model meta or schema defaults.

  • A simple housekeeping macro data_retention(days_to_keep, date_col) that deletes rows older than N days.

  • YAML examples showing how to declare retention in model meta.

Setup

  1. Add the macros to your project:

    Macro: set_data_retention_tag
    -- Generic data retention tagging macro for dbt
    -- Applies a table tag `retention` with value determined by model meta or default policy.
    -- Adjust ALTER syntax to your target warehouse if needed.
    
    {% macro set_data_retention_tag() %}
        {% if execute %}
            {# Read optional model-level override, e.g. meta: { retention: "90d" | "2y" | "permanent" } #}
            {% set retention_meta = model.meta.get('retention') %}
    
            {# Simple defaults by schema name; adapt to your environment/policies #}
            {% set schema_lower = this.schema | lower %}
            {% if schema_lower == 'stage' %}
                {% set default_retention = '2y' %}
            {% elif schema_lower == 'core' %}
                {% set default_retention = '3y' %}
            {% elif schema_lower == 'report' %}
                {% set default_retention = '5y' %}
            {% else %}
                {% set default_retention = 'permanent' %}
            {% endif %}
    
            {% if retention_meta is string and retention_meta|length > 0 %}
                {% set tag_value = retention_meta %}
            {% else %}
                {% set tag_value = default_retention %}
            {% endif %}
    
            {% set tag_sql %}
                ALTER TABLE {{ this }} SET TAGS ('retention' = '{{ tag_value }}');
            {% endset %}
            {% do run_query(tag_sql) %}
        {% endif %}
    {% endmacro %}
    
    Macro: data_retention
    -- Housekeeping macro to delete rows older than N days
    -- Usage: post_hook = "{{ data_retention(90, 'event_date') }}"
    
    {% macro data_retention(days_to_keep=365, date_col='event_date') %}
        DELETE FROM {{ this }}
        WHERE {{ date_col }} < DATE_ADD({{ event_date() }}, -{{ days_to_keep }})
          AND {{ event_date() }} <= {{ current_date() }}
    {% endmacro %}
    
    Helper macros for dates
    -- Small helpers to make examples portable across warehouses
    
    {% macro event_date() %}
        {{ var('event_date', 'CURRENT_DATE()') }}
    {% endmacro %}
    
    {% macro current_date() %}
        CURRENT_DATE()
    {% endmacro %}
    
  2. Declare retention in your model YAML where needed:

    Example YAML with model-level retention
    version: 2
    
    models:
      - name: user_events_monthly
        description: Monthly user engagement events used for activity analytics.
        meta:
          is_gdpr_model: true
          retention: 400d
        columns:
          - name: user_id
            description: Anonymized user identifier
            meta:
              gdpr_column_type: user_identifier
          - name: session_id
            description: Analytics session identifier
            meta:
              is_gdpr_column: true
          - name: event_date
            description: Logical partition/event date
          - name: event_type
            description: Type of event
    
      - name: advertising_campaigns
        description: Campaigns table retained permanently for financial audit.
        meta:
          retention: permanent
        columns:
          - name: campaign_id
            description: Campaign identifier
          - name: seller_id
            description: Seller identifier
            meta:
              gdpr_column_type: business_identifier
    
  3. Add post-hooks so the tag macro runs for the relevant models. You can set this at the dbt_project.yml folder level or per model via config(post_hook=...).

Example: incremental model with retention The snippet below shows an incremental model using replace_where and a post_hook to delete rows older than 90 days:

Incremental model using data_retention
{{
    config(
        materialized='incremental',
        on_schema_change='append_new_columns',
        incremental_strategy='replace_where',
        incremental_predicates = "event_date = " ~ event_date(),
        post_hook = "{{ data_retention(90, 'event_date') }}"
    )
}}

select
  user_id,
  session_id,
  event_date,
  event_type
from {{ ref('user_events_monthly_source') }}
{% if is_incremental() %}
where event_date = {{ event_date() }}
{% endif %}

Behavior and policy notes

  • The tag macro does not enforce deletion by itself; it documents intent at the table level for governance and discovery.

  • The housekeeping macro issues a DELETE based on a date column and the current date helpers. Adjust for your warehouse functions if needed.

  • Choose safe execution patterns: for large tables, prefer partition filters and limit delete frequency to off-peak hours.

  • Defaults by schema are just examples; adapt them to your organization’s policies.