7.4. Data Retention
Overview
This article outlines a practical way to track and enforce data retention in a dbt project. It uses model meta
to declare intended retention (for example, 90d
, 2y
, permanent
), a post-hook macro to stamp warehouse-level tags, and an optional housekeeping macro to delete old rows by date.
What you implement
A macro to set a table tag
retention=<value>
based on modelmeta
or schema defaults.A simple housekeeping macro
data_retention(days_to_keep, date_col)
that deletes rows older than N days.YAML examples showing how to declare retention in model
meta
.
Setup
Add the macros to your project:
Macro: set_data_retention_tag-- Generic data retention tagging macro for dbt -- Applies a table tag `retention` with value determined by model meta or default policy. -- Adjust ALTER syntax to your target warehouse if needed. {% macro set_data_retention_tag() %} {% if execute %} {# Read optional model-level override, e.g. meta: { retention: "90d" | "2y" | "permanent" } #} {% set retention_meta = model.meta.get('retention') %} {# Simple defaults by schema name; adapt to your environment/policies #} {% set schema_lower = this.schema | lower %} {% if schema_lower == 'stage' %} {% set default_retention = '2y' %} {% elif schema_lower == 'core' %} {% set default_retention = '3y' %} {% elif schema_lower == 'report' %} {% set default_retention = '5y' %} {% else %} {% set default_retention = 'permanent' %} {% endif %} {% if retention_meta is string and retention_meta|length > 0 %} {% set tag_value = retention_meta %} {% else %} {% set tag_value = default_retention %} {% endif %} {% set tag_sql %} ALTER TABLE {{ this }} SET TAGS ('retention' = '{{ tag_value }}'); {% endset %} {% do run_query(tag_sql) %} {% endif %} {% endmacro %}
Macro: data_retention-- Housekeeping macro to delete rows older than N days -- Usage: post_hook = "{{ data_retention(90, 'event_date') }}" {% macro data_retention(days_to_keep=365, date_col='event_date') %} DELETE FROM {{ this }} WHERE {{ date_col }} < DATE_ADD({{ event_date() }}, -{{ days_to_keep }}) AND {{ event_date() }} <= {{ current_date() }} {% endmacro %}
Helper macros for dates-- Small helpers to make examples portable across warehouses {% macro event_date() %} {{ var('event_date', 'CURRENT_DATE()') }} {% endmacro %} {% macro current_date() %} CURRENT_DATE() {% endmacro %}
Declare retention in your model YAML where needed:
Example YAML with model-level retentionversion: 2 models: - name: user_events_monthly description: Monthly user engagement events used for activity analytics. meta: is_gdpr_model: true retention: 400d columns: - name: user_id description: Anonymized user identifier meta: gdpr_column_type: user_identifier - name: session_id description: Analytics session identifier meta: is_gdpr_column: true - name: event_date description: Logical partition/event date - name: event_type description: Type of event - name: advertising_campaigns description: Campaigns table retained permanently for financial audit. meta: retention: permanent columns: - name: campaign_id description: Campaign identifier - name: seller_id description: Seller identifier meta: gdpr_column_type: business_identifier
Add post-hooks so the tag macro runs for the relevant models. You can set this at the
dbt_project.yml
folder level or per model viaconfig(post_hook=...)
.
Example: incremental model with retention
The snippet below shows an incremental model using replace_where
and a post_hook
to delete rows older than 90 days:
{{
config(
materialized='incremental',
on_schema_change='append_new_columns',
incremental_strategy='replace_where',
incremental_predicates = "event_date = " ~ event_date(),
post_hook = "{{ data_retention(90, 'event_date') }}"
)
}}
select
user_id,
session_id,
event_date,
event_type
from {{ ref('user_events_monthly_source') }}
{% if is_incremental() %}
where event_date = {{ event_date() }}
{% endif %}
Behavior and policy notes
The tag macro does not enforce deletion by itself; it documents intent at the table level for governance and discovery.
The housekeeping macro issues a
DELETE
based on a date column and the current date helpers. Adjust for your warehouse functions if needed.Choose safe execution patterns: for large tables, prefer partition filters and limit delete frequency to off-peak hours.
Defaults by schema are just examples; adapt them to your organization’s policies.