5.3. Databricks Environment Strategy
A practical approach to managing multiple Databricks environments with dbt, enabling safe development workflows while maintaining production data access for testing.
Introduction
Managing data transformations across multiple environments is critical for maintaining data quality and development velocity. This guide describes a two-environment strategy using Databricks and dbt that balances developer flexibility with production safety.
The approach uses two separate Databricks workspaces (stage and prod) with cross-environment data access capabilities, allowing analysts to develop and test models locally or in stage while reading from production data when needed.
Environment Architecture
Stage Environment
The stage environment serves as the development and testing workspace where analysts have full permissions:
Write Access: Analysts can create, modify, and delete any objects (tables, views, notebooks)
Experimentation: Free to test queries, prototypes, and data transformations
Catalog Access: Access to both
stage_*andprod_*catalogs for cross-environment testingdbt Development: Primary environment for developing and testing dbt models
Prod Environment
The production environment is strictly controlled and managed through automation:
Read-Only Access: Analysts can query data but cannot modify production objects
CI/CD Managed: All changes deployed exclusively through automated pipelines
Data Integrity: Prevents accidental modifications or deletions of production data
Catalog Access: Only
prod_*catalogs available
Cross-Environment Data Access
Both environments can access data from either catalog, enabling powerful testing workflows:
-- In stage environment, you can query production data
SELECT * FROM prod_monetization.core.users
-- Or stage data
SELECT * FROM stage_monetization.core.users
-- Join across environments for validation
SELECT
s.user_id,
s.total_revenue AS stage_revenue,
p.total_revenue AS prod_revenue,
s.total_revenue - p.total_revenue AS diff
FROM stage_monetization.core.user_revenue s
JOIN prod_monetization.core.user_revenue p
ON s.user_id = p.user_id
This capability is particularly valuable for:
Data Validation: Compare stage outputs against production for regression testing
Production Data Testing: Test new models using real production data without risk
Debugging: Investigate discrepancies between environments
Migration Validation: Verify data migrations by comparing source and target
Local Development with Production Data
A common development challenge: testing dbt models locally using production data without hardcoding table paths. The traditional workflow involves:
Temporarily hardcode production table path:
FROM prod_monetization.core.usersRun and test the model
Change back to:
FROM {{ ref('users') }}Commit changes
This is tedious and error-prone. A better solution is to override the ref() macro to dynamically switch environments.
The Problem
When developing locally, you want to:
Test models using production data (which is more complete and realistic)
Keep using
{{ ref('model_name') }}for proper dependency trackingAvoid hardcoding environment-specific table paths
Switch between stage and prod data easily
The Solution
Override the built-in ref() macro to allow cross-environment reads while preserving dbt’s dependency tracking:
-- macros/cross_environment_ref.sql
{#
Override ref() to allow cross-environment reads between stage and prod.
Usage:
- Default: {{ ref('some_model') }}
- Override to prod: dbt run --vars '{"source_data_env": "prod"}'
- Override to stage: dbt run --vars '{"source_data_env": "stage"}'
Behavior:
1) Calls the original built-in ref() to preserve dependency tracking
2) If source_data_env is set, swaps the catalog prefix (stage_* <-> prod_*)
3) Only uses the swapped relation if it actually exists
4) Falls back to original relation if target doesn't exist
Example:
-- Model reads from stage_monetization.core.users by default
SELECT * FROM {{ ref('users') }}
-- With --vars '{"source_data_env": "prod"}', reads from prod_monetization.core.users
-- Dependency tracking still works correctly
#}
{% macro ref() %}
{# --- 1) Extract user-provided args (supports version/v and 1- or 2-arg forms) --- #}
{% set version = kwargs.get('version') or kwargs.get('v') %}
{% set packagename = none %}
{% if (varargs | length) == 1 %}
{% set modelname = varargs[0] %}
{% else %}
{% set packagename = varargs[0] %}
{% set modelname = varargs[1] %}
{% endif %}
{# --- 2) Call the original built-in ref() so dbt records DAG dependencies --- #}
{% if packagename is not none %}
{% set rel = builtins.ref(packagename, modelname, version=version) %}
{% else %}
{% set rel = builtins.ref(modelname, version=version) %}
{% endif %}
{# --- 3) Read override env (project var first, then env var) --- #}
{% set override_env = var('source_data_env', None) %}
{% if override_env %}
{% set override_env = override_env | trim %}
{% endif %}
{% if not override_env %}
{% do return(rel) %}
{% endif %}
{# If Relation has no database (ephemeral models), return as-is #}
{% if rel.database is none %}
{% do return(rel) %}
{% endif %}
{# --- 4) Compute new catalog by swapping the prefix before first '_' --- #}
{% set parts = rel.database.split('_', 1) %}
{% if parts | length == 2 %}
{# Catalog format: stage_monetization or prod_monetization #}
{% set current_env = parts[0] %}
{% set domain = parts[1] %}
{% set new_catalog = override_env ~ '_' ~ domain %}
{% else %}
{# Fallback: no underscore found, just prefix #}
{% set new_catalog = override_env ~ '_' ~ rel.database %}
{% endif %}
{# --- 5) Only use the swapped relation if it actually exists --- #}
{% set candidate = adapter.get_relation(
database=new_catalog,
schema=rel.schema,
identifier=rel.identifier
) %}
{% if candidate is not none %}
{# Target exists in override environment, use it #}
{% do return(candidate) %}
{% else %}
{# Target doesn't exist, fall back to original #}
{% do log("Warning: " ~ new_catalog ~ "." ~ rel.schema ~ "." ~ rel.identifier ~
" not found, falling back to " ~ rel.database ~ "." ~ rel.schema ~ "." ~ rel.identifier,
info=True) %}
{% do return(rel) %}
{% endif %}
{% endmacro %}
How It Works
Preserve Dependencies: Calls the original
ref()macro first to maintain dbt’s DAG trackingCheck Override Variable: Reads
source_data_envvariable (project var or env var)Swap Catalog Prefix: Changes
stage_prefix toprod_(or vice versa) in the database nameVerify Existence: Only uses the swapped relation if it actually exists; otherwise falls back to original
Return Relation: Returns the appropriate relation object for dbt to use
Usage Examples
Default behavior (reads from stage):
# Normal dbt run uses stage environment
dbt run --select my_model
Override to read from production:
# Override to read prod data while developing in stage
dbt run --select my_model --vars '{"source_data_env": "prod"}'
# Useful for testing with realistic data
dbt build --select my_model+ --vars '{"source_data_env": "prod"}'
Set as project default in dbt_project.yml:
vars:
source_data_env: "prod" # Always read from prod by default
Benefits of This Approach
For Development Workflow
No Hardcoding: Keep using
{{ ref() }}without manual path changesFast Iteration: Toggle between environments with a single flag
Realistic Testing: Test models with production data volumes and distributions
Safe Experimentation: Read prod data without risk of modifying it
For Data Quality
Production Validation: Verify model logic against real production data
Regression Testing: Compare stage outputs to prod for consistency checks
Debugging: Investigate data issues using actual production records
Migration Testing: Validate transformations before promoting to production
For Team Productivity
Simplified Onboarding: New team members can test with prod data immediately
Reduced Context Switching: No need to manually swap table references
Better Code Reviews: Reviewers can test PRs with production data locally
Consistent Patterns: Standardized approach across the team
Best Practices
Environment Naming Convention
Use consistent prefixes for catalogs across environments:
Stage:
stage_<domain>(e.g.,stage_monetization,stage_analytics)Prod:
prod_<domain>(e.g.,prod_monetization,prod_analytics)
This naming convention enables the macro to automatically swap prefixes without additional configuration.
Security Considerations
Read-Only Prod Access: Analysts should have SELECT-only permissions on production catalogs
Audit Logging: Enable Databricks audit logs to track cross-environment queries
Sensitive Data: Restrict PII/sensitive data access to prod environment only
Cost Monitoring: Track compute costs for cross-environment queries
Development Guidelines
Default to Stage: Run models against stage data by default during development
Test with Prod: Use
source_data_env=prodfor final validation before deploymentDocument Dependencies: Clearly indicate when models require production data
CI/CD Validation: Automated tests should validate both stage and prod reads work
Limitations and Edge Cases
The cross-environment ref macro handles several edge cases:
Missing Objects: Falls back to original environment if target doesn’t exist
Ephemeral Models: Returns as-is since they have no database
External Packages: Works with both project models and dbt packages
Version Support: Compatible with dbt model versioning (
ref('model', v=2))
Common Pitfalls
Forgetting to Reset Override
If you set source_data_env: "prod" in dbt_project.yml for testing, remember to remove it before committing. Otherwise, all local runs will read from prod.
Performance Considerations
Cross-environment queries may have higher latency due to network traversal between workspaces. For large datasets, consider:
Filtering data early in transformations
Using stage data for iterative development
Switching to prod only for final validation
Catalog Synchronization
The macro assumes stage and prod have similar catalog structures. If catalogs diverge (e.g., prod has a table that stage doesn’t), the fallback logic ensures models still run but may produce unexpected results.
Example Workflow
Here’s a typical development workflow using this environment strategy:
1. Local Development
# Develop model using stage data (fast iteration)
dbt run --select my_new_model
# Test with production data for validation
dbt run --select my_new_model --vars '{"source_data_env": "prod"}'
2. Create Pull Request
# Run tests against both environments
dbt test --select my_new_model
dbt test --select my_new_model --vars '{"source_data_env": "prod"}'
# Compare outputs
dbt run --select my_validation_model --vars '{"source_data_env": "prod"}'
3. CI/CD Deployment
The pipeline automatically:
Runs models in stage environment
Executes full test suite
On merge to main, deploys to prod via automated workflow
No manual intervention required
Further Reading
For related topics on dbt development and testing strategies:
5.1. dbt Incremental Models - Incremental model strategies and patterns
7.2. Data Migration Guide - Data migration and validation best practices