5.3. Databricks Environment Strategy

A practical approach to managing multiple Databricks environments with dbt, enabling safe development workflows while maintaining production data access for testing.

Introduction 

Managing data transformations across multiple environments is critical for maintaining data quality and development velocity. This guide describes a two-environment strategy using Databricks and dbt that balances developer flexibility with production safety.

The approach uses two separate Databricks workspaces (stage and prod) with cross-environment data access capabilities, allowing analysts to develop and test models locally or in stage while reading from production data when needed.

Environment Architecture 

Stage Environment

The stage environment serves as the development and testing workspace where analysts have full permissions:

Write Access: Analysts can create, modify, and delete any objects (tables, views, notebooks)
Experimentation: Free to test queries, prototypes, and data transformations
Catalog Access: Access to both stage_* and prod_* catalogs for cross-environment testing
dbt Development: Primary environment for developing and testing dbt models

Prod Environment

The production environment is strictly controlled and managed through automation:

Read-Only Access: Analysts can query data but cannot modify production objects
CI/CD Managed: All changes deployed exclusively through automated pipelines
Data Integrity: Prevents accidental modifications or deletions of production data
Catalog Access: Only prod_* catalogs available

Cross-Environment Data Access 

Both environments can access data from either catalog, enabling powerful testing workflows:

-- In stage environment, you can query production data
SELECT * FROM prod_monetization.core.users

-- Or stage data
SELECT * FROM stage_monetization.core.users

-- Join across environments for validation
SELECT
    s.user_id,
    s.total_revenue AS stage_revenue,
    p.total_revenue AS prod_revenue,
    s.total_revenue - p.total_revenue AS diff
FROM stage_monetization.core.user_revenue s
JOIN prod_monetization.core.user_revenue p
    ON s.user_id = p.user_id

This capability is particularly valuable for:

Data Validation: Compare stage outputs against production for regression testing
Production Data Testing: Test new models using real production data without risk
Debugging: Investigate discrepancies between environments
Migration Validation: Verify data migrations by comparing source and target

Local Development with Production Data 

A common development challenge: testing dbt models locally using production data without hardcoding table paths. The traditional workflow involves:

Temporarily hardcode production table path: FROM prod_monetization.core.users
Run and test the model
Change back to: FROM {{ ref('users') }}
Commit changes

This is tedious and error-prone. A better solution is to override the ref() macro to dynamically switch environments.

The Problem

When developing locally, you want to:

Test models using production data (which is more complete and realistic)
Keep using {{ ref('model_name') }} for proper dependency tracking
Avoid hardcoding environment-specific table paths
Switch between stage and prod data easily

The Solution

Override the built-in ref() macro to allow cross-environment reads while preserving dbt’s dependency tracking:

macros/cross_environment_ref.sql - Override ref() for cross-environment reads

-- macros/cross_environment_ref.sql
{#
  Override ref() to allow cross-environment reads between stage and prod.
  
  Usage:
    - Default:            {{ ref('some_model') }}                
    - Override to prod:   dbt run --vars '{"source_data_env": "prod"}'
    - Override to stage:  dbt run --vars '{"source_data_env": "stage"}'

  Behavior:
    1) Calls the original built-in ref() to preserve dependency tracking
    2) If source_data_env is set, swaps the catalog prefix (stage_* <-> prod_*)
    3) Only uses the swapped relation if it actually exists
    4) Falls back to original relation if target doesn't exist

  Example:
    -- Model reads from stage_monetization.core.users by default
    SELECT * FROM {{ ref('users') }}
    
    -- With --vars '{"source_data_env": "prod"}', reads from prod_monetization.core.users
    -- Dependency tracking still works correctly
#}

{% macro ref() %}
  {# --- 1) Extract user-provided args (supports version/v and 1- or 2-arg forms) --- #}
  {% set version = kwargs.get('version') or kwargs.get('v') %}
  {% set packagename = none %}
  
  {% if (varargs | length) == 1 %}
    {% set modelname = varargs[0] %}
  {% else %}
    {% set packagename = varargs[0] %}
    {% set modelname   = varargs[1] %}
  {% endif %}

  {# --- 2) Call the original built-in ref() so dbt records DAG dependencies --- #}
  {% if packagename is not none %}
    {% set rel = builtins.ref(packagename, modelname, version=version) %}
  {% else %}
    {% set rel = builtins.ref(modelname, version=version) %}
  {% endif %}

  {# --- 3) Read override env (project var first, then env var) --- #}
  {% set override_env = var('source_data_env', None) %}
  {% if override_env %}
    {% set override_env = override_env | trim %}
  {% endif %}
  
  {% if not override_env %}
    {% do return(rel) %}
  {% endif %}

  {# If Relation has no database (ephemeral models), return as-is #}
  {% if rel.database is none %}
    {% do return(rel) %}
  {% endif %}

  {# --- 4) Compute new catalog by swapping the prefix before first '_' --- #}
  {% set parts = rel.database.split('_', 1) %}
  {% if parts | length == 2 %}
    {# Catalog format: stage_monetization or prod_monetization #}
    {% set current_env = parts[0] %}
    {% set domain = parts[1] %}
    {% set new_catalog = override_env ~ '_' ~ domain %}
  {% else %}
    {# Fallback: no underscore found, just prefix #}
    {% set new_catalog = override_env ~ '_' ~ rel.database %}
  {% endif %}

  {# --- 5) Only use the swapped relation if it actually exists --- #}
  {% set candidate = adapter.get_relation(
      database=new_catalog,
      schema=rel.schema,
      identifier=rel.identifier
  ) %}

  {% if candidate is not none %}
    {# Target exists in override environment, use it #}
    {% do return(candidate) %}
  {% else %}
    {# Target doesn't exist, fall back to original #}
    {% do log("Warning: " ~ new_catalog ~ "." ~ rel.schema ~ "." ~ rel.identifier ~ 
              " not found, falling back to " ~ rel.database ~ "." ~ rel.schema ~ "." ~ rel.identifier, 
              info=True) %}
    {% do return(rel) %}
  {% endif %}
{% endmacro %}

How It Works

Preserve Dependencies: Calls the original ref() macro first to maintain dbt’s DAG tracking
Check Override Variable: Reads source_data_env variable (project var or env var)
Swap Catalog Prefix: Changes stage_ prefix to prod_ (or vice versa) in the database name
Verify Existence: Only uses the swapped relation if it actually exists; otherwise falls back to original
Return Relation: Returns the appropriate relation object for dbt to use

Usage Examples

Default behavior (reads from stage):

# Normal dbt run uses stage environment
dbt run --select my_model

Override to read from production:

# Override to read prod data while developing in stage
dbt run --select my_model --vars '{"source_data_env": "prod"}'

# Useful for testing with realistic data
dbt build --select my_model+ --vars '{"source_data_env": "prod"}'

Set as project default in dbt_project.yml:

vars:
  source_data_env: "prod"  # Always read from prod by default

Benefits of This Approach 

For Development Workflow

No Hardcoding: Keep using {{ ref() }} without manual path changes
Fast Iteration: Toggle between environments with a single flag
Realistic Testing: Test models with production data volumes and distributions
Safe Experimentation: Read prod data without risk of modifying it

For Data Quality

Production Validation: Verify model logic against real production data
Regression Testing: Compare stage outputs to prod for consistency checks
Debugging: Investigate data issues using actual production records
Migration Testing: Validate transformations before promoting to production

For Team Productivity

Simplified Onboarding: New team members can test with prod data immediately
Reduced Context Switching: No need to manually swap table references
Better Code Reviews: Reviewers can test PRs with production data locally
Consistent Patterns: Standardized approach across the team

Best Practices 

Environment Naming Convention

Use consistent prefixes for catalogs across environments:

Stage: stage_<domain> (e.g., stage_monetization, stage_analytics)
Prod: prod_<domain> (e.g., prod_monetization, prod_analytics)

This naming convention enables the macro to automatically swap prefixes without additional configuration.

Security Considerations

Read-Only Prod Access: Analysts should have SELECT-only permissions on production catalogs
Audit Logging: Enable Databricks audit logs to track cross-environment queries
Sensitive Data: Restrict PII/sensitive data access to prod environment only
Cost Monitoring: Track compute costs for cross-environment queries

Development Guidelines

Default to Stage: Run models against stage data by default during development
Test with Prod: Use source_data_env=prod for final validation before deployment
Document Dependencies: Clearly indicate when models require production data
CI/CD Validation: Automated tests should validate both stage and prod reads work

Limitations and Edge Cases

The cross-environment ref macro handles several edge cases:

Missing Objects: Falls back to original environment if target doesn’t exist
Ephemeral Models: Returns as-is since they have no database
External Packages: Works with both project models and dbt packages
Version Support: Compatible with dbt model versioning (ref('model', v=2))

Common Pitfalls 

Forgetting to Reset Override

If you set source_data_env: "prod" in dbt_project.yml for testing, remember to remove it before committing. Otherwise, all local runs will read from prod.

Performance Considerations

Cross-environment queries may have higher latency due to network traversal between workspaces. For large datasets, consider:

Filtering data early in transformations
Using stage data for iterative development
Switching to prod only for final validation

Catalog Synchronization

The macro assumes stage and prod have similar catalog structures. If catalogs diverge (e.g., prod has a table that stage doesn’t), the fallback logic ensures models still run but may produce unexpected results.

Example Workflow 

Here’s a typical development workflow using this environment strategy:

1. Local Development

# Develop model using stage data (fast iteration)
dbt run --select my_new_model

# Test with production data for validation
dbt run --select my_new_model --vars '{"source_data_env": "prod"}'

2. Create Pull Request

# Run tests against both environments
dbt test --select my_new_model
dbt test --select my_new_model --vars '{"source_data_env": "prod"}'

# Compare outputs
dbt run --select my_validation_model --vars '{"source_data_env": "prod"}'

3. CI/CD Deployment

The pipeline automatically:

Runs models in stage environment
Executes full test suite
On merge to main, deploys to prod via automated workflow
No manual intervention required