5.3. Databricks Environment Strategy

A practical approach to managing multiple Databricks environments with dbt, enabling safe development workflows while maintaining production data access for testing.

Introduction

Managing data transformations across multiple environments is critical for maintaining data quality and development velocity. This guide describes a two-environment strategy using Databricks and dbt that balances developer flexibility with production safety.

The approach uses two separate Databricks workspaces (stage and prod) with cross-environment data access capabilities, allowing analysts to develop and test models locally or in stage while reading from production data when needed.

Environment Architecture

Stage Environment

The stage environment serves as the development and testing workspace where analysts have full permissions:

  • Write Access: Analysts can create, modify, and delete any objects (tables, views, notebooks)

  • Experimentation: Free to test queries, prototypes, and data transformations

  • Catalog Access: Access to both stage_* and prod_* catalogs for cross-environment testing

  • dbt Development: Primary environment for developing and testing dbt models

Prod Environment

The production environment is strictly controlled and managed through automation:

  • Read-Only Access: Analysts can query data but cannot modify production objects

  • CI/CD Managed: All changes deployed exclusively through automated pipelines

  • Data Integrity: Prevents accidental modifications or deletions of production data

  • Catalog Access: Only prod_* catalogs available

Cross-Environment Data Access

Both environments can access data from either catalog, enabling powerful testing workflows:

-- In stage environment, you can query production data
SELECT * FROM prod_monetization.core.users

-- Or stage data
SELECT * FROM stage_monetization.core.users

-- Join across environments for validation
SELECT
    s.user_id,
    s.total_revenue AS stage_revenue,
    p.total_revenue AS prod_revenue,
    s.total_revenue - p.total_revenue AS diff
FROM stage_monetization.core.user_revenue s
JOIN prod_monetization.core.user_revenue p
    ON s.user_id = p.user_id

This capability is particularly valuable for:

  • Data Validation: Compare stage outputs against production for regression testing

  • Production Data Testing: Test new models using real production data without risk

  • Debugging: Investigate discrepancies between environments

  • Migration Validation: Verify data migrations by comparing source and target

Local Development with Production Data

A common development challenge: testing dbt models locally using production data without hardcoding table paths. The traditional workflow involves:

  1. Temporarily hardcode production table path: FROM prod_monetization.core.users

  2. Run and test the model

  3. Change back to: FROM {{ ref('users') }}

  4. Commit changes

This is tedious and error-prone. A better solution is to override the ref() macro to dynamically switch environments.

The Problem

When developing locally, you want to:

  • Test models using production data (which is more complete and realistic)

  • Keep using {{ ref('model_name') }} for proper dependency tracking

  • Avoid hardcoding environment-specific table paths

  • Switch between stage and prod data easily

The Solution

Override the built-in ref() macro to allow cross-environment reads while preserving dbt’s dependency tracking:

macros/cross_environment_ref.sql - Override ref() for cross-environment reads
-- macros/cross_environment_ref.sql
{#
  Override ref() to allow cross-environment reads between stage and prod.
  
  Usage:
    - Default:            {{ ref('some_model') }}                
    - Override to prod:   dbt run --vars '{"source_data_env": "prod"}'
    - Override to stage:  dbt run --vars '{"source_data_env": "stage"}'

  Behavior:
    1) Calls the original built-in ref() to preserve dependency tracking
    2) If source_data_env is set, swaps the catalog prefix (stage_* <-> prod_*)
    3) Only uses the swapped relation if it actually exists
    4) Falls back to original relation if target doesn't exist

  Example:
    -- Model reads from stage_monetization.core.users by default
    SELECT * FROM {{ ref('users') }}
    
    -- With --vars '{"source_data_env": "prod"}', reads from prod_monetization.core.users
    -- Dependency tracking still works correctly
#}

{% macro ref() %}
  {# --- 1) Extract user-provided args (supports version/v and 1- or 2-arg forms) --- #}
  {% set version = kwargs.get('version') or kwargs.get('v') %}
  {% set packagename = none %}
  
  {% if (varargs | length) == 1 %}
    {% set modelname = varargs[0] %}
  {% else %}
    {% set packagename = varargs[0] %}
    {% set modelname   = varargs[1] %}
  {% endif %}

  {# --- 2) Call the original built-in ref() so dbt records DAG dependencies --- #}
  {% if packagename is not none %}
    {% set rel = builtins.ref(packagename, modelname, version=version) %}
  {% else %}
    {% set rel = builtins.ref(modelname, version=version) %}
  {% endif %}

  {# --- 3) Read override env (project var first, then env var) --- #}
  {% set override_env = var('source_data_env', None) %}
  {% if override_env %}
    {% set override_env = override_env | trim %}
  {% endif %}
  
  {% if not override_env %}
    {% do return(rel) %}
  {% endif %}

  {# If Relation has no database (ephemeral models), return as-is #}
  {% if rel.database is none %}
    {% do return(rel) %}
  {% endif %}

  {# --- 4) Compute new catalog by swapping the prefix before first '_' --- #}
  {% set parts = rel.database.split('_', 1) %}
  {% if parts | length == 2 %}
    {# Catalog format: stage_monetization or prod_monetization #}
    {% set current_env = parts[0] %}
    {% set domain = parts[1] %}
    {% set new_catalog = override_env ~ '_' ~ domain %}
  {% else %}
    {# Fallback: no underscore found, just prefix #}
    {% set new_catalog = override_env ~ '_' ~ rel.database %}
  {% endif %}

  {# --- 5) Only use the swapped relation if it actually exists --- #}
  {% set candidate = adapter.get_relation(
      database=new_catalog,
      schema=rel.schema,
      identifier=rel.identifier
  ) %}

  {% if candidate is not none %}
    {# Target exists in override environment, use it #}
    {% do return(candidate) %}
  {% else %}
    {# Target doesn't exist, fall back to original #}
    {% do log("Warning: " ~ new_catalog ~ "." ~ rel.schema ~ "." ~ rel.identifier ~ 
              " not found, falling back to " ~ rel.database ~ "." ~ rel.schema ~ "." ~ rel.identifier, 
              info=True) %}
    {% do return(rel) %}
  {% endif %}
{% endmacro %}

How It Works

  1. Preserve Dependencies: Calls the original ref() macro first to maintain dbt’s DAG tracking

  2. Check Override Variable: Reads source_data_env variable (project var or env var)

  3. Swap Catalog Prefix: Changes stage_ prefix to prod_ (or vice versa) in the database name

  4. Verify Existence: Only uses the swapped relation if it actually exists; otherwise falls back to original

  5. Return Relation: Returns the appropriate relation object for dbt to use

Usage Examples

Default behavior (reads from stage):

# Normal dbt run uses stage environment
dbt run --select my_model

Override to read from production:

# Override to read prod data while developing in stage
dbt run --select my_model --vars '{"source_data_env": "prod"}'

# Useful for testing with realistic data
dbt build --select my_model+ --vars '{"source_data_env": "prod"}'

Set as project default in dbt_project.yml:

vars:
  source_data_env: "prod"  # Always read from prod by default

Benefits of This Approach

For Development Workflow

  • No Hardcoding: Keep using {{ ref() }} without manual path changes

  • Fast Iteration: Toggle between environments with a single flag

  • Realistic Testing: Test models with production data volumes and distributions

  • Safe Experimentation: Read prod data without risk of modifying it

For Data Quality

  • Production Validation: Verify model logic against real production data

  • Regression Testing: Compare stage outputs to prod for consistency checks

  • Debugging: Investigate data issues using actual production records

  • Migration Testing: Validate transformations before promoting to production

For Team Productivity

  • Simplified Onboarding: New team members can test with prod data immediately

  • Reduced Context Switching: No need to manually swap table references

  • Better Code Reviews: Reviewers can test PRs with production data locally

  • Consistent Patterns: Standardized approach across the team

Best Practices

Environment Naming Convention

Use consistent prefixes for catalogs across environments:

  • Stage: stage_<domain> (e.g., stage_monetization, stage_analytics)

  • Prod: prod_<domain> (e.g., prod_monetization, prod_analytics)

This naming convention enables the macro to automatically swap prefixes without additional configuration.

Security Considerations

  • Read-Only Prod Access: Analysts should have SELECT-only permissions on production catalogs

  • Audit Logging: Enable Databricks audit logs to track cross-environment queries

  • Sensitive Data: Restrict PII/sensitive data access to prod environment only

  • Cost Monitoring: Track compute costs for cross-environment queries

Development Guidelines

  • Default to Stage: Run models against stage data by default during development

  • Test with Prod: Use source_data_env=prod for final validation before deployment

  • Document Dependencies: Clearly indicate when models require production data

  • CI/CD Validation: Automated tests should validate both stage and prod reads work

Limitations and Edge Cases

The cross-environment ref macro handles several edge cases:

  • Missing Objects: Falls back to original environment if target doesn’t exist

  • Ephemeral Models: Returns as-is since they have no database

  • External Packages: Works with both project models and dbt packages

  • Version Support: Compatible with dbt model versioning (ref('model', v=2))

Common Pitfalls

Forgetting to Reset Override

If you set source_data_env: "prod" in dbt_project.yml for testing, remember to remove it before committing. Otherwise, all local runs will read from prod.

Performance Considerations

Cross-environment queries may have higher latency due to network traversal between workspaces. For large datasets, consider:

  • Filtering data early in transformations

  • Using stage data for iterative development

  • Switching to prod only for final validation

Catalog Synchronization

The macro assumes stage and prod have similar catalog structures. If catalogs diverge (e.g., prod has a table that stage doesn’t), the fallback logic ensures models still run but may produce unexpected results.

Example Workflow

Here’s a typical development workflow using this environment strategy:

1. Local Development

# Develop model using stage data (fast iteration)
dbt run --select my_new_model

# Test with production data for validation
dbt run --select my_new_model --vars '{"source_data_env": "prod"}'

2. Create Pull Request

# Run tests against both environments
dbt test --select my_new_model
dbt test --select my_new_model --vars '{"source_data_env": "prod"}'

# Compare outputs
dbt run --select my_validation_model --vars '{"source_data_env": "prod"}'

3. CI/CD Deployment

The pipeline automatically:

  • Runs models in stage environment

  • Executes full test suite

  • On merge to main, deploys to prod via automated workflow

  • No manual intervention required

Further Reading

For related topics on dbt development and testing strategies: