Ingestion Recording and Replay

Beta Feature: Recording and replay is currently in beta. The feature is stable for debugging purposes but the archive format may change in future releases.

Debug ingestion issues by capturing all external I/O (HTTP requests and database queries) during ingestion runs, then replaying them locally in an air-gapped environment with full debugger support.

Overview

The recording system captures:

HTTP Traffic: All requests to external APIs (Looker, PowerBI, Snowflake REST, etc.) and DataHub GMS
Database Queries: SQL queries and results from native database connectors (Snowflake, Redshift, BigQuery, Databricks, etc.)

Recordings are stored in encrypted, compressed archives that can be replayed offline to reproduce issues exactly as they occurred in production.

Comparing Recording and Replay Output

The recorded and replayed MCPs are semantically identical - they contain the same source data. However, certain metadata fields will differ because they reflect when MCPs are emitted, not the source data itself:

systemMetadata.lastObserved - timestamp of MCP emission
systemMetadata.runId - unique run identifier
auditStamp.time - audit timestamp

Use datahub check metadata-diff to compare recordings semantically:

# Compare MCPs ignoring system metadata
datahub check metadata-diff \
    --ignore-path "root['*']['systemMetadata']['lastObserved']" \
    --ignore-path "root['*']['systemMetadata']['runId']" \
    recording_output.json replay_output.json

A successful replay will show PERFECT SEMANTIC MATCH when ignoring these fields.

Installation

Install the optional debug-recording plugin:

pip install 'acryl-datahub[debug-recording]'

# Or with your source connectors
pip install 'acryl-datahub[looker,debug-recording]'

Dependencies:

vcrpy>=8.0.0 (Python 3.10+) or vcrpy>=7.0.0,<8.0.0 (Python 3.9) - HTTP recording/replay
pyzipper>=0.3.6 - AES-256 encrypted archives

Note: The recording module uses lazy imports to avoid requiring optional dependencies (like sqlalchemy) when recording is not used. This means you can install the recording plugin without pulling in database connector dependencies unless you actually use them.

Quick Start

Recording an Ingestion Run

# Record with password protection
datahub ingest run -c recipe.yaml --record --record-password mysecret

# Record without S3 upload (for local testing)
datahub ingest run -c recipe.yaml --record --record-password mysecret --no-s3-upload

The recording creates an encrypted ZIP archive containing:

HTTP cassette with all request/response pairs
Database query recordings (if applicable)
Redacted recipe (secrets replaced with safe markers)
Manifest with metadata and checksums

Replaying a Recording

# Replay in air-gapped mode (default) - no network required
datahub ingest replay recording.zip --password mysecret

# Replay with live sink - replay source data, emit to real DataHub
datahub ingest replay recording.zip --password mysecret \
    --live-sink --server http://localhost:8080

Inspecting Recordings

# View archive metadata
datahub recording info recording.zip --password mysecret

# Extract archive contents
datahub recording extract recording.zip --password mysecret --output-dir ./extracted

# List contents of a recording archive
datahub recording list recording.zip --password mysecret

Configuration

Recipe Configuration

source:
  type: looker
  config:
    # ... source config ...

# Optional recording configuration
recording:
  enabled: true
  password: ${DATAHUB_RECORDING_PASSWORD} # Or use --record-password CLI flag
  s3_upload: true # Upload directly to S3 (default: false)
  output_path: s3://my-bucket/recordings/ # Required when s3_upload=true

When s3_upload is disabled (default), the recording is saved locally:

To output_path if specified
To INGESTION_ARTIFACT_DIR directory if set
To a temp directory otherwise

Environment Variables

Variable	Description
`DATAHUB_RECORDING_PASSWORD`	Default password for recording encryption
`ADMIN_PASSWORD`	Fallback password (used in managed environments)
`INGESTION_ARTIFACT_DIR`	Directory to save recordings when S3 upload is disabled. If not set, recordings are saved to temp directory.

CLI Options

Recording:

datahub ingest run -c recipe.yaml \
    --record                    # Enable recording
    --record-password <pwd>     # Encryption password
    --record-output-path <path> # Override output path (for debugging)
    --no-s3-upload              # Disable S3 upload
    --no-secret-redaction       # Keep real credentials (for local debugging)

# Or save to specific directory
export INGESTION_ARTIFACT_DIR=/path/to/recordings
datahub ingest run -c recipe.yaml --record --record-password <pwd> --no-s3-upload
# Recording saved as: /path/to/recordings/recording-{run_id}.zip

Replay:

datahub ingest replay <archive> \
    --password <pwd>            # Decryption password
    --live-sink                 # Enable real GMS sink
    --server <url>              # GMS server for live sink
    --token <token>             # Auth token for live sink

Archive Format

recording-{run_id}.zip (AES-256 encrypted, LZMA compressed)
├── manifest.json           # Metadata, versions, checksums
├── recipe.yaml             # Recipe with redacted secrets
├── http/
│   └── cassette.yaml       # VCR HTTP recordings (YAML for binary data support)
└── db/
    └── queries.jsonl       # Database query recordings

Manifest Contents

{
  "format_version": "1.0.0",
  "run_id": "looker-2024-12-03-10_30_00-abc123",
  "source_type": "looker",
  "sink_type": "datahub-rest",
  "datahub_cli_version": "0.14.0",
  "python_version": "3.10.15",
  "created_at": "2024-12-03T10:35:00Z",
  "recording_start_time": "2024-12-03T10:30:00Z",
  "files": ["http/cassette.yaml", "db/queries.jsonl"],
  "checksums": { "http/cassette.yaml": "sha256:..." },
  "has_exception": false,
  "exception_info": null
}

source_type: The type of source connector (e.g., snowflake, looker, bigquery)
sink_type: The type of sink (e.g., datahub-rest, file)
datahub_cli_version: The DataHub CLI version used for recording
python_version: The Python version used for recording (e.g., "3.10.15")
recording_start_time: When recording began (informational)
has_exception: Whether the recording captured an exception
exception_info: Stack trace and details if an exception occurred

Best Practices

1. Use Consistent Passwords

Store the recording password in a secure location (secrets manager, environment variable) and use the same password across your team:

export DATAHUB_RECORDING_PASSWORD=$(vault read -field=password secret/datahub/recording)
datahub ingest run -c recipe.yaml --record

2. Record in Production-Like Environments

For best debugging results, record in an environment that matches production:

Same credentials and permissions
Same network access
Same data volume (or representative sample)

3. Use Descriptive Run IDs

The archive filename includes the run_id. Use meaningful recipe names for easy identification:

# Recipe: snowflake-prod-daily.yaml
# Archive: snowflake-prod-daily-2024-12-03-10_30_00-abc123.zip

4. Test Replay Immediately

After recording, test the replay to ensure the recording is complete:

# Record (save MCP output for comparison)
datahub ingest run -c recipe.yaml --record --record-password test --no-s3-upload \
    | tee recording_output.json

# Immediately test replay (save output)
datahub ingest replay /tmp/recording.zip --password test \
    | tee replay_output.json

# Verify semantic equivalence
datahub check metadata-diff \
    --ignore-path "root['*']['systemMetadata']['lastObserved']" \
    --ignore-path "root['*']['systemMetadata']['runId']" \
    recording_output.json replay_output.json

5. Include Exception Context

If recording captures an exception, the archive includes exception details:

datahub recording info recording.zip --password mysecret
# Output includes: has_exception: true, exception_info: {...}

6. Secure Archive Handling

Never commit recordings to source control
Use strong passwords (16+ characters)
Delete recordings after debugging is complete
Use S3 lifecycle policies for automatic cleanup

7. Minimize Recording Scope

For faster recordings and smaller archives, limit the scope:

source:
  type: looker
  config:
    dashboard_pattern:
      allow:
        - "^specific-dashboard-id$"

Limitations

1. Thread-Safe Recording Impact

To capture all HTTP requests reliably, recording serializes HTTP calls. This has performance implications:

Scenario	Without Recording	With Recording
Parallel API calls	~10s	~90s
Single-threaded	~90s	~90s

Mitigation: Recording is intended for debugging, not production. Use --no-s3-upload for faster local testing.

2. Timestamps Differ Between Runs

MCP metadata timestamps will always differ between recording and replay:

systemMetadata.lastObserved - set when MCP is emitted
systemMetadata.runId - unique per run
auditStamp.time - set during processing

Mitigation: The actual source data is identical. Use datahub check metadata-diff with --ignore-path to verify semantic equivalence (see "Comparing Recording and Replay Output" above).

3. Non-Deterministic Source Behavior

Some sources have non-deterministic behavior:

Random sampling or ordering of results
Rate limiting/retry timing variations
Parallel processing order

Mitigation: The replay serves recorded API responses, so data is identical. The system includes custom VCR matchers that handle non-deterministic request ordering (e.g., Looker usage queries with varying filter orders).

4. Database Connection Mocking

Database replay mocks the connection entirely - authentication is bypassed. This means:

Connection pooling behavior may differ
Transaction semantics are simplified
Cursor state is simulated

Mitigation: For complex database debugging, use database-specific profiling tools alongside recording.

5. Large Recordings

Recordings can be large for high-volume sources:

Looker with 1000+ dashboards: ~50MB
PowerBI with many workspaces: ~100MB
Snowflake with full schema extraction: ~200MB

Mitigation:

Use patterns to limit scope
Enable LZMA compression (default)
Use S3 for storage instead of local disk

6. Secret Handling

Secrets are redacted in the stored recipe using __REPLAY_DUMMY__ markers. During replay:

Pydantic validation receives valid dummy values
Actual API/DB calls use recorded responses (no real auth needed)
Some sources may have validation that fails with dummy values

Mitigation: The replay system auto-injects valid dummy values that pass common validators.

7. HTTP-Only for Some Sources

Sources using non-HTTP protocols cannot be fully recorded:

Direct TCP/binary database protocols (partially supported via db_proxy)
gRPC (not currently supported)
WebSocket (not currently supported)

Mitigation: Most sources use HTTP REST APIs which are fully supported.

8. Vendored HTTP Libraries (Snowflake, Databricks)

Some database connectors use non-standard HTTP implementations:

Snowflake: Uses snowflake.connector.vendored.urllib3 and vendored.requests
Databricks: Uses internal Thrift HTTP client

Impact: HTTP authentication calls are NOT recorded during connection setup.

Why recording still works:

Authentication happens once during connect()
SQL queries use standard DB-API cursors (no HTTP involved)
During replay, authentication is bypassed entirely (mock connection)
All SQL queries and results are perfectly recorded/replayed

What IS recorded:

✅ All SQL queries via cursor.execute()
✅ All query results
✅ Cursor metadata (description, rowcount)

What is NOT recorded:

❌ HTTP authentication calls (not needed for replay)
❌ PUT/GET file operations (not used in metadata ingestion)

Automatic error handling: The recording system detects when VCR interferes with connection and automatically retries with VCR bypassed. You'll see warnings in logs but recording will succeed. SQL queries are captured normally regardless of HTTP recording status.

For debugging: SQL query recordings are sufficient for all metadata extraction scenarios.

9. Stateful Ingestion

Stateful ingestion checkpoints may behave differently during replay:

Recorded state may reference timestamps that don't match replay time
State backend calls are mocked

Mitigation: For stateful debugging, record a fresh run without existing state.

10. Memory Usage

Large recordings are loaded into memory during replay:

HTTP cassette is fully loaded
DB queries are streamed from JSONL

Mitigation: For very large recordings, extract and inspect specific parts:

datahub recording extract recording.zip --password mysecret --output-dir ./extracted
# Manually inspect http/cassette.yaml

11. Lazy Imports

The recording module uses lazy imports to avoid requiring optional dependencies when recording is not used:

sqlalchemy is only imported when actually recording/replaying SQLAlchemy-based sources
RecordingConfig, IngestionRecorder, and IngestionReplayer are imported on-demand via __getattr__
This allows installing the recording plugin without pulling in database connector dependencies
This also allows other sources not depending on SQLAlchemy (e.g., HTTP-based sources like Looker, PowerBI) to be safely installed when no recording is used

Impact: This is transparent to users - the recording system works exactly the same, but with better dependency isolation. The debug-recording plugin is designed to be installed alongside source connectors, not as a standalone package. Dependencies like sqlalchemy are expected to be provided by the source connector itself when needed.

Supported Sources

Fully Supported (HTTP-based)

Source	HTTP Recording	Notes
Looker	✅	Full support including SDK calls
PowerBI	✅	Full support
Tableau	✅	Full support
Superset	✅	Full support
Mode	✅	Full support
Sigma	✅	Full support
dbt Cloud	✅	Full support
Fivetran	✅	Full support

Database Sources

Source	HTTP Recording	DB Recording	Strategy	Notes
Snowflake	❌ Not needed	✅ Full	Connection wrapper	Native connector wrapped at `connect()`
Redshift	N/A	✅ Full	Connection wrapper	Native connector wrapped at `connect()`
Databricks	❌ Not needed	✅ Full	Connection wrapper	Native connector wrapped at `connect()`
BigQuery	✅ (REST API)	✅ Full	Client wrapper	Client class wrapped
PostgreSQL	N/A	✅ Full	Connection.execute() wrapper	SQLAlchemy `connection.execute()` wrapped
MySQL	N/A	✅ Full	Connection.execute() wrapper	SQLAlchemy `connection.execute()` wrapped
SQLite	N/A	✅ Full	Connection.execute() wrapper	SQLAlchemy `connection.execute()` wrapped
MSSQL	N/A	✅ Full	Connection.execute() wrapper	SQLAlchemy `connection.execute()` wrapped

Note: File staging operations (PUT/GET) are not used in metadata extraction and are therefore not a concern for recording/replay.

Hybrid Recording Strategy

The recording system uses a hybrid approach that selects the best interception method for each database connector type:

1. Wrapper Strategy (Native Connectors)

Used for: Snowflake, Redshift, Databricks, BigQuery
How it works: Wraps the connector's connect() function or Client class
Why: These connectors have direct connect() functions that return connections we can wrap
Implementation: ConnectionProxy wraps the real connection, CursorProxy intercepts queries

2. Connection.execute() Wrapper Strategy (SQLAlchemy-based)

Used for: PostgreSQL, MySQL, SQLite, MSSQL, and other SQLAlchemy-based sources
How it works: Wraps engine.connect() to intercept connections, then wraps connection.execute() to capture queries and results
Why: SQLAlchemy 2.x uses Result objects that are best captured at the execute() level, avoiding import reference issues with modules that import create_engine directly
Implementation: Wraps engine.connect() to return connections with wrapped execute() methods that materialize and record results
Benefits: Works even when modules import create_engine directly (e.g., from sqlalchemy import create_engine), avoiding stale reference issues

Why Different Strategies?

Native connectors (Snowflake, Redshift) expose direct connect() functions that are easy to wrap
SQLAlchemy-based sources use connection pooling and engines. Wrapping connection.execute() captures Result objects directly, avoiding issues with modules that import create_engine directly
BigQuery uses a Client class pattern, requiring class-level wrapping

Both strategies achieve the same goal: intercepting SQL queries and results for recording/replay, but use the most appropriate method for each connector's architecture.

Database Connection Architecture

Database sources have a two-phase execution model:

Phase 1: Authentication (During connect())

Uses source-specific HTTP clients (may be vendored/custom)
NOT recorded (but also not needed during replay)
During replay: Bypassed entirely with mock connection
Automatic retry if VCR interferes with connection

Phase 2: SQL Execution (After connect())

Uses standard Python DB-API 2.0 cursor interface
Fully recorded via CursorProxy (works for both wrapper strategies)
Protocol-agnostic (works for any DB-API connector)
During replay: Served from recorded queries.jsonl

This architecture makes recording resilient to HTTP library changes while maintaining perfect SQL replay fidelity. For Snowflake and Databricks, all metadata extraction happens via SQL queries in Phase 2, making HTTP recording unnecessary.

DataHub Backend

Component	Recording	Notes
GMS REST API	✅	Sink emissions captured
GraphQL API	✅	If used by source
Stateful Backend	✅	Checkpoint calls captured

Troubleshooting

"Module not found: vcrpy"

Install the debug-recording plugin:

pip install 'acryl-datahub[debug-recording]'

"Checksum verification failed"

The archive may be corrupted. Re-download or re-record:

datahub recording info recording.zip --password mysecret
# Check for checksum errors in output

"No match for request" during replay

The recorded cassette doesn't have a matching request. This can happen if:

Recording was incomplete (check has_exception in manifest)
Source behavior changed between recording and replay
Different credentials caused different API paths

Solution: Re-record with the exact same configuration.

Replay produces different event count

A small difference in event count (e.g., 3259 vs 3251) is normal due to:

Duplicate MCP emissions during recording
Timing-dependent code paths
Non-deterministic processing order

Verification: Use datahub check metadata-diff to confirm semantic equivalence:

datahub check metadata-diff \
    --ignore-path "root['*']['systemMetadata']['lastObserved']" \
    --ignore-path "root['*']['systemMetadata']['runId']" \
    recording_output.json replay_output.json

A "PERFECT SEMANTIC MATCH" confirms the replay is correct despite count differences.

Recording takes too long

HTTP requests are serialized during recording for reliability. To speed up:

Reduce source scope with patterns
Use --no-s3-upload for local testing
Accept that recording is slower than normal ingestion

Archive too large for S3 upload

Large archives may timeout during upload:

# Record locally first
datahub ingest run -c recipe.yaml --record --record-password mysecret --no-s3-upload

# Upload manually with multipart
aws s3 cp recording.zip s3://bucket/recordings/ --expected-size $(stat -f%z recording.zip)

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    IngestionRecorder                         │
│  ┌─────────────────┐  ┌─────────────────┐  ┌──────────────┐ │
│  │  HTTPRecorder   │  │   ModulePatcher │  │ QueryRecorder│ │
│  │  (VCR.py)       │  │   (DB proxies)  │  │  (JSONL)     │ │
│  └────────┬────────┘  └────────┬────────┘  └──────┬───────┘ │
│           │                    │                   │         │
│           ▼                    ▼                   ▼         │
│  ┌─────────────────────────────────────────────────────────┐│
│  │                 Encrypted Archive                        ││
│  │  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────┐  ││
│  │  │manifest  │ │ recipe   │ │cassette  │ │queries.jsonl│ ││
│  │  │.json     │ │ .yaml    │ │.yaml     │ │            │  ││
│  │  └──────────┘ └──────────┘ └──────────┘ └────────────┘  ││
│  └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    IngestionReplayer                         │
│  ┌─────────────────┐  ┌─────────────────┐  ┌──────────────┐ │
│  │  HTTPReplayer   │  │  ReplayPatcher  │  │ QueryReplayer│ │
│  │  (VCR replay)   │  │  (Mock conns)   │  │  (Mock cursor│ │
│  └─────────────────┘  └─────────────────┘  └──────────────┘ │
│                              │                               │
│                              ▼                               │
│              ┌──────────────────────────────┐               │
│              │    Air-Gapped Replay         │               │
│              │  - No network required       │               │
│              │  - Full debugger support     │               │
│              │  - Exact reproduction        │               │
│              └──────────────────────────────┘               │
└─────────────────────────────────────────────────────────────┘

Contributing

When adding new source connectors:

HTTP-based sources work automatically via VCR
Database sources may need additions to patcher.py for their specific connector
Test recording and replay with the new source before releasing

Ingestion Recording and Replay

Overview​

Comparing Recording and Replay Output​

Installation​

Quick Start​

Recording an Ingestion Run​

Replaying a Recording​

Inspecting Recordings​

Configuration​

Recipe Configuration​

Environment Variables​

CLI Options​

Archive Format​

Manifest Contents​

Best Practices​

1. Use Consistent Passwords​

2. Record in Production-Like Environments​

3. Use Descriptive Run IDs​

4. Test Replay Immediately​

5. Include Exception Context​

6. Secure Archive Handling​

7. Minimize Recording Scope​

Limitations​

1. Thread-Safe Recording Impact​

2. Timestamps Differ Between Runs​

3. Non-Deterministic Source Behavior​

4. Database Connection Mocking​

5. Large Recordings​

6. Secret Handling​

7. HTTP-Only for Some Sources​

8. Vendored HTTP Libraries (Snowflake, Databricks)​

9. Stateful Ingestion​

10. Memory Usage​

11. Lazy Imports​

Supported Sources​

Fully Supported (HTTP-based)​

Database Sources​

Hybrid Recording Strategy​

Database Connection Architecture​

DataHub Backend​

Troubleshooting​

"Module not found: vcrpy"​

"Checksum verification failed"​

"No match for request" during replay​

Replay produces different event count​

Recording takes too long​

Archive too large for S3 upload​

Architecture​

Contributing​

See Also​