Skip to main content

Ingestion Recording and Replay

Beta Feature: Recording and replay is currently in beta. The feature is stable for debugging purposes but the archive format may change in future releases.

Debug ingestion issues by capturing all external I/O (HTTP requests and database queries) during ingestion runs, then replaying them locally in an air-gapped environment with full debugger support.

Overview

The recording system captures:

  • HTTP Traffic: All requests to external APIs (Looker, PowerBI, Snowflake REST, etc.) and DataHub GMS
  • Database Queries: SQL queries and results from native database connectors (Snowflake, Redshift, BigQuery, Databricks, etc.)

Recordings are stored in encrypted, compressed archives that can be replayed offline to reproduce issues exactly as they occurred in production.

Comparing Recording and Replay Output

The recorded and replayed MCPs are semantically identical - they contain the same source data. However, certain metadata fields will differ because they reflect when MCPs are emitted, not the source data itself:

  • systemMetadata.lastObserved - timestamp of MCP emission
  • systemMetadata.runId - unique run identifier
  • auditStamp.time - audit timestamp

Use datahub check metadata-diff to compare recordings semantically:

# Compare MCPs ignoring system metadata
datahub check metadata-diff \
--ignore-path "root['*']['systemMetadata']['lastObserved']" \
--ignore-path "root['*']['systemMetadata']['runId']" \
recording_output.json replay_output.json

A successful replay will show PERFECT SEMANTIC MATCH when ignoring these fields.

Installation

Install the optional debug-recording plugin:

pip install 'acryl-datahub[debug-recording]'

# Or with your source connectors
pip install 'acryl-datahub[looker,debug-recording]'

Dependencies:

  • vcrpy>=8.0.0 (Python 3.10+) or vcrpy>=7.0.0,<8.0.0 (Python 3.9) - HTTP recording/replay
  • pyzipper>=0.3.6 - AES-256 encrypted archives

Note: The recording module uses lazy imports to avoid requiring optional dependencies (like sqlalchemy) when recording is not used. This means you can install the recording plugin without pulling in database connector dependencies unless you actually use them.

Quick Start

Recording an Ingestion Run

# Record with password protection
datahub ingest run -c recipe.yaml --record --record-password mysecret

# Record without S3 upload (for local testing)
datahub ingest run -c recipe.yaml --record --record-password mysecret --no-s3-upload

The recording creates an encrypted ZIP archive containing:

  • HTTP cassette with all request/response pairs
  • Database query recordings (if applicable)
  • Redacted recipe (secrets replaced with safe markers)
  • Manifest with metadata and checksums

Replaying a Recording

# Replay in air-gapped mode (default) - no network required
datahub ingest replay recording.zip --password mysecret

# Replay with live sink - replay source data, emit to real DataHub
datahub ingest replay recording.zip --password mysecret \
--live-sink --server http://localhost:8080

Inspecting Recordings

# View archive metadata
datahub recording info recording.zip --password mysecret

# Extract archive contents
datahub recording extract recording.zip --password mysecret --output-dir ./extracted

# List contents of a recording archive
datahub recording list recording.zip --password mysecret

Configuration

Recipe Configuration

source:
type: looker
config:
# ... source config ...

# Optional recording configuration
recording:
enabled: true
password: ${DATAHUB_RECORDING_PASSWORD} # Or use --record-password CLI flag
s3_upload: true # Upload directly to S3 (default: false)
output_path: s3://my-bucket/recordings/ # Required when s3_upload=true

When s3_upload is disabled (default), the recording is saved locally:

  • To output_path if specified
  • To INGESTION_ARTIFACT_DIR directory if set
  • To a temp directory otherwise

Environment Variables

VariableDescription
DATAHUB_RECORDING_PASSWORDDefault password for recording encryption
ADMIN_PASSWORDFallback password (used in managed environments)
INGESTION_ARTIFACT_DIRDirectory to save recordings when S3 upload is disabled. If not set, recordings are saved to temp directory.

CLI Options

Recording:

datahub ingest run -c recipe.yaml \
--record # Enable recording
--record-password <pwd> # Encryption password
--record-output-path <path> # Override output path (for debugging)
--no-s3-upload # Disable S3 upload
--no-secret-redaction # Keep real credentials (for local debugging)

# Or save to specific directory
export INGESTION_ARTIFACT_DIR=/path/to/recordings
datahub ingest run -c recipe.yaml --record --record-password <pwd> --no-s3-upload
# Recording saved as: /path/to/recordings/recording-{run_id}.zip

Replay:

datahub ingest replay <archive> \
--password <pwd> # Decryption password
--live-sink # Enable real GMS sink
--server <url> # GMS server for live sink
--token <token> # Auth token for live sink

Archive Format

recording-{run_id}.zip (AES-256 encrypted, LZMA compressed)
├── manifest.json # Metadata, versions, checksums
├── recipe.yaml # Recipe with redacted secrets
├── http/
│ └── cassette.yaml # VCR HTTP recordings (YAML for binary data support)
└── db/
└── queries.jsonl # Database query recordings

Manifest Contents

{
"format_version": "1.0.0",
"run_id": "looker-2024-12-03-10_30_00-abc123",
"source_type": "looker",
"sink_type": "datahub-rest",
"datahub_cli_version": "0.14.0",
"python_version": "3.10.15",
"created_at": "2024-12-03T10:35:00Z",
"recording_start_time": "2024-12-03T10:30:00Z",
"files": ["http/cassette.yaml", "db/queries.jsonl"],
"checksums": { "http/cassette.yaml": "sha256:..." },
"has_exception": false,
"exception_info": null
}
  • source_type: The type of source connector (e.g., snowflake, looker, bigquery)
  • sink_type: The type of sink (e.g., datahub-rest, file)
  • datahub_cli_version: The DataHub CLI version used for recording
  • python_version: The Python version used for recording (e.g., "3.10.15")
  • recording_start_time: When recording began (informational)
  • has_exception: Whether the recording captured an exception
  • exception_info: Stack trace and details if an exception occurred

Best Practices

1. Use Consistent Passwords

Store the recording password in a secure location (secrets manager, environment variable) and use the same password across your team:

export DATAHUB_RECORDING_PASSWORD=$(vault read -field=password secret/datahub/recording)
datahub ingest run -c recipe.yaml --record

2. Record in Production-Like Environments

For best debugging results, record in an environment that matches production:

  • Same credentials and permissions
  • Same network access
  • Same data volume (or representative sample)

3. Use Descriptive Run IDs

The archive filename includes the run_id. Use meaningful recipe names for easy identification:

# Recipe: snowflake-prod-daily.yaml
# Archive: snowflake-prod-daily-2024-12-03-10_30_00-abc123.zip

4. Test Replay Immediately

After recording, test the replay to ensure the recording is complete:

# Record (save MCP output for comparison)
datahub ingest run -c recipe.yaml --record --record-password test --no-s3-upload \
| tee recording_output.json

# Immediately test replay (save output)
datahub ingest replay /tmp/recording.zip --password test \
| tee replay_output.json

# Verify semantic equivalence
datahub check metadata-diff \
--ignore-path "root['*']['systemMetadata']['lastObserved']" \
--ignore-path "root['*']['systemMetadata']['runId']" \
recording_output.json replay_output.json

5. Include Exception Context

If recording captures an exception, the archive includes exception details:

datahub recording info recording.zip --password mysecret
# Output includes: has_exception: true, exception_info: {...}

6. Secure Archive Handling

  • Never commit recordings to source control
  • Use strong passwords (16+ characters)
  • Delete recordings after debugging is complete
  • Use S3 lifecycle policies for automatic cleanup

7. Minimize Recording Scope

For faster recordings and smaller archives, limit the scope:

source:
type: looker
config:
dashboard_pattern:
allow:
- "^specific-dashboard-id$"

Limitations

1. Thread-Safe Recording Impact

To capture all HTTP requests reliably, recording serializes HTTP calls. This has performance implications:

ScenarioWithout RecordingWith Recording
Parallel API calls~10s~90s
Single-threaded~90s~90s

Mitigation: Recording is intended for debugging, not production. Use --no-s3-upload for faster local testing.

2. Timestamps Differ Between Runs

MCP metadata timestamps will always differ between recording and replay:

  • systemMetadata.lastObserved - set when MCP is emitted
  • systemMetadata.runId - unique per run
  • auditStamp.time - set during processing

Mitigation: The actual source data is identical. Use datahub check metadata-diff with --ignore-path to verify semantic equivalence (see "Comparing Recording and Replay Output" above).

3. Non-Deterministic Source Behavior

Some sources have non-deterministic behavior:

  • Random sampling or ordering of results
  • Rate limiting/retry timing variations
  • Parallel processing order

Mitigation: The replay serves recorded API responses, so data is identical. The system includes custom VCR matchers that handle non-deterministic request ordering (e.g., Looker usage queries with varying filter orders).

4. Database Connection Mocking

Database replay mocks the connection entirely - authentication is bypassed. This means:

  • Connection pooling behavior may differ
  • Transaction semantics are simplified
  • Cursor state is simulated

Mitigation: For complex database debugging, use database-specific profiling tools alongside recording.

5. Large Recordings

Recordings can be large for high-volume sources:

  • Looker with 1000+ dashboards: ~50MB
  • PowerBI with many workspaces: ~100MB
  • Snowflake with full schema extraction: ~200MB

Mitigation:

  • Use patterns to limit scope
  • Enable LZMA compression (default)
  • Use S3 for storage instead of local disk

6. Secret Handling

Secrets are redacted in the stored recipe using __REPLAY_DUMMY__ markers. During replay:

  • Pydantic validation receives valid dummy values
  • Actual API/DB calls use recorded responses (no real auth needed)
  • Some sources may have validation that fails with dummy values

Mitigation: The replay system auto-injects valid dummy values that pass common validators.

7. HTTP-Only for Some Sources

Sources using non-HTTP protocols cannot be fully recorded:

  • Direct TCP/binary database protocols (partially supported via db_proxy)
  • gRPC (not currently supported)
  • WebSocket (not currently supported)

Mitigation: Most sources use HTTP REST APIs which are fully supported.

8. Vendored HTTP Libraries (Snowflake, Databricks)

Some database connectors use non-standard HTTP implementations:

  • Snowflake: Uses snowflake.connector.vendored.urllib3 and vendored.requests
  • Databricks: Uses internal Thrift HTTP client

Impact: HTTP authentication calls are NOT recorded during connection setup.

Why recording still works:

  • Authentication happens once during connect()
  • SQL queries use standard DB-API cursors (no HTTP involved)
  • During replay, authentication is bypassed entirely (mock connection)
  • All SQL queries and results are perfectly recorded/replayed

What IS recorded:

  • ✅ All SQL queries via cursor.execute()
  • ✅ All query results
  • ✅ Cursor metadata (description, rowcount)

What is NOT recorded:

  • ❌ HTTP authentication calls (not needed for replay)
  • ❌ PUT/GET file operations (not used in metadata ingestion)

Automatic error handling: The recording system detects when VCR interferes with connection and automatically retries with VCR bypassed. You'll see warnings in logs but recording will succeed. SQL queries are captured normally regardless of HTTP recording status.

For debugging: SQL query recordings are sufficient for all metadata extraction scenarios.

9. Stateful Ingestion

Stateful ingestion checkpoints may behave differently during replay:

  • Recorded state may reference timestamps that don't match replay time
  • State backend calls are mocked

Mitigation: For stateful debugging, record a fresh run without existing state.

10. Memory Usage

Large recordings are loaded into memory during replay:

  • HTTP cassette is fully loaded
  • DB queries are streamed from JSONL

Mitigation: For very large recordings, extract and inspect specific parts:

datahub recording extract recording.zip --password mysecret --output-dir ./extracted
# Manually inspect http/cassette.yaml

11. Lazy Imports

The recording module uses lazy imports to avoid requiring optional dependencies when recording is not used:

  • sqlalchemy is only imported when actually recording/replaying SQLAlchemy-based sources
  • RecordingConfig, IngestionRecorder, and IngestionReplayer are imported on-demand via __getattr__
  • This allows installing the recording plugin without pulling in database connector dependencies
  • This also allows other sources not depending on SQLAlchemy (e.g., HTTP-based sources like Looker, PowerBI) to be safely installed when no recording is used

Impact: This is transparent to users - the recording system works exactly the same, but with better dependency isolation. The debug-recording plugin is designed to be installed alongside source connectors, not as a standalone package. Dependencies like sqlalchemy are expected to be provided by the source connector itself when needed.

Supported Sources

Fully Supported (HTTP-based)

SourceHTTP RecordingNotes
LookerFull support including SDK calls
PowerBIFull support
TableauFull support
SupersetFull support
ModeFull support
SigmaFull support
dbt CloudFull support
FivetranFull support

Database Sources

SourceHTTP RecordingDB RecordingStrategyNotes
Snowflake❌ Not needed✅ FullConnection wrapperNative connector wrapped at connect()
RedshiftN/A✅ FullConnection wrapperNative connector wrapped at connect()
Databricks❌ Not needed✅ FullConnection wrapperNative connector wrapped at connect()
BigQuery✅ (REST API)✅ FullClient wrapperClient class wrapped
PostgreSQLN/A✅ FullConnection.execute() wrapperSQLAlchemy connection.execute() wrapped
MySQLN/A✅ FullConnection.execute() wrapperSQLAlchemy connection.execute() wrapped
SQLiteN/A✅ FullConnection.execute() wrapperSQLAlchemy connection.execute() wrapped
MSSQLN/A✅ FullConnection.execute() wrapperSQLAlchemy connection.execute() wrapped

Note: File staging operations (PUT/GET) are not used in metadata extraction and are therefore not a concern for recording/replay.

Hybrid Recording Strategy

The recording system uses a hybrid approach that selects the best interception method for each database connector type:

1. Wrapper Strategy (Native Connectors)

  • Used for: Snowflake, Redshift, Databricks, BigQuery
  • How it works: Wraps the connector's connect() function or Client class
  • Why: These connectors have direct connect() functions that return connections we can wrap
  • Implementation: ConnectionProxy wraps the real connection, CursorProxy intercepts queries

2. Connection.execute() Wrapper Strategy (SQLAlchemy-based)

  • Used for: PostgreSQL, MySQL, SQLite, MSSQL, and other SQLAlchemy-based sources
  • How it works: Wraps engine.connect() to intercept connections, then wraps connection.execute() to capture queries and results
  • Why: SQLAlchemy 2.x uses Result objects that are best captured at the execute() level, avoiding import reference issues with modules that import create_engine directly
  • Implementation: Wraps engine.connect() to return connections with wrapped execute() methods that materialize and record results
  • Benefits: Works even when modules import create_engine directly (e.g., from sqlalchemy import create_engine), avoiding stale reference issues

Why Different Strategies?

  • Native connectors (Snowflake, Redshift) expose direct connect() functions that are easy to wrap
  • SQLAlchemy-based sources use connection pooling and engines. Wrapping connection.execute() captures Result objects directly, avoiding issues with modules that import create_engine directly
  • BigQuery uses a Client class pattern, requiring class-level wrapping

Both strategies achieve the same goal: intercepting SQL queries and results for recording/replay, but use the most appropriate method for each connector's architecture.

Database Connection Architecture

Database sources have a two-phase execution model:

Phase 1: Authentication (During connect())

  • Uses source-specific HTTP clients (may be vendored/custom)
  • NOT recorded (but also not needed during replay)
  • During replay: Bypassed entirely with mock connection
  • Automatic retry if VCR interferes with connection

Phase 2: SQL Execution (After connect())

  • Uses standard Python DB-API 2.0 cursor interface
  • Fully recorded via CursorProxy (works for both wrapper strategies)
  • Protocol-agnostic (works for any DB-API connector)
  • During replay: Served from recorded queries.jsonl

This architecture makes recording resilient to HTTP library changes while maintaining perfect SQL replay fidelity. For Snowflake and Databricks, all metadata extraction happens via SQL queries in Phase 2, making HTTP recording unnecessary.

DataHub Backend

ComponentRecordingNotes
GMS REST APISink emissions captured
GraphQL APIIf used by source
Stateful BackendCheckpoint calls captured

Troubleshooting

"Module not found: vcrpy"

Install the debug-recording plugin:

pip install 'acryl-datahub[debug-recording]'

"Checksum verification failed"

The archive may be corrupted. Re-download or re-record:

datahub recording info recording.zip --password mysecret
# Check for checksum errors in output

"No match for request" during replay

The recorded cassette doesn't have a matching request. This can happen if:

  1. Recording was incomplete (check has_exception in manifest)
  2. Source behavior changed between recording and replay
  3. Different credentials caused different API paths

Solution: Re-record with the exact same configuration.

Replay produces different event count

A small difference in event count (e.g., 3259 vs 3251) is normal due to:

  • Duplicate MCP emissions during recording
  • Timing-dependent code paths
  • Non-deterministic processing order

Verification: Use datahub check metadata-diff to confirm semantic equivalence:

datahub check metadata-diff \
--ignore-path "root['*']['systemMetadata']['lastObserved']" \
--ignore-path "root['*']['systemMetadata']['runId']" \
recording_output.json replay_output.json

A "PERFECT SEMANTIC MATCH" confirms the replay is correct despite count differences.

Recording takes too long

HTTP requests are serialized during recording for reliability. To speed up:

  1. Reduce source scope with patterns
  2. Use --no-s3-upload for local testing
  3. Accept that recording is slower than normal ingestion

Archive too large for S3 upload

Large archives may timeout during upload:

# Record locally first
datahub ingest run -c recipe.yaml --record --record-password mysecret --no-s3-upload

# Upload manually with multipart
aws s3 cp recording.zip s3://bucket/recordings/ --expected-size $(stat -f%z recording.zip)

Architecture

┌─────────────────────────────────────────────────────────────┐
│ IngestionRecorder │
│ ┌─────────────────┐ ┌─────────────────┐ ┌──────────────┐ │
│ │ HTTPRecorder │ │ ModulePatcher │ │ QueryRecorder│ │
│ │ (VCR.py) │ │ (DB proxies) │ │ (JSONL) │ │
│ └────────┬────────┘ └────────┬────────┘ └──────┬───────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ Encrypted Archive ││
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────────┐ ││
│ │ │manifest │ │ recipe │ │cassette │ │queries.jsonl│ ││
│ │ │.json │ │ .yaml │ │.yaml │ │ │ ││
│ │ └──────────┘ └──────────┘ └──────────┘ └────────────┘ ││
│ └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ IngestionReplayer │
│ ┌─────────────────┐ ┌─────────────────┐ ┌──────────────┐ │
│ │ HTTPReplayer │ │ ReplayPatcher │ │ QueryReplayer│ │
│ │ (VCR replay) │ │ (Mock conns) │ │ (Mock cursor│ │
│ └─────────────────┘ └─────────────────┘ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────┐ │
│ │ Air-Gapped Replay │ │
│ │ - No network required │ │
│ │ - Full debugger support │ │
│ │ - Exact reproduction │ │
│ └──────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘

Contributing

When adding new source connectors:

  1. HTTP-based sources work automatically via VCR
  2. Database sources may need additions to patcher.py for their specific connector
  3. Test recording and replay with the new source before releasing

See Also