Teradata

Overview

Teradata is a DataHub utility or metadata-focused integration. Learn more in the official Teradata documentation.

The DataHub integration for Teradata covers metadata entities and operational objects relevant to this connector. It also captures table- and column-level lineage, usage statistics, data profiling, ownership, and stateful deletion detection.

Concept Mapping

While the specific concept mapping is still pending, this shows the generic concept mapping in DataHub.

Source Concept	DataHub Concept	Notes
Platform/account/project scope	Platform Instance, Container	Organizes assets within the platform context.
Core technical asset (for example table/view/topic/file)	Dataset	Primary ingested technical asset.
Schema fields / columns	SchemaField	Included when schema extraction is supported.
Ownership and collaboration principals	CorpUser, CorpGroup	Emitted by modules that support ownership and identity metadata.
Dependencies and processing relationships	Lineage edges	Available when lineage extraction is supported and enabled.

Module `teradata`

Important Capabilities

Capability	Status	Notes
Asset Containers	✅	Enabled by default. Supported for types - Database.
Column-level Lineage	✅	Optionally enabled via configuration.
Data Profiling	✅	Optionally enabled via configuration.
Dataset Usage	✅	Optionally enabled via configuration.
Descriptions	✅	Enabled by default.
Detect Deleted Entities	✅	Enabled by default when stateful ingestion is turned on.
Domains	✅	Enabled by default.
Extract Ownership	✅	Optionally enabled via configuration (extract_ownership).
Operation Capture	✅	Optionally enabled via `include_usage_statistics`; controlled by `usage.include_operational_stats`.
Platform Instance	✅	Enabled by default.
Schema Metadata	✅	Enabled by default.
Table-Level Lineage	✅	Optionally enabled via configuration.
Test Connection	✅	Enabled by default.

Overview

The teradata module ingests metadata from Teradata into DataHub. It is intended for production ingestion workflows and module-specific capabilities are documented below.

This plugin extracts the following:

Metadata for databases, schemas, views, and tables
Column types associated with each table
Table, row, and column statistics via optional SQL profiling

Prerequisites

Create a user which has access to the database you want to ingest.

CREATE USER datahub FROM <database> AS PASSWORD = <password> PERM = 20000000;

Create a user with the following privileges:

GRANT SELECT ON dbc.columns TO datahub;
GRANT SELECT ON dbc.databases TO datahub;
GRANT SELECT ON dbc.tables TO datahub;
GRANT SELECT ON DBC.All_RI_ChildrenV TO datahub;
GRANT SELECT ON DBC.ColumnsV TO datahub;
GRANT SELECT ON DBC.IndicesV TO datahub;
GRANT SELECT ON dbc.TableTextV TO datahub;
GRANT SELECT ON dbc.TablesV TO datahub;
GRANT SELECT ON dbc.dbqlogtbl TO datahub; -- if lineage or usage extraction is enabled

If you want to run profiling, you need to grant select permission on all the tables you want to profile.

For lineage/usage extraction: Enable query logging and set an appropriate query text size (default is 200 chars, may be insufficient).
To set for all users:
```
REPLACE QUERY LOGGING LIMIT SQLTEXT=2000 ON ALL;
```
See more here about query logging: https://docs.teradata.com/r/Lake-Database-Reference/Database-Administration/Tracking-Query-Behavior-with-Database-Query-Logging-Operational-DBAs/SQL-Statements-to-Control-Logging/LIMIT-Logging-Options

Install the Plugin

pip install 'acryl-datahub[teradata]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

pipeline_name: my-teradata-ingestion-pipeline
source:
  type: teradata
  config:
    host_port: "myteradatainstance.teradata.com:1025"
    username: myuser
    password: mypassword
    #database_pattern:
    #  allow:
    #    - "my_database"
    #  ignoreCase: true
    include_table_lineage: true
    include_usage_statistics: true
    stateful_ingestion:
      enabled: true

    # --- Performance options for large installations ---

    # Skip column extraction for tables not altered in the last N days (recommended for
    # scheduled pipelines — set once and never update the recipe).
    #column_extraction_days_back: 3

    # Alternative: skip column extraction for tables unchanged since an absolute timestamp.
    # Set to the start time of the last successful run. Mutually exclusive with
    # column_extraction_days_back.
    #column_extraction_watermark: "2024-06-01T00:00:00Z"

    # Use dbc.ColumnsV for view columns first (faster); fall back to HELP only
    # when a column has an unknown type (e.g. derived expressions).
    #use_dbc_columns_for_views: true

    # Cap profiling to high-priority tables (uses standard GEProfilingConfig).
    #profiling:
    #  enabled: true
    #  limit: 500
    #profile_pattern:
    #  allow:
    #    - "important_db\\..*"

    # Increase if lineage queries against DBC.QryLogV time out (default: 120000 ms).
    #request_timeout_ms: 300000
sink:

Config Details

Options
Schema

Note that a . is used to denote nested fields in the YAML recipe.

Field	Description
host_port ✅ string	host URL
bucket_duration Enum	One of: "DAY", "HOUR"
column_extraction_days_back One of integer, null	Skip column extraction for tables/views not altered within the last N days. Computed at runtime as now() - N days, so the recipe never needs updating. A value of 3 for a daily schedule covers up to two missed runs with no gap risk. Mutually exclusive with column_extraction_watermark. Default: None
column_extraction_watermark One of string(date-time), null	Skip column extraction for tables/views whose LastAlterTimeStamp is older than this timestamp. Set to the start time of the last successful ingestion run to enable incremental column extraction. Mutually exclusive with column_extraction_days_back. At 13k tables where ~200 change per day this can reduce ingestion from hours to minutes. Default: None
connect_timeout_ms integer	Connection timeout in milliseconds when establishing Teradata connections. Default is 30000 (30 seconds). Default: 30000
connection_pool_timeout_ms integer	How long, in milliseconds, a worker thread will wait for a free connection from the pool before raising a PoolTimeoutError. PoolTimeoutError is a retryable condition: the connector will sleep with full-jitter exponential backoff and try again up to retry_max_attempts times. Increase this when parallel view processing saturates the pool on large schemas (watch num_pool_timeout_retries in the ingestion report). Decrease it to surface pool-exhaustion failures faster on small installations. Default is 60000 (60 seconds). Default: 60000
convert_urns_to_lowercase boolean	Whether to convert dataset urns to lowercase. This value is part of each dataset's URN identity, so it must stay fixed for the life of a deployment. Changing it after data has been ingested re-keys every dataset (e.g. `MyDb.MyTable` becomes `mydb.mytable`); with stateful ingestion enabled the old-cased URNs are then soft-deleted as stale while the new-cased ones are created, producing duplicate or orphaned entities. Pick one value before the first run and leave it unchanged. Default: False
database One of string, null	database (catalog) Default: None
default_db One of string, null	The default database to use for unqualified table names Default: None
end_time string(date-time)	Latest date of lineage/usage to consider. Default: Current time in UTC
extract_ownership boolean	Whether to extract ownership information for tables and views based on their creator. When enabled, the table/view creator from Teradata's system tables will be added as an owner with DATAOWNER type. Ownership is applied using OVERWRITE mode, meaning any existing ownership information (including manually added or modified owners from the UI) will be replaced. Use with caution. Default: False
include_historical_lineage boolean	Whether to include historical lineage data from PDCRINFO.DBQLSqlTbl_Hst in addition to current DBC.QryLogV data. This provides access to historical query logs that may have been archived. The historical table existence is checked automatically and gracefully falls back to current data only if not available. Default: False
include_queries boolean	Whether to generate query entities for SQL queries. Query entities provide metadata about individual SQL queries including execution timestamps, user information, and query text. Default: True
include_table_lineage boolean	Whether to include table lineage in the ingestion. This requires to have the table lineage feature enabled. Default: False
include_table_location_lineage boolean	If the source supports it, include table lineage to the underlying storage location. Default: True
include_tables boolean	Whether tables should be ingested. Default: True
include_usage_statistics boolean	Generate usage statistic. Default: False
include_view_column_lineage boolean	Populates column-level lineage for view->view and table->view lineage using DataHub's sql parser. Requires `include_view_lineage` to be enabled. Default: True
include_view_lineage boolean	Whether to include view lineage in the ingestion. This requires to have the view lineage feature enabled. Default: True
include_views boolean	Whether views should be ingested. Default: True
incremental_lineage boolean	When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run. Default: False
lineage_fetch_batch_size integer	Number of rows fetched per batch when streaming results from DBC.QryLogV during lineage extraction. Each row can carry several KB of query_text, so larger values increase peak memory usage while smaller values increase the number of round-trips to the database. Lower this (e.g. to a few hundred, or lower still) if the ingestion process runs out of memory during lineage extraction; raise it to reduce round-trips when rows are small and network latency is high. Must be a positive integer (a batch size of 0 would fetch no rows and stall the stream). NOTE: this only reduces memory when `use_server_side_cursors` is true (the default). With client-side cursors the driver buffers the entire result set in memory before this batching applies, so lowering the batch size will not prevent out-of-memory errors in that mode — it only changes the Python iteration chunk size. Default is 5000. Default: 5000
lineage_fetch_stall_warning_seconds integer	If no lineage row batch arrives from DBC.QryLogV within this many seconds, emit a warning identifying the stalled phase. Set to 0 to disable. Default is 300 (5 minutes). Default: 300
lineage_slow_query_log_seconds number	When the total database time for a single lineage query (execute call plus all fetchmany calls, excluding downstream processing time) exceeds this many seconds, emit a warning with the query label, elapsed DB time, and the first 500 characters of the SQL text so slow queries can be identified and tuned. Note: when the driver retries a failed fetchmany call, the retry backoff sleep time is included in the measurement, so the threshold should be set well above the expected base query time. Set to 0 to disable. Default is 60 seconds. Default: 60.0
max_pool_size integer	Ceiling on the number of concurrent Teradata connections used during parallel view processing. The actual pool size is min(max_workers, max_pool_size), so this value only takes effect when max_workers exceeds it. For example, max_workers=10 with max_pool_size=13 creates a pool of 10, not 13. The upper bound of 50 is a conservative ingestion-time safety ceiling, not a Teradata system limit. Teradata's per-user MAXSESSIONS parameter is typically 64–200+ depending on the platform and user profile. Default: 13
max_workers integer	Maximum number of worker threads to use for parallel processing. Controls the level of concurrency for operations like view processing. Default: 10
options object	Any options specified here will be passed to SQLAlchemy.create_engine as kwargs. To set connection arguments in the URL, specify them under `connect_args`.
password One of string(password), null	password Default: None
platform_instance One of string, null	The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details. Default: None
request_timeout_ms integer	Request timeout in milliseconds for Teradata query execution. Increase this when queries against large system tables (e.g., DBC.QryLogV) time out silently and fall back. Default is 120000 (2 minutes). Default: 120000
retry_initial_backoff_seconds number	Seed value, in seconds, for the full-jitter exponential backoff between retry attempts. Each retry sleeps for a duration drawn uniformly from [0, min(initial * 2^attempt, 30.0)] seconds. The 30-second cap prevents runaway sleep times even when retry_max_attempts is set high (e.g. initial=1.0, attempt=10 would be 1024s without the cap). Increase this to spread retries further apart on a heavily loaded cluster; decrease it for faster recovery on transient blips. Default is 1.0. Default: 1.0
retry_max_attempts integer	Maximum total attempts (initial + retries) for retryable database operations (connect, execute, fetchmany). Retryable conditions: pool exhaustion, transaction-aborted messages, dead-socket signals at connect time, and Teradata error codes 2631/3111/3120/3597/3598/3897. Permanent errors (auth failures, permission denied, object does not exist) are never retried regardless of this setting. Worst-case added latency per operation is approximately retry_max_attempts × connection_pool_timeout_ms plus backoff sleeps (each capped at 30.0s). Increase when ingesting from a busy or flaky cluster; decrease to surface persistent errors faster. Default is 3. Default: 3
scheme string	database scheme Default: teradatasql
sqlalchemy_uri One of string, null	URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters. Default: None
start_time string(date-time)	Earliest date of lineage/usage to consider. Default: Last full day in UTC (or hour, depending on `bucket_duration`). You can also specify relative time with respect to end_time such as '-7 days' Or '-7d'. Default: None
use_dbc_columns_for_views boolean	When True, attempt to use dbc.ColumnsV for view column metadata (faster bulk fetch) and fall back to HELP statements only for views where any column has a null/unknown ColumnType (e.g., derived expression columns). Can cut HELP calls by 80-90%% for installations where most view columns have explicit types. Set to False (default) to always use HELP for views, which is the conservative but slower approach. Default: False
use_file_backed_cache boolean	Whether to use a file backed cache for the view definitions. Default: True
use_qvci boolean	Whether to use QVCI to get column information. This is faster but requires to have QVCI enabled. Default: False
use_server_side_cursors boolean	Enable server-side cursors for large result sets using SQLAlchemy's stream_results. This reduces memory usage by streaming results from the database server. Automatically falls back to client-side batching if server-side cursors are not supported. Default: True
username One of string, null	username Default: None
view_processing_heartbeat_seconds integer	How often, in seconds, to emit a 'view processing heartbeat' log line during parallel view processing. The heartbeat reports completed/in-progress counts and the longest-running view, making it possible to diagnose silent halts in the executor. Set to 0 to disable. Default is 30 seconds. Default: 30
view_processing_timeout_seconds integer	Maximum wall-clock time, in seconds, that a single view may spend in the parallel view-processing pool before the connector abandons it and moves on. Set to 0 to disable. Stalled views are reported as warnings and counted in `num_view_processing_timeouts`. This protects bulk ingestion from silent hangs when a Teradata query blocks indefinitely (e.g., on a dropped TCP connection). Default is 1800 (30 minutes). Default: 1800
env string	The environment that all assets produced by this connector belong to Default: PROD
database_pattern AllowDenyPattern	A class to store allow deny regexes
database_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
databases One of array, null	List of databases to ingest. If not specified, all databases will be ingested. Even if this is specified, databases will still be filtered by `database_pattern`. Default: None
databases.string string
domain map(str,AllowDenyPattern)	A class to store allow deny regexes
domain.`key`.allow array	List of regex patterns to include in ingestion Default: ['.*']
domain.`key`.allow.string string
domain.`key`.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
domain.`key`.deny array	List of regex patterns to exclude from ingestion. Default: []
domain.`key`.deny.string string
profile_pattern AllowDenyPattern	A class to store allow deny regexes
profile_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
table_pattern AllowDenyPattern	A class to store allow deny regexes
table_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
usage BaseUsageConfig
usage.bucket_duration Enum	One of: "DAY", "HOUR"
usage.end_time string(date-time)	Latest date of lineage/usage to consider. Default: Current time in UTC
usage.format_sql_queries boolean	Whether to format sql queries Default: False
usage.include_operational_stats boolean	Whether to display operational stats. Default: True
usage.include_read_operational_stats boolean	Whether to report read operational stats. Experimental. Default: False
usage.include_top_n_queries boolean	Whether to ingest the top_n_queries. Default: True
usage.start_time string(date-time)	Earliest date of lineage/usage to consider. Default: Last full day in UTC (or hour, depending on `bucket_duration`). You can also specify relative time with respect to end_time such as '-7 days' Or '-7d'. Default: None
usage.top_n_queries integer	Number of top queries to save to each table. Default: 10
usage.user_email_pattern AllowDenyPattern	A class to store allow deny regexes
usage.user_email_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
view_pattern AllowDenyPattern	A class to store allow deny regexes
view_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
classification ClassificationConfig
classification.enabled boolean	Whether classification should be used to auto-detect glossary terms Default: False
classification.info_type_to_term map(str,string)
classification.max_workers integer	Number of worker processes to use for classification. Set to 1 to disable. Default: 4
classification.sample_size integer	Number of sample values used for classification. Default: 100
classification.classifiers array	Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance. Default: [{'type': 'datahub', 'config': None}]
classification.classifiers.DynamicTypedClassifierConfig DynamicTypedClassifierConfig
classification.classifiers.DynamicTypedClassifierConfig.type ❓ string	The type of the classifier to use. The built-in `datahub` classifier has been removed; register a custom classifier and reference its type here.
classification.classifiers.DynamicTypedClassifierConfig.config One of object, null	The configuration required for initializing the classifier. If not specified, uses defaults for classifer type. Default: None
classification.column_pattern AllowDenyPattern	A class to store allow deny regexes
classification.column_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
classification.table_pattern AllowDenyPattern	A class to store allow deny regexes
classification.table_pattern.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
profiling GEProfilingConfig
profiling.catch_exceptions boolean	Default: True
profiling.enabled boolean	Whether profiling should be done. Default: False
profiling.field_sample_values_limit integer	Upper limit for number of sample values to collect for all columns. Default: 20
profiling.include_field_distinct_count boolean	Whether to profile for the number of distinct values for each column. Default: True
profiling.include_field_distinct_value_frequencies boolean	Whether to profile for distinct value frequencies. Default: False
profiling.include_field_histogram boolean	Whether to profile for the histogram for numeric fields. Default: False
profiling.include_field_max_value boolean	Whether to profile for the max value of numeric columns. Default: True
profiling.include_field_mean_value boolean	Whether to profile for the mean value of numeric columns. Default: True
profiling.include_field_median_value boolean	Whether to profile for the median value of numeric columns. Default: True
profiling.include_field_min_value boolean	Whether to profile for the min value of numeric columns. Default: True
profiling.include_field_null_count boolean	Whether to profile for the number of nulls for each column. Default: True
profiling.include_field_quantiles boolean	Whether to profile for the quantiles of numeric columns. Default: False
profiling.include_field_sample_values boolean	Whether to profile for the sample values for all columns. Default: True
profiling.include_field_stddev_value boolean	Whether to profile for the standard deviation of numeric columns. Default: True
profiling.limit One of integer, null	Max number of documents to profile. By default, profiles all documents. Default: None
profiling.max_number_of_fields_to_profile One of integer, null	A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up. Default: None
profiling.max_workers integer	Number of worker threads to use for profiling. Set to 1 to disable. Default: 20
profiling.method Enum	One of: "ge", "sqlalchemy" Default: sqlalchemy
profiling.nested_field_max_depth integer	Maximum recursion depth when flattening nested JSON structures during profiling. Lower values prevent recursion errors but may truncate deeply nested data. Applies to connectors that process dynamic JSON content (e.g., Kafka, MongoDB, Elasticsearch). Default: 10
profiling.offset One of integer, null	Offset in documents to profile. By default, uses no offset. Default: None
profiling.partition_datetime One of string(date-time), null	If specified, profile only the partition which matches this datetime. If not specified, profile the latest partition. Only Bigquery supports this. Default: None
profiling.partition_profiling_enabled boolean	Whether to profile partitioned tables. Only BigQuery and Aws Athena supports this. If enabled, latest partition data is used for profiling. Default: True
profiling.profile_external_tables boolean	Whether to profile external tables. Only Snowflake and Redshift supports this. Default: False
profiling.profile_if_updated_since_days One of number, null	Profile table only if it has been updated since these many number of days. If set to `null`, no constraint of last modified time for tables to profile. Supported in `Snowflake`, `BigQuery`, and `Dremio`. Note: for Dremio this compares against DataHub's last-profiled timestamp (Dremio exposes no table modification time), so it controls profile frequency rather than reacting to upstream change. Default: None
profiling.profile_nested_fields boolean	Whether to profile complex types like structs, arrays and maps. Default: False
profiling.profile_table_level_only boolean	Whether to perform profiling at table-level only, or include column-level profiling as well. Default: False
profiling.profile_table_row_count_estimate_only boolean	Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL. Default: False
profiling.profile_table_row_limit One of integer, null	Profile tables only if their row count is less than specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `Snowflake`, `BigQuery`. Supported for `Oracle` based on gathered stats. Default: 5000000
profiling.profile_table_size_limit One of integer, null	Profile tables only if their size is less than specified GBs. If set to `null`, no limit on the size of tables to profile. Supported in `Snowflake`, `BigQuery`, `Databricks`, `Oracle`, and `Teradata`. `Oracle` uses calculated size from gathered stats. `Teradata` uses DBC space accounting. Default: 5
profiling.query_combiner_enabled boolean	This feature is still experimental and can be disabled if it causes issues. Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible. Default: True
profiling.report_dropped_profiles boolean	Whether to report datasets or dataset columns which were not profiled. Set to `True` for debugging purposes. Default: False
profiling.sample_size integer	Number of rows to be sampled from table for column level profiling.Applicable only if `use_sampling` is set to True. Default: 10000
profiling.turn_off_expensive_profiling_metrics boolean	Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10. Default: False
profiling.use_sampling boolean	Whether to profile column level stats on sample of table. Only BigQuery and Snowflake support this. If enabled, profiling is done on rows sampled from table. Sampling is not done for smaller tables. Default: True
profiling.operation_config OperationConfig
profiling.operation_config.lower_freq_profile_enabled boolean	Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling. Default: False
profiling.operation_config.profile_date_of_month One of integer, null	Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect. Default: None
profiling.operation_config.profile_day_of_week One of integer, null	Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect. Default: None
profiling.tags_to_ignore_sampling One of array, null	Fixed list of tags to ignore sampling. Each entry may be a full tag URN (e.g. `urn:li:tag:my_tag`) or just the tag name (e.g. `my_tag`). If not specified, tables will be sampled based on `use_sampling`. Default: None
profiling.tags_to_ignore_sampling.string string
stateful_ingestion One of StatefulStaleMetadataRemovalConfig, null	Default: None
stateful_ingestion.enabled boolean	Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False Default: False
stateful_ingestion.fail_safe_threshold number	Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'. Default: 75.0
stateful_ingestion.remove_stale_metadata boolean	Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled. Default: True

The JSONSchema for this configuration is inlined below.

{
  "$defs": {
    "AllowDenyPattern": {
      "additionalProperties": false,
      "description": "A class to store allow deny regexes",
      "properties": {
        "allow": {
          "default": [
            ".*"
          ],
          "description": "List of regex patterns to include in ingestion",
          "items": {
            "type": "string"
          },
          "title": "Allow",
          "type": "array"
        },
        "deny": {
          "default": [],
          "description": "List of regex patterns to exclude from ingestion.",
          "items": {
            "type": "string"
          },
          "title": "Deny",
          "type": "array"
        },
        "ignoreCase": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": true,
          "description": "Whether to ignore case sensitivity during pattern matching.",
          "title": "Ignorecase"
        }
      },
      "title": "AllowDenyPattern",
      "type": "object"
    },
    "BaseUsageConfig": {
      "additionalProperties": false,
      "properties": {
        "bucket_duration": {
          "$ref": "#/$defs/BucketDuration",
          "default": "DAY",
          "description": "Size of the time window to aggregate usage stats."
        },
        "end_time": {
          "description": "Latest date of lineage/usage to consider. Default: Current time in UTC",
          "format": "date-time",
          "title": "End Time",
          "type": "string"
        },
        "start_time": {
          "default": null,
          "description": "Earliest date of lineage/usage to consider. Default: Last full day in UTC (or hour, depending on `bucket_duration`). You can also specify relative time with respect to end_time such as '-7 days' Or '-7d'.",
          "format": "date-time",
          "title": "Start Time",
          "type": "string"
        },
        "top_n_queries": {
          "default": 10,
          "description": "Number of top queries to save to each table.",
          "exclusiveMinimum": 0,
          "title": "Top N Queries",
          "type": "integer"
        },
        "user_email_pattern": {
          "$ref": "#/$defs/AllowDenyPattern",
          "default": {
            "allow": [
              ".*"
            ],
            "deny": [],
            "ignoreCase": true
          },
          "description": "regex patterns for user emails to filter in usage."
        },
        "include_operational_stats": {
          "default": true,
          "description": "Whether to display operational stats.",
          "title": "Include Operational Stats",
          "type": "boolean"
        },
        "include_read_operational_stats": {
          "default": false,
          "description": "Whether to report read operational stats. Experimental.",
          "title": "Include Read Operational Stats",
          "type": "boolean"
        },
        "format_sql_queries": {
          "default": false,
          "description": "Whether to format sql queries",
          "title": "Format Sql Queries",
          "type": "boolean"
        },
        "include_top_n_queries": {
          "default": true,
          "description": "Whether to ingest the top_n_queries.",
          "title": "Include Top N Queries",
          "type": "boolean"
        }
      },
      "title": "BaseUsageConfig",
      "type": "object"
    },
    "BucketDuration": {
      "enum": [
        "DAY",
        "HOUR"
      ],
      "title": "BucketDuration",
      "type": "string"
    },
    "ClassificationConfig": {
      "additionalProperties": false,
      "properties": {
        "enabled": {
          "default": false,
          "description": "Whether classification should be used to auto-detect glossary terms",
          "title": "Enabled",
          "type": "boolean"
        },
        "sample_size": {
          "default": 100,
          "description": "Number of sample values used for classification.",
          "title": "Sample Size",
          "type": "integer"
        },
        "max_workers": {
          "default": 4,
          "description": "Number of worker processes to use for classification. Set to 1 to disable.",
          "title": "Max Workers",
          "type": "integer"
        },
        "table_pattern": {
          "$ref": "#/$defs/AllowDenyPattern",
          "default": {
            "allow": [
              ".*"
            ],
            "deny": [],
            "ignoreCase": true
          },
          "description": "Regex patterns to filter tables for classification. This is used in combination with other patterns in parent config. Specify regex to match the entire table name in `database.schema.table` format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'"
        },
        "column_pattern": {
          "$ref": "#/$defs/AllowDenyPattern",
          "default": {
            "allow": [
              ".*"
            ],
            "deny": [],
            "ignoreCase": true
          },
          "description": "Regex patterns to filter columns for classification. This is used in combination with other patterns in parent config. Specify regex to match the column name in `database.schema.table.column` format."
        },
        "info_type_to_term": {
          "additionalProperties": {
            "type": "string"
          },
          "default": {},
          "description": "Optional mapping to provide glossary term identifier for info type",
          "title": "Info Type To Term",
          "type": "object"
        },
        "classifiers": {
          "default": [
            {
              "type": "datahub",
              "config": null
            }
          ],
          "description": "Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance.",
          "items": {
            "$ref": "#/$defs/DynamicTypedClassifierConfig"
          },
          "title": "Classifiers",
          "type": "array"
        }
      },
      "title": "ClassificationConfig",
      "type": "object"
    },
    "DynamicTypedClassifierConfig": {
      "additionalProperties": false,
      "properties": {
        "type": {
          "description": "The type of the classifier to use. The built-in `datahub` classifier has been removed; register a custom classifier and reference its type here.",
          "title": "Type",
          "type": "string"
        },
        "config": {
          "anyOf": [
            {},
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "The configuration required for initializing the classifier. If not specified, uses defaults for classifer type.",
          "title": "Config"
        }
      },
      "required": [
        "type"
      ],
      "title": "DynamicTypedClassifierConfig",
      "type": "object"
    },
    "GEProfilingConfig": {
      "additionalProperties": false,
      "properties": {
        "method": {
          "default": "sqlalchemy",
          "description": "Profiling method to use. `sqlalchemy` (default) runs profiling queries directly against your source's existing SQLAlchemy connection. `ge` selects the legacy Great Expectations profiler, which is deprecated and requires `pip install 'acryl-datahub[profiling-ge]'`.",
          "enum": [
            "ge",
            "sqlalchemy"
          ],
          "title": "Method",
          "type": "string"
        },
        "enabled": {
          "default": false,
          "description": "Whether profiling should be done.",
          "title": "Enabled",
          "type": "boolean"
        },
        "operation_config": {
          "$ref": "#/$defs/OperationConfig",
          "description": "Experimental feature. To specify operation configs."
        },
        "limit": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Max number of documents to profile. By default, profiles all documents.",
          "title": "Limit"
        },
        "offset": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Offset in documents to profile. By default, uses no offset.",
          "title": "Offset"
        },
        "profile_table_level_only": {
          "default": false,
          "description": "Whether to perform profiling at table-level only, or include column-level profiling as well.",
          "title": "Profile Table Level Only",
          "type": "boolean"
        },
        "include_field_null_count": {
          "default": true,
          "description": "Whether to profile for the number of nulls for each column.",
          "title": "Include Field Null Count",
          "type": "boolean"
        },
        "include_field_distinct_count": {
          "default": true,
          "description": "Whether to profile for the number of distinct values for each column.",
          "title": "Include Field Distinct Count",
          "type": "boolean"
        },
        "include_field_min_value": {
          "default": true,
          "description": "Whether to profile for the min value of numeric columns.",
          "title": "Include Field Min Value",
          "type": "boolean"
        },
        "include_field_max_value": {
          "default": true,
          "description": "Whether to profile for the max value of numeric columns.",
          "title": "Include Field Max Value",
          "type": "boolean"
        },
        "include_field_mean_value": {
          "default": true,
          "description": "Whether to profile for the mean value of numeric columns.",
          "title": "Include Field Mean Value",
          "type": "boolean"
        },
        "include_field_median_value": {
          "default": true,
          "description": "Whether to profile for the median value of numeric columns.",
          "title": "Include Field Median Value",
          "type": "boolean"
        },
        "include_field_stddev_value": {
          "default": true,
          "description": "Whether to profile for the standard deviation of numeric columns.",
          "title": "Include Field Stddev Value",
          "type": "boolean"
        },
        "include_field_quantiles": {
          "default": false,
          "description": "Whether to profile for the quantiles of numeric columns.",
          "title": "Include Field Quantiles",
          "type": "boolean"
        },
        "include_field_distinct_value_frequencies": {
          "default": false,
          "description": "Whether to profile for distinct value frequencies.",
          "title": "Include Field Distinct Value Frequencies",
          "type": "boolean"
        },
        "include_field_histogram": {
          "default": false,
          "description": "Whether to profile for the histogram for numeric fields.",
          "title": "Include Field Histogram",
          "type": "boolean"
        },
        "include_field_sample_values": {
          "default": true,
          "description": "Whether to profile for the sample values for all columns.",
          "title": "Include Field Sample Values",
          "type": "boolean"
        },
        "max_workers": {
          "default": 20,
          "description": "Number of worker threads to use for profiling. Set to 1 to disable.",
          "title": "Max Workers",
          "type": "integer"
        },
        "report_dropped_profiles": {
          "default": false,
          "description": "Whether to report datasets or dataset columns which were not profiled. Set to `True` for debugging purposes.",
          "title": "Report Dropped Profiles",
          "type": "boolean"
        },
        "turn_off_expensive_profiling_metrics": {
          "default": false,
          "description": "Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.",
          "title": "Turn Off Expensive Profiling Metrics",
          "type": "boolean"
        },
        "field_sample_values_limit": {
          "default": 20,
          "description": "Upper limit for number of sample values to collect for all columns.",
          "title": "Field Sample Values Limit",
          "type": "integer"
        },
        "max_number_of_fields_to_profile": {
          "anyOf": [
            {
              "exclusiveMinimum": 0,
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up.",
          "title": "Max Number Of Fields To Profile"
        },
        "profile_if_updated_since_days": {
          "anyOf": [
            {
              "exclusiveMinimum": 0,
              "type": "number"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Profile table only if it has been updated since these many number of days. If set to `null`, no constraint of last modified time for tables to profile. Supported in `Snowflake`, `BigQuery`, and `Dremio`. Note: for Dremio this compares against DataHub's last-profiled timestamp (Dremio exposes no table modification time), so it controls profile frequency rather than reacting to upstream change.",
          "schema_extra": {
            "supported_sources": [
              "snowflake",
              "bigquery",
              "dremio"
            ]
          },
          "title": "Profile If Updated Since Days"
        },
        "profile_table_size_limit": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": 5,
          "description": "Profile tables only if their size is less than specified GBs. If set to `null`, no limit on the size of tables to profile. Supported in `Snowflake`, `BigQuery`, `Databricks`, `Oracle`, and `Teradata`. `Oracle` uses calculated size from gathered stats. `Teradata` uses DBC space accounting.",
          "schema_extra": {
            "supported_sources": [
              "snowflake",
              "bigquery",
              "unity-catalog",
              "oracle",
              "teradata"
            ]
          },
          "title": "Profile Table Size Limit"
        },
        "profile_table_row_limit": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": 5000000,
          "description": "Profile tables only if their row count is less than specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `Snowflake`, `BigQuery`. Supported for `Oracle` based on gathered stats.",
          "schema_extra": {
            "supported_sources": [
              "snowflake",
              "bigquery",
              "oracle"
            ]
          },
          "title": "Profile Table Row Limit"
        },
        "profile_table_row_count_estimate_only": {
          "default": false,
          "description": "Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL. ",
          "schema_extra": {
            "supported_sources": [
              "postgres",
              "mysql"
            ]
          },
          "title": "Profile Table Row Count Estimate Only",
          "type": "boolean"
        },
        "query_combiner_enabled": {
          "default": true,
          "description": "*This feature is still experimental and can be disabled if it causes issues.* Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.",
          "title": "Query Combiner Enabled",
          "type": "boolean"
        },
        "catch_exceptions": {
          "default": true,
          "description": "",
          "title": "Catch Exceptions",
          "type": "boolean"
        },
        "partition_profiling_enabled": {
          "default": true,
          "description": "Whether to profile partitioned tables. Only BigQuery and Aws Athena supports this. If enabled, latest partition data is used for profiling.",
          "schema_extra": {
            "supported_sources": [
              "athena",
              "bigquery"
            ]
          },
          "title": "Partition Profiling Enabled",
          "type": "boolean"
        },
        "partition_datetime": {
          "anyOf": [
            {
              "format": "date-time",
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "If specified, profile only the partition which matches this datetime. If not specified, profile the latest partition. Only Bigquery supports this.",
          "schema_extra": {
            "supported_sources": [
              "bigquery"
            ]
          },
          "title": "Partition Datetime"
        },
        "use_sampling": {
          "default": true,
          "description": "Whether to profile column level stats on sample of table. Only BigQuery and Snowflake support this. If enabled, profiling is done on rows sampled from table. Sampling is not done for smaller tables. ",
          "schema_extra": {
            "supported_sources": [
              "bigquery",
              "snowflake"
            ]
          },
          "title": "Use Sampling",
          "type": "boolean"
        },
        "sample_size": {
          "default": 10000,
          "description": "Number of rows to be sampled from table for column level profiling.Applicable only if `use_sampling` is set to True.",
          "schema_extra": {
            "supported_sources": [
              "bigquery",
              "snowflake"
            ]
          },
          "title": "Sample Size",
          "type": "integer"
        },
        "profile_external_tables": {
          "default": false,
          "description": "Whether to profile external tables. Only Snowflake and Redshift supports this.",
          "schema_extra": {
            "supported_sources": [
              "redshift",
              "snowflake"
            ]
          },
          "title": "Profile External Tables",
          "type": "boolean"
        },
        "tags_to_ignore_sampling": {
          "anyOf": [
            {
              "items": {
                "type": "string"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Fixed list of tags to ignore sampling. Each entry may be a full tag URN (e.g. `urn:li:tag:my_tag`) or just the tag name (e.g. `my_tag`). If not specified, tables will be sampled based on `use_sampling`.",
          "title": "Tags To Ignore Sampling"
        },
        "profile_nested_fields": {
          "default": false,
          "description": "Whether to profile complex types like structs, arrays and maps. ",
          "title": "Profile Nested Fields",
          "type": "boolean"
        },
        "nested_field_max_depth": {
          "default": 10,
          "description": "Maximum recursion depth when flattening nested JSON structures during profiling. Lower values prevent recursion errors but may truncate deeply nested data. Applies to connectors that process dynamic JSON content (e.g., Kafka, MongoDB, Elasticsearch).",
          "exclusiveMinimum": 0,
          "title": "Nested Field Max Depth",
          "type": "integer"
        }
      },
      "title": "GEProfilingConfig",
      "type": "object"
    },
    "OperationConfig": {
      "additionalProperties": false,
      "properties": {
        "lower_freq_profile_enabled": {
          "default": false,
          "description": "Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling.",
          "title": "Lower Freq Profile Enabled",
          "type": "boolean"
        },
        "profile_day_of_week": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect.",
          "title": "Profile Day Of Week"
        },
        "profile_date_of_month": {
          "anyOf": [
            {
              "type": "integer"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "description": "Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect.",
          "title": "Profile Date Of Month"
        }
      },
      "title": "OperationConfig",
      "type": "object"
    },
    "StatefulStaleMetadataRemovalConfig": {
      "additionalProperties": false,
      "description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.",
      "properties": {
        "enabled": {
          "default": false,
          "description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
          "title": "Enabled",
          "type": "boolean"
        },
        "remove_stale_metadata": {
          "default": true,
          "description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.",
          "title": "Remove Stale Metadata",
          "type": "boolean"
        },
        "fail_safe_threshold": {
          "default": 75.0,
          "description": "Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.",
          "maximum": 100.0,
          "minimum": 0.0,
          "title": "Fail Safe Threshold",
          "type": "number"
        }
      },
      "title": "StatefulStaleMetadataRemovalConfig",
      "type": "object"
    }
  },
  "additionalProperties": false,
  "properties": {
    "bucket_duration": {
      "$ref": "#/$defs/BucketDuration",
      "default": "DAY",
      "description": "Size of the time window to aggregate usage stats."
    },
    "end_time": {
      "description": "Latest date of lineage/usage to consider. Default: Current time in UTC",
      "format": "date-time",
      "title": "End Time",
      "type": "string"
    },
    "start_time": {
      "default": null,
      "description": "Earliest date of lineage/usage to consider. Default: Last full day in UTC (or hour, depending on `bucket_duration`). You can also specify relative time with respect to end_time such as '-7 days' Or '-7d'.",
      "format": "date-time",
      "title": "Start Time",
      "type": "string"
    },
    "table_pattern": {
      "$ref": "#/$defs/AllowDenyPattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "description": "Regex patterns for tables to filter in ingestion. Specify regex to match the entire table name in database.schema.table format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'"
    },
    "view_pattern": {
      "$ref": "#/$defs/AllowDenyPattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "description": "Regex patterns for views to filter in ingestion. Note: Defaults to table_pattern if not specified. Specify regex to match the entire view name in database.schema.view format. e.g. to match all views starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'"
    },
    "classification": {
      "$ref": "#/$defs/ClassificationConfig",
      "default": {
        "enabled": false,
        "sample_size": 100,
        "max_workers": 4,
        "table_pattern": {
          "allow": [
            ".*"
          ],
          "deny": [],
          "ignoreCase": true
        },
        "column_pattern": {
          "allow": [
            ".*"
          ],
          "deny": [],
          "ignoreCase": true
        },
        "info_type_to_term": {},
        "classifiers": [
          {
            "config": null,
            "type": "datahub"
          }
        ]
      },
      "description": "For details, refer to [Classification](../../../../metadata-ingestion/docs/dev_guides/classification.md)."
    },
    "incremental_lineage": {
      "default": false,
      "description": "When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run.",
      "title": "Incremental Lineage",
      "type": "boolean"
    },
    "convert_urns_to_lowercase": {
      "default": false,
      "description": "Whether to convert dataset urns to lowercase. This value is part of each dataset's URN identity, so it must stay fixed for the life of a deployment. Changing it after data has been ingested re-keys every dataset (e.g. `MyDb.MyTable` becomes `mydb.mytable`); with stateful ingestion enabled the old-cased URNs are then soft-deleted as stale while the new-cased ones are created, producing duplicate or orphaned entities. Pick one value before the first run and leave it unchanged.",
      "title": "Convert Urns To Lowercase",
      "type": "boolean"
    },
    "env": {
      "default": "PROD",
      "description": "The environment that all assets produced by this connector belong to",
      "title": "Env",
      "type": "string"
    },
    "platform_instance": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.",
      "title": "Platform Instance"
    },
    "stateful_ingestion": {
      "anyOf": [
        {
          "$ref": "#/$defs/StatefulStaleMetadataRemovalConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "options": {
      "additionalProperties": true,
      "description": "Any options specified here will be passed to [SQLAlchemy.create_engine](https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine) as kwargs. To set connection arguments in the URL, specify them under `connect_args`.",
      "title": "Options",
      "type": "object"
    },
    "profile_pattern": {
      "$ref": "#/$defs/AllowDenyPattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "description": "Regex patterns to filter tables (or specific columns) for profiling during ingestion. Note that only tables allowed by the `table_pattern` will be considered."
    },
    "domain": {
      "additionalProperties": {
        "$ref": "#/$defs/AllowDenyPattern"
      },
      "default": {},
      "description": "Attach domains to databases, schemas or tables during ingestion using regex patterns. Domain key can be a guid like *urn:li:domain:ec428203-ce86-4db3-985d-5a8ee6df32ba* or a string like \"Marketing\".) If you provide strings, then datahub will attempt to resolve this name to a guid, and will error out if this fails. There can be multiple domain keys specified.",
      "title": "Domain",
      "type": "object"
    },
    "include_views": {
      "default": true,
      "description": "Whether views should be ingested.",
      "title": "Include Views",
      "type": "boolean"
    },
    "include_tables": {
      "default": true,
      "description": "Whether tables should be ingested.",
      "title": "Include Tables",
      "type": "boolean"
    },
    "include_table_location_lineage": {
      "default": true,
      "description": "If the source supports it, include table lineage to the underlying storage location.",
      "title": "Include Table Location Lineage",
      "type": "boolean"
    },
    "include_view_lineage": {
      "default": true,
      "description": "Whether to include view lineage in the ingestion. This requires to have the view lineage feature enabled.",
      "title": "Include View Lineage",
      "type": "boolean"
    },
    "include_view_column_lineage": {
      "default": true,
      "description": "Populates column-level lineage for  view->view and table->view lineage using DataHub's sql parser. Requires `include_view_lineage` to be enabled.",
      "title": "Include View Column Lineage",
      "type": "boolean"
    },
    "use_file_backed_cache": {
      "default": true,
      "description": "Whether to use a file backed cache for the view definitions.",
      "title": "Use File Backed Cache",
      "type": "boolean"
    },
    "profiling": {
      "$ref": "#/$defs/GEProfilingConfig",
      "default": {
        "method": "sqlalchemy",
        "enabled": false,
        "operation_config": {
          "lower_freq_profile_enabled": false,
          "profile_date_of_month": null,
          "profile_day_of_week": null
        },
        "limit": null,
        "offset": null,
        "profile_table_level_only": false,
        "include_field_null_count": true,
        "include_field_distinct_count": true,
        "include_field_min_value": true,
        "include_field_max_value": true,
        "include_field_mean_value": true,
        "include_field_median_value": true,
        "include_field_stddev_value": true,
        "include_field_quantiles": false,
        "include_field_distinct_value_frequencies": false,
        "include_field_histogram": false,
        "include_field_sample_values": true,
        "max_workers": 20,
        "report_dropped_profiles": false,
        "turn_off_expensive_profiling_metrics": false,
        "field_sample_values_limit": 20,
        "max_number_of_fields_to_profile": null,
        "profile_if_updated_since_days": null,
        "profile_table_size_limit": 5,
        "profile_table_row_limit": 5000000,
        "profile_table_row_count_estimate_only": false,
        "query_combiner_enabled": true,
        "catch_exceptions": true,
        "partition_profiling_enabled": true,
        "partition_datetime": null,
        "use_sampling": true,
        "sample_size": 10000,
        "profile_external_tables": false,
        "tags_to_ignore_sampling": null,
        "profile_nested_fields": false,
        "nested_field_max_depth": 10
      }
    },
    "username": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "username",
      "title": "Username"
    },
    "password": {
      "anyOf": [
        {
          "format": "password",
          "type": "string",
          "writeOnly": true
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "password",
      "title": "Password"
    },
    "host_port": {
      "description": "host URL",
      "title": "Host Port",
      "type": "string"
    },
    "database": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "database (catalog)",
      "title": "Database"
    },
    "scheme": {
      "default": "teradatasql",
      "description": "database scheme",
      "title": "Scheme",
      "type": "string"
    },
    "sqlalchemy_uri": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters.",
      "title": "Sqlalchemy Uri"
    },
    "database_pattern": {
      "$ref": "#/$defs/AllowDenyPattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [
          "All",
          "Crashdumps",
          "Default",
          "DemoNow_Monitor",
          "EXTUSER",
          "External_AP",
          "GLOBAL_FUNCTIONS",
          "LockLogShredder",
          "PUBLIC",
          "SQLJ",
          "SYSBAR",
          "SYSJDBC",
          "SYSLIB",
          "SYSSPATIAL",
          "SYSUDTLIB",
          "SYSUIF",
          "SysAdmin",
          "Sys_Calendar",
          "SystemFe",
          "TDBCMgmt",
          "TDMaps",
          "TDPUSER",
          "TDQCD",
          "TDStats",
          "TD_ANALYTICS_DB",
          "TD_SERVER_DB",
          "TD_SYSFNLIB",
          "TD_SYSGPL",
          "TD_SYSXML",
          "TDaaS_BAR",
          "TDaaS_DB",
          "TDaaS_Maint",
          "TDaaS_Monitor",
          "TDaaS_Support",
          "TDaaS_TDBCMgmt1",
          "TDaaS_TDBCMgmt2",
          "dbcmngr",
          "mldb",
          "system",
          "tapidb",
          "tdwm",
          "val",
          "dbc"
        ],
        "ignoreCase": true
      },
      "description": "Regex patterns for databases to filter in ingestion."
    },
    "databases": {
      "anyOf": [
        {
          "items": {
            "type": "string"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "List of databases to ingest. If not specified, all databases will be ingested. Even if this is specified, databases will still be filtered by `database_pattern`.",
      "title": "Databases"
    },
    "include_table_lineage": {
      "default": false,
      "description": "Whether to include table lineage in the ingestion. This requires to have the table lineage feature enabled.",
      "title": "Include Table Lineage",
      "type": "boolean"
    },
    "include_queries": {
      "default": true,
      "description": "Whether to generate query entities for SQL queries. Query entities provide metadata about individual SQL queries including execution timestamps, user information, and query text.",
      "title": "Include Queries",
      "type": "boolean"
    },
    "usage": {
      "$ref": "#/$defs/BaseUsageConfig",
      "default": {
        "bucket_duration": "DAY",
        "end_time": "2026-07-25T04:47:58.196026Z",
        "start_time": "2026-07-24T00:00:00Z",
        "queries_character_limit": 24000,
        "top_n_queries": 10,
        "user_email_pattern": {
          "allow": [
            ".*"
          ],
          "deny": [],
          "ignoreCase": true
        },
        "include_operational_stats": true,
        "include_read_operational_stats": false,
        "format_sql_queries": false,
        "include_top_n_queries": true
      },
      "description": "The usage config to use when generating usage statistics"
    },
    "default_db": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "The default database to use for unqualified table names",
      "title": "Default Db"
    },
    "include_usage_statistics": {
      "default": false,
      "description": "Generate usage statistic.",
      "title": "Include Usage Statistics",
      "type": "boolean"
    },
    "use_qvci": {
      "default": false,
      "description": "Whether to use QVCI to get column information. This is faster but requires to have QVCI enabled.",
      "title": "Use Qvci",
      "type": "boolean"
    },
    "include_historical_lineage": {
      "default": false,
      "description": "Whether to include historical lineage data from PDCRINFO.DBQLSqlTbl_Hst in addition to current DBC.QryLogV data. This provides access to historical query logs that may have been archived. The historical table existence is checked automatically and gracefully falls back to current data only if not available.",
      "title": "Include Historical Lineage",
      "type": "boolean"
    },
    "use_server_side_cursors": {
      "default": true,
      "description": "Enable server-side cursors for large result sets using SQLAlchemy's stream_results. This reduces memory usage by streaming results from the database server. Automatically falls back to client-side batching if server-side cursors are not supported.",
      "title": "Use Server Side Cursors",
      "type": "boolean"
    },
    "max_workers": {
      "default": 10,
      "description": "Maximum number of worker threads to use for parallel processing. Controls the level of concurrency for operations like view processing.",
      "title": "Max Workers",
      "type": "integer"
    },
    "max_pool_size": {
      "default": 13,
      "description": "Ceiling on the number of concurrent Teradata connections used during parallel view processing. The actual pool size is min(max_workers, max_pool_size), so this value only takes effect when max_workers exceeds it. For example, max_workers=10 with max_pool_size=13 creates a pool of 10, not 13. The upper bound of 50 is a conservative ingestion-time safety ceiling, not a Teradata system limit. Teradata's per-user MAXSESSIONS parameter is typically 64\u2013200+ depending on the platform and user profile. ",
      "maximum": 50,
      "minimum": 1,
      "title": "Max Pool Size",
      "type": "integer"
    },
    "extract_ownership": {
      "default": false,
      "description": "Whether to extract ownership information for tables and views based on their creator. When enabled, the table/view creator from Teradata's system tables will be added as an owner with DATAOWNER type. Ownership is applied using OVERWRITE mode, meaning any existing ownership information (including manually added or modified owners from the UI) will be replaced. Use with caution.",
      "title": "Extract Ownership",
      "type": "boolean"
    },
    "column_extraction_watermark": {
      "anyOf": [
        {
          "format": "date-time",
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Skip column extraction for tables/views whose LastAlterTimeStamp is older than this timestamp. Set to the start time of the last successful ingestion run to enable incremental column extraction. Mutually exclusive with column_extraction_days_back. At 13k tables where ~200 change per day this can reduce ingestion from hours to minutes.",
      "title": "Column Extraction Watermark"
    },
    "column_extraction_days_back": {
      "anyOf": [
        {
          "type": "integer"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Skip column extraction for tables/views not altered within the last N days. Computed at runtime as now() - N days, so the recipe never needs updating. A value of 3 for a daily schedule covers up to two missed runs with no gap risk. Mutually exclusive with column_extraction_watermark.",
      "title": "Column Extraction Days Back"
    },
    "use_dbc_columns_for_views": {
      "default": false,
      "description": "When True, attempt to use dbc.ColumnsV for view column metadata (faster bulk fetch) and fall back to HELP statements only for views where any column has a null/unknown ColumnType (e.g., derived expression columns). Can cut HELP calls by 80-90%% for installations where most view columns have explicit types. Set to False (default) to always use HELP for views, which is the conservative but slower approach.",
      "title": "Use Dbc Columns For Views",
      "type": "boolean"
    },
    "request_timeout_ms": {
      "default": 120000,
      "description": "Request timeout in milliseconds for Teradata query execution. Increase this when queries against large system tables (e.g., DBC.QryLogV) time out silently and fall back. Default is 120000 (2 minutes).",
      "title": "Request Timeout Ms",
      "type": "integer"
    },
    "connect_timeout_ms": {
      "default": 30000,
      "description": "Connection timeout in milliseconds when establishing Teradata connections. Default is 30000 (30 seconds).",
      "title": "Connect Timeout Ms",
      "type": "integer"
    },
    "connection_pool_timeout_ms": {
      "default": 60000,
      "description": "How long, in milliseconds, a worker thread will wait for a free connection from the pool before raising a PoolTimeoutError. PoolTimeoutError is a retryable condition: the connector will sleep with full-jitter exponential backoff and try again up to retry_max_attempts times. Increase this when parallel view processing saturates the pool on large schemas (watch num_pool_timeout_retries in the ingestion report). Decrease it to surface pool-exhaustion failures faster on small installations. Default is 60000 (60 seconds).",
      "maximum": 600000,
      "minimum": 1,
      "title": "Connection Pool Timeout Ms",
      "type": "integer"
    },
    "retry_max_attempts": {
      "default": 3,
      "description": "Maximum total attempts (initial + retries) for retryable database operations (connect, execute, fetchmany). Retryable conditions: pool exhaustion, transaction-aborted messages, dead-socket signals at connect time, and Teradata error codes 2631/3111/3120/3597/3598/3897. Permanent errors (auth failures, permission denied, object does not exist) are never retried regardless of this setting. Worst-case added latency per operation is approximately retry_max_attempts \u00d7 connection_pool_timeout_ms plus backoff sleeps (each capped at 30.0s). Increase when ingesting from a busy or flaky cluster; decrease to surface persistent errors faster. Default is 3.",
      "maximum": 10,
      "minimum": 1,
      "title": "Retry Max Attempts",
      "type": "integer"
    },
    "retry_initial_backoff_seconds": {
      "default": 1.0,
      "description": "Seed value, in seconds, for the full-jitter exponential backoff between retry attempts. Each retry sleeps for a duration drawn uniformly from [0, min(initial * 2^attempt, 30.0)] seconds. The 30-second cap prevents runaway sleep times even when retry_max_attempts is set high (e.g. initial=1.0, attempt=10 would be 1024s without the cap). Increase this to spread retries further apart on a heavily loaded cluster; decrease it for faster recovery on transient blips. Default is 1.0.",
      "exclusiveMinimum": 0,
      "title": "Retry Initial Backoff Seconds",
      "type": "number"
    },
    "view_processing_timeout_seconds": {
      "default": 1800,
      "description": "Maximum wall-clock time, in seconds, that a single view may spend in the parallel view-processing pool before the connector abandons it and moves on. Set to 0 to disable. Stalled views are reported as warnings and counted in `num_view_processing_timeouts`. This protects bulk ingestion from silent hangs when a Teradata query blocks indefinitely (e.g., on a dropped TCP connection). Default is 1800 (30 minutes).",
      "title": "View Processing Timeout Seconds",
      "type": "integer"
    },
    "view_processing_heartbeat_seconds": {
      "default": 30,
      "description": "How often, in seconds, to emit a 'view processing heartbeat' log line during parallel view processing. The heartbeat reports completed/in-progress counts and the longest-running view, making it possible to diagnose silent halts in the executor. Set to 0 to disable. Default is 30 seconds.",
      "title": "View Processing Heartbeat Seconds",
      "type": "integer"
    },
    "lineage_fetch_stall_warning_seconds": {
      "default": 300,
      "description": "If no lineage row batch arrives from DBC.QryLogV within this many seconds, emit a warning identifying the stalled phase. Set to 0 to disable. Default is 300 (5 minutes).",
      "title": "Lineage Fetch Stall Warning Seconds",
      "type": "integer"
    },
    "lineage_fetch_batch_size": {
      "default": 5000,
      "description": "Number of rows fetched per batch when streaming results from DBC.QryLogV during lineage extraction. Each row can carry several KB of query_text, so larger values increase peak memory usage while smaller values increase the number of round-trips to the database. Lower this (e.g. to a few hundred, or lower still) if the ingestion process runs out of memory during lineage extraction; raise it to reduce round-trips when rows are small and network latency is high. Must be a positive integer (a batch size of 0 would fetch no rows and stall the stream). NOTE: this only reduces memory when `use_server_side_cursors` is true (the default). With client-side cursors the driver buffers the entire result set in memory before this batching applies, so lowering the batch size will not prevent out-of-memory errors in that mode \u2014 it only changes the Python iteration chunk size. Default is 5000.",
      "exclusiveMinimum": 0,
      "title": "Lineage Fetch Batch Size",
      "type": "integer"
    },
    "lineage_slow_query_log_seconds": {
      "default": 60.0,
      "description": "When the total database time for a single lineage query (execute call plus all fetchmany calls, excluding downstream processing time) exceeds this many seconds, emit a warning with the query label, elapsed DB time, and the first 500 characters of the SQL text so slow queries can be identified and tuned. Note: when the driver retries a failed fetchmany call, the retry backoff sleep time is included in the measurement, so the threshold should be set well above the expected base query time. Set to 0 to disable. Default is 60 seconds.",
      "minimum": 0.0,
      "title": "Lineage Slow Query Log Seconds",
      "type": "number"
    }
  },
  "required": [
    "host_port"
  ],
  "title": "TeradataConfig",
  "type": "object"
}

Capabilities

Use the Important Capabilities table above as the source of truth for supported features and whether additional configuration is required.

Large-scale Deployment Tuning

For Teradata installations with thousands of tables the following options can significantly reduce ingestion time.

Incremental column extraction

The connector compares each table's LastAlterTimeStamp against a watermark and skips column extraction for tables that have not changed. Only altered tables and tables with no recorded alter timestamp are re-extracted. At 13 000 tables where ~200 change per day this typically reduces a multi-hour run to minutes.

Two mutually exclusive options control the watermark (setting both raises a validation error at startup):

column_extraction_days_back — recommended for scheduled pipelines. Set once and never update the recipe. A value of 3 covers up to two missed daily runs with no gap risk.
```
column_extraction_days_back: 3
```
column_extraction_watermark — for stateful pipelines that track the exact timestamp of the last successful run programmatically.
```
column_extraction_watermark: "2024-06-01T00:00:00Z"
```

Faster view column fetching

By default the connector uses Teradata HELP statements for every view to ensure derived expression columns (e.g. col1 + col2) have correct types. Set use_dbc_columns_for_views: true to attempt a bulk dbc.ColumnsV fetch first and fall back to HELP only for views where any column has an unknown type. This can reduce HELP calls by 80–90 % on installations where most view columns have explicit types.

Profiling at scale

Profiling all tables in a large installation is impractical. Use profiling.limit (part of the standard GEProfilingConfig) to cap how many tables are profiled per run. You can also combine it with profile_pattern to restrict profiling to specific schemas or tables.

profiling:
  enabled: true
  limit: 500
profile_pattern:
  allow:
    - "high_priority_db\\..*"

Lineage query scope

When databases is not set the connector automatically scopes DBC.QryLogV queries to the databases discovered during metadata extraction, filtered by database_pattern. This avoids scanning the entire audit log. You can further restrict the scope with an explicit databases list.

Slow lineage query detection

Large DBC.QryLogV tables can cause individual lineage queries to run for several minutes without producing an obvious error. Set lineage_slow_query_log_seconds to emit a WARNING-level log line whenever the total database time for a single lineage query (execute call plus all fetchmany calls — downstream sqlglot processing time is excluded) exceeds the threshold. The warning includes the query label and elapsed DB time. The log line additionally includes the first 500 characters of the SQL text — check WARNING-level logs to see the SQL snippet.

lineage_slow_query_log_seconds: 120 # warn if any lineage query takes longer than 2 minutes (DB time)

The default is 60 seconds. Set to 0 to disable slow-query warnings entirely. Each slow query is also counted in report.lineage_slow_queries_detected, and per-query DB timings are available in report.lineage_query_timings for post-run analysis.

Note: if the driver retries a failed fetchmany call, the retry backoff sleep time is included in the DB time measurement — set the threshold well above the expected base query time.

SQL parse cache size

When usage statistics or lineage are enabled, every query row from DBC.QryLogV is parsed with sqlglot to extract table references. Identical query text in a session (e.g. a BI dashboard query that runs thousands of times per day) hits an LRU cache and avoids re-parsing. The default cache holds 1 000 entries, which is too small for production Teradata installations where hundreds of distinct queries each execute thousands of times.

Set the DATAHUB_SQL_PARSE_CACHE_SIZE environment variable before running the pipeline to increase the cache:

export DATAHUB_SQL_PARSE_CACHE_SIZE=50000
datahub ingest -c teradata_recipe.yml

Each cache entry holds a parsed query result in memory. 50 000 entries typically uses 200–500 MB of additional heap depending on query complexity. Start with 10 000 if memory is constrained and increase until cache hit rates stabilise (visible in the ingestion report under sql_parsing_cache_stats).

Connection timeouts

Use request_timeout_ms and connect_timeout_ms to tune the Teradata driver timeouts. Increase request_timeout_ms (default: 120 000 ms) if lineage queries against large DBC.QryLogV tables time out silently.

Hang protection for bulk parallel runs

Parallel view processing and audit-log fetching can stall indefinitely if a single Teradata call blocks (for example, when a firewall silently drops an idle TCP connection mid-query). The connector ships with three knobs that prevent this from manifesting as a fully silent halt:

view_processing_timeout_seconds (default 1800) — wall-clock cap per view in the parallel pool. A stalled view is abandoned and the run continues. Abandoned views are counted in report.num_view_processing_timeouts and listed in report.stalled_views. Set to 0 to disable.
view_processing_heartbeat_seconds (default 30) — interval between View processing heartbeat: ... log lines that report completed/in-progress counts and the longest-running view. Use this to identify which view is stuck if a run is making no progress. Set to 0 to disable.
lineage_fetch_stall_warning_seconds (default 300) — if no DBC.QryLogV batch arrives within this window, a Lineage fetch stall warning is logged with the current phase (executing_query, awaiting_first_batch, or fetching_batches). Pure observability — does not interrupt the fetch. Set to 0 to disable.

The defaults are conservative and safe to leave alone. Tighten view_processing_timeout_seconds (for example to 300) on installations where individual views are known to complete quickly and you want stalls to surface sooner.

Limitations

use_dbc_columns_for_views falls back to HELP for any view that contains derived expression columns. Views with only explicit-type columns benefit most from this option.
column_extraction_watermark must be managed manually — set it to the start time of the previous successful run. Use column_extraction_days_back instead if you want a self-maintaining schedule-relative window.
column_extraction_watermark and column_extraction_days_back are mutually exclusive. Setting both raises a validation error at startup.
Profiling capped by profiling.limit does not prioritise tables — they are profiled in the order they are returned by dbc.TablesV. Use profile_pattern to target specific schemas if order matters.

Troubleshooting

If ingestion fails, validate credentials, permissions, connectivity, and scope filters first. Then review ingestion logs for source-specific errors and adjust configuration accordingly.

If lineage queries fail silently and return no results, increase request_timeout_ms. The default 2-minute timeout can be insufficient for DBC.QryLogV on busy systems with large audit logs.

Ingestion appears to stop without an error

Bulk parallel runs on large Teradata installations can appear to halt with no error or status update when a single underlying call blocks (hung DB query, dropped TCP connection, exhausted resources). To diagnose:

Enable debug logging (datahub ingest run -c recipe.yml --debug) and re-run one failing recipe. The last log line before the halt identifies the phase: a View processing heartbeat line points to the parallel view pool, a Lineage fetch stall warning points to DBC.QryLogV streaming, and silence in both points to network or pod-level termination.
Confirm the stalled-view path by checking report.num_view_processing_timeouts and report.stalled_views in the ingestion report after the run finishes. A non-zero count means the hang-protection logic abandoned one or more views; the listed views are the candidates for further investigation.
To rule out the parallel view pool entirely, re-run with max_workers: 1. If the run completes, the issue is confined to the parallel path.
For Kubernetes-hosted runs, check the executor pod for OOMKilled / CrashLoopBackOff events. Pod-level termination produces identical symptoms but cannot be addressed in the connector — provision more memory or reduce max_workers.

The defaults of view_processing_timeout_seconds: 1800, view_processing_heartbeat_seconds: 30, and lineage_fetch_stall_warning_seconds: 300 ensure that even an unattended run will surface progress information and recover from stalls on its own. See the Hang protection for bulk parallel runs section above for details on tuning these.

Code Coordinates

Class Name: datahub.ingestion.source.sql.teradata.TeradataSource
Browse on GitHub

Questions?

If you've got any questions on configuring ingestion for Teradata, feel free to ping us on our Slack.

💡 Contributing to this documentation

This page is auto-generated from the underlying source code. To make changes, please edit the relevant source files in the metadata-ingestion directory.

Tip: For quick typo fixes or documentation updates, you can click the ✏️ Edit icon directly in the GitHub UI to open a Pull Request. For larger changes and PR naming conventions, please refer to our Contributing Guide.

Teradata

Overview​

Concept Mapping​

Module teradata​

Important Capabilities​

Overview​

Prerequisites​

Install the Plugin​

Starter Recipe​

Config Details​

Capabilities​

Large-scale Deployment Tuning​

Limitations​

Troubleshooting​

Ingestion appears to stop without an error​

Code Coordinates​

Overview

Concept Mapping

Module `teradata`

Important Capabilities

Overview

Prerequisites

Install the Plugin

Starter Recipe

Config Details

Capabilities

Large-scale Deployment Tuning

Limitations

Troubleshooting

Ingestion appears to stop without an error

Code Coordinates