Hive Metastore
There are 2 sources that provide integration with Hive Metastore
| Source Module | Documentation |
| This plugin extracts the following:
|
| This plugin extracts the following:
|
Module hive-metastore
Important Capabilities
| Capability | Status | Notes |
|---|---|---|
| Asset Containers | ✅ | Enabled by default. Supported for types - Catalog, Schema. |
| Classification | ❌ | Not Supported. |
| Column-level Lineage | ✅ | Enabled by default for views via include_view_lineage, and to storage via include_column_lineage when storage lineage is enabled. Supported for types - Table, View. |
| Data Profiling | ❌ | Not Supported. |
| Descriptions | ✅ | Enabled by default. |
| Detect Deleted Entities | ✅ | Enabled by default via stateful ingestion. |
| Domains | ✅ | Enabled by default. |
| Schema Metadata | ✅ | Enabled by default. |
| Table-Level Lineage | ✅ | Enabled by default for views via include_view_lineage, and to upstream/downstream storage via emit_storage_lineage. Supported for types - Table, View. |
| Test Connection | ✅ | Enabled by default. |
This plugin extracts the following:
- Metadata for Presto views and Hive tables (external / managed)
- Column types associated with each table / view
- Detailed table / view property info
Prerequisites
The Hive Metastore source connects directly to the Hive metastore database (MySQL, PostgreSQL, etc.) to extract metadata. This approach is faster and more comprehensive than connecting via HiveServer2, especially for large deployments.
Before configuring the DataHub connector, ensure you have:
Database Access: Direct read access to the Hive metastore database (typically MySQL or PostgreSQL).
Network Access: The machine running DataHub ingestion must be able to reach your metastore database on the configured port.
Database Driver: Install the appropriate Python database driver:
# For PostgreSQL metastore
pip install 'acryl-datahub[hive]' psycopg2-binary
# For MySQL metastore
pip install 'acryl-datahub[hive]' PyMySQLMetastore Schema Knowledge: Familiarity with your metastore database schema (typically
publicfor PostgreSQL, or the database name for MySQL).
Required Database Permissions
The database user account used by DataHub needs read-only access to the Hive metastore tables.
PostgreSQL Metastore
-- Create a dedicated read-only user for DataHub
CREATE USER datahub_user WITH PASSWORD 'secure_password';
-- Grant connection privileges
GRANT CONNECT ON DATABASE metastore TO datahub_user;
-- Grant schema usage
GRANT USAGE ON SCHEMA public TO datahub_user;
-- Grant SELECT on metastore tables
GRANT SELECT ON ALL TABLES IN SCHEMA public TO datahub_user;
-- Grant SELECT on future tables (for metastore upgrades)
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO datahub_user;
MySQL Metastore
-- Create a dedicated read-only user for DataHub
CREATE USER 'datahub_user'@'%' IDENTIFIED BY 'secure_password';
-- Grant SELECT privileges on metastore database
GRANT SELECT ON metastore.* TO 'datahub_user'@'%';
-- Apply changes
FLUSH PRIVILEGES;
Required Metastore Tables
DataHub queries the following metastore tables:
| Table | Purpose |
|---|---|
DBS | Database/schema information |
TBLS | Table metadata |
TABLE_PARAMS | Table properties (including view definitions) |
SDS | Storage descriptor (location, format) |
COLUMNS_V2 | Column metadata |
PARTITION_KEYS | Partition information |
SERDES | Serialization/deserialization information |
Recommendation: Grant SELECT on all metastore tables to ensure compatibility with different Hive versions and for future DataHub enhancements.
Authentication
PostgreSQL
Standard Connection:
source:
type: hive-metastore
config:
host_port: metastore-db.company.com:5432
database: metastore
username: datahub_user
password: ${METASTORE_PASSWORD}
scheme: "postgresql+psycopg2"
SSL Connection:
source:
type: hive-metastore
config:
host_port: metastore-db.company.com:5432
database: metastore
username: datahub_user
password: ${METASTORE_PASSWORD}
scheme: "postgresql+psycopg2"
options:
connect_args:
sslmode: require
sslrootcert: /path/to/ca-cert.pem
MySQL
Standard Connection:
source:
type: hive-metastore
config:
host_port: metastore-db.company.com:3306
database: metastore
username: datahub_user
password: ${METASTORE_PASSWORD}
scheme: "mysql+pymysql" # Default if not specified
SSL Connection:
source:
type: hive-metastore
config:
host_port: metastore-db.company.com:3306
database: metastore
username: datahub_user
password: ${METASTORE_PASSWORD}
scheme: "mysql+pymysql"
options:
connect_args:
ssl:
ca: /path/to/ca-cert.pem
cert: /path/to/client-cert.pem
key: /path/to/client-key.pem
Amazon RDS (PostgreSQL or MySQL)
For AWS RDS-hosted metastore databases:
source:
type: hive-metastore
config:
host_port: metastore.abc123.us-east-1.rds.amazonaws.com:5432
database: metastore
username: datahub_user
password: ${RDS_PASSWORD}
scheme: "postgresql+psycopg2" # or 'mysql+pymysql'
options:
connect_args:
sslmode: require # RDS requires SSL
Azure Database for PostgreSQL/MySQL
source:
type: hive-metastore
config:
host_port: metastore-server.postgres.database.azure.com:5432
database: metastore
username: datahub_user@metastore-server # Note: Azure requires @server-name suffix
password: ${AZURE_DB_PASSWORD}
scheme: "postgresql+psycopg2"
options:
connect_args:
sslmode: require
Storage Lineage
The Hive Metastore connector supports the same storage lineage features as the Hive connector, with enhanced performance due to direct database access.
Quick Start
Enable storage lineage with minimal configuration:
source:
type: hive-metastore
config:
host_port: metastore-db.company.com:5432
database: metastore
username: datahub_user
password: ${METASTORE_PASSWORD}
scheme: "postgresql+psycopg2"
# Enable storage lineage
emit_storage_lineage: true
Configuration Options
Storage lineage is controlled by the same parameters as the Hive connector:
| Parameter | Type | Default | Description |
|---|---|---|---|
emit_storage_lineage | boolean | false | Master toggle to enable/disable storage lineage |
hive_storage_lineage_direction | string | "upstream" | Direction: "upstream" (storage → Hive) or "downstream" (Hive → storage) |
include_column_lineage | boolean | true | Enable column-level lineage from storage paths to Hive columns |
storage_platform_instance | string | None | Platform instance for storage URNs (e.g., "prod-s3", "dev-hdfs") |
Supported Storage Platforms
All storage platforms supported by the Hive connector are also supported here:
- Amazon S3 (
s3://,s3a://,s3n://) - HDFS (
hdfs://) - Google Cloud Storage (
gs://) - Azure Blob Storage (
wasb://,wasbs://) - Azure Data Lake (
adl://,abfs://,abfss://) - Databricks File System (
dbfs://) - Local File System (
file://)
See the sections above for complete configuration details.
Presto and Trino View Support
A key advantage of the Hive Metastore connector is its ability to extract metadata from Presto and Trino views that are stored in the metastore.
How It Works
View Detection: The connector identifies views by checking the
TABLE_PARAMStable for Presto/Trino view definitions.View Parsing: Presto/Trino view JSON is parsed to extract:
- Original SQL text
- Referenced tables
- Column metadata and types
Lineage Extraction: SQL is parsed using
sqlglotto create table-to-view lineage.Storage Lineage Integration: If storage lineage is enabled, the connector also creates lineage from storage → tables → views.
Configuration
Presto/Trino view support is automatically enabled when ingesting from a metastore that contains Presto/Trino views. No additional configuration is required.
Example
source:
type: hive-metastore
config:
host_port: metastore-db.company.com:5432
database: metastore
username: datahub_user
password: ${METASTORE_PASSWORD}
scheme: "postgresql+psycopg2"
# Enable storage lineage for complete lineage chain
emit_storage_lineage: true
This configuration will create complete lineage:
S3 Bucket → Hive Table → Presto View
Limitations
- Presto/Trino Version: The connector supports Presto 0.200+ and Trino view formats
- Complex SQL: Very complex SQL with non-standard syntax may have incomplete lineage
- Cross-Database References: Lineage is extracted for references within the same Hive metastore
Schema Filtering
For large metastore deployments with many databases, use filtering to limit ingestion scope:
Database Filtering
source:
type: hive-metastore
config:
# ... connection config ...
# Only ingest from specific databases
schema_pattern:
allow:
- "^production_.*" # All databases starting with production_
- "analytics" # Specific database
deny:
- ".*_test$" # Exclude test databases
Table Filtering with SQL
For more complex filtering, use SQL WHERE clauses:
source:
type: hive-metastore
config:
# ... connection config ...
# Custom SQL filter for tables
tables_where_clause_suffix: 'AND d."name" in (''production_db'', ''analytics_db'')'
# Custom SQL filter for views
views_where_clause_suffix: 'AND d."name" in (''production_db'', ''analytics_db'')'
Note: The WHERE clause suffix is appended to the internal query. Use proper SQL syntax for your database type (e.g., quoted identifiers for PostgreSQL).
Performance Considerations
Advantages Over HiveServer2 Connector
The Hive Metastore connector is significantly faster than the Hive connector because:
- Direct Database Access: No HiveServer2 overhead
- Batch Queries: Fetches all metadata in optimized SQL queries
- No Query Execution: Doesn't run Hive queries to extract metadata
- Parallel Processing: Can process multiple databases concurrently
Performance Comparison (approximate):
- 10 databases, 1000 tables: ~2 minutes (Metastore) vs ~15 minutes (HiveServer2)
- 100 databases, 10,000 tables: ~15 minutes (Metastore) vs ~2 hours (HiveServer2)
Optimization Tips
Database Connection Pooling: The connector uses SQLAlchemy's default connection pooling. For very large deployments, consider tuning pool size:
options:
pool_size: 10
max_overflow: 20Schema Filtering: Use
schema_patternto limit scope and reduce query time.Stateful Ingestion: Enable to only process changes:
stateful_ingestion:
enabled: true
remove_stale_metadata: trueDisable Column Lineage: If not needed:
emit_storage_lineage: true
include_column_lineage: false # Faster
Network Considerations
- Latency: Low latency to the metastore database is important
- Bandwidth: Minimal bandwidth required (only metadata, no data transfer)
- Connection Limits: Ensure metastore database can handle additional read connections
Platform Instances
When ingesting from multiple metastores (e.g., different clusters or environments), use platform_instance:
source:
type: hive-metastore
config:
host_port: prod-metastore-db.company.com:5432
database: metastore
platform_instance: "prod-hive"
Best Practice: Combine with storage_platform_instance:
source:
type: hive-metastore
config:
platform_instance: "prod-hive" # Hive tables
storage_platform_instance: "prod-hdfs" # Storage locations
emit_storage_lineage: true
Caveats and Limitations
Metastore Schema Compatibility
- Hive Versions: Tested with Hive 1.x, 2.x, and 3.x metastore schemas
- Schema Variations: Different Hive versions may have slightly different metastore schemas
- Custom Tables: If your organization has added custom metastore tables, they won't be processed
Database Support
- Supported: PostgreSQL, MySQL, MariaDB
- Not Supported: Oracle, MSSQL (may work but untested)
- Derby: Not recommended (embedded metastore, typically single-user)
View Lineage Parsing
- Simple SQL: Fully supported with accurate lineage
- Complex SQL: Best-effort parsing; some edge cases may have incomplete lineage
- Non-standard SQL: Presto/Trino-specific functions may not be fully parsed
Permissions Limitations
- Read-Only: The connector only needs SELECT permissions
- No Write Operations: Never requires INSERT, UPDATE, or DELETE
- Metastore Locks: Read operations don't acquire metastore locks
Storage Lineage Limitations
Same as the Hive connector:
- Only tables with defined storage locations have lineage
- Temporary tables are not supported
- Partition-level lineage is aggregated at table level
Known Issues
Large Column Lists: Tables with 500+ columns may be slow to process due to metastore query complexity.
View Definition Encoding: Some older Hive versions store view definitions in non-UTF-8 encoding, which may cause parsing issues.
Case Sensitivity:
- PostgreSQL metastore: Case-sensitive identifiers (use
"quoted"names in WHERE clauses) - MySQL metastore: Case-insensitive by default
- DataHub automatically lowercases URNs for consistency
- PostgreSQL metastore: Case-sensitive identifiers (use
Concurrent Metastore Writes: If the metastore is being actively modified during ingestion, some metadata may be inconsistent.
Troubleshooting
Connection Issues
Problem: Could not connect to metastore database
Solutions:
- Verify
host_port,database, andschemeare correct - Check network connectivity:
telnet <host> <port> - Verify firewall rules allow connections
- For PostgreSQL: Check
pg_hba.confallows connections from your IP - For MySQL: Check
bind-addressinmy.cnf
Authentication Failures
Problem: Authentication failed or Access denied
Solutions:
- Verify username and password are correct
- Check user has CONNECT/LOGIN privileges
- For Azure: Ensure username includes
@server-namesuffix - Review database logs for detailed error messages
Missing Tables
Problem: Not all tables appear in DataHub
Solutions:
- Verify database user has SELECT on all metastore tables
- Check if tables are filtered by
schema_patternor WHERE clauses - Query metastore directly to verify tables exist:
SELECT d.name as db_name, t.tbl_name as table_name, t.tbl_type
FROM TBLS t
JOIN DBS d ON t.db_id = d.db_id
WHERE d.name = 'your_database';
Presto/Trino Views Not Appearing
Problem: Views defined in Presto/Trino don't show up
Solutions:
- Check view definitions exist in metastore:
SELECT d.name as db_name, t.tbl_name as view_name, tp.param_value
FROM TBLS t
JOIN DBS d ON t.db_id = d.db_id
JOIN TABLE_PARAMS tp ON t.tbl_id = tp.tbl_id
WHERE t.tbl_type = 'VIRTUAL_VIEW'
AND tp.param_key = 'presto_view'
LIMIT 10; - Review ingestion logs for parsing errors
- Verify view JSON is valid
Storage Lineage Not Appearing
Problem: No storage lineage relationships visible
Solutions:
- Verify
emit_storage_lineage: trueis set - Check tables have storage locations in metastore:
SELECT d.name as db_name, t.tbl_name as table_name, s.location
FROM TBLS t
JOIN DBS d ON t.db_id = d.db_id
JOIN SDS s ON t.sd_id = s.sd_id
WHERE s.location IS NOT NULL
LIMIT 10; - Review logs for "Failed to parse storage location" warnings
- See the "Storage Lineage" section above for troubleshooting tips
Slow Ingestion
Problem: Ingestion takes too long
Solutions:
- Use schema filtering to reduce scope
- Enable stateful ingestion to only process changes
- Check database query performance (may need indexes on metastore tables)
- Ensure low latency network connection to metastore database
- Consider disabling column lineage if not needed
Related Documentation
- Hive Metastore Configuration - Configuration examples
- Hive Connector - Alternative connector via HiveServer2
- SQLAlchemy Documentation - Underlying database connection library
CLI based Ingestion
Starter Recipe
Check out the following recipe to get started with ingestion! See below for full configuration options.
For general pointers on writing and running a recipe, see our main recipe guide.
source:
type: hive-metastore
config:
# Hive metastore DB connection
host_port: localhost:5432
database: metastore
# specify the schema where metastore tables reside
schema_pattern:
allow:
- "^public"
# credentials
username: user # optional
password: pass # optional
#scheme: 'postgresql+psycopg2' # set this if metastore db is using postgres
#scheme: 'mysql+pymysql' # set this if metastore db is using mysql, default if unset
# set this to have advanced filters on what to ingest
#views_where_clause_suffix: AND d."name" in ('db1')
#tables_where_clause_suffix: AND d."name" in ('db1')
# Storage Lineage Configuration (Optional)
# Enables lineage between Hive tables and their underlying storage locations
#emit_storage_lineage: false # Set to true to enable storage lineage
#hive_storage_lineage_direction: upstream # Direction: 'upstream' (storage -> Hive) or 'downstream' (Hive -> storage)
#include_column_lineage: true # Set to false to disable column-level lineage
#storage_platform_instance: "prod-hdfs" # Optional: platform instance for storage URNs
sink:
# sink configs
Config Details
- Options
- Schema
Note that a . is used to denote nested fields in the YAML recipe.
| Field | Description |
|---|---|
convert_urns_to_lowercase boolean | Whether to convert dataset urns to lowercase. Default: False |
database One of string, null | database (catalog) Default: None |
emit_storage_lineage boolean | Whether to emit storage-to-Hive lineage. When enabled, creates lineage relationships between Hive tables and their underlying storage locations (S3, Azure, GCS, HDFS, etc.). Default: False |
enable_properties_merge boolean | By default, the connector enables merging of properties with what exists on the server. Set this to False to enable the default connector behavior of overwriting properties on each ingestion. Default: True |
hive_storage_lineage_direction Enum | One of: "upstream", "downstream" |
host_port string | Host URL and port to connect to. Example: localhost:3306 Default: localhost:3306 |
include_catalog_name_in_ids boolean | Add the Presto catalog name (e.g. hive) to the generated dataset urns. urn:li:dataset:(urn:li:dataPlatform:hive,hive.user.logging_events,PROD) versus urn:li:dataset:(urn:li:dataPlatform:hive,user.logging_events,PROD) Default: False |
include_column_lineage boolean | When enabled along with emit_storage_lineage, column-level lineage will be extracted between Hive table columns and storage location fields. Default: True |
include_table_location_lineage boolean | If the source supports it, include table lineage to the underlying storage location. Default: True |
include_tables boolean | Whether tables should be ingested. Default: True |
include_view_column_lineage boolean | Populates column-level lineage for view->view and table->view lineage using DataHub's sql parser. Requires include_view_lineage to be enabled. Default: True |
include_view_lineage boolean | Whether to extract lineage from Hive views. Requires parsing the view definition SQL. Default: True |
include_views boolean | Whether views should be ingested. Default: True |
incremental_lineage boolean | When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run. Default: False |
ingestion_job_id string | Default: |
metastore_db_name One of string, null | Name of the Hive metastore's database (usually: metastore). For backward compatibility, if this field is not provided, the database field will be used. If both the 'database' and 'metastore_db_name' fields are set then the 'database' field will be used to filter the hive/presto/trino database Default: None |
mode Enum | One of: "hive", "presto", "presto-on-hive", "trino" |
options object | Any options specified here will be passed to SQLAlchemy.create_engine as kwargs. To set connection arguments in the URL, specify them under connect_args. |
password One of string(password), null | password Default: None |
platform_instance One of string, null | The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details. Default: None |
schemas_where_clause_suffix string | Where clause to specify what Hive schemas should be ingested. Default: |
simplify_nested_field_paths boolean | Simplify v2 field paths to v1 by default. If the schema has Union or Array types, still falls back to v2 Default: False |
sqlalchemy_uri One of string, null | URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters. Default: None |
storage_platform_instance One of string, null | Platform instance for the storage system (e.g., 'my-s3-instance'). Used when generating URNs for storage datasets. Default: None |
tables_where_clause_suffix string | Where clause to specify what Hive tables should be ingested. Default: |
use_catalog_subtype boolean | Container Subtype name to be 'Database' or 'Catalog' Valid options: ['True', 'False'] Default: True |
use_dataset_pascalcase_subtype boolean | Dataset Subtype name to be 'Table' or 'View' Valid options: ['True', 'False'] Default: False |
use_file_backed_cache boolean | Whether to use a file backed cache for the view definitions. Default: True |
username One of string, null | username Default: None |
views_where_clause_suffix string | Where clause to specify what Presto views should be ingested. Default: |
env string | The environment that all assets produced by this connector belong to Default: PROD |
database_pattern AllowDenyPattern | A class to store allow deny regexes |
database_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
domain map(str,AllowDenyPattern) | A class to store allow deny regexes |
domain. key.allowarray | List of regex patterns to include in ingestion Default: ['.*'] |
domain. key.allow.stringstring | |
domain. key.ignoreCaseOne of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
domain. key.denyarray | List of regex patterns to exclude from ingestion. Default: [] |
domain. key.deny.stringstring | |
profile_pattern AllowDenyPattern | A class to store allow deny regexes |
profile_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
schema_pattern AllowDenyPattern | A class to store allow deny regexes |
schema_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
table_pattern AllowDenyPattern | A class to store allow deny regexes |
table_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
view_pattern AllowDenyPattern | A class to store allow deny regexes |
view_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
classification ClassificationConfig | |
classification.enabled boolean | Whether classification should be used to auto-detect glossary terms Default: False |
classification.info_type_to_term map(str,string) | |
classification.max_workers integer | Number of worker processes to use for classification. Set to 1 to disable. Default: 4 |
classification.sample_size integer | Number of sample values used for classification. Default: 100 |
classification.classifiers array | Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance. Default: [{'type': 'datahub', 'config': None}] |
classification.classifiers.DynamicTypedClassifierConfig DynamicTypedClassifierConfig | |
classification.classifiers.DynamicTypedClassifierConfig.type ❓ string | The type of the classifier to use. For DataHub, use datahub |
classification.classifiers.DynamicTypedClassifierConfig.config One of object, null | The configuration required for initializing the classifier. If not specified, uses defaults for classifer type. Default: None |
classification.column_pattern AllowDenyPattern | A class to store allow deny regexes |
classification.column_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
classification.table_pattern AllowDenyPattern | A class to store allow deny regexes |
classification.table_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
profiling GEProfilingConfig | |
profiling.catch_exceptions boolean | Default: True |
profiling.enabled boolean | Whether profiling should be done. Default: False |
profiling.field_sample_values_limit integer | Upper limit for number of sample values to collect for all columns. Default: 20 |
profiling.include_field_distinct_count boolean | Whether to profile for the number of distinct values for each column. Default: True |
profiling.include_field_distinct_value_frequencies boolean | Whether to profile for distinct value frequencies. Default: False |
profiling.include_field_histogram boolean | Whether to profile for the histogram for numeric fields. Default: False |
profiling.include_field_max_value boolean | Whether to profile for the max value of numeric columns. Default: True |
profiling.include_field_mean_value boolean | Whether to profile for the mean value of numeric columns. Default: True |
profiling.include_field_median_value boolean | Whether to profile for the median value of numeric columns. Default: True |
profiling.include_field_min_value boolean | Whether to profile for the min value of numeric columns. Default: True |
profiling.include_field_null_count boolean | Whether to profile for the number of nulls for each column. Default: True |
profiling.include_field_quantiles boolean | Whether to profile for the quantiles of numeric columns. Default: False |
profiling.include_field_sample_values boolean | Whether to profile for the sample values for all columns. Default: True |
profiling.include_field_stddev_value boolean | Whether to profile for the standard deviation of numeric columns. Default: True |
profiling.limit One of integer, null | Max number of documents to profile. By default, profiles all documents. Default: None |
profiling.max_number_of_fields_to_profile One of integer, null | A positive integer that specifies the maximum number of columns to profile for any table. None implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up. Default: None |
profiling.max_workers integer | Number of worker threads to use for profiling. Set to 1 to disable. Default: 20 |
profiling.offset One of integer, null | Offset in documents to profile. By default, uses no offset. Default: None |
profiling.partition_datetime One of string(date-time), null | If specified, profile only the partition which matches this datetime. If not specified, profile the latest partition. Only Bigquery supports this. Default: None |
profiling.partition_profiling_enabled boolean | Whether to profile partitioned tables. Only BigQuery and Aws Athena supports this. If enabled, latest partition data is used for profiling. Default: True |
profiling.profile_external_tables boolean | Whether to profile external tables. Only Snowflake and Redshift supports this. Default: False |
profiling.profile_if_updated_since_days One of number, null | Profile table only if it has been updated since these many number of days. If set to null, no constraint of last modified time for tables to profile. Supported only in snowflake and BigQuery. Default: None |
profiling.profile_nested_fields boolean | Whether to profile complex types like structs, arrays and maps. Default: False |
profiling.profile_table_level_only boolean | Whether to perform profiling at table-level only, or include column-level profiling as well. Default: False |
profiling.profile_table_row_count_estimate_only boolean | Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL. Default: False |
profiling.profile_table_row_limit One of integer, null | Profile tables only if their row count is less than specified count. If set to null, no limit on the row count of tables to profile. Supported only in Snowflake, BigQuery. Supported for Oracle based on gathered stats. Default: 5000000 |
profiling.profile_table_size_limit One of integer, null | Profile tables only if their size is less than specified GBs. If set to null, no limit on the size of tables to profile. Supported only in Snowflake, BigQuery and Databricks. Supported for Oracle based on calculated size from gathered stats. Default: 5 |
profiling.query_combiner_enabled boolean | This feature is still experimental and can be disabled if it causes issues. Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible. Default: True |
profiling.report_dropped_profiles boolean | Whether to report datasets or dataset columns which were not profiled. Set to True for debugging purposes. Default: False |
profiling.sample_size integer | Number of rows to be sampled from table for column level profiling.Applicable only if use_sampling is set to True. Default: 10000 |
profiling.turn_off_expensive_profiling_metrics boolean | Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10. Default: False |
profiling.use_sampling boolean | Whether to profile column level stats on sample of table. Only BigQuery and Snowflake support this. If enabled, profiling is done on rows sampled from table. Sampling is not done for smaller tables. Default: True |
profiling.operation_config OperationConfig | |
profiling.operation_config.lower_freq_profile_enabled boolean | Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling. Default: False |
profiling.operation_config.profile_date_of_month One of integer, null | Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect. Default: None |
profiling.operation_config.profile_day_of_week One of integer, null | Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect. Default: None |
profiling.tags_to_ignore_sampling One of array, null | Fixed list of tags to ignore sampling. If not specified, tables will be sampled based on use_sampling. Default: None |
profiling.tags_to_ignore_sampling.string string | |
stateful_ingestion One of StatefulStaleMetadataRemovalConfig, null | Default: None |
stateful_ingestion.enabled boolean | Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False Default: False |
stateful_ingestion.fail_safe_threshold number | Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'. Default: 75.0 |
stateful_ingestion.remove_stale_metadata boolean | Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled. Default: True |
The JSONSchema for this configuration is inlined below.
{
"$defs": {
"AllowDenyPattern": {
"additionalProperties": false,
"description": "A class to store allow deny regexes",
"properties": {
"allow": {
"default": [
".*"
],
"description": "List of regex patterns to include in ingestion",
"items": {
"type": "string"
},
"title": "Allow",
"type": "array"
},
"deny": {
"default": [],
"description": "List of regex patterns to exclude from ingestion.",
"items": {
"type": "string"
},
"title": "Deny",
"type": "array"
},
"ignoreCase": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"description": "Whether to ignore case sensitivity during pattern matching.",
"title": "Ignorecase"
}
},
"title": "AllowDenyPattern",
"type": "object"
},
"ClassificationConfig": {
"additionalProperties": false,
"properties": {
"enabled": {
"default": false,
"description": "Whether classification should be used to auto-detect glossary terms",
"title": "Enabled",
"type": "boolean"
},
"sample_size": {
"default": 100,
"description": "Number of sample values used for classification.",
"title": "Sample Size",
"type": "integer"
},
"max_workers": {
"default": 4,
"description": "Number of worker processes to use for classification. Set to 1 to disable.",
"title": "Max Workers",
"type": "integer"
},
"table_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns to filter tables for classification. This is used in combination with other patterns in parent config. Specify regex to match the entire table name in `database.schema.table` format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'"
},
"column_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns to filter columns for classification. This is used in combination with other patterns in parent config. Specify regex to match the column name in `database.schema.table.column` format."
},
"info_type_to_term": {
"additionalProperties": {
"type": "string"
},
"default": {},
"description": "Optional mapping to provide glossary term identifier for info type",
"title": "Info Type To Term",
"type": "object"
},
"classifiers": {
"default": [
{
"type": "datahub",
"config": null
}
],
"description": "Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance.",
"items": {
"$ref": "#/$defs/DynamicTypedClassifierConfig"
},
"title": "Classifiers",
"type": "array"
}
},
"title": "ClassificationConfig",
"type": "object"
},
"DynamicTypedClassifierConfig": {
"additionalProperties": false,
"properties": {
"type": {
"description": "The type of the classifier to use. For DataHub, use `datahub`",
"title": "Type",
"type": "string"
},
"config": {
"anyOf": [
{},
{
"type": "null"
}
],
"default": null,
"description": "The configuration required for initializing the classifier. If not specified, uses defaults for classifer type.",
"title": "Config"
}
},
"required": [
"type"
],
"title": "DynamicTypedClassifierConfig",
"type": "object"
},
"GEProfilingConfig": {
"additionalProperties": false,
"properties": {
"enabled": {
"default": false,
"description": "Whether profiling should be done.",
"title": "Enabled",
"type": "boolean"
},
"operation_config": {
"$ref": "#/$defs/OperationConfig",
"description": "Experimental feature. To specify operation configs."
},
"limit": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Max number of documents to profile. By default, profiles all documents.",
"title": "Limit"
},
"offset": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Offset in documents to profile. By default, uses no offset.",
"title": "Offset"
},
"profile_table_level_only": {
"default": false,
"description": "Whether to perform profiling at table-level only, or include column-level profiling as well.",
"title": "Profile Table Level Only",
"type": "boolean"
},
"include_field_null_count": {
"default": true,
"description": "Whether to profile for the number of nulls for each column.",
"title": "Include Field Null Count",
"type": "boolean"
},
"include_field_distinct_count": {
"default": true,
"description": "Whether to profile for the number of distinct values for each column.",
"title": "Include Field Distinct Count",
"type": "boolean"
},
"include_field_min_value": {
"default": true,
"description": "Whether to profile for the min value of numeric columns.",
"title": "Include Field Min Value",
"type": "boolean"
},
"include_field_max_value": {
"default": true,
"description": "Whether to profile for the max value of numeric columns.",
"title": "Include Field Max Value",
"type": "boolean"
},
"include_field_mean_value": {
"default": true,
"description": "Whether to profile for the mean value of numeric columns.",
"title": "Include Field Mean Value",
"type": "boolean"
},
"include_field_median_value": {
"default": true,
"description": "Whether to profile for the median value of numeric columns.",
"title": "Include Field Median Value",
"type": "boolean"
},
"include_field_stddev_value": {
"default": true,
"description": "Whether to profile for the standard deviation of numeric columns.",
"title": "Include Field Stddev Value",
"type": "boolean"
},
"include_field_quantiles": {
"default": false,
"description": "Whether to profile for the quantiles of numeric columns.",
"title": "Include Field Quantiles",
"type": "boolean"
},
"include_field_distinct_value_frequencies": {
"default": false,
"description": "Whether to profile for distinct value frequencies.",
"title": "Include Field Distinct Value Frequencies",
"type": "boolean"
},
"include_field_histogram": {
"default": false,
"description": "Whether to profile for the histogram for numeric fields.",
"title": "Include Field Histogram",
"type": "boolean"
},
"include_field_sample_values": {
"default": true,
"description": "Whether to profile for the sample values for all columns.",
"title": "Include Field Sample Values",
"type": "boolean"
},
"max_workers": {
"default": 20,
"description": "Number of worker threads to use for profiling. Set to 1 to disable.",
"title": "Max Workers",
"type": "integer"
},
"report_dropped_profiles": {
"default": false,
"description": "Whether to report datasets or dataset columns which were not profiled. Set to `True` for debugging purposes.",
"title": "Report Dropped Profiles",
"type": "boolean"
},
"turn_off_expensive_profiling_metrics": {
"default": false,
"description": "Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.",
"title": "Turn Off Expensive Profiling Metrics",
"type": "boolean"
},
"field_sample_values_limit": {
"default": 20,
"description": "Upper limit for number of sample values to collect for all columns.",
"title": "Field Sample Values Limit",
"type": "integer"
},
"max_number_of_fields_to_profile": {
"anyOf": [
{
"exclusiveMinimum": 0,
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up.",
"title": "Max Number Of Fields To Profile"
},
"profile_if_updated_since_days": {
"anyOf": [
{
"exclusiveMinimum": 0,
"type": "number"
},
{
"type": "null"
}
],
"default": null,
"description": "Profile table only if it has been updated since these many number of days. If set to `null`, no constraint of last modified time for tables to profile. Supported only in `snowflake` and `BigQuery`.",
"schema_extra": {
"supported_sources": [
"snowflake",
"bigquery"
]
},
"title": "Profile If Updated Since Days"
},
"profile_table_size_limit": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 5,
"description": "Profile tables only if their size is less than specified GBs. If set to `null`, no limit on the size of tables to profile. Supported only in `Snowflake`, `BigQuery` and `Databricks`. Supported for `Oracle` based on calculated size from gathered stats.",
"schema_extra": {
"supported_sources": [
"snowflake",
"bigquery",
"unity-catalog",
"oracle"
]
},
"title": "Profile Table Size Limit"
},
"profile_table_row_limit": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 5000000,
"description": "Profile tables only if their row count is less than specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `Snowflake`, `BigQuery`. Supported for `Oracle` based on gathered stats.",
"schema_extra": {
"supported_sources": [
"snowflake",
"bigquery",
"oracle"
]
},
"title": "Profile Table Row Limit"
},
"profile_table_row_count_estimate_only": {
"default": false,
"description": "Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL. ",
"schema_extra": {
"supported_sources": [
"postgres",
"mysql"
]
},
"title": "Profile Table Row Count Estimate Only",
"type": "boolean"
},
"query_combiner_enabled": {
"default": true,
"description": "*This feature is still experimental and can be disabled if it causes issues.* Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.",
"title": "Query Combiner Enabled",
"type": "boolean"
},
"catch_exceptions": {
"default": true,
"description": "",
"title": "Catch Exceptions",
"type": "boolean"
},
"partition_profiling_enabled": {
"default": true,
"description": "Whether to profile partitioned tables. Only BigQuery and Aws Athena supports this. If enabled, latest partition data is used for profiling.",
"schema_extra": {
"supported_sources": [
"athena",
"bigquery"
]
},
"title": "Partition Profiling Enabled",
"type": "boolean"
},
"partition_datetime": {
"anyOf": [
{
"format": "date-time",
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "If specified, profile only the partition which matches this datetime. If not specified, profile the latest partition. Only Bigquery supports this.",
"schema_extra": {
"supported_sources": [
"bigquery"
]
},
"title": "Partition Datetime"
},
"use_sampling": {
"default": true,
"description": "Whether to profile column level stats on sample of table. Only BigQuery and Snowflake support this. If enabled, profiling is done on rows sampled from table. Sampling is not done for smaller tables. ",
"schema_extra": {
"supported_sources": [
"bigquery",
"snowflake"
]
},
"title": "Use Sampling",
"type": "boolean"
},
"sample_size": {
"default": 10000,
"description": "Number of rows to be sampled from table for column level profiling.Applicable only if `use_sampling` is set to True.",
"schema_extra": {
"supported_sources": [
"bigquery",
"snowflake"
]
},
"title": "Sample Size",
"type": "integer"
},
"profile_external_tables": {
"default": false,
"description": "Whether to profile external tables. Only Snowflake and Redshift supports this.",
"schema_extra": {
"supported_sources": [
"redshift",
"snowflake"
]
},
"title": "Profile External Tables",
"type": "boolean"
},
"tags_to_ignore_sampling": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "Fixed list of tags to ignore sampling. If not specified, tables will be sampled based on `use_sampling`.",
"title": "Tags To Ignore Sampling"
},
"profile_nested_fields": {
"default": false,
"description": "Whether to profile complex types like structs, arrays and maps. ",
"title": "Profile Nested Fields",
"type": "boolean"
}
},
"title": "GEProfilingConfig",
"type": "object"
},
"HiveMetastoreConfigMode": {
"enum": [
"hive",
"presto",
"presto-on-hive",
"trino"
],
"title": "HiveMetastoreConfigMode",
"type": "string"
},
"LineageDirection": {
"description": "Direction of lineage relationship between storage and Hive",
"enum": [
"upstream",
"downstream"
],
"title": "LineageDirection",
"type": "string"
},
"OperationConfig": {
"additionalProperties": false,
"properties": {
"lower_freq_profile_enabled": {
"default": false,
"description": "Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling.",
"title": "Lower Freq Profile Enabled",
"type": "boolean"
},
"profile_day_of_week": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect.",
"title": "Profile Day Of Week"
},
"profile_date_of_month": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect.",
"title": "Profile Date Of Month"
}
},
"title": "OperationConfig",
"type": "object"
},
"StatefulStaleMetadataRemovalConfig": {
"additionalProperties": false,
"description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.",
"properties": {
"enabled": {
"default": false,
"description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
"title": "Enabled",
"type": "boolean"
},
"remove_stale_metadata": {
"default": true,
"description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.",
"title": "Remove Stale Metadata",
"type": "boolean"
},
"fail_safe_threshold": {
"default": 75.0,
"description": "Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.",
"maximum": 100.0,
"minimum": 0.0,
"title": "Fail Safe Threshold",
"type": "number"
}
},
"title": "StatefulStaleMetadataRemovalConfig",
"type": "object"
}
},
"additionalProperties": false,
"properties": {
"emit_storage_lineage": {
"default": false,
"description": "Whether to emit storage-to-Hive lineage. When enabled, creates lineage relationships between Hive tables and their underlying storage locations (S3, Azure, GCS, HDFS, etc.).",
"title": "Emit Storage Lineage",
"type": "boolean"
},
"hive_storage_lineage_direction": {
"$ref": "#/$defs/LineageDirection",
"default": "upstream",
"description": "Direction of storage lineage. If 'upstream', storage is treated as upstream to Hive (data flows from storage to Hive). If 'downstream', storage is downstream to Hive (data flows from Hive to storage)."
},
"include_column_lineage": {
"default": true,
"description": "When enabled along with emit_storage_lineage, column-level lineage will be extracted between Hive table columns and storage location fields.",
"title": "Include Column Lineage",
"type": "boolean"
},
"storage_platform_instance": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Platform instance for the storage system (e.g., 'my-s3-instance'). Used when generating URNs for storage datasets.",
"title": "Storage Platform Instance"
},
"schema_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns for schemas to filter in ingestion. Specify regex to only match the schema name. e.g. to match all tables in schema analytics, use the regex 'analytics'"
},
"table_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns for tables to filter in ingestion. Specify regex to match the entire table name in database.schema.table format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'"
},
"view_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns for views to filter in ingestion. Note: Defaults to table_pattern if not specified. Specify regex to match the entire view name in database.schema.view format. e.g. to match all views starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'"
},
"classification": {
"$ref": "#/$defs/ClassificationConfig",
"default": {
"enabled": false,
"sample_size": 100,
"max_workers": 4,
"table_pattern": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"column_pattern": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"info_type_to_term": {},
"classifiers": [
{
"config": null,
"type": "datahub"
}
]
},
"description": "For details, refer to [Classification](../../../../metadata-ingestion/docs/dev_guides/classification.md)."
},
"incremental_lineage": {
"default": false,
"description": "When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run.",
"title": "Incremental Lineage",
"type": "boolean"
},
"convert_urns_to_lowercase": {
"default": false,
"description": "Whether to convert dataset urns to lowercase.",
"title": "Convert Urns To Lowercase",
"type": "boolean"
},
"env": {
"default": "PROD",
"description": "The environment that all assets produced by this connector belong to",
"title": "Env",
"type": "string"
},
"platform_instance": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.",
"title": "Platform Instance"
},
"stateful_ingestion": {
"anyOf": [
{
"$ref": "#/$defs/StatefulStaleMetadataRemovalConfig"
},
{
"type": "null"
}
],
"default": null
},
"options": {
"additionalProperties": true,
"description": "Any options specified here will be passed to [SQLAlchemy.create_engine](https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine) as kwargs. To set connection arguments in the URL, specify them under `connect_args`.",
"title": "Options",
"type": "object"
},
"profile_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns to filter tables (or specific columns) for profiling during ingestion. Note that only tables allowed by the `table_pattern` will be considered."
},
"domain": {
"additionalProperties": {
"$ref": "#/$defs/AllowDenyPattern"
},
"default": {},
"description": "Attach domains to databases, schemas or tables during ingestion using regex patterns. Domain key can be a guid like *urn:li:domain:ec428203-ce86-4db3-985d-5a8ee6df32ba* or a string like \"Marketing\".) If you provide strings, then datahub will attempt to resolve this name to a guid, and will error out if this fails. There can be multiple domain keys specified.",
"title": "Domain",
"type": "object"
},
"include_views": {
"default": true,
"description": "Whether views should be ingested.",
"title": "Include Views",
"type": "boolean"
},
"include_tables": {
"default": true,
"description": "Whether tables should be ingested.",
"title": "Include Tables",
"type": "boolean"
},
"include_table_location_lineage": {
"default": true,
"description": "If the source supports it, include table lineage to the underlying storage location.",
"title": "Include Table Location Lineage",
"type": "boolean"
},
"include_view_lineage": {
"default": true,
"description": "Whether to extract lineage from Hive views. Requires parsing the view definition SQL.",
"title": "Include View Lineage",
"type": "boolean"
},
"include_view_column_lineage": {
"default": true,
"description": "Populates column-level lineage for view->view and table->view lineage using DataHub's sql parser. Requires `include_view_lineage` to be enabled.",
"title": "Include View Column Lineage",
"type": "boolean"
},
"use_file_backed_cache": {
"default": true,
"description": "Whether to use a file backed cache for the view definitions.",
"title": "Use File Backed Cache",
"type": "boolean"
},
"profiling": {
"$ref": "#/$defs/GEProfilingConfig",
"default": {
"enabled": false,
"operation_config": {
"lower_freq_profile_enabled": false,
"profile_date_of_month": null,
"profile_day_of_week": null
},
"limit": null,
"offset": null,
"profile_table_level_only": false,
"include_field_null_count": true,
"include_field_distinct_count": true,
"include_field_min_value": true,
"include_field_max_value": true,
"include_field_mean_value": true,
"include_field_median_value": true,
"include_field_stddev_value": true,
"include_field_quantiles": false,
"include_field_distinct_value_frequencies": false,
"include_field_histogram": false,
"include_field_sample_values": true,
"max_workers": 20,
"report_dropped_profiles": false,
"turn_off_expensive_profiling_metrics": false,
"field_sample_values_limit": 20,
"max_number_of_fields_to_profile": null,
"profile_if_updated_since_days": null,
"profile_table_size_limit": 5,
"profile_table_row_limit": 5000000,
"profile_table_row_count_estimate_only": false,
"query_combiner_enabled": true,
"catch_exceptions": true,
"partition_profiling_enabled": true,
"partition_datetime": null,
"use_sampling": true,
"sample_size": 10000,
"profile_external_tables": false,
"tags_to_ignore_sampling": null,
"profile_nested_fields": false
}
},
"username": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "username",
"title": "Username"
},
"password": {
"anyOf": [
{
"format": "password",
"type": "string",
"writeOnly": true
},
{
"type": "null"
}
],
"default": null,
"description": "password",
"title": "Password"
},
"host_port": {
"default": "localhost:3306",
"description": "Host URL and port to connect to. Example: localhost:3306",
"title": "Host Port",
"type": "string"
},
"database": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "database (catalog)",
"title": "Database"
},
"sqlalchemy_uri": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters.",
"title": "Sqlalchemy Uri"
},
"views_where_clause_suffix": {
"default": "",
"description": "Where clause to specify what Presto views should be ingested.",
"title": "Views Where Clause Suffix",
"type": "string"
},
"tables_where_clause_suffix": {
"default": "",
"description": "Where clause to specify what Hive tables should be ingested.",
"title": "Tables Where Clause Suffix",
"type": "string"
},
"schemas_where_clause_suffix": {
"default": "",
"description": "Where clause to specify what Hive schemas should be ingested.",
"title": "Schemas Where Clause Suffix",
"type": "string"
},
"ingestion_job_id": {
"default": "",
"title": "Ingestion Job Id",
"type": "string"
},
"database_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns for hive/presto database to filter in ingestion. Specify regex to only match the database name. e.g. to match all tables in database analytics, use the regex 'analytics'"
},
"metastore_db_name": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Name of the Hive metastore's database (usually: metastore). For backward compatibility, if this field is not provided, the database field will be used. If both the 'database' and 'metastore_db_name' fields are set then the 'database' field will be used to filter the hive/presto/trino database",
"title": "Metastore Db Name"
},
"mode": {
"$ref": "#/$defs/HiveMetastoreConfigMode",
"default": "hive",
"description": "The ingested data will be stored under this platform. Valid options: ['hive', 'presto', 'presto-on-hive', 'trino']"
},
"use_catalog_subtype": {
"default": true,
"description": "Container Subtype name to be 'Database' or 'Catalog' Valid options: ['True', 'False']",
"title": "Use Catalog Subtype",
"type": "boolean"
},
"use_dataset_pascalcase_subtype": {
"default": false,
"description": "Dataset Subtype name to be 'Table' or 'View' Valid options: ['True', 'False']",
"title": "Use Dataset Pascalcase Subtype",
"type": "boolean"
},
"include_catalog_name_in_ids": {
"default": false,
"description": "Add the Presto catalog name (e.g. hive) to the generated dataset urns. `urn:li:dataset:(urn:li:dataPlatform:hive,hive.user.logging_events,PROD)` versus `urn:li:dataset:(urn:li:dataPlatform:hive,user.logging_events,PROD)`",
"title": "Include Catalog Name In Ids",
"type": "boolean"
},
"enable_properties_merge": {
"default": true,
"description": "By default, the connector enables merging of properties with what exists on the server. Set this to False to enable the default connector behavior of overwriting properties on each ingestion.",
"title": "Enable Properties Merge",
"type": "boolean"
},
"simplify_nested_field_paths": {
"default": false,
"description": "Simplify v2 field paths to v1 by default. If the schema has Union or Array types, still falls back to v2",
"title": "Simplify Nested Field Paths",
"type": "boolean"
}
},
"title": "HiveMetastore",
"type": "object"
}
Code Coordinates
- Class Name:
datahub.ingestion.source.sql.hive.hive_metastore_source.HiveMetastoreSource - Browse on GitHub
Module presto-on-hive
Important Capabilities
| Capability | Status | Notes |
|---|---|---|
| Asset Containers | ✅ | Enabled by default. Supported for types - Catalog, Schema. |
| Classification | ❌ | Not Supported. |
| Column-level Lineage | ✅ | Enabled by default for views via include_view_lineage, and to storage via include_column_lineage when storage lineage is enabled. Supported for types - Table, View. |
| Data Profiling | ❌ | Not Supported. |
| Descriptions | ✅ | Enabled by default. |
| Detect Deleted Entities | ✅ | Enabled by default via stateful ingestion. |
| Domains | ✅ | Enabled by default. |
| Schema Metadata | ✅ | Enabled by default. |
| Table-Level Lineage | ✅ | Enabled by default for views via include_view_lineage, and to upstream/downstream storage via emit_storage_lineage. Supported for types - Table, View. |
| Test Connection | ✅ | Enabled by default. |
This plugin extracts the following:
- Metadata for Presto views and Hive tables (external / managed)
- Column types associated with each table / view
- Detailed table / view property info
CLI based Ingestion
Config Details
- Options
- Schema
Note that a . is used to denote nested fields in the YAML recipe.
| Field | Description |
|---|---|
convert_urns_to_lowercase boolean | Whether to convert dataset urns to lowercase. Default: False |
database One of string, null | database (catalog) Default: None |
emit_storage_lineage boolean | Whether to emit storage-to-Hive lineage. When enabled, creates lineage relationships between Hive tables and their underlying storage locations (S3, Azure, GCS, HDFS, etc.). Default: False |
enable_properties_merge boolean | By default, the connector enables merging of properties with what exists on the server. Set this to False to enable the default connector behavior of overwriting properties on each ingestion. Default: True |
hive_storage_lineage_direction Enum | One of: "upstream", "downstream" |
host_port string | Host URL and port to connect to. Example: localhost:3306 Default: localhost:3306 |
include_catalog_name_in_ids boolean | Add the Presto catalog name (e.g. hive) to the generated dataset urns. urn:li:dataset:(urn:li:dataPlatform:hive,hive.user.logging_events,PROD) versus urn:li:dataset:(urn:li:dataPlatform:hive,user.logging_events,PROD) Default: False |
include_column_lineage boolean | When enabled along with emit_storage_lineage, column-level lineage will be extracted between Hive table columns and storage location fields. Default: True |
include_table_location_lineage boolean | If the source supports it, include table lineage to the underlying storage location. Default: True |
include_tables boolean | Whether tables should be ingested. Default: True |
include_view_column_lineage boolean | Populates column-level lineage for view->view and table->view lineage using DataHub's sql parser. Requires include_view_lineage to be enabled. Default: True |
include_view_lineage boolean | Whether to extract lineage from Hive views. Requires parsing the view definition SQL. Default: True |
include_views boolean | Whether views should be ingested. Default: True |
incremental_lineage boolean | When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run. Default: False |
ingestion_job_id string | Default: |
metastore_db_name One of string, null | Name of the Hive metastore's database (usually: metastore). For backward compatibility, if this field is not provided, the database field will be used. If both the 'database' and 'metastore_db_name' fields are set then the 'database' field will be used to filter the hive/presto/trino database Default: None |
mode Enum | One of: "hive", "presto", "presto-on-hive", "trino" |
options object | Any options specified here will be passed to SQLAlchemy.create_engine as kwargs. To set connection arguments in the URL, specify them under connect_args. |
password One of string(password), null | password Default: None |
platform_instance One of string, null | The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details. Default: None |
schemas_where_clause_suffix string | Where clause to specify what Hive schemas should be ingested. Default: |
simplify_nested_field_paths boolean | Simplify v2 field paths to v1 by default. If the schema has Union or Array types, still falls back to v2 Default: False |
sqlalchemy_uri One of string, null | URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters. Default: None |
storage_platform_instance One of string, null | Platform instance for the storage system (e.g., 'my-s3-instance'). Used when generating URNs for storage datasets. Default: None |
tables_where_clause_suffix string | Where clause to specify what Hive tables should be ingested. Default: |
use_catalog_subtype boolean | Container Subtype name to be 'Database' or 'Catalog' Valid options: ['True', 'False'] Default: True |
use_dataset_pascalcase_subtype boolean | Dataset Subtype name to be 'Table' or 'View' Valid options: ['True', 'False'] Default: False |
use_file_backed_cache boolean | Whether to use a file backed cache for the view definitions. Default: True |
username One of string, null | username Default: None |
views_where_clause_suffix string | Where clause to specify what Presto views should be ingested. Default: |
env string | The environment that all assets produced by this connector belong to Default: PROD |
database_pattern AllowDenyPattern | A class to store allow deny regexes |
database_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
domain map(str,AllowDenyPattern) | A class to store allow deny regexes |
domain. key.allowarray | List of regex patterns to include in ingestion Default: ['.*'] |
domain. key.allow.stringstring | |
domain. key.ignoreCaseOne of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
domain. key.denyarray | List of regex patterns to exclude from ingestion. Default: [] |
domain. key.deny.stringstring | |
profile_pattern AllowDenyPattern | A class to store allow deny regexes |
profile_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
schema_pattern AllowDenyPattern | A class to store allow deny regexes |
schema_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
table_pattern AllowDenyPattern | A class to store allow deny regexes |
table_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
view_pattern AllowDenyPattern | A class to store allow deny regexes |
view_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
classification ClassificationConfig | |
classification.enabled boolean | Whether classification should be used to auto-detect glossary terms Default: False |
classification.info_type_to_term map(str,string) | |
classification.max_workers integer | Number of worker processes to use for classification. Set to 1 to disable. Default: 4 |
classification.sample_size integer | Number of sample values used for classification. Default: 100 |
classification.classifiers array | Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance. Default: [{'type': 'datahub', 'config': None}] |
classification.classifiers.DynamicTypedClassifierConfig DynamicTypedClassifierConfig | |
classification.classifiers.DynamicTypedClassifierConfig.type ❓ string | The type of the classifier to use. For DataHub, use datahub |
classification.classifiers.DynamicTypedClassifierConfig.config One of object, null | The configuration required for initializing the classifier. If not specified, uses defaults for classifer type. Default: None |
classification.column_pattern AllowDenyPattern | A class to store allow deny regexes |
classification.column_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
classification.table_pattern AllowDenyPattern | A class to store allow deny regexes |
classification.table_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
profiling GEProfilingConfig | |
profiling.catch_exceptions boolean | Default: True |
profiling.enabled boolean | Whether profiling should be done. Default: False |
profiling.field_sample_values_limit integer | Upper limit for number of sample values to collect for all columns. Default: 20 |
profiling.include_field_distinct_count boolean | Whether to profile for the number of distinct values for each column. Default: True |
profiling.include_field_distinct_value_frequencies boolean | Whether to profile for distinct value frequencies. Default: False |
profiling.include_field_histogram boolean | Whether to profile for the histogram for numeric fields. Default: False |
profiling.include_field_max_value boolean | Whether to profile for the max value of numeric columns. Default: True |
profiling.include_field_mean_value boolean | Whether to profile for the mean value of numeric columns. Default: True |
profiling.include_field_median_value boolean | Whether to profile for the median value of numeric columns. Default: True |
profiling.include_field_min_value boolean | Whether to profile for the min value of numeric columns. Default: True |
profiling.include_field_null_count boolean | Whether to profile for the number of nulls for each column. Default: True |
profiling.include_field_quantiles boolean | Whether to profile for the quantiles of numeric columns. Default: False |
profiling.include_field_sample_values boolean | Whether to profile for the sample values for all columns. Default: True |
profiling.include_field_stddev_value boolean | Whether to profile for the standard deviation of numeric columns. Default: True |
profiling.limit One of integer, null | Max number of documents to profile. By default, profiles all documents. Default: None |
profiling.max_number_of_fields_to_profile One of integer, null | A positive integer that specifies the maximum number of columns to profile for any table. None implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up. Default: None |
profiling.max_workers integer | Number of worker threads to use for profiling. Set to 1 to disable. Default: 20 |
profiling.offset One of integer, null | Offset in documents to profile. By default, uses no offset. Default: None |
profiling.partition_datetime One of string(date-time), null | If specified, profile only the partition which matches this datetime. If not specified, profile the latest partition. Only Bigquery supports this. Default: None |
profiling.partition_profiling_enabled boolean | Whether to profile partitioned tables. Only BigQuery and Aws Athena supports this. If enabled, latest partition data is used for profiling. Default: True |
profiling.profile_external_tables boolean | Whether to profile external tables. Only Snowflake and Redshift supports this. Default: False |
profiling.profile_if_updated_since_days One of number, null | Profile table only if it has been updated since these many number of days. If set to null, no constraint of last modified time for tables to profile. Supported only in snowflake and BigQuery. Default: None |
profiling.profile_nested_fields boolean | Whether to profile complex types like structs, arrays and maps. Default: False |
profiling.profile_table_level_only boolean | Whether to perform profiling at table-level only, or include column-level profiling as well. Default: False |
profiling.profile_table_row_count_estimate_only boolean | Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL. Default: False |
profiling.profile_table_row_limit One of integer, null | Profile tables only if their row count is less than specified count. If set to null, no limit on the row count of tables to profile. Supported only in Snowflake, BigQuery. Supported for Oracle based on gathered stats. Default: 5000000 |
profiling.profile_table_size_limit One of integer, null | Profile tables only if their size is less than specified GBs. If set to null, no limit on the size of tables to profile. Supported only in Snowflake, BigQuery and Databricks. Supported for Oracle based on calculated size from gathered stats. Default: 5 |
profiling.query_combiner_enabled boolean | This feature is still experimental and can be disabled if it causes issues. Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible. Default: True |
profiling.report_dropped_profiles boolean | Whether to report datasets or dataset columns which were not profiled. Set to True for debugging purposes. Default: False |
profiling.sample_size integer | Number of rows to be sampled from table for column level profiling.Applicable only if use_sampling is set to True. Default: 10000 |
profiling.turn_off_expensive_profiling_metrics boolean | Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10. Default: False |
profiling.use_sampling boolean | Whether to profile column level stats on sample of table. Only BigQuery and Snowflake support this. If enabled, profiling is done on rows sampled from table. Sampling is not done for smaller tables. Default: True |
profiling.operation_config OperationConfig | |
profiling.operation_config.lower_freq_profile_enabled boolean | Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling. Default: False |
profiling.operation_config.profile_date_of_month One of integer, null | Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect. Default: None |
profiling.operation_config.profile_day_of_week One of integer, null | Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect. Default: None |
profiling.tags_to_ignore_sampling One of array, null | Fixed list of tags to ignore sampling. If not specified, tables will be sampled based on use_sampling. Default: None |
profiling.tags_to_ignore_sampling.string string | |
stateful_ingestion One of StatefulStaleMetadataRemovalConfig, null | Default: None |
stateful_ingestion.enabled boolean | Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False Default: False |
stateful_ingestion.fail_safe_threshold number | Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'. Default: 75.0 |
stateful_ingestion.remove_stale_metadata boolean | Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled. Default: True |
The JSONSchema for this configuration is inlined below.
{
"$defs": {
"AllowDenyPattern": {
"additionalProperties": false,
"description": "A class to store allow deny regexes",
"properties": {
"allow": {
"default": [
".*"
],
"description": "List of regex patterns to include in ingestion",
"items": {
"type": "string"
},
"title": "Allow",
"type": "array"
},
"deny": {
"default": [],
"description": "List of regex patterns to exclude from ingestion.",
"items": {
"type": "string"
},
"title": "Deny",
"type": "array"
},
"ignoreCase": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"description": "Whether to ignore case sensitivity during pattern matching.",
"title": "Ignorecase"
}
},
"title": "AllowDenyPattern",
"type": "object"
},
"ClassificationConfig": {
"additionalProperties": false,
"properties": {
"enabled": {
"default": false,
"description": "Whether classification should be used to auto-detect glossary terms",
"title": "Enabled",
"type": "boolean"
},
"sample_size": {
"default": 100,
"description": "Number of sample values used for classification.",
"title": "Sample Size",
"type": "integer"
},
"max_workers": {
"default": 4,
"description": "Number of worker processes to use for classification. Set to 1 to disable.",
"title": "Max Workers",
"type": "integer"
},
"table_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns to filter tables for classification. This is used in combination with other patterns in parent config. Specify regex to match the entire table name in `database.schema.table` format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'"
},
"column_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns to filter columns for classification. This is used in combination with other patterns in parent config. Specify regex to match the column name in `database.schema.table.column` format."
},
"info_type_to_term": {
"additionalProperties": {
"type": "string"
},
"default": {},
"description": "Optional mapping to provide glossary term identifier for info type",
"title": "Info Type To Term",
"type": "object"
},
"classifiers": {
"default": [
{
"type": "datahub",
"config": null
}
],
"description": "Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance.",
"items": {
"$ref": "#/$defs/DynamicTypedClassifierConfig"
},
"title": "Classifiers",
"type": "array"
}
},
"title": "ClassificationConfig",
"type": "object"
},
"DynamicTypedClassifierConfig": {
"additionalProperties": false,
"properties": {
"type": {
"description": "The type of the classifier to use. For DataHub, use `datahub`",
"title": "Type",
"type": "string"
},
"config": {
"anyOf": [
{},
{
"type": "null"
}
],
"default": null,
"description": "The configuration required for initializing the classifier. If not specified, uses defaults for classifer type.",
"title": "Config"
}
},
"required": [
"type"
],
"title": "DynamicTypedClassifierConfig",
"type": "object"
},
"GEProfilingConfig": {
"additionalProperties": false,
"properties": {
"enabled": {
"default": false,
"description": "Whether profiling should be done.",
"title": "Enabled",
"type": "boolean"
},
"operation_config": {
"$ref": "#/$defs/OperationConfig",
"description": "Experimental feature. To specify operation configs."
},
"limit": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Max number of documents to profile. By default, profiles all documents.",
"title": "Limit"
},
"offset": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Offset in documents to profile. By default, uses no offset.",
"title": "Offset"
},
"profile_table_level_only": {
"default": false,
"description": "Whether to perform profiling at table-level only, or include column-level profiling as well.",
"title": "Profile Table Level Only",
"type": "boolean"
},
"include_field_null_count": {
"default": true,
"description": "Whether to profile for the number of nulls for each column.",
"title": "Include Field Null Count",
"type": "boolean"
},
"include_field_distinct_count": {
"default": true,
"description": "Whether to profile for the number of distinct values for each column.",
"title": "Include Field Distinct Count",
"type": "boolean"
},
"include_field_min_value": {
"default": true,
"description": "Whether to profile for the min value of numeric columns.",
"title": "Include Field Min Value",
"type": "boolean"
},
"include_field_max_value": {
"default": true,
"description": "Whether to profile for the max value of numeric columns.",
"title": "Include Field Max Value",
"type": "boolean"
},
"include_field_mean_value": {
"default": true,
"description": "Whether to profile for the mean value of numeric columns.",
"title": "Include Field Mean Value",
"type": "boolean"
},
"include_field_median_value": {
"default": true,
"description": "Whether to profile for the median value of numeric columns.",
"title": "Include Field Median Value",
"type": "boolean"
},
"include_field_stddev_value": {
"default": true,
"description": "Whether to profile for the standard deviation of numeric columns.",
"title": "Include Field Stddev Value",
"type": "boolean"
},
"include_field_quantiles": {
"default": false,
"description": "Whether to profile for the quantiles of numeric columns.",
"title": "Include Field Quantiles",
"type": "boolean"
},
"include_field_distinct_value_frequencies": {
"default": false,
"description": "Whether to profile for distinct value frequencies.",
"title": "Include Field Distinct Value Frequencies",
"type": "boolean"
},
"include_field_histogram": {
"default": false,
"description": "Whether to profile for the histogram for numeric fields.",
"title": "Include Field Histogram",
"type": "boolean"
},
"include_field_sample_values": {
"default": true,
"description": "Whether to profile for the sample values for all columns.",
"title": "Include Field Sample Values",
"type": "boolean"
},
"max_workers": {
"default": 20,
"description": "Number of worker threads to use for profiling. Set to 1 to disable.",
"title": "Max Workers",
"type": "integer"
},
"report_dropped_profiles": {
"default": false,
"description": "Whether to report datasets or dataset columns which were not profiled. Set to `True` for debugging purposes.",
"title": "Report Dropped Profiles",
"type": "boolean"
},
"turn_off_expensive_profiling_metrics": {
"default": false,
"description": "Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.",
"title": "Turn Off Expensive Profiling Metrics",
"type": "boolean"
},
"field_sample_values_limit": {
"default": 20,
"description": "Upper limit for number of sample values to collect for all columns.",
"title": "Field Sample Values Limit",
"type": "integer"
},
"max_number_of_fields_to_profile": {
"anyOf": [
{
"exclusiveMinimum": 0,
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up.",
"title": "Max Number Of Fields To Profile"
},
"profile_if_updated_since_days": {
"anyOf": [
{
"exclusiveMinimum": 0,
"type": "number"
},
{
"type": "null"
}
],
"default": null,
"description": "Profile table only if it has been updated since these many number of days. If set to `null`, no constraint of last modified time for tables to profile. Supported only in `snowflake` and `BigQuery`.",
"schema_extra": {
"supported_sources": [
"snowflake",
"bigquery"
]
},
"title": "Profile If Updated Since Days"
},
"profile_table_size_limit": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 5,
"description": "Profile tables only if their size is less than specified GBs. If set to `null`, no limit on the size of tables to profile. Supported only in `Snowflake`, `BigQuery` and `Databricks`. Supported for `Oracle` based on calculated size from gathered stats.",
"schema_extra": {
"supported_sources": [
"snowflake",
"bigquery",
"unity-catalog",
"oracle"
]
},
"title": "Profile Table Size Limit"
},
"profile_table_row_limit": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 5000000,
"description": "Profile tables only if their row count is less than specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `Snowflake`, `BigQuery`. Supported for `Oracle` based on gathered stats.",
"schema_extra": {
"supported_sources": [
"snowflake",
"bigquery",
"oracle"
]
},
"title": "Profile Table Row Limit"
},
"profile_table_row_count_estimate_only": {
"default": false,
"description": "Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL. ",
"schema_extra": {
"supported_sources": [
"postgres",
"mysql"
]
},
"title": "Profile Table Row Count Estimate Only",
"type": "boolean"
},
"query_combiner_enabled": {
"default": true,
"description": "*This feature is still experimental and can be disabled if it causes issues.* Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.",
"title": "Query Combiner Enabled",
"type": "boolean"
},
"catch_exceptions": {
"default": true,
"description": "",
"title": "Catch Exceptions",
"type": "boolean"
},
"partition_profiling_enabled": {
"default": true,
"description": "Whether to profile partitioned tables. Only BigQuery and Aws Athena supports this. If enabled, latest partition data is used for profiling.",
"schema_extra": {
"supported_sources": [
"athena",
"bigquery"
]
},
"title": "Partition Profiling Enabled",
"type": "boolean"
},
"partition_datetime": {
"anyOf": [
{
"format": "date-time",
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "If specified, profile only the partition which matches this datetime. If not specified, profile the latest partition. Only Bigquery supports this.",
"schema_extra": {
"supported_sources": [
"bigquery"
]
},
"title": "Partition Datetime"
},
"use_sampling": {
"default": true,
"description": "Whether to profile column level stats on sample of table. Only BigQuery and Snowflake support this. If enabled, profiling is done on rows sampled from table. Sampling is not done for smaller tables. ",
"schema_extra": {
"supported_sources": [
"bigquery",
"snowflake"
]
},
"title": "Use Sampling",
"type": "boolean"
},
"sample_size": {
"default": 10000,
"description": "Number of rows to be sampled from table for column level profiling.Applicable only if `use_sampling` is set to True.",
"schema_extra": {
"supported_sources": [
"bigquery",
"snowflake"
]
},
"title": "Sample Size",
"type": "integer"
},
"profile_external_tables": {
"default": false,
"description": "Whether to profile external tables. Only Snowflake and Redshift supports this.",
"schema_extra": {
"supported_sources": [
"redshift",
"snowflake"
]
},
"title": "Profile External Tables",
"type": "boolean"
},
"tags_to_ignore_sampling": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "Fixed list of tags to ignore sampling. If not specified, tables will be sampled based on `use_sampling`.",
"title": "Tags To Ignore Sampling"
},
"profile_nested_fields": {
"default": false,
"description": "Whether to profile complex types like structs, arrays and maps. ",
"title": "Profile Nested Fields",
"type": "boolean"
}
},
"title": "GEProfilingConfig",
"type": "object"
},
"HiveMetastoreConfigMode": {
"enum": [
"hive",
"presto",
"presto-on-hive",
"trino"
],
"title": "HiveMetastoreConfigMode",
"type": "string"
},
"LineageDirection": {
"description": "Direction of lineage relationship between storage and Hive",
"enum": [
"upstream",
"downstream"
],
"title": "LineageDirection",
"type": "string"
},
"OperationConfig": {
"additionalProperties": false,
"properties": {
"lower_freq_profile_enabled": {
"default": false,
"description": "Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling.",
"title": "Lower Freq Profile Enabled",
"type": "boolean"
},
"profile_day_of_week": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect.",
"title": "Profile Day Of Week"
},
"profile_date_of_month": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect.",
"title": "Profile Date Of Month"
}
},
"title": "OperationConfig",
"type": "object"
},
"StatefulStaleMetadataRemovalConfig": {
"additionalProperties": false,
"description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.",
"properties": {
"enabled": {
"default": false,
"description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
"title": "Enabled",
"type": "boolean"
},
"remove_stale_metadata": {
"default": true,
"description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.",
"title": "Remove Stale Metadata",
"type": "boolean"
},
"fail_safe_threshold": {
"default": 75.0,
"description": "Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.",
"maximum": 100.0,
"minimum": 0.0,
"title": "Fail Safe Threshold",
"type": "number"
}
},
"title": "StatefulStaleMetadataRemovalConfig",
"type": "object"
}
},
"additionalProperties": false,
"properties": {
"emit_storage_lineage": {
"default": false,
"description": "Whether to emit storage-to-Hive lineage. When enabled, creates lineage relationships between Hive tables and their underlying storage locations (S3, Azure, GCS, HDFS, etc.).",
"title": "Emit Storage Lineage",
"type": "boolean"
},
"hive_storage_lineage_direction": {
"$ref": "#/$defs/LineageDirection",
"default": "upstream",
"description": "Direction of storage lineage. If 'upstream', storage is treated as upstream to Hive (data flows from storage to Hive). If 'downstream', storage is downstream to Hive (data flows from Hive to storage)."
},
"include_column_lineage": {
"default": true,
"description": "When enabled along with emit_storage_lineage, column-level lineage will be extracted between Hive table columns and storage location fields.",
"title": "Include Column Lineage",
"type": "boolean"
},
"storage_platform_instance": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Platform instance for the storage system (e.g., 'my-s3-instance'). Used when generating URNs for storage datasets.",
"title": "Storage Platform Instance"
},
"schema_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns for schemas to filter in ingestion. Specify regex to only match the schema name. e.g. to match all tables in schema analytics, use the regex 'analytics'"
},
"table_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns for tables to filter in ingestion. Specify regex to match the entire table name in database.schema.table format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'"
},
"view_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns for views to filter in ingestion. Note: Defaults to table_pattern if not specified. Specify regex to match the entire view name in database.schema.view format. e.g. to match all views starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'"
},
"classification": {
"$ref": "#/$defs/ClassificationConfig",
"default": {
"enabled": false,
"sample_size": 100,
"max_workers": 4,
"table_pattern": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"column_pattern": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"info_type_to_term": {},
"classifiers": [
{
"config": null,
"type": "datahub"
}
]
},
"description": "For details, refer to [Classification](../../../../metadata-ingestion/docs/dev_guides/classification.md)."
},
"incremental_lineage": {
"default": false,
"description": "When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run.",
"title": "Incremental Lineage",
"type": "boolean"
},
"convert_urns_to_lowercase": {
"default": false,
"description": "Whether to convert dataset urns to lowercase.",
"title": "Convert Urns To Lowercase",
"type": "boolean"
},
"env": {
"default": "PROD",
"description": "The environment that all assets produced by this connector belong to",
"title": "Env",
"type": "string"
},
"platform_instance": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.",
"title": "Platform Instance"
},
"stateful_ingestion": {
"anyOf": [
{
"$ref": "#/$defs/StatefulStaleMetadataRemovalConfig"
},
{
"type": "null"
}
],
"default": null
},
"options": {
"additionalProperties": true,
"description": "Any options specified here will be passed to [SQLAlchemy.create_engine](https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine) as kwargs. To set connection arguments in the URL, specify them under `connect_args`.",
"title": "Options",
"type": "object"
},
"profile_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns to filter tables (or specific columns) for profiling during ingestion. Note that only tables allowed by the `table_pattern` will be considered."
},
"domain": {
"additionalProperties": {
"$ref": "#/$defs/AllowDenyPattern"
},
"default": {},
"description": "Attach domains to databases, schemas or tables during ingestion using regex patterns. Domain key can be a guid like *urn:li:domain:ec428203-ce86-4db3-985d-5a8ee6df32ba* or a string like \"Marketing\".) If you provide strings, then datahub will attempt to resolve this name to a guid, and will error out if this fails. There can be multiple domain keys specified.",
"title": "Domain",
"type": "object"
},
"include_views": {
"default": true,
"description": "Whether views should be ingested.",
"title": "Include Views",
"type": "boolean"
},
"include_tables": {
"default": true,
"description": "Whether tables should be ingested.",
"title": "Include Tables",
"type": "boolean"
},
"include_table_location_lineage": {
"default": true,
"description": "If the source supports it, include table lineage to the underlying storage location.",
"title": "Include Table Location Lineage",
"type": "boolean"
},
"include_view_lineage": {
"default": true,
"description": "Whether to extract lineage from Hive views. Requires parsing the view definition SQL.",
"title": "Include View Lineage",
"type": "boolean"
},
"include_view_column_lineage": {
"default": true,
"description": "Populates column-level lineage for view->view and table->view lineage using DataHub's sql parser. Requires `include_view_lineage` to be enabled.",
"title": "Include View Column Lineage",
"type": "boolean"
},
"use_file_backed_cache": {
"default": true,
"description": "Whether to use a file backed cache for the view definitions.",
"title": "Use File Backed Cache",
"type": "boolean"
},
"profiling": {
"$ref": "#/$defs/GEProfilingConfig",
"default": {
"enabled": false,
"operation_config": {
"lower_freq_profile_enabled": false,
"profile_date_of_month": null,
"profile_day_of_week": null
},
"limit": null,
"offset": null,
"profile_table_level_only": false,
"include_field_null_count": true,
"include_field_distinct_count": true,
"include_field_min_value": true,
"include_field_max_value": true,
"include_field_mean_value": true,
"include_field_median_value": true,
"include_field_stddev_value": true,
"include_field_quantiles": false,
"include_field_distinct_value_frequencies": false,
"include_field_histogram": false,
"include_field_sample_values": true,
"max_workers": 20,
"report_dropped_profiles": false,
"turn_off_expensive_profiling_metrics": false,
"field_sample_values_limit": 20,
"max_number_of_fields_to_profile": null,
"profile_if_updated_since_days": null,
"profile_table_size_limit": 5,
"profile_table_row_limit": 5000000,
"profile_table_row_count_estimate_only": false,
"query_combiner_enabled": true,
"catch_exceptions": true,
"partition_profiling_enabled": true,
"partition_datetime": null,
"use_sampling": true,
"sample_size": 10000,
"profile_external_tables": false,
"tags_to_ignore_sampling": null,
"profile_nested_fields": false
}
},
"username": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "username",
"title": "Username"
},
"password": {
"anyOf": [
{
"format": "password",
"type": "string",
"writeOnly": true
},
{
"type": "null"
}
],
"default": null,
"description": "password",
"title": "Password"
},
"host_port": {
"default": "localhost:3306",
"description": "Host URL and port to connect to. Example: localhost:3306",
"title": "Host Port",
"type": "string"
},
"database": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "database (catalog)",
"title": "Database"
},
"sqlalchemy_uri": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters.",
"title": "Sqlalchemy Uri"
},
"views_where_clause_suffix": {
"default": "",
"description": "Where clause to specify what Presto views should be ingested.",
"title": "Views Where Clause Suffix",
"type": "string"
},
"tables_where_clause_suffix": {
"default": "",
"description": "Where clause to specify what Hive tables should be ingested.",
"title": "Tables Where Clause Suffix",
"type": "string"
},
"schemas_where_clause_suffix": {
"default": "",
"description": "Where clause to specify what Hive schemas should be ingested.",
"title": "Schemas Where Clause Suffix",
"type": "string"
},
"ingestion_job_id": {
"default": "",
"title": "Ingestion Job Id",
"type": "string"
},
"database_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns for hive/presto database to filter in ingestion. Specify regex to only match the database name. e.g. to match all tables in database analytics, use the regex 'analytics'"
},
"metastore_db_name": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Name of the Hive metastore's database (usually: metastore). For backward compatibility, if this field is not provided, the database field will be used. If both the 'database' and 'metastore_db_name' fields are set then the 'database' field will be used to filter the hive/presto/trino database",
"title": "Metastore Db Name"
},
"mode": {
"$ref": "#/$defs/HiveMetastoreConfigMode",
"default": "hive",
"description": "The ingested data will be stored under this platform. Valid options: ['hive', 'presto', 'presto-on-hive', 'trino']"
},
"use_catalog_subtype": {
"default": true,
"description": "Container Subtype name to be 'Database' or 'Catalog' Valid options: ['True', 'False']",
"title": "Use Catalog Subtype",
"type": "boolean"
},
"use_dataset_pascalcase_subtype": {
"default": false,
"description": "Dataset Subtype name to be 'Table' or 'View' Valid options: ['True', 'False']",
"title": "Use Dataset Pascalcase Subtype",
"type": "boolean"
},
"include_catalog_name_in_ids": {
"default": false,
"description": "Add the Presto catalog name (e.g. hive) to the generated dataset urns. `urn:li:dataset:(urn:li:dataPlatform:hive,hive.user.logging_events,PROD)` versus `urn:li:dataset:(urn:li:dataPlatform:hive,user.logging_events,PROD)`",
"title": "Include Catalog Name In Ids",
"type": "boolean"
},
"enable_properties_merge": {
"default": true,
"description": "By default, the connector enables merging of properties with what exists on the server. Set this to False to enable the default connector behavior of overwriting properties on each ingestion.",
"title": "Enable Properties Merge",
"type": "boolean"
},
"simplify_nested_field_paths": {
"default": false,
"description": "Simplify v2 field paths to v1 by default. If the schema has Union or Array types, still falls back to v2",
"title": "Simplify Nested Field Paths",
"type": "boolean"
}
},
"title": "HiveMetastore",
"type": "object"
}
Code Coordinates
- Class Name:
datahub.ingestion.source.sql.hive.hive_metastore_source.HiveMetastoreSource - Browse on GitHub
Questions
If you've got any questions on configuring ingestion for Hive Metastore, feel free to ping us on our Slack.