Skip to main content

Hive Metastore

There are 2 sources that provide integration with Hive Metastore

Source ModuleDocumentation

hive-metastore

This plugin extracts the following:

  • Metadata for Presto views and Hive tables (external / managed)
  • Column types associated with each table / view
  • Detailed table / view property info Read more...

presto-on-hive

This plugin extracts the following:

  • Metadata for Presto views and Hive tables (external / managed)
  • Column types associated with each table / view
  • Detailed table / view property info Read more...

Module hive-metastore

Certified

Important Capabilities

CapabilityStatusNotes
Asset ContainersEnabled by default. Supported for types - Catalog, Schema.
ClassificationNot Supported.
Column-level LineageEnabled by default for views via include_view_lineage, and to storage via include_column_lineage when storage lineage is enabled. Supported for types - Table, View.
Data ProfilingNot Supported.
DescriptionsEnabled by default.
Detect Deleted EntitiesEnabled by default via stateful ingestion.
DomainsEnabled by default.
Schema MetadataEnabled by default.
Table-Level LineageEnabled by default for views via include_view_lineage, and to upstream/downstream storage via emit_storage_lineage. Supported for types - Table, View.
Test ConnectionEnabled by default.

This plugin extracts the following:

  • Metadata for Presto views and Hive tables (external / managed)
  • Column types associated with each table / view
  • Detailed table / view property info

Prerequisites

The Hive Metastore source connects directly to the Hive metastore database (MySQL, PostgreSQL, etc.) to extract metadata. This approach is faster and more comprehensive than connecting via HiveServer2, especially for large deployments.

Before configuring the DataHub connector, ensure you have:

  1. Database Access: Direct read access to the Hive metastore database (typically MySQL or PostgreSQL).

  2. Network Access: The machine running DataHub ingestion must be able to reach your metastore database on the configured port.

  3. Database Driver: Install the appropriate Python database driver:

    # For PostgreSQL metastore
    pip install 'acryl-datahub[hive]' psycopg2-binary

    # For MySQL metastore
    pip install 'acryl-datahub[hive]' PyMySQL
  4. Metastore Schema Knowledge: Familiarity with your metastore database schema (typically public for PostgreSQL, or the database name for MySQL).

Required Database Permissions

The database user account used by DataHub needs read-only access to the Hive metastore tables.

PostgreSQL Metastore

-- Create a dedicated read-only user for DataHub
CREATE USER datahub_user WITH PASSWORD 'secure_password';

-- Grant connection privileges
GRANT CONNECT ON DATABASE metastore TO datahub_user;

-- Grant schema usage
GRANT USAGE ON SCHEMA public TO datahub_user;

-- Grant SELECT on metastore tables
GRANT SELECT ON ALL TABLES IN SCHEMA public TO datahub_user;

-- Grant SELECT on future tables (for metastore upgrades)
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO datahub_user;

MySQL Metastore

-- Create a dedicated read-only user for DataHub
CREATE USER 'datahub_user'@'%' IDENTIFIED BY 'secure_password';

-- Grant SELECT privileges on metastore database
GRANT SELECT ON metastore.* TO 'datahub_user'@'%';

-- Apply changes
FLUSH PRIVILEGES;

Required Metastore Tables

DataHub queries the following metastore tables:

TablePurpose
DBSDatabase/schema information
TBLSTable metadata
TABLE_PARAMSTable properties (including view definitions)
SDSStorage descriptor (location, format)
COLUMNS_V2Column metadata
PARTITION_KEYSPartition information
SERDESSerialization/deserialization information

Recommendation: Grant SELECT on all metastore tables to ensure compatibility with different Hive versions and for future DataHub enhancements.

Authentication

PostgreSQL

Standard Connection:

source:
type: hive-metastore
config:
host_port: metastore-db.company.com:5432
database: metastore
username: datahub_user
password: ${METASTORE_PASSWORD}
scheme: "postgresql+psycopg2"

SSL Connection:

source:
type: hive-metastore
config:
host_port: metastore-db.company.com:5432
database: metastore
username: datahub_user
password: ${METASTORE_PASSWORD}
scheme: "postgresql+psycopg2"
options:
connect_args:
sslmode: require
sslrootcert: /path/to/ca-cert.pem

MySQL

Standard Connection:

source:
type: hive-metastore
config:
host_port: metastore-db.company.com:3306
database: metastore
username: datahub_user
password: ${METASTORE_PASSWORD}
scheme: "mysql+pymysql" # Default if not specified

SSL Connection:

source:
type: hive-metastore
config:
host_port: metastore-db.company.com:3306
database: metastore
username: datahub_user
password: ${METASTORE_PASSWORD}
scheme: "mysql+pymysql"
options:
connect_args:
ssl:
ca: /path/to/ca-cert.pem
cert: /path/to/client-cert.pem
key: /path/to/client-key.pem

Amazon RDS (PostgreSQL or MySQL)

For AWS RDS-hosted metastore databases:

source:
type: hive-metastore
config:
host_port: metastore.abc123.us-east-1.rds.amazonaws.com:5432
database: metastore
username: datahub_user
password: ${RDS_PASSWORD}
scheme: "postgresql+psycopg2" # or 'mysql+pymysql'
options:
connect_args:
sslmode: require # RDS requires SSL

Azure Database for PostgreSQL/MySQL

source:
type: hive-metastore
config:
host_port: metastore-server.postgres.database.azure.com:5432
database: metastore
username: datahub_user@metastore-server # Note: Azure requires @server-name suffix
password: ${AZURE_DB_PASSWORD}
scheme: "postgresql+psycopg2"
options:
connect_args:
sslmode: require

Storage Lineage

The Hive Metastore connector supports the same storage lineage features as the Hive connector, with enhanced performance due to direct database access.

Quick Start

Enable storage lineage with minimal configuration:

source:
type: hive-metastore
config:
host_port: metastore-db.company.com:5432
database: metastore
username: datahub_user
password: ${METASTORE_PASSWORD}
scheme: "postgresql+psycopg2"

# Enable storage lineage
emit_storage_lineage: true

Configuration Options

Storage lineage is controlled by the same parameters as the Hive connector:

ParameterTypeDefaultDescription
emit_storage_lineagebooleanfalseMaster toggle to enable/disable storage lineage
hive_storage_lineage_directionstring"upstream"Direction: "upstream" (storage → Hive) or "downstream" (Hive → storage)
include_column_lineagebooleantrueEnable column-level lineage from storage paths to Hive columns
storage_platform_instancestringNonePlatform instance for storage URNs (e.g., "prod-s3", "dev-hdfs")

Supported Storage Platforms

All storage platforms supported by the Hive connector are also supported here:

  • Amazon S3 (s3://, s3a://, s3n://)
  • HDFS (hdfs://)
  • Google Cloud Storage (gs://)
  • Azure Blob Storage (wasb://, wasbs://)
  • Azure Data Lake (adl://, abfs://, abfss://)
  • Databricks File System (dbfs://)
  • Local File System (file://)

See the sections above for complete configuration details.

Presto and Trino View Support

A key advantage of the Hive Metastore connector is its ability to extract metadata from Presto and Trino views that are stored in the metastore.

How It Works

  1. View Detection: The connector identifies views by checking the TABLE_PARAMS table for Presto/Trino view definitions.

  2. View Parsing: Presto/Trino view JSON is parsed to extract:

    • Original SQL text
    • Referenced tables
    • Column metadata and types
  3. Lineage Extraction: SQL is parsed using sqlglot to create table-to-view lineage.

  4. Storage Lineage Integration: If storage lineage is enabled, the connector also creates lineage from storage → tables → views.

Configuration

Presto/Trino view support is automatically enabled when ingesting from a metastore that contains Presto/Trino views. No additional configuration is required.

Example

source:
type: hive-metastore
config:
host_port: metastore-db.company.com:5432
database: metastore
username: datahub_user
password: ${METASTORE_PASSWORD}
scheme: "postgresql+psycopg2"

# Enable storage lineage for complete lineage chain
emit_storage_lineage: true

This configuration will create complete lineage:

S3 Bucket → Hive Table → Presto View

Limitations

  • Presto/Trino Version: The connector supports Presto 0.200+ and Trino view formats
  • Complex SQL: Very complex SQL with non-standard syntax may have incomplete lineage
  • Cross-Database References: Lineage is extracted for references within the same Hive metastore

Schema Filtering

For large metastore deployments with many databases, use filtering to limit ingestion scope:

Database Filtering

source:
type: hive-metastore
config:
# ... connection config ...

# Only ingest from specific databases
schema_pattern:
allow:
- "^production_.*" # All databases starting with production_
- "analytics" # Specific database
deny:
- ".*_test$" # Exclude test databases

Table Filtering with SQL

For more complex filtering, use SQL WHERE clauses:

source:
type: hive-metastore
config:
# ... connection config ...

# Custom SQL filter for tables
tables_where_clause_suffix: 'AND d."name" in (''production_db'', ''analytics_db'')'

# Custom SQL filter for views
views_where_clause_suffix: 'AND d."name" in (''production_db'', ''analytics_db'')'

Note: The WHERE clause suffix is appended to the internal query. Use proper SQL syntax for your database type (e.g., quoted identifiers for PostgreSQL).

Performance Considerations

Advantages Over HiveServer2 Connector

The Hive Metastore connector is significantly faster than the Hive connector because:

  1. Direct Database Access: No HiveServer2 overhead
  2. Batch Queries: Fetches all metadata in optimized SQL queries
  3. No Query Execution: Doesn't run Hive queries to extract metadata
  4. Parallel Processing: Can process multiple databases concurrently

Performance Comparison (approximate):

  • 10 databases, 1000 tables: ~2 minutes (Metastore) vs ~15 minutes (HiveServer2)
  • 100 databases, 10,000 tables: ~15 minutes (Metastore) vs ~2 hours (HiveServer2)

Optimization Tips

  1. Database Connection Pooling: The connector uses SQLAlchemy's default connection pooling. For very large deployments, consider tuning pool size:

    options:
    pool_size: 10
    max_overflow: 20
  2. Schema Filtering: Use schema_pattern to limit scope and reduce query time.

  3. Stateful Ingestion: Enable to only process changes:

    stateful_ingestion:
    enabled: true
    remove_stale_metadata: true
  4. Disable Column Lineage: If not needed:

    emit_storage_lineage: true
    include_column_lineage: false # Faster

Network Considerations

  • Latency: Low latency to the metastore database is important
  • Bandwidth: Minimal bandwidth required (only metadata, no data transfer)
  • Connection Limits: Ensure metastore database can handle additional read connections

Platform Instances

When ingesting from multiple metastores (e.g., different clusters or environments), use platform_instance:

source:
type: hive-metastore
config:
host_port: prod-metastore-db.company.com:5432
database: metastore
platform_instance: "prod-hive"

Best Practice: Combine with storage_platform_instance:

source:
type: hive-metastore
config:
platform_instance: "prod-hive" # Hive tables
storage_platform_instance: "prod-hdfs" # Storage locations
emit_storage_lineage: true

Caveats and Limitations

Metastore Schema Compatibility

  • Hive Versions: Tested with Hive 1.x, 2.x, and 3.x metastore schemas
  • Schema Variations: Different Hive versions may have slightly different metastore schemas
  • Custom Tables: If your organization has added custom metastore tables, they won't be processed

Database Support

  • Supported: PostgreSQL, MySQL, MariaDB
  • Not Supported: Oracle, MSSQL (may work but untested)
  • Derby: Not recommended (embedded metastore, typically single-user)

View Lineage Parsing

  • Simple SQL: Fully supported with accurate lineage
  • Complex SQL: Best-effort parsing; some edge cases may have incomplete lineage
  • Non-standard SQL: Presto/Trino-specific functions may not be fully parsed

Permissions Limitations

  • Read-Only: The connector only needs SELECT permissions
  • No Write Operations: Never requires INSERT, UPDATE, or DELETE
  • Metastore Locks: Read operations don't acquire metastore locks

Storage Lineage Limitations

Same as the Hive connector:

  • Only tables with defined storage locations have lineage
  • Temporary tables are not supported
  • Partition-level lineage is aggregated at table level

Known Issues

  1. Large Column Lists: Tables with 500+ columns may be slow to process due to metastore query complexity.

  2. View Definition Encoding: Some older Hive versions store view definitions in non-UTF-8 encoding, which may cause parsing issues.

  3. Case Sensitivity:

    • PostgreSQL metastore: Case-sensitive identifiers (use "quoted" names in WHERE clauses)
    • MySQL metastore: Case-insensitive by default
    • DataHub automatically lowercases URNs for consistency
  4. Concurrent Metastore Writes: If the metastore is being actively modified during ingestion, some metadata may be inconsistent.

Troubleshooting

Connection Issues

Problem: Could not connect to metastore database

Solutions:

  • Verify host_port, database, and scheme are correct
  • Check network connectivity: telnet <host> <port>
  • Verify firewall rules allow connections
  • For PostgreSQL: Check pg_hba.conf allows connections from your IP
  • For MySQL: Check bind-address in my.cnf

Authentication Failures

Problem: Authentication failed or Access denied

Solutions:

  • Verify username and password are correct
  • Check user has CONNECT/LOGIN privileges
  • For Azure: Ensure username includes @server-name suffix
  • Review database logs for detailed error messages

Missing Tables

Problem: Not all tables appear in DataHub

Solutions:

  • Verify database user has SELECT on all metastore tables
  • Check if tables are filtered by schema_pattern or WHERE clauses
  • Query metastore directly to verify tables exist:
    SELECT d.name as db_name, t.tbl_name as table_name, t.tbl_type
    FROM TBLS t
    JOIN DBS d ON t.db_id = d.db_id
    WHERE d.name = 'your_database';

Presto/Trino Views Not Appearing

Problem: Views defined in Presto/Trino don't show up

Solutions:

  • Check view definitions exist in metastore:
    SELECT d.name as db_name, t.tbl_name as view_name, tp.param_value
    FROM TBLS t
    JOIN DBS d ON t.db_id = d.db_id
    JOIN TABLE_PARAMS tp ON t.tbl_id = tp.tbl_id
    WHERE t.tbl_type = 'VIRTUAL_VIEW'
    AND tp.param_key = 'presto_view'
    LIMIT 10;
  • Review ingestion logs for parsing errors
  • Verify view JSON is valid

Storage Lineage Not Appearing

Problem: No storage lineage relationships visible

Solutions:

  • Verify emit_storage_lineage: true is set
  • Check tables have storage locations in metastore:
    SELECT d.name as db_name, t.tbl_name as table_name, s.location
    FROM TBLS t
    JOIN DBS d ON t.db_id = d.db_id
    JOIN SDS s ON t.sd_id = s.sd_id
    WHERE s.location IS NOT NULL
    LIMIT 10;
  • Review logs for "Failed to parse storage location" warnings
  • See the "Storage Lineage" section above for troubleshooting tips

Slow Ingestion

Problem: Ingestion takes too long

Solutions:

  • Use schema filtering to reduce scope
  • Enable stateful ingestion to only process changes
  • Check database query performance (may need indexes on metastore tables)
  • Ensure low latency network connection to metastore database
  • Consider disabling column lineage if not needed

CLI based Ingestion

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
type: hive-metastore
config:
# Hive metastore DB connection
host_port: localhost:5432
database: metastore

# specify the schema where metastore tables reside
schema_pattern:
allow:
- "^public"

# credentials
username: user # optional
password: pass # optional

#scheme: 'postgresql+psycopg2' # set this if metastore db is using postgres
#scheme: 'mysql+pymysql' # set this if metastore db is using mysql, default if unset

# set this to have advanced filters on what to ingest
#views_where_clause_suffix: AND d."name" in ('db1')
#tables_where_clause_suffix: AND d."name" in ('db1')

# Storage Lineage Configuration (Optional)
# Enables lineage between Hive tables and their underlying storage locations
#emit_storage_lineage: false # Set to true to enable storage lineage
#hive_storage_lineage_direction: upstream # Direction: 'upstream' (storage -> Hive) or 'downstream' (Hive -> storage)
#include_column_lineage: true # Set to false to disable column-level lineage
#storage_platform_instance: "prod-hdfs" # Optional: platform instance for storage URNs

sink:
# sink configs

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
convert_urns_to_lowercase
boolean
Whether to convert dataset urns to lowercase.
Default: False
database
One of string, null
database (catalog)
Default: None
emit_storage_lineage
boolean
Whether to emit storage-to-Hive lineage. When enabled, creates lineage relationships between Hive tables and their underlying storage locations (S3, Azure, GCS, HDFS, etc.).
Default: False
enable_properties_merge
boolean
By default, the connector enables merging of properties with what exists on the server. Set this to False to enable the default connector behavior of overwriting properties on each ingestion.
Default: True
hive_storage_lineage_direction
Enum
One of: "upstream", "downstream"
host_port
string
Host URL and port to connect to. Example: localhost:3306
Default: localhost:3306
include_catalog_name_in_ids
boolean
Add the Presto catalog name (e.g. hive) to the generated dataset urns. urn:li:dataset:(urn:li:dataPlatform:hive,hive.user.logging_events,PROD) versus urn:li:dataset:(urn:li:dataPlatform:hive,user.logging_events,PROD)
Default: False
include_column_lineage
boolean
When enabled along with emit_storage_lineage, column-level lineage will be extracted between Hive table columns and storage location fields.
Default: True
include_table_location_lineage
boolean
If the source supports it, include table lineage to the underlying storage location.
Default: True
include_tables
boolean
Whether tables should be ingested.
Default: True
include_view_column_lineage
boolean
Populates column-level lineage for view->view and table->view lineage using DataHub's sql parser. Requires include_view_lineage to be enabled.
Default: True
include_view_lineage
boolean
Whether to extract lineage from Hive views. Requires parsing the view definition SQL.
Default: True
include_views
boolean
Whether views should be ingested.
Default: True
incremental_lineage
boolean
When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run.
Default: False
ingestion_job_id
string
Default:
metastore_db_name
One of string, null
Name of the Hive metastore's database (usually: metastore). For backward compatibility, if this field is not provided, the database field will be used. If both the 'database' and 'metastore_db_name' fields are set then the 'database' field will be used to filter the hive/presto/trino database
Default: None
mode
Enum
One of: "hive", "presto", "presto-on-hive", "trino"
options
object
Any options specified here will be passed to SQLAlchemy.create_engine as kwargs. To set connection arguments in the URL, specify them under connect_args.
password
One of string(password), null
password
Default: None
platform_instance
One of string, null
The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.
Default: None
schemas_where_clause_suffix
string
Where clause to specify what Hive schemas should be ingested.
Default:
simplify_nested_field_paths
boolean
Simplify v2 field paths to v1 by default. If the schema has Union or Array types, still falls back to v2
Default: False
sqlalchemy_uri
One of string, null
URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters.
Default: None
storage_platform_instance
One of string, null
Platform instance for the storage system (e.g., 'my-s3-instance'). Used when generating URNs for storage datasets.
Default: None
tables_where_clause_suffix
string
Where clause to specify what Hive tables should be ingested.
Default:
use_catalog_subtype
boolean
Container Subtype name to be 'Database' or 'Catalog' Valid options: ['True', 'False']
Default: True
use_dataset_pascalcase_subtype
boolean
Dataset Subtype name to be 'Table' or 'View' Valid options: ['True', 'False']
Default: False
use_file_backed_cache
boolean
Whether to use a file backed cache for the view definitions.
Default: True
username
One of string, null
username
Default: None
views_where_clause_suffix
string
Where clause to specify what Presto views should be ingested.
Default:
env
string
The environment that all assets produced by this connector belong to
Default: PROD
database_pattern
AllowDenyPattern
A class to store allow deny regexes
database_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
domain
map(str,AllowDenyPattern)
A class to store allow deny regexes
domain.key.allow
array
List of regex patterns to include in ingestion
Default: ['.*']
domain.key.allow.string
string
domain.key.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
domain.key.deny
array
List of regex patterns to exclude from ingestion.
Default: []
domain.key.deny.string
string
profile_pattern
AllowDenyPattern
A class to store allow deny regexes
profile_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
schema_pattern
AllowDenyPattern
A class to store allow deny regexes
schema_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
table_pattern
AllowDenyPattern
A class to store allow deny regexes
table_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
view_pattern
AllowDenyPattern
A class to store allow deny regexes
view_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
classification
ClassificationConfig
classification.enabled
boolean
Whether classification should be used to auto-detect glossary terms
Default: False
classification.info_type_to_term
map(str,string)
classification.max_workers
integer
Number of worker processes to use for classification. Set to 1 to disable.
Default: 4
classification.sample_size
integer
Number of sample values used for classification.
Default: 100
classification.classifiers
array
Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance.
Default: [{'type': 'datahub', 'config': None}]
classification.classifiers.DynamicTypedClassifierConfig
DynamicTypedClassifierConfig
classification.classifiers.DynamicTypedClassifierConfig.type 
string
The type of the classifier to use. For DataHub, use datahub
classification.classifiers.DynamicTypedClassifierConfig.config
One of object, null
The configuration required for initializing the classifier. If not specified, uses defaults for classifer type.
Default: None
classification.column_pattern
AllowDenyPattern
A class to store allow deny regexes
classification.column_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
classification.table_pattern
AllowDenyPattern
A class to store allow deny regexes
classification.table_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
profiling
GEProfilingConfig
profiling.catch_exceptions
boolean
Default: True
profiling.enabled
boolean
Whether profiling should be done.
Default: False
profiling.field_sample_values_limit
integer
Upper limit for number of sample values to collect for all columns.
Default: 20
profiling.include_field_distinct_count
boolean
Whether to profile for the number of distinct values for each column.
Default: True
profiling.include_field_distinct_value_frequencies
boolean
Whether to profile for distinct value frequencies.
Default: False
profiling.include_field_histogram
boolean
Whether to profile for the histogram for numeric fields.
Default: False
profiling.include_field_max_value
boolean
Whether to profile for the max value of numeric columns.
Default: True
profiling.include_field_mean_value
boolean
Whether to profile for the mean value of numeric columns.
Default: True
profiling.include_field_median_value
boolean
Whether to profile for the median value of numeric columns.
Default: True
profiling.include_field_min_value
boolean
Whether to profile for the min value of numeric columns.
Default: True
profiling.include_field_null_count
boolean
Whether to profile for the number of nulls for each column.
Default: True
profiling.include_field_quantiles
boolean
Whether to profile for the quantiles of numeric columns.
Default: False
profiling.include_field_sample_values
boolean
Whether to profile for the sample values for all columns.
Default: True
profiling.include_field_stddev_value
boolean
Whether to profile for the standard deviation of numeric columns.
Default: True
profiling.limit
One of integer, null
Max number of documents to profile. By default, profiles all documents.
Default: None
profiling.max_number_of_fields_to_profile
One of integer, null
A positive integer that specifies the maximum number of columns to profile for any table. None implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up.
Default: None
profiling.max_workers
integer
Number of worker threads to use for profiling. Set to 1 to disable.
Default: 20
profiling.offset
One of integer, null
Offset in documents to profile. By default, uses no offset.
Default: None
profiling.partition_datetime
One of string(date-time), null
If specified, profile only the partition which matches this datetime. If not specified, profile the latest partition. Only Bigquery supports this.
Default: None
profiling.partition_profiling_enabled
boolean
Whether to profile partitioned tables. Only BigQuery and Aws Athena supports this. If enabled, latest partition data is used for profiling.
Default: True
profiling.profile_external_tables
boolean
Whether to profile external tables. Only Snowflake and Redshift supports this.
Default: False
profiling.profile_if_updated_since_days
One of number, null
Profile table only if it has been updated since these many number of days. If set to null, no constraint of last modified time for tables to profile. Supported only in snowflake and BigQuery.
Default: None
profiling.profile_nested_fields
boolean
Whether to profile complex types like structs, arrays and maps.
Default: False
profiling.profile_table_level_only
boolean
Whether to perform profiling at table-level only, or include column-level profiling as well.
Default: False
profiling.profile_table_row_count_estimate_only
boolean
Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL.
Default: False
profiling.profile_table_row_limit
One of integer, null
Profile tables only if their row count is less than specified count. If set to null, no limit on the row count of tables to profile. Supported only in Snowflake, BigQuery. Supported for Oracle based on gathered stats.
Default: 5000000
profiling.profile_table_size_limit
One of integer, null
Profile tables only if their size is less than specified GBs. If set to null, no limit on the size of tables to profile. Supported only in Snowflake, BigQuery and Databricks. Supported for Oracle based on calculated size from gathered stats.
Default: 5
profiling.query_combiner_enabled
boolean
This feature is still experimental and can be disabled if it causes issues. Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.
Default: True
profiling.report_dropped_profiles
boolean
Whether to report datasets or dataset columns which were not profiled. Set to True for debugging purposes.
Default: False
profiling.sample_size
integer
Number of rows to be sampled from table for column level profiling.Applicable only if use_sampling is set to True.
Default: 10000
profiling.turn_off_expensive_profiling_metrics
boolean
Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.
Default: False
profiling.use_sampling
boolean
Whether to profile column level stats on sample of table. Only BigQuery and Snowflake support this. If enabled, profiling is done on rows sampled from table. Sampling is not done for smaller tables.
Default: True
profiling.operation_config
OperationConfig
profiling.operation_config.lower_freq_profile_enabled
boolean
Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling.
Default: False
profiling.operation_config.profile_date_of_month
One of integer, null
Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect.
Default: None
profiling.operation_config.profile_day_of_week
One of integer, null
Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect.
Default: None
profiling.tags_to_ignore_sampling
One of array, null
Fixed list of tags to ignore sampling. If not specified, tables will be sampled based on use_sampling.
Default: None
profiling.tags_to_ignore_sampling.string
string
stateful_ingestion
One of StatefulStaleMetadataRemovalConfig, null
Default: None
stateful_ingestion.enabled
boolean
Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False
Default: False
stateful_ingestion.fail_safe_threshold
number
Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.
Default: 75.0
stateful_ingestion.remove_stale_metadata
boolean
Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True

Code Coordinates

  • Class Name: datahub.ingestion.source.sql.hive.hive_metastore_source.HiveMetastoreSource
  • Browse on GitHub

Module presto-on-hive

Certified

Important Capabilities

CapabilityStatusNotes
Asset ContainersEnabled by default. Supported for types - Catalog, Schema.
ClassificationNot Supported.
Column-level LineageEnabled by default for views via include_view_lineage, and to storage via include_column_lineage when storage lineage is enabled. Supported for types - Table, View.
Data ProfilingNot Supported.
DescriptionsEnabled by default.
Detect Deleted EntitiesEnabled by default via stateful ingestion.
DomainsEnabled by default.
Schema MetadataEnabled by default.
Table-Level LineageEnabled by default for views via include_view_lineage, and to upstream/downstream storage via emit_storage_lineage. Supported for types - Table, View.
Test ConnectionEnabled by default.

This plugin extracts the following:

  • Metadata for Presto views and Hive tables (external / managed)
  • Column types associated with each table / view
  • Detailed table / view property info

CLI based Ingestion

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
convert_urns_to_lowercase
boolean
Whether to convert dataset urns to lowercase.
Default: False
database
One of string, null
database (catalog)
Default: None
emit_storage_lineage
boolean
Whether to emit storage-to-Hive lineage. When enabled, creates lineage relationships between Hive tables and their underlying storage locations (S3, Azure, GCS, HDFS, etc.).
Default: False
enable_properties_merge
boolean
By default, the connector enables merging of properties with what exists on the server. Set this to False to enable the default connector behavior of overwriting properties on each ingestion.
Default: True
hive_storage_lineage_direction
Enum
One of: "upstream", "downstream"
host_port
string
Host URL and port to connect to. Example: localhost:3306
Default: localhost:3306
include_catalog_name_in_ids
boolean
Add the Presto catalog name (e.g. hive) to the generated dataset urns. urn:li:dataset:(urn:li:dataPlatform:hive,hive.user.logging_events,PROD) versus urn:li:dataset:(urn:li:dataPlatform:hive,user.logging_events,PROD)
Default: False
include_column_lineage
boolean
When enabled along with emit_storage_lineage, column-level lineage will be extracted between Hive table columns and storage location fields.
Default: True
include_table_location_lineage
boolean
If the source supports it, include table lineage to the underlying storage location.
Default: True
include_tables
boolean
Whether tables should be ingested.
Default: True
include_view_column_lineage
boolean
Populates column-level lineage for view->view and table->view lineage using DataHub's sql parser. Requires include_view_lineage to be enabled.
Default: True
include_view_lineage
boolean
Whether to extract lineage from Hive views. Requires parsing the view definition SQL.
Default: True
include_views
boolean
Whether views should be ingested.
Default: True
incremental_lineage
boolean
When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run.
Default: False
ingestion_job_id
string
Default:
metastore_db_name
One of string, null
Name of the Hive metastore's database (usually: metastore). For backward compatibility, if this field is not provided, the database field will be used. If both the 'database' and 'metastore_db_name' fields are set then the 'database' field will be used to filter the hive/presto/trino database
Default: None
mode
Enum
One of: "hive", "presto", "presto-on-hive", "trino"
options
object
Any options specified here will be passed to SQLAlchemy.create_engine as kwargs. To set connection arguments in the URL, specify them under connect_args.
password
One of string(password), null
password
Default: None
platform_instance
One of string, null
The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.
Default: None
schemas_where_clause_suffix
string
Where clause to specify what Hive schemas should be ingested.
Default:
simplify_nested_field_paths
boolean
Simplify v2 field paths to v1 by default. If the schema has Union or Array types, still falls back to v2
Default: False
sqlalchemy_uri
One of string, null
URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters.
Default: None
storage_platform_instance
One of string, null
Platform instance for the storage system (e.g., 'my-s3-instance'). Used when generating URNs for storage datasets.
Default: None
tables_where_clause_suffix
string
Where clause to specify what Hive tables should be ingested.
Default:
use_catalog_subtype
boolean
Container Subtype name to be 'Database' or 'Catalog' Valid options: ['True', 'False']
Default: True
use_dataset_pascalcase_subtype
boolean
Dataset Subtype name to be 'Table' or 'View' Valid options: ['True', 'False']
Default: False
use_file_backed_cache
boolean
Whether to use a file backed cache for the view definitions.
Default: True
username
One of string, null
username
Default: None
views_where_clause_suffix
string
Where clause to specify what Presto views should be ingested.
Default:
env
string
The environment that all assets produced by this connector belong to
Default: PROD
database_pattern
AllowDenyPattern
A class to store allow deny regexes
database_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
domain
map(str,AllowDenyPattern)
A class to store allow deny regexes
domain.key.allow
array
List of regex patterns to include in ingestion
Default: ['.*']
domain.key.allow.string
string
domain.key.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
domain.key.deny
array
List of regex patterns to exclude from ingestion.
Default: []
domain.key.deny.string
string
profile_pattern
AllowDenyPattern
A class to store allow deny regexes
profile_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
schema_pattern
AllowDenyPattern
A class to store allow deny regexes
schema_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
table_pattern
AllowDenyPattern
A class to store allow deny regexes
table_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
view_pattern
AllowDenyPattern
A class to store allow deny regexes
view_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
classification
ClassificationConfig
classification.enabled
boolean
Whether classification should be used to auto-detect glossary terms
Default: False
classification.info_type_to_term
map(str,string)
classification.max_workers
integer
Number of worker processes to use for classification. Set to 1 to disable.
Default: 4
classification.sample_size
integer
Number of sample values used for classification.
Default: 100
classification.classifiers
array
Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance.
Default: [{'type': 'datahub', 'config': None}]
classification.classifiers.DynamicTypedClassifierConfig
DynamicTypedClassifierConfig
classification.classifiers.DynamicTypedClassifierConfig.type 
string
The type of the classifier to use. For DataHub, use datahub
classification.classifiers.DynamicTypedClassifierConfig.config
One of object, null
The configuration required for initializing the classifier. If not specified, uses defaults for classifer type.
Default: None
classification.column_pattern
AllowDenyPattern
A class to store allow deny regexes
classification.column_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
classification.table_pattern
AllowDenyPattern
A class to store allow deny regexes
classification.table_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
profiling
GEProfilingConfig
profiling.catch_exceptions
boolean
Default: True
profiling.enabled
boolean
Whether profiling should be done.
Default: False
profiling.field_sample_values_limit
integer
Upper limit for number of sample values to collect for all columns.
Default: 20
profiling.include_field_distinct_count
boolean
Whether to profile for the number of distinct values for each column.
Default: True
profiling.include_field_distinct_value_frequencies
boolean
Whether to profile for distinct value frequencies.
Default: False
profiling.include_field_histogram
boolean
Whether to profile for the histogram for numeric fields.
Default: False
profiling.include_field_max_value
boolean
Whether to profile for the max value of numeric columns.
Default: True
profiling.include_field_mean_value
boolean
Whether to profile for the mean value of numeric columns.
Default: True
profiling.include_field_median_value
boolean
Whether to profile for the median value of numeric columns.
Default: True
profiling.include_field_min_value
boolean
Whether to profile for the min value of numeric columns.
Default: True
profiling.include_field_null_count
boolean
Whether to profile for the number of nulls for each column.
Default: True
profiling.include_field_quantiles
boolean
Whether to profile for the quantiles of numeric columns.
Default: False
profiling.include_field_sample_values
boolean
Whether to profile for the sample values for all columns.
Default: True
profiling.include_field_stddev_value
boolean
Whether to profile for the standard deviation of numeric columns.
Default: True
profiling.limit
One of integer, null
Max number of documents to profile. By default, profiles all documents.
Default: None
profiling.max_number_of_fields_to_profile
One of integer, null
A positive integer that specifies the maximum number of columns to profile for any table. None implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up.
Default: None
profiling.max_workers
integer
Number of worker threads to use for profiling. Set to 1 to disable.
Default: 20
profiling.offset
One of integer, null
Offset in documents to profile. By default, uses no offset.
Default: None
profiling.partition_datetime
One of string(date-time), null
If specified, profile only the partition which matches this datetime. If not specified, profile the latest partition. Only Bigquery supports this.
Default: None
profiling.partition_profiling_enabled
boolean
Whether to profile partitioned tables. Only BigQuery and Aws Athena supports this. If enabled, latest partition data is used for profiling.
Default: True
profiling.profile_external_tables
boolean
Whether to profile external tables. Only Snowflake and Redshift supports this.
Default: False
profiling.profile_if_updated_since_days
One of number, null
Profile table only if it has been updated since these many number of days. If set to null, no constraint of last modified time for tables to profile. Supported only in snowflake and BigQuery.
Default: None
profiling.profile_nested_fields
boolean
Whether to profile complex types like structs, arrays and maps.
Default: False
profiling.profile_table_level_only
boolean
Whether to perform profiling at table-level only, or include column-level profiling as well.
Default: False
profiling.profile_table_row_count_estimate_only
boolean
Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL.
Default: False
profiling.profile_table_row_limit
One of integer, null
Profile tables only if their row count is less than specified count. If set to null, no limit on the row count of tables to profile. Supported only in Snowflake, BigQuery. Supported for Oracle based on gathered stats.
Default: 5000000
profiling.profile_table_size_limit
One of integer, null
Profile tables only if their size is less than specified GBs. If set to null, no limit on the size of tables to profile. Supported only in Snowflake, BigQuery and Databricks. Supported for Oracle based on calculated size from gathered stats.
Default: 5
profiling.query_combiner_enabled
boolean
This feature is still experimental and can be disabled if it causes issues. Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.
Default: True
profiling.report_dropped_profiles
boolean
Whether to report datasets or dataset columns which were not profiled. Set to True for debugging purposes.
Default: False
profiling.sample_size
integer
Number of rows to be sampled from table for column level profiling.Applicable only if use_sampling is set to True.
Default: 10000
profiling.turn_off_expensive_profiling_metrics
boolean
Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.
Default: False
profiling.use_sampling
boolean
Whether to profile column level stats on sample of table. Only BigQuery and Snowflake support this. If enabled, profiling is done on rows sampled from table. Sampling is not done for smaller tables.
Default: True
profiling.operation_config
OperationConfig
profiling.operation_config.lower_freq_profile_enabled
boolean
Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling.
Default: False
profiling.operation_config.profile_date_of_month
One of integer, null
Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect.
Default: None
profiling.operation_config.profile_day_of_week
One of integer, null
Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect.
Default: None
profiling.tags_to_ignore_sampling
One of array, null
Fixed list of tags to ignore sampling. If not specified, tables will be sampled based on use_sampling.
Default: None
profiling.tags_to_ignore_sampling.string
string
stateful_ingestion
One of StatefulStaleMetadataRemovalConfig, null
Default: None
stateful_ingestion.enabled
boolean
Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False
Default: False
stateful_ingestion.fail_safe_threshold
number
Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.
Default: 75.0
stateful_ingestion.remove_stale_metadata
boolean
Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True

Code Coordinates

  • Class Name: datahub.ingestion.source.sql.hive.hive_metastore_source.HiveMetastoreSource
  • Browse on GitHub

Questions

If you've got any questions on configuring ingestion for Hive Metastore, feel free to ping us on our Slack.