Hive
Important Capabilities
| Capability | Status | Notes |
|---|---|---|
| Asset Containers | ✅ | Enabled by default. Supported for types - Database, Schema. |
| Classification | ✅ | Optionally enabled via classification.enabled. |
| Column-level Lineage | ✅ | Enabled by default for views via include_view_column_lineage, and to storage via include_column_lineage when storage lineage is enabled. Supported for types - Table, View. |
| Descriptions | ✅ | Enabled by default. |
| Detect Deleted Entities | ✅ | Enabled by default via stateful ingestion. |
| Domains | ✅ | Supported via the domain config field. |
| Platform Instance | ✅ | Enabled by default. |
| Schema Metadata | ✅ | Enabled by default. |
| Table-Level Lineage | ✅ | Enabled by default for views via include_view_lineage, and to upstream/downstream storage via emit_storage_lineage. Supported for types - Table, View. |
| Test Connection | ✅ | Enabled by default. |
This plugin extracts the following:
- Metadata for databases, schemas, and tables
- Column types associated with each table
- Detailed table and storage information
- Table, row, and column statistics via optional SQL profiling.
Prerequisites
The Hive source connects directly to the HiveServer2 service to extract metadata. Before configuring the DataHub connector, ensure you have:
Network Access: The machine running DataHub ingestion must be able to reach your HiveServer2 instance on the configured port (typically 10000 or 10001 for TLS).
Hive User Account: A Hive user account with appropriate permissions to read metadata from the databases and tables you want to ingest.
PyHive Dependencies: The connector uses PyHive for connectivity. Install the appropriate dependencies:
pip install 'acryl-datahub[hive]'
Required Permissions
The Hive user account used by DataHub needs the following permissions:
Minimum Permissions (Metadata Only)
-- Grant SELECT on all databases you want to ingest
GRANT SELECT ON DATABASE <database_name> TO USER <datahub_user>;
-- Grant SELECT on tables/views for schema extraction
GRANT SELECT ON TABLE <database_name>.* TO USER <datahub_user>;
Additional Permissions for Storage Lineage
If you plan to enable storage lineage, the connector needs to read table location information:
-- Grant DESCRIBE on tables to read storage locations
GRANT SELECT ON <database_name>.* TO USER <datahub_user>;
Recommendations
- Read-only Access: DataHub only needs read permissions. Never grant
INSERT,UPDATE,DELETE, orDROPprivileges. - Database Filtering: If you only need to ingest specific databases, use the
databaseconfig parameter to limit scope and reduce the permissions required.
Authentication
The Hive connector supports multiple authentication methods through PyHive. Configure authentication using the recipe parameters described below.
Basic Authentication (Username/Password)
The simplest authentication method using a username and password:
source:
type: hive
config:
host_port: hive.company.com:10000
username: datahub_user
password: ${HIVE_PASSWORD} # Use environment variables for sensitive data
LDAP Authentication
For LDAP-based authentication:
source:
type: hive
config:
host_port: hive.company.com:10000
username: datahub_user
password: ${LDAP_PASSWORD}
options:
connect_args:
auth: LDAP
Kerberos Authentication
For Kerberos-secured Hive clusters:
source:
type: hive
config:
host_port: hive.company.com:10000
options:
connect_args:
auth: KERBEROS
kerberos_service_name: hive
Requirements:
- Valid Kerberos ticket (use
kinitbefore running ingestion) - Kerberos configuration file (
/etc/krb5.confor specified viaKRB5_CONFIGenvironment variable) - PyKerberos or requests-kerberos package installed
TLS/SSL Connection
For secure connections over HTTPS:
source:
type: hive
config:
host_port: hive.company.com:10001
scheme: "hive+https"
username: datahub_user
password: ${HIVE_PASSWORD}
options:
connect_args:
auth: BASIC
Azure HDInsight
For Microsoft Azure HDInsight clusters:
source:
type: hive
config:
host_port: <cluster_name>.azurehdinsight.net:443
scheme: "hive+https"
username: admin
password: ${HDINSIGHT_PASSWORD}
options:
connect_args:
http_path: "/hive2"
auth: BASIC
Databricks (via PyHive)
For Databricks clusters using the Hive connector:
source:
type: hive
config:
host_port: <workspace-url>:443
scheme: "databricks+pyhive"
username: token # or your Databricks username
password: ${DATABRICKS_TOKEN} # Personal access token or password
options:
connect_args:
http_path: "sql/protocolv1/o/xxxyyyzzzaaasa/1234-567890-hello123"
Note: For comprehensive Databricks support, consider using the dedicated Databricks Unity Catalog connector instead, which provides enhanced features.
Storage Lineage
DataHub can extract lineage between Hive tables and their underlying storage locations (S3, Azure Blob, HDFS, GCS, etc.). This feature creates relationships showing data flow from raw storage to Hive tables.
Quick Start
Enable storage lineage with minimal configuration:
source:
type: hive
config:
host_port: hive.company.com:10000
username: datahub_user
password: ${HIVE_PASSWORD}
# Enable storage lineage
emit_storage_lineage: true
This will:
- Extract storage locations from Hive table metadata
- Create lineage from storage (S3, HDFS, etc.) to Hive tables
- Include column-level lineage by default
Configuration Options
Storage lineage behavior is controlled by four parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
emit_storage_lineage | boolean | false | Master toggle to enable/disable storage lineage |
hive_storage_lineage_direction | string | "upstream" | Direction: "upstream" (storage → Hive) or "downstream" (Hive → storage) |
include_column_lineage | boolean | true | Enable column-level lineage from storage paths to Hive columns |
storage_platform_instance | string | None | Platform instance for storage URNs (e.g., "prod-s3", "dev-hdfs") |
Supported Storage Platforms
The connector automatically detects and creates lineage for:
- Amazon S3:
s3://,s3a://,s3n:// - HDFS:
hdfs:// - Google Cloud Storage:
gs:// - Azure Blob Storage:
wasb://,wasbs:// - Azure Data Lake (Gen1):
adl:// - Azure Data Lake (Gen2):
abfs://,abfss:// - Databricks File System:
dbfs:// - Local File System:
file://or absolute paths
Complete Documentation
See the sections above for detailed configuration examples, troubleshooting, and best practices.
Platform Instances
When ingesting from multiple Hive environments (e.g., production, staging, development), use platform_instance to distinguish them:
source:
type: hive
config:
host_port: prod-hive.company.com:10000
platform_instance: "prod-hive"
This creates URNs like:
urn:li:dataset:(urn:li:dataPlatform:hive,database.table,prod-hive)
Best Practice: Combine with storage_platform_instance for complete environment isolation:
source:
type: hive
config:
platform_instance: "prod-hive" # Hive environment
storage_platform_instance: "prod-s3" # Storage environment
emit_storage_lineage: true
Performance Considerations
Large Hive Deployments
For Hive clusters with thousands of tables, consider:
Database Filtering: Limit ingestion to specific databases:
database: "production_db" # Only ingest one databaseIncremental Ingestion: Use DataHub's stateful ingestion to only process changes:
stateful_ingestion:
enabled: true
remove_stale_metadata: trueDisable Column Lineage: If not needed, disable to improve performance:
emit_storage_lineage: true
include_column_lineage: false # Faster ingestionConnection Pooling: The connector uses a single connection by default. For better performance with large schemas, ensure your HiveServer2 has sufficient resources.
Network Latency
- If DataHub is running far from your Hive cluster, expect slower metadata extraction
- Consider running ingestion from a machine closer to your Hive infrastructure
- Use scheduled ingestion during off-peak hours for large deployments
Caveats and Limitations
Hive Version Support
- Supported Versions: Hive 1.x, 2.x, and 3.x
- HiveServer2 Required: The connector connects to HiveServer2, not the metastore database directly
- For direct metastore access, use the Hive Metastore connector instead
View Definitions
- Simple Views: Fully supported with SQL lineage extraction
- Complex Views: Views with complex SQL (CTEs, subqueries) are supported via SQL parsing
- Presto/Trino Views: Not directly accessible via this connector. Use the Hive Metastore connector for Presto/Trino view support
Storage Lineage Limitations
- Location Required: Only tables with defined storage locations (
LOCATIONclause) will have storage lineage - External Tables: Best supported (always have explicit locations)
- Managed Tables: Lineage created for default warehouse locations
- Temporary Tables: Not supported (no persistent storage location)
Partitioned Tables
- Partition information is extracted and included in schema metadata
- Partition-level storage lineage is aggregated at the table level
- Individual partition lineage is not currently supported
Authentication Limitations
- No Proxy Support: Direct connection to HiveServer2 required
- Token-Based Auth: Not natively supported (use Kerberos or basic auth)
- Multi-Factor Authentication: Not supported
Known Issues
Session Timeout: Long-running ingestion may hit HiveServer2 session timeouts. Configure
hive.server2.session.timeoutappropriately on the Hive side.Large Schemas: Tables with 1000+ columns may be slow to ingest due to schema extraction overhead.
Case Sensitivity:
- Hive is case-insensitive by default
- DataHub automatically lowercases URNs for consistency
View Lineage Parsing: Complex views using non-standard SQL may not have complete lineage extracted.
Troubleshooting
Connection Issues
Problem: Could not connect to HiveServer2
Solutions:
- Verify
host_portis correct and accessible - Check firewall rules allow traffic on the Hive port
- Confirm HiveServer2 service is running:
beeline -u jdbc:hive2://<host>:<port>
Authentication Failures
Problem: Authentication failed
Solutions:
- Verify username and password are correct
- Check authentication method matches your Hive configuration
- For Kerberos: Ensure valid ticket exists (
klist) - Review HiveServer2 logs for detailed error messages
Missing Tables
Problem: Not all tables appear in DataHub
Solutions:
- Verify user has SELECT permissions on missing tables
- Check if tables are in filtered databases
- Review warnings in ingestion logs
- Ensure tables are not temporary or views with complex definitions
Storage Lineage Not Appearing
Problem: No storage lineage relationships visible
Solutions:
- Verify
emit_storage_lineage: trueis set - Check tables have defined storage locations:
DESCRIBE FORMATTED <table> - Review logs for "Failed to parse storage location" warnings
- See the "Storage Lineage" section above for more troubleshooting tips
Related Documentation
- Hive Source Configuration - Configuration examples
- Hive Metastore Connector - Alternative connector for direct metastore access
- PyHive Documentation - Underlying connection library
CLI based Ingestion
Starter Recipe
Check out the following recipe to get started with ingestion! See below for full configuration options.
For general pointers on writing and running a recipe, see our main recipe guide.
source:
type: hive
config:
# Coordinates
host_port: localhost:10000
database: DemoDatabase # optional, if not specified, ingests from all databases
# Credentials
username: user # optional
password: pass # optional
# For more details on authentication, see the PyHive docs:
# https://github.com/dropbox/PyHive#passing-session-configuration.
# LDAP, Kerberos, etc. are supported using connect_args, which can be
# added under the `options` config parameter.
#options:
# connect_args:
# auth: KERBEROS
# kerberos_service_name: hive
#scheme: 'hive+http' # set this if Thrift should use the HTTP transport
#scheme: 'hive+https' # set this if Thrift should use the HTTP with SSL transport
#scheme: 'sparksql' # set this for Spark Thrift Server
# Storage Lineage Configuration (Optional)
# Enables lineage between Hive tables and their underlying storage locations
#emit_storage_lineage: false # Set to true to enable storage lineage
#hive_storage_lineage_direction: upstream # Direction: 'upstream' (storage -> Hive) or 'downstream' (Hive -> storage)
#include_column_lineage: true # Set to false to disable column-level lineage
#storage_platform_instance: "prod-s3" # Optional: platform instance for storage URNs
sink:
# sink configs
# ---------------------------------------------------------
# Recipe (Azure HDInsight)
# Connecting to Microsoft Azure HDInsight using TLS.
# ---------------------------------------------------------
source:
type: hive
config:
# Coordinates
host_port: <cluster_name>.azurehdinsight.net:443
# Credentials
username: admin
password: password
# Options
options:
connect_args:
http_path: "/hive2"
auth: BASIC
sink:
# sink configs
# ---------------------------------------------------------
# Recipe (Databricks)
# Ensure that databricks-dbapi is installed. If not, use ```pip install databricks-dbapi``` to install.
# Use the ```http_path``` from your Databricks cluster in the following recipe.
# See (https://docs.databricks.com/integrations/bi/jdbc-odbc-bi.html#get-server-hostname-port-http-path-and-jdbc-url) for instructions to find ```http_path```.
# ---------------------------------------------------------
source:
type: hive
config:
host_port: <databricks workspace URL>:443
username: token / username
password: <api token> / password
scheme: 'databricks+pyhive'
options:
connect_args:
http_path: 'sql/protocolv1/o/xxxyyyzzzaaasa/1234-567890-hello123'
sink:
# sink configs
Config Details
- Options
- Schema
Note that a . is used to denote nested fields in the YAML recipe.
| Field | Description |
|---|---|
host_port ✅ string | host URL |
convert_urns_to_lowercase boolean | Whether to convert dataset urns to lowercase. Default: False |
database One of string, null | database (catalog) Default: None |
emit_storage_lineage boolean | Whether to emit storage-to-Hive lineage. When enabled, creates lineage relationships between Hive tables and their underlying storage locations (S3, Azure, GCS, HDFS, etc.). Default: False |
hive_storage_lineage_direction Enum | One of: "upstream", "downstream" |
include_column_lineage boolean | When enabled along with emit_storage_lineage, column-level lineage will be extracted between Hive table columns and storage location fields. Default: True |
include_tables boolean | Whether tables should be ingested. Default: True |
include_view_column_lineage boolean | Populates column-level lineage for view->view and table->view lineage using DataHub's sql parser. Requires include_view_lineage to be enabled. Default: True |
include_view_lineage boolean | Populates view->view and table->view lineage using DataHub's sql parser. Default: True |
include_views boolean | Whether views should be ingested. Default: True |
incremental_lineage boolean | When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run. Default: False |
options object | Any options specified here will be passed to SQLAlchemy.create_engine as kwargs. To set connection arguments in the URL, specify them under connect_args. |
password One of string(password), null | password Default: None |
platform_instance One of string, null | The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details. Default: None |
sqlalchemy_uri One of string, null | URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters. Default: None |
storage_platform_instance One of string, null | Platform instance for the storage system (e.g., 'my-s3-instance'). Used when generating URNs for storage datasets. Default: None |
use_file_backed_cache boolean | Whether to use a file backed cache for the view definitions. Default: True |
username One of string, null | username Default: None |
env string | The environment that all assets produced by this connector belong to Default: PROD |
database_pattern AllowDenyPattern | A class to store allow deny regexes |
database_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
domain map(str,AllowDenyPattern) | A class to store allow deny regexes |
domain. key.allowarray | List of regex patterns to include in ingestion Default: ['.*'] |
domain. key.allow.stringstring | |
domain. key.ignoreCaseOne of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
domain. key.denyarray | List of regex patterns to exclude from ingestion. Default: [] |
domain. key.deny.stringstring | |
profile_pattern AllowDenyPattern | A class to store allow deny regexes |
profile_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
table_pattern AllowDenyPattern | A class to store allow deny regexes |
table_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
view_pattern AllowDenyPattern | A class to store allow deny regexes |
view_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
classification ClassificationConfig | |
classification.enabled boolean | Whether classification should be used to auto-detect glossary terms Default: False |
classification.info_type_to_term map(str,string) | |
classification.max_workers integer | Number of worker processes to use for classification. Set to 1 to disable. Default: 4 |
classification.sample_size integer | Number of sample values used for classification. Default: 100 |
classification.classifiers array | Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance. Default: [{'type': 'datahub', 'config': None}] |
classification.classifiers.DynamicTypedClassifierConfig DynamicTypedClassifierConfig | |
classification.classifiers.DynamicTypedClassifierConfig.type ❓ string | The type of the classifier to use. For DataHub, use datahub |
classification.classifiers.DynamicTypedClassifierConfig.config One of object, null | The configuration required for initializing the classifier. If not specified, uses defaults for classifer type. Default: None |
classification.column_pattern AllowDenyPattern | A class to store allow deny regexes |
classification.column_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
classification.table_pattern AllowDenyPattern | A class to store allow deny regexes |
classification.table_pattern.ignoreCase One of boolean, null | Whether to ignore case sensitivity during pattern matching. Default: True |
profiling GEProfilingConfig | |
profiling.catch_exceptions boolean | Default: True |
profiling.enabled boolean | Whether profiling should be done. Default: False |
profiling.field_sample_values_limit integer | Upper limit for number of sample values to collect for all columns. Default: 20 |
profiling.include_field_distinct_count boolean | Whether to profile for the number of distinct values for each column. Default: True |
profiling.include_field_distinct_value_frequencies boolean | Whether to profile for distinct value frequencies. Default: False |
profiling.include_field_histogram boolean | Whether to profile for the histogram for numeric fields. Default: False |
profiling.include_field_max_value boolean | Whether to profile for the max value of numeric columns. Default: True |
profiling.include_field_mean_value boolean | Whether to profile for the mean value of numeric columns. Default: True |
profiling.include_field_median_value boolean | Whether to profile for the median value of numeric columns. Default: True |
profiling.include_field_min_value boolean | Whether to profile for the min value of numeric columns. Default: True |
profiling.include_field_null_count boolean | Whether to profile for the number of nulls for each column. Default: True |
profiling.include_field_quantiles boolean | Whether to profile for the quantiles of numeric columns. Default: False |
profiling.include_field_sample_values boolean | Whether to profile for the sample values for all columns. Default: True |
profiling.include_field_stddev_value boolean | Whether to profile for the standard deviation of numeric columns. Default: True |
profiling.limit One of integer, null | Max number of documents to profile. By default, profiles all documents. Default: None |
profiling.max_number_of_fields_to_profile One of integer, null | A positive integer that specifies the maximum number of columns to profile for any table. None implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up. Default: None |
profiling.max_workers integer | Number of worker threads to use for profiling. Set to 1 to disable. Default: 20 |
profiling.offset One of integer, null | Offset in documents to profile. By default, uses no offset. Default: None |
profiling.partition_datetime One of string(date-time), null | If specified, profile only the partition which matches this datetime. If not specified, profile the latest partition. Only Bigquery supports this. Default: None |
profiling.partition_profiling_enabled boolean | Whether to profile partitioned tables. Only BigQuery and Aws Athena supports this. If enabled, latest partition data is used for profiling. Default: True |
profiling.profile_external_tables boolean | Whether to profile external tables. Only Snowflake and Redshift supports this. Default: False |
profiling.profile_if_updated_since_days One of number, null | Profile table only if it has been updated since these many number of days. If set to null, no constraint of last modified time for tables to profile. Supported only in snowflake and BigQuery. Default: None |
profiling.profile_nested_fields boolean | Whether to profile complex types like structs, arrays and maps. Default: False |
profiling.profile_table_level_only boolean | Whether to perform profiling at table-level only, or include column-level profiling as well. Default: False |
profiling.profile_table_row_count_estimate_only boolean | Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL. Default: False |
profiling.profile_table_row_limit One of integer, null | Profile tables only if their row count is less than specified count. If set to null, no limit on the row count of tables to profile. Supported only in Snowflake, BigQuery. Supported for Oracle based on gathered stats. Default: 5000000 |
profiling.profile_table_size_limit One of integer, null | Profile tables only if their size is less than specified GBs. If set to null, no limit on the size of tables to profile. Supported only in Snowflake, BigQuery and Databricks. Supported for Oracle based on calculated size from gathered stats. Default: 5 |
profiling.query_combiner_enabled boolean | This feature is still experimental and can be disabled if it causes issues. Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible. Default: True |
profiling.report_dropped_profiles boolean | Whether to report datasets or dataset columns which were not profiled. Set to True for debugging purposes. Default: False |
profiling.sample_size integer | Number of rows to be sampled from table for column level profiling.Applicable only if use_sampling is set to True. Default: 10000 |
profiling.turn_off_expensive_profiling_metrics boolean | Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10. Default: False |
profiling.use_sampling boolean | Whether to profile column level stats on sample of table. Only BigQuery and Snowflake support this. If enabled, profiling is done on rows sampled from table. Sampling is not done for smaller tables. Default: True |
profiling.operation_config OperationConfig | |
profiling.operation_config.lower_freq_profile_enabled boolean | Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling. Default: False |
profiling.operation_config.profile_date_of_month One of integer, null | Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect. Default: None |
profiling.operation_config.profile_day_of_week One of integer, null | Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect. Default: None |
profiling.tags_to_ignore_sampling One of array, null | Fixed list of tags to ignore sampling. If not specified, tables will be sampled based on use_sampling. Default: None |
profiling.tags_to_ignore_sampling.string string | |
stateful_ingestion One of StatefulStaleMetadataRemovalConfig, null | Default: None |
stateful_ingestion.enabled boolean | Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False Default: False |
stateful_ingestion.fail_safe_threshold number | Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'. Default: 75.0 |
stateful_ingestion.remove_stale_metadata boolean | Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled. Default: True |
The JSONSchema for this configuration is inlined below.
{
"$defs": {
"AllowDenyPattern": {
"additionalProperties": false,
"description": "A class to store allow deny regexes",
"properties": {
"allow": {
"default": [
".*"
],
"description": "List of regex patterns to include in ingestion",
"items": {
"type": "string"
},
"title": "Allow",
"type": "array"
},
"deny": {
"default": [],
"description": "List of regex patterns to exclude from ingestion.",
"items": {
"type": "string"
},
"title": "Deny",
"type": "array"
},
"ignoreCase": {
"anyOf": [
{
"type": "boolean"
},
{
"type": "null"
}
],
"default": true,
"description": "Whether to ignore case sensitivity during pattern matching.",
"title": "Ignorecase"
}
},
"title": "AllowDenyPattern",
"type": "object"
},
"ClassificationConfig": {
"additionalProperties": false,
"properties": {
"enabled": {
"default": false,
"description": "Whether classification should be used to auto-detect glossary terms",
"title": "Enabled",
"type": "boolean"
},
"sample_size": {
"default": 100,
"description": "Number of sample values used for classification.",
"title": "Sample Size",
"type": "integer"
},
"max_workers": {
"default": 4,
"description": "Number of worker processes to use for classification. Set to 1 to disable.",
"title": "Max Workers",
"type": "integer"
},
"table_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns to filter tables for classification. This is used in combination with other patterns in parent config. Specify regex to match the entire table name in `database.schema.table` format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'"
},
"column_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns to filter columns for classification. This is used in combination with other patterns in parent config. Specify regex to match the column name in `database.schema.table.column` format."
},
"info_type_to_term": {
"additionalProperties": {
"type": "string"
},
"default": {},
"description": "Optional mapping to provide glossary term identifier for info type",
"title": "Info Type To Term",
"type": "object"
},
"classifiers": {
"default": [
{
"type": "datahub",
"config": null
}
],
"description": "Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance.",
"items": {
"$ref": "#/$defs/DynamicTypedClassifierConfig"
},
"title": "Classifiers",
"type": "array"
}
},
"title": "ClassificationConfig",
"type": "object"
},
"DynamicTypedClassifierConfig": {
"additionalProperties": false,
"properties": {
"type": {
"description": "The type of the classifier to use. For DataHub, use `datahub`",
"title": "Type",
"type": "string"
},
"config": {
"anyOf": [
{},
{
"type": "null"
}
],
"default": null,
"description": "The configuration required for initializing the classifier. If not specified, uses defaults for classifer type.",
"title": "Config"
}
},
"required": [
"type"
],
"title": "DynamicTypedClassifierConfig",
"type": "object"
},
"GEProfilingConfig": {
"additionalProperties": false,
"properties": {
"enabled": {
"default": false,
"description": "Whether profiling should be done.",
"title": "Enabled",
"type": "boolean"
},
"operation_config": {
"$ref": "#/$defs/OperationConfig",
"description": "Experimental feature. To specify operation configs."
},
"limit": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Max number of documents to profile. By default, profiles all documents.",
"title": "Limit"
},
"offset": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Offset in documents to profile. By default, uses no offset.",
"title": "Offset"
},
"profile_table_level_only": {
"default": false,
"description": "Whether to perform profiling at table-level only, or include column-level profiling as well.",
"title": "Profile Table Level Only",
"type": "boolean"
},
"include_field_null_count": {
"default": true,
"description": "Whether to profile for the number of nulls for each column.",
"title": "Include Field Null Count",
"type": "boolean"
},
"include_field_distinct_count": {
"default": true,
"description": "Whether to profile for the number of distinct values for each column.",
"title": "Include Field Distinct Count",
"type": "boolean"
},
"include_field_min_value": {
"default": true,
"description": "Whether to profile for the min value of numeric columns.",
"title": "Include Field Min Value",
"type": "boolean"
},
"include_field_max_value": {
"default": true,
"description": "Whether to profile for the max value of numeric columns.",
"title": "Include Field Max Value",
"type": "boolean"
},
"include_field_mean_value": {
"default": true,
"description": "Whether to profile for the mean value of numeric columns.",
"title": "Include Field Mean Value",
"type": "boolean"
},
"include_field_median_value": {
"default": true,
"description": "Whether to profile for the median value of numeric columns.",
"title": "Include Field Median Value",
"type": "boolean"
},
"include_field_stddev_value": {
"default": true,
"description": "Whether to profile for the standard deviation of numeric columns.",
"title": "Include Field Stddev Value",
"type": "boolean"
},
"include_field_quantiles": {
"default": false,
"description": "Whether to profile for the quantiles of numeric columns.",
"title": "Include Field Quantiles",
"type": "boolean"
},
"include_field_distinct_value_frequencies": {
"default": false,
"description": "Whether to profile for distinct value frequencies.",
"title": "Include Field Distinct Value Frequencies",
"type": "boolean"
},
"include_field_histogram": {
"default": false,
"description": "Whether to profile for the histogram for numeric fields.",
"title": "Include Field Histogram",
"type": "boolean"
},
"include_field_sample_values": {
"default": true,
"description": "Whether to profile for the sample values for all columns.",
"title": "Include Field Sample Values",
"type": "boolean"
},
"max_workers": {
"default": 20,
"description": "Number of worker threads to use for profiling. Set to 1 to disable.",
"title": "Max Workers",
"type": "integer"
},
"report_dropped_profiles": {
"default": false,
"description": "Whether to report datasets or dataset columns which were not profiled. Set to `True` for debugging purposes.",
"title": "Report Dropped Profiles",
"type": "boolean"
},
"turn_off_expensive_profiling_metrics": {
"default": false,
"description": "Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.",
"title": "Turn Off Expensive Profiling Metrics",
"type": "boolean"
},
"field_sample_values_limit": {
"default": 20,
"description": "Upper limit for number of sample values to collect for all columns.",
"title": "Field Sample Values Limit",
"type": "integer"
},
"max_number_of_fields_to_profile": {
"anyOf": [
{
"exclusiveMinimum": 0,
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "A positive integer that specifies the maximum number of columns to profile for any table. `None` implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up.",
"title": "Max Number Of Fields To Profile"
},
"profile_if_updated_since_days": {
"anyOf": [
{
"exclusiveMinimum": 0,
"type": "number"
},
{
"type": "null"
}
],
"default": null,
"description": "Profile table only if it has been updated since these many number of days. If set to `null`, no constraint of last modified time for tables to profile. Supported only in `snowflake` and `BigQuery`.",
"schema_extra": {
"supported_sources": [
"snowflake",
"bigquery"
]
},
"title": "Profile If Updated Since Days"
},
"profile_table_size_limit": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 5,
"description": "Profile tables only if their size is less than specified GBs. If set to `null`, no limit on the size of tables to profile. Supported only in `Snowflake`, `BigQuery` and `Databricks`. Supported for `Oracle` based on calculated size from gathered stats.",
"schema_extra": {
"supported_sources": [
"snowflake",
"bigquery",
"unity-catalog",
"oracle"
]
},
"title": "Profile Table Size Limit"
},
"profile_table_row_limit": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": 5000000,
"description": "Profile tables only if their row count is less than specified count. If set to `null`, no limit on the row count of tables to profile. Supported only in `Snowflake`, `BigQuery`. Supported for `Oracle` based on gathered stats.",
"schema_extra": {
"supported_sources": [
"snowflake",
"bigquery",
"oracle"
]
},
"title": "Profile Table Row Limit"
},
"profile_table_row_count_estimate_only": {
"default": false,
"description": "Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL. ",
"schema_extra": {
"supported_sources": [
"postgres",
"mysql"
]
},
"title": "Profile Table Row Count Estimate Only",
"type": "boolean"
},
"query_combiner_enabled": {
"default": true,
"description": "*This feature is still experimental and can be disabled if it causes issues.* Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.",
"title": "Query Combiner Enabled",
"type": "boolean"
},
"catch_exceptions": {
"default": true,
"description": "",
"title": "Catch Exceptions",
"type": "boolean"
},
"partition_profiling_enabled": {
"default": true,
"description": "Whether to profile partitioned tables. Only BigQuery and Aws Athena supports this. If enabled, latest partition data is used for profiling.",
"schema_extra": {
"supported_sources": [
"athena",
"bigquery"
]
},
"title": "Partition Profiling Enabled",
"type": "boolean"
},
"partition_datetime": {
"anyOf": [
{
"format": "date-time",
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "If specified, profile only the partition which matches this datetime. If not specified, profile the latest partition. Only Bigquery supports this.",
"schema_extra": {
"supported_sources": [
"bigquery"
]
},
"title": "Partition Datetime"
},
"use_sampling": {
"default": true,
"description": "Whether to profile column level stats on sample of table. Only BigQuery and Snowflake support this. If enabled, profiling is done on rows sampled from table. Sampling is not done for smaller tables. ",
"schema_extra": {
"supported_sources": [
"bigquery",
"snowflake"
]
},
"title": "Use Sampling",
"type": "boolean"
},
"sample_size": {
"default": 10000,
"description": "Number of rows to be sampled from table for column level profiling.Applicable only if `use_sampling` is set to True.",
"schema_extra": {
"supported_sources": [
"bigquery",
"snowflake"
]
},
"title": "Sample Size",
"type": "integer"
},
"profile_external_tables": {
"default": false,
"description": "Whether to profile external tables. Only Snowflake and Redshift supports this.",
"schema_extra": {
"supported_sources": [
"redshift",
"snowflake"
]
},
"title": "Profile External Tables",
"type": "boolean"
},
"tags_to_ignore_sampling": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"default": null,
"description": "Fixed list of tags to ignore sampling. If not specified, tables will be sampled based on `use_sampling`.",
"title": "Tags To Ignore Sampling"
},
"profile_nested_fields": {
"default": false,
"description": "Whether to profile complex types like structs, arrays and maps. ",
"title": "Profile Nested Fields",
"type": "boolean"
}
},
"title": "GEProfilingConfig",
"type": "object"
},
"LineageDirection": {
"description": "Direction of lineage relationship between storage and Hive",
"enum": [
"upstream",
"downstream"
],
"title": "LineageDirection",
"type": "string"
},
"OperationConfig": {
"additionalProperties": false,
"properties": {
"lower_freq_profile_enabled": {
"default": false,
"description": "Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling.",
"title": "Lower Freq Profile Enabled",
"type": "boolean"
},
"profile_day_of_week": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect.",
"title": "Profile Day Of Week"
},
"profile_date_of_month": {
"anyOf": [
{
"type": "integer"
},
{
"type": "null"
}
],
"default": null,
"description": "Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect.",
"title": "Profile Date Of Month"
}
},
"title": "OperationConfig",
"type": "object"
},
"StatefulStaleMetadataRemovalConfig": {
"additionalProperties": false,
"description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.",
"properties": {
"enabled": {
"default": false,
"description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
"title": "Enabled",
"type": "boolean"
},
"remove_stale_metadata": {
"default": true,
"description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.",
"title": "Remove Stale Metadata",
"type": "boolean"
},
"fail_safe_threshold": {
"default": 75.0,
"description": "Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.",
"maximum": 100.0,
"minimum": 0.0,
"title": "Fail Safe Threshold",
"type": "number"
}
},
"title": "StatefulStaleMetadataRemovalConfig",
"type": "object"
}
},
"additionalProperties": false,
"properties": {
"emit_storage_lineage": {
"default": false,
"description": "Whether to emit storage-to-Hive lineage. When enabled, creates lineage relationships between Hive tables and their underlying storage locations (S3, Azure, GCS, HDFS, etc.).",
"title": "Emit Storage Lineage",
"type": "boolean"
},
"hive_storage_lineage_direction": {
"$ref": "#/$defs/LineageDirection",
"default": "upstream",
"description": "Direction of storage lineage. If 'upstream', storage is treated as upstream to Hive (data flows from storage to Hive). If 'downstream', storage is downstream to Hive (data flows from Hive to storage)."
},
"include_column_lineage": {
"default": true,
"description": "When enabled along with emit_storage_lineage, column-level lineage will be extracted between Hive table columns and storage location fields.",
"title": "Include Column Lineage",
"type": "boolean"
},
"storage_platform_instance": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "Platform instance for the storage system (e.g., 'my-s3-instance'). Used when generating URNs for storage datasets.",
"title": "Storage Platform Instance"
},
"table_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns for tables to filter in ingestion. Specify regex to match the entire table name in database.schema.table format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'"
},
"view_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns for views to filter in ingestion. Note: Defaults to table_pattern if not specified. Specify regex to match the entire view name in database.schema.view format. e.g. to match all views starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*'"
},
"classification": {
"$ref": "#/$defs/ClassificationConfig",
"default": {
"enabled": false,
"sample_size": 100,
"max_workers": 4,
"table_pattern": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"column_pattern": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"info_type_to_term": {},
"classifiers": [
{
"config": null,
"type": "datahub"
}
]
},
"description": "For details, refer to [Classification](../../../../metadata-ingestion/docs/dev_guides/classification.md)."
},
"incremental_lineage": {
"default": false,
"description": "When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run.",
"title": "Incremental Lineage",
"type": "boolean"
},
"convert_urns_to_lowercase": {
"default": false,
"description": "Whether to convert dataset urns to lowercase.",
"title": "Convert Urns To Lowercase",
"type": "boolean"
},
"env": {
"default": "PROD",
"description": "The environment that all assets produced by this connector belong to",
"title": "Env",
"type": "string"
},
"platform_instance": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.",
"title": "Platform Instance"
},
"stateful_ingestion": {
"anyOf": [
{
"$ref": "#/$defs/StatefulStaleMetadataRemovalConfig"
},
{
"type": "null"
}
],
"default": null
},
"options": {
"additionalProperties": true,
"description": "Any options specified here will be passed to [SQLAlchemy.create_engine](https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine) as kwargs. To set connection arguments in the URL, specify them under `connect_args`.",
"title": "Options",
"type": "object"
},
"profile_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns to filter tables (or specific columns) for profiling during ingestion. Note that only tables allowed by the `table_pattern` will be considered."
},
"domain": {
"additionalProperties": {
"$ref": "#/$defs/AllowDenyPattern"
},
"default": {},
"description": "Attach domains to databases, schemas or tables during ingestion using regex patterns. Domain key can be a guid like *urn:li:domain:ec428203-ce86-4db3-985d-5a8ee6df32ba* or a string like \"Marketing\".) If you provide strings, then datahub will attempt to resolve this name to a guid, and will error out if this fails. There can be multiple domain keys specified.",
"title": "Domain",
"type": "object"
},
"include_views": {
"default": true,
"description": "Whether views should be ingested.",
"title": "Include Views",
"type": "boolean"
},
"include_tables": {
"default": true,
"description": "Whether tables should be ingested.",
"title": "Include Tables",
"type": "boolean"
},
"include_view_lineage": {
"default": true,
"description": "Populates view->view and table->view lineage using DataHub's sql parser.",
"title": "Include View Lineage",
"type": "boolean"
},
"include_view_column_lineage": {
"default": true,
"description": "Populates column-level lineage for view->view and table->view lineage using DataHub's sql parser. Requires `include_view_lineage` to be enabled.",
"title": "Include View Column Lineage",
"type": "boolean"
},
"use_file_backed_cache": {
"default": true,
"description": "Whether to use a file backed cache for the view definitions.",
"title": "Use File Backed Cache",
"type": "boolean"
},
"profiling": {
"$ref": "#/$defs/GEProfilingConfig",
"default": {
"enabled": false,
"operation_config": {
"lower_freq_profile_enabled": false,
"profile_date_of_month": null,
"profile_day_of_week": null
},
"limit": null,
"offset": null,
"profile_table_level_only": false,
"include_field_null_count": true,
"include_field_distinct_count": true,
"include_field_min_value": true,
"include_field_max_value": true,
"include_field_mean_value": true,
"include_field_median_value": true,
"include_field_stddev_value": true,
"include_field_quantiles": false,
"include_field_distinct_value_frequencies": false,
"include_field_histogram": false,
"include_field_sample_values": true,
"max_workers": 20,
"report_dropped_profiles": false,
"turn_off_expensive_profiling_metrics": false,
"field_sample_values_limit": 20,
"max_number_of_fields_to_profile": null,
"profile_if_updated_since_days": null,
"profile_table_size_limit": 5,
"profile_table_row_limit": 5000000,
"profile_table_row_count_estimate_only": false,
"query_combiner_enabled": true,
"catch_exceptions": true,
"partition_profiling_enabled": true,
"partition_datetime": null,
"use_sampling": true,
"sample_size": 10000,
"profile_external_tables": false,
"tags_to_ignore_sampling": null,
"profile_nested_fields": false
}
},
"username": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "username",
"title": "Username"
},
"password": {
"anyOf": [
{
"format": "password",
"type": "string",
"writeOnly": true
},
{
"type": "null"
}
],
"default": null,
"description": "password",
"title": "Password"
},
"host_port": {
"description": "host URL",
"title": "Host Port",
"type": "string"
},
"database": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "database (catalog)",
"title": "Database"
},
"sqlalchemy_uri": {
"anyOf": [
{
"type": "string"
},
{
"type": "null"
}
],
"default": null,
"description": "URI of database to connect to. See https://docs.sqlalchemy.org/en/14/core/engines.html#database-urls. Takes precedence over other connection parameters.",
"title": "Sqlalchemy Uri"
},
"database_pattern": {
"$ref": "#/$defs/AllowDenyPattern",
"default": {
"allow": [
".*"
],
"deny": [],
"ignoreCase": true
},
"description": "Regex patterns for databases to filter in ingestion."
}
},
"required": [
"host_port"
],
"title": "HiveConfig",
"type": "object"
}
Code Coordinates
- Class Name:
datahub.ingestion.source.sql.hive.hive_source.HiveSource - Browse on GitHub
Questions
If you've got any questions on configuring ingestion for Hive, feel free to ping us on our Slack.