Skip to main content

Snowflake

Overview

Snowflake is a data platform used to store and query analytical or operational data. Learn more in the official Snowflake documentation.

The DataHub integration for Snowflake covers core metadata entities such as datasets/tables/views, schema fields, and containers. It also captures table- and column-level lineage, usage statistics, data profiling, tags, and stateful deletion detection.

Snowflake Ingestion through the UI

The following video shows you how to ingest Snowflake metadata through the UI.

Read on if you are interested in ingesting Snowflake metadata using the datahub cli, or want to learn about all the configuration parameters that are supported by the connectors.

Concept Mapping

Snowflake ConceptDataHub Entity (Subtype)Notes
AccountPlatform InstanceTop-level scope; all URNs include the configured platform instance.
DatabaseContainer (DATABASE)Top-level namespace. Ingested with description, tags, and Snowsight URL.
SchemaContainer (SCHEMA)Nested under its Database container.
TableDataset (TABLE)Includes regular, Iceberg, and hybrid tables. Schema, PKs/FKs, tags, and descriptions are extracted.
Dynamic TableDataset (DYNAMIC TABLE)Includes target lag, SQL definition, and lineage to source tables.
ViewDataset (VIEW)Standard, materialized, and secure views. View definition is captured.
Semantic ViewDataset (SEMANTIC VIEW)Columns classified as DIMENSION, FACT, or METRIC. Column-level lineage to physical tables is extracted.
StreamDataset (SNOWFLAKE STREAM)Change-data-capture stream. Adds METADATA$ACTION, METADATA$ISUPDATE, METADATA$ROW_ID columns and lineage to the source table.
External TableDataset (TABLE)Lineage to the backing cloud storage location is emitted when available.
Internal StageContainer (SNOWFLAKE STAGE) + Dataset (SNOWFLAKE STAGE DATA)Emits both a Container (organizational) and a Dataset (for the resident data).
External StageContainer (SNOWFLAKE STAGE)Container only; the backing cloud storage asset (S3/GCS/Azure) is referenced via lineage.
TaskDataFlow (SNOWFLAKE TASK GROUP) + DataJob (SNOWFLAKE TASK)One DataFlow per schema; each task is a DataJob. Predecessor relationships appear as DataJob inputs.
PipeDataFlow (SNOWFLAKE PIPE GROUP) + DataJob (SNOWFLAKE PIPE)One DataFlow per schema; each pipe is a DataJob linking a stage to a target table via lineage.
Streamlit AppDashboard (STREAMLIT)App name, owner, and Snowsight URL captured as custom properties.
Column / fieldSchemaFieldColumn type, nullability, descriptions, and tags are extracted where available.
RoleCorpGroupOwnership roles are mapped to urn:li:corpGroup:{role_name}.
TagTag or Structured PropertyControlled by extract_tags config. Tags support database/schema/table/column inheritance.
Table- and column-level lineageLineage edgesExtracted from view definitions, dynamic table definitions, and SQL query history.
Query operations and usageDatasetUsageStatistics, OperationPer-dataset query counts, user access patterns, and DML operation metrics.

Module snowflake

Certified

Important Capabilities

CapabilityStatusNotes
Asset ContainersEnabled by default. Supported for types - Database, Schema.
ClassificationOptionally enabled via classification.enabled.
Column-level LineageEnabled by default, can be disabled via configuration include_column_lineage.
Data ProfilingOptionally enabled via configuration profiling.enabled.
Dataset UsageEnabled by default, can be disabled via configuration include_usage_stats.
DescriptionsEnabled by default.
Detect Deleted EntitiesEnabled by default via stateful ingestion.
DomainsSupported via the domain config field.
Extract TagsOptionally enabled via extract_tags.
Operation CaptureEnabled by default, can be disabled via configuration include_operational_stats.
Platform InstanceEnabled by default.
Schema MetadataEnabled by default.
Table-Level LineageEnabled by default, can be disabled via configuration include_table_lineage.
Test ConnectionEnabled by default.

Overview

The snowflake module ingests metadata from Snowflake into DataHub. It is intended for production ingestion workflows and module-specific capabilities are documented below.

Prerequisites

Requires specific privileges to read metadata from your Snowflake warehouse.

Execute the following commands as ACCOUNTADMIN or a user with MANAGE GRANTS privilege to create a DataHub-specific role:

create or replace role datahub_role;

// Grant access to a warehouse to run queries to view metadata
grant operate, usage on warehouse "<your-warehouse>" to role datahub_role;

// Grant access to view database and schema in which your tables/views/dynamic tables exist
grant usage on DATABASE "<your-database>" to role datahub_role;
grant usage on all schemas in database "<your-database>" to role datahub_role;
grant usage on future schemas in database "<your-database>" to role datahub_role;
grant select on all streams in database "<your-database>" to role datahub_role;
grant select on future streams in database "<your-database>" to role datahub_role;

// If you are NOT using Snowflake Profiling or Classification feature: Grant references privileges to your tables and views
grant references on all tables in database "<your-database>" to role datahub_role;
grant references on future tables in database "<your-database>" to role datahub_role;
grant references on all external tables in database "<your-database>" to role datahub_role;
grant references on future external tables in database "<your-database>" to role datahub_role;
grant references on all views in database "<your-database>" to role datahub_role;
grant references on future views in database "<your-database>" to role datahub_role;
-- Note: Semantic views are covered by the above view grants

-- Grant monitor privileges for dynamic tables
grant monitor on all dynamic tables in database "<your-database>" to role datahub_role;
grant monitor on future dynamic tables in database "<your-database>" to role datahub_role;

// If you ARE using Snowflake Profiling or Classification feature: Grant select privileges to your tables
grant select on all tables in database "<your-database>" to role datahub_role;
grant select on future tables in database "<your-database>" to role datahub_role;
grant select on all external tables in database "<your-database>" to role datahub_role;
grant select on future external tables in database "<your-database>" to role datahub_role;
grant select on all dynamic tables in database "<your-database>" to role datahub_role;
grant select on future dynamic tables in database "<your-database>" to role datahub_role;

// Create a new DataHub user and assign the DataHub role to it
create user datahub_user display_name = 'DataHub' password='' default_role = datahub_role default_warehouse = '<your-warehouse>';

// Grant the datahub_role to the new DataHub user.
grant role datahub_role to user datahub_user;

// Optional - required if extracting lineage, usage or tags (without lineage)
grant imported privileges on database snowflake to role datahub_role;

// Optional - required for INTERNAL marketplace ingestion (private data sharing)
// This grants access to:
// - SHOW AVAILABLE LISTINGS (IS_ORGANIZATION = TRUE) for internal marketplace listings
// - SNOWFLAKE.ACCOUNT_USAGE.DATABASES for identifying imported databases
// - SNOWFLAKE.DATA_SHARING_USAGE.LISTING_ACCESS_HISTORY for usage statistics
// - DESCRIBE AVAILABLE LISTING for enriched metadata (if fetch_internal_marketplace_listing_details=true)
grant imported privileges on database snowflake to role datahub_role;

// For marketplace provider mode (marketplace_mode: provider or both):
// SHOW SHARES lists shares the role owns or is inheriting from a parent role.
// DESC SHARE requires OWNERSHIP of the share — Snowflake grants no finer-grained privilege.
// Marketplace-created shares are typically owned by SYSADMIN. To allow DESC SHARE,
// grant SYSADMIN to the DataHub role (use SECURITYADMIN to grant roles):
use role securityadmin;
grant role sysadmin to role datahub_role;

// Alternatively, set `role: SYSADMIN` directly in your recipe.

// Optional - required if extracting Streamlit Apps
grant usage on all streamlits in database "<your-database>" to role datahub_role;
grant usage on future streamlits in database "<your-database>" to role datahub_role;

// Optional - required if extracting Stages, Tasks, or Pipes
grant usage on all stages in database "<your-database>" to role datahub_role;
grant usage on future stages in database "<your-database>" to role datahub_role;
grant monitor on all tasks in database "<your-database>" to role datahub_role;
grant monitor on future tasks in database "<your-database>" to role datahub_role;
grant monitor on all pipes in database "<your-database>" to role datahub_role;
grant monitor on future pipes in database "<your-database>" to role datahub_role;

The details of each granted privilege can be viewed in the Snowflake docs. A summary of each privilege and why it is required for this connector:

  • operate is required only to start the warehouse. If the warehouse is already running during ingestion or has auto-resume enabled, this permission is not required.
  • usage is required to run queries using the warehouse
  • usage on database and schema are required because without them, tables, views, and streams inside them are not accessible. If an admin does the required grants on table but misses the grants on schema or the database in which the table/view/stream exists, then we will not be able to get metadata for the table/view/stream.
  • If metadata is required only on some schemas, then you can grant the usage privileges only on a particular schema like:
grant usage on schema "<your-database>"."<your-schema>" to role datahub_role;
  • select on streams is required for stream definitions to be available. This does not allow selecting the data (not required) unless the underlying dataset has select access as well.
  • usage on streamlit is required to show streamlits in a database. See the schema-level usage example above.
  • usage on stages is required to list stages via SHOW STAGES. Only needed if include_stages: true or include_pipes: true.
  • monitor on tasks is required to list tasks via SHOW TASKS. Only needed if include_tasks: true.
  • monitor on pipes is required to list pipes via SHOW PIPES. Only needed if include_pipes: true.

This represents the bare minimum privileges required to extract databases, schemas, views, and tables from Snowflake.

If you plan to enable extraction of table lineage via the include_table_lineage config flag, extraction of usage statistics via the include_usage_stats config, or extraction of tags (without lineage) via the extract_tags config, you'll also need to grant access to the Account Usage system tables from which the DataHub source extracts information. This can be done by granting access to the snowflake database.

grant imported privileges on database snowflake to role datahub_role;

Note that imported privileges grants access to all schemas and views in the shared SNOWFLAKE database, primarily:

  • SNOWFLAKE.ACCOUNT_USAGE.* (all views: QUERY_HISTORY, ACCESS_HISTORY, USERS, etc.)
  • SNOWFLAKE.ORGANIZATION_USAGE.* (requires separate enablement by Snowflake support at the organization level)

The SNOWFLAKE database is a shared database owned by Snowflake. Unlike regular databases where you can grant granular SELECT privileges on individual tables, shared databases require granting IMPORTED PRIVILEGES which provides all-or-nothing access to all objects in the database.

Which ACCOUNT_USAGE Tables Does DataHub Access?

When you grant IMPORTED PRIVILEGES, DataHub will specifically access the following ACCOUNT_USAGE tables:

TablePurposeRequired For
QUERY_HISTORYQuery logs for lineage, usage stats, and semantic view usageinclude_table_lineage, include_usage_stats, include_queries
ACCESS_HISTORYTable/view lineage and access patternsinclude_table_lineage, include_usage_stats
USERSUser email mapping for corp user entitiesinclude_usage_stats (for user attribution)
TAG_REFERENCESTag metadata extractionextract_tags
VIEWSView metadata (DDL, ownership, etc.) for all viewsAlways (when views exist)
COPY_HISTORYLineage from COPY INTO operations (all stages/sources)include_table_lineage

If you cannot grant IMPORTED PRIVILEGES due to security policies, the related features (lineage, usage, tags) will not work, and you'll see permission errors in the ingestion logs.

Authentication

Authentication is most simply done via a Snowflake user and password.

Alternatively, other authentication methods are supported via the authentication_type config option.

Key Pair Authentication

To set up Key Pair authentication, follow the three steps in this guide:

  • Generate the private key
  • Generate the public key
  • Assign the public key to the DataHub user to be configured in the recipe.

Pass in the following values in the recipe config instead of a password, ensuring the private key maintains proper PEM format with line breaks at the beginning, end, and approximately every 64 characters within the key:

authentication_type: KEY_PAIR_AUTHENTICATOR
private_key: <Private key in a form of '-----BEGIN PRIVATE KEY-----\nprivate-key\n-----END PRIVATE KEY-----'>

# Optional - if using encrypted private key
private_key_password: <Password for your private key>
Okta OAuth

To set up Okta OAuth authentication, roughly follow the four steps in this guide.

Pass in the following values, as described in the article, for your recipe's oauth_config:

  • provider: okta
  • client_id: <OAUTH_CLIENT_ID>
  • client_secret: <OAUTH_CLIENT_SECRET>
  • authority_url: <OKTA_OAUTH_TOKEN_ENDPOINT>
  • scopes: The list of your Okta scopes, i.e. with the session:role: prefix

DataHub only supports two OAuth grant types: client_credentials and password. The steps slightly differ based on which you decide to use.

Client Credentials Grant Type (Simpler)
  • When creating an Okta App Integration, choose type API Services
    • Ensure client authentication method is Client secret
    • Note your Client ID
  • Create a Snowflake user to correspond to your newly created Okta client credentials
    • Ensure the user's Login Name matches your Okta application's Client ID
    • Ensure the user has been granted your DataHub role
Password Grant Type
  • When creating an Okta App Integration, choose type OIDC -> Native Application
    • Add Grant Type Resource Owner Password
    • Ensure client authentication method is Client secret
  • Create an Okta user to sign into, noting the Username and Password
  • Create a Snowflake user to correspond to your newly created Okta client credentials
    • Ensure the user's Login Name matches your Okta user's Username (likely an email)
    • Ensure the user has been granted your DataHub role
  • When running ingestion, provide the required oauth_config fields, including client_id and client_secret, plus your Okta user's Username and Password
    • Note: the username and password config options are not nested under oauth_config

Install the Plugin

pip install 'acryl-datahub[snowflake]'

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
type: snowflake
config:
# This option is recommended to be used to ingest all lineage on the first run.
ignore_start_time_lineage: true

# Coordinates
account_id: "abc48144"
warehouse: "COMPUTE_WH"

# Credentials
username: "${SNOWFLAKE_USER}"
password: "${SNOWFLAKE_PASS}"
role: "datahub_role"

# (Optional) Uncomment and update this section to filter ingested datasets
# database_pattern:
# allow:
# - "^ACCOUNTING_DB$"
# - "^MARKETING_DB$"

profiling:
# Change to false to disable profiling
enabled: true
# This option is recommended to reduce profiling time and costs.
turn_off_expensive_profiling_metrics: true

# (Optional) Uncomment and update this section to filter profiled tables
# profile_pattern:
# allow:
# - "ACCOUNTING_DB.*.*"
# - "MARKETING_DB.*.*"

# (Optional) Ingest INTERNAL marketplace (private data sharing) listings as Data Products
# NOTE: This is for INTERNAL marketplace only, not the public Snowflake Data Marketplace
# marketplace:
# enabled: true
# marketplace_mode: "consumer" # Options: "consumer" (default), "provider", "both"
# internal_marketplace_listing_pattern:
# allow: [".*"]
# internal_marketplace_owner_patterns:
# "^Customer.*": ["data-team"]
# "^Finance.*": ["finance-team"]
# fetch_internal_marketplace_listing_details: false
# marketplace_properties_as_structured_properties: false
# # Time window for marketplace usage statistics (optional)
# # start_time: "-7 days" # Default: -1 day
# # end_time: "now"
# # bucket_duration: "DAY" # Options: "DAY", "HOUR"

# (Optional) Map data products to domains using purchased database names
# domain:
# "urn:li:domain:finance":
# allow: ["^FINANCE_.*", "^FIN_.*"]
# "Marketing":
# allow: ["^MARKETING_.*", "^MKT_.*"]

# (REQUIRED for marketplace) Link imported databases to their source shares/listings
# Without this, Data Products will be created but won't have any associated datasets
# Run: SHOW SHARES; and SELECT DATABASE_NAME FROM SNOWFLAKE.ACCOUNT_USAGE.DATABASES WHERE TYPE='IMPORTED DATABASE';
# shares:
# <SHARE_NAME>:
# database: "<SOURCE_DATABASE>" # Database in the share
# platform_instance: null
# consumers:
# - database: "<IMPORTED_DATABASE>" # Your purchased/imported database
# platform_instance: null
# # (Optional) Explicit marketplace listing mapping for precise linking
# # listing_global_name: "ACME.DATA.LISTING" # From: SHOW AVAILABLE LISTINGS

# Default sink is datahub-rest and doesn't need to be configured
# See https://docs.datahub.com/docs/metadata-ingestion/sink_docs/datahub for customization options

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
account_id 
string
Snowflake account identifier. e.g. xy12345, xy12345.us-east-2.aws, xy12345.us-central1.gcp, xy12345.central-us.azure, xy12345.us-west-2.privatelink. Refer Account Identifiers for more details.
apply_view_usage_to_tables
boolean
Whether to apply view's usage to its base tables. If set to True, usage is applied to base tables only.
Default: False
authentication_type
string
The type of authenticator to use when connecting to Snowflake. Supports "DEFAULT_AUTHENTICATOR", "OAUTH_AUTHENTICATOR", "EXTERNAL_BROWSER_AUTHENTICATOR" and "KEY_PAIR_AUTHENTICATOR".
Default: DEFAULT_AUTHENTICATOR
bucket_duration
Enum
One of: "DAY", "HOUR"
connect_args
One of object, null
Connect args to pass to Snowflake SqlAlchemy driver
Default: None
convert_urns_to_lowercase
boolean
Whether to convert dataset urns to lowercase.
Default: True
email_domain
One of string, null
Email domain of your organization so users can be displayed on UI appropriately. This is used only if we cannot infer email ID.
Default: None
enable_stateful_lineage_ingestion
boolean
Enable stateful lineage ingestion. This will store lineage window timestamps after successful lineage ingestion. and will not run lineage ingestion for same timestamps in subsequent run. NOTE: This only works with use_queries_v2=False (legacy extraction path). For queries v2, use enable_stateful_time_window instead.
Default: True
enable_stateful_profiling
boolean
Enable stateful profiling. This will store profiling timestamps per dataset after successful profiling. and will not run profiling again in subsequent run if table has not been updated.
Default: True
enable_stateful_time_window
boolean
Enable stateful time window tracking. This will store the time window after successful extraction and adjust the time window in subsequent runs to avoid reprocessing. NOTE: This is ONLY applicable when using queries v2 (use_queries_v2=True). This replaces enable_stateful_lineage_ingestion and enable_stateful_usage_ingestion for the queries v2 extraction path, since queries v2 extracts lineage, usage, operations, and queries together from a single audit log and uses a unified time window.
Default: False
enable_stateful_usage_ingestion
boolean
Enable stateful lineage ingestion. This will store usage window timestamps after successful usage ingestion. and will not run usage ingestion for same timestamps in subsequent run. NOTE: This only works with use_queries_v2=False (legacy extraction path). For queries v2, use enable_stateful_time_window instead.
Default: True
end_time
string(date-time)
Latest date of lineage/usage to consider. Default: Current time in UTC
exclude_dynamic_tables
boolean
If enabled, dynamic tables will be excluded from ingestion. Use this to speed up ingestion if you don't need dynamic tables in DataHub.
Default: False
extract_tags
Enum
One of: "with_lineage", "without_lineage", "skip"
extract_tags_as_structured_properties
boolean
If enabled along with extract_tags, extracts snowflake's key-value tags as DataHub structured properties instead of DataHub tags.
Default: False
fetch_views_from_information_schema
boolean
If enabled, uses information_schema.views to fetch view definitions instead of SHOW VIEWS command. This alternative method can be more reliable for databases with large numbers of views (> 10K views), as the SHOW VIEWS approach has proven unreliable and can lead to missing views in such scenarios. However, this method requires OWNERSHIP privileges on views to retrieve their definitions. For views without ownership permissions (where VIEW_DEFINITION is null/empty), the system will automatically fall back to using batched SHOW VIEWS queries to populate the missing definitions.
Default: False
format_sql_queries
boolean
Whether to format sql queries
Default: False
ignore_start_time_lineage
boolean
Default: False
include_assertion_results
boolean
Whether to ingest assertion run results for assertions created using DataHub assertions CLI in Snowflake. Also required for external DMF ingestion.
Default: False
include_column_lineage
boolean
Populates table->table and view->table column lineage. Requires appropriate grants given to the role and the Snowflake Enterprise Edition or above.
Default: True
include_external_url
boolean
Whether to populate Snowsight url for Snowflake Objects
Default: True
include_externally_managed_dmfs
boolean
Ingest user-created Snowflake DMFs (not created via DataHub) as external assertions. Requires include_assertion_results: true. When enabled, all DMFs (not just datahub__* prefixed) will be ingested with their execution results. IMPORTANT: External DMFs must return 1 for SUCCESS and 0 for FAILURE. DataHub interprets VALUE=1 as passed, VALUE=0 as failed. See Snowflake DMF Assertions for details.
Default: False
include_foreign_keys
boolean
If enabled, populates the snowflake foreign keys.
Default: True
include_operational_stats
boolean
Whether to display operational stats.
Default: True
include_pipes
boolean
If enabled, Snowflake Snowpipe objects will be ingested as DataJobs with COPY INTO lineage.
Default: False
include_primary_keys
boolean
If enabled, populates the snowflake primary keys.
Default: True
include_procedures
boolean
If enabled, procedures will be ingested as pipelines/tasks.
Default: True
include_queries
boolean
If enabled, generate query entities associated with lineage edges. Only applicable if use_queries_v2 is enabled.
Default: True
include_query_usage_statistics
boolean
If enabled, generate query popularity statistics. Only applicable if use_queries_v2 is enabled.
Default: True
include_read_operational_stats
boolean
Whether to report read operational stats. Experimental.
Default: False
include_stages
boolean
If enabled, Snowflake Stages will be ingested as containers with associated metadata.
Default: False
include_streamlits
boolean
If enabled, Streamlit apps will be ingested as dashboards.
Default: False
include_streams
boolean
If enabled, streams will be ingested as separate entities from tables/views.
Default: True
include_table_lineage
boolean
If enabled, populates the snowflake table-to-table and s3-to-snowflake table lineage. Requires appropriate grants given to the role and Snowflake Enterprise Edition or above.
Default: True
include_table_location_lineage
boolean
If the source supports it, include table lineage to the underlying storage location.
Default: True
include_tables
boolean
Whether tables should be ingested.
Default: True
include_tasks
boolean
If enabled, Snowflake Tasks will be ingested as DataJobs with DAG dependencies and SQL lineage.
Default: False
include_technical_schema
boolean
If enabled, populates the snowflake technical schema and descriptions.
Default: True
include_top_n_queries
boolean
Whether to ingest the top_n_queries.
Default: True
include_usage_stats
boolean
If enabled, populates the snowflake usage statistics. Requires appropriate grants given to the role.
Default: True
include_view_column_lineage
boolean
Populates column-level lineage for view->view and table->view lineage using DataHub's sql parser. Requires include_view_lineage to be enabled.
Default: True
include_view_definitions
boolean
If enabled, populates the ingested views' definitions.
Default: True
include_view_lineage
boolean
Populates view->view and table->view lineage using DataHub's sql parser.
Default: True
include_views
boolean
Whether views should be ingested.
Default: True
incremental_lineage
boolean
When enabled, emits lineage as incremental to existing lineage already in DataHub. When disabled, re-states lineage on each run.
Default: False
incremental_properties
boolean
When enabled, emits dataset properties as incremental to existing dataset properties in DataHub. When disabled, re-states dataset properties on each run.
Default: False
known_snowflake_edition
One of Enum, null
Explicitly specify the Snowflake edition (STANDARD or ENTERPRISE). If unset, the edition will be inferred automatically using 'SHOW TAGS'.
Default: None
lazy_schema_resolver
boolean
If enabled, uses lazy schema resolver to resolve schemas for tables and views. This is useful if you have a large number of schemas and want to avoid bulk fetching the schema for each table/view.
Default: True
match_fully_qualified_names
boolean
Whether schema_pattern is matched against fully qualified schema name <catalog>.<schema>.
Default: False
options
object
Any options specified here will be passed to SQLAlchemy.create_engine as kwargs.
password
One of string(password), null
Snowflake password.
Default: None
platform_instance
One of string, null
The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.
Default: None
private_key
One of string(password), null
Private key in a form of '-----BEGIN PRIVATE KEY-----\nprivate-key\n-----END PRIVATE KEY-----\n' if using key pair authentication. Encrypted version of private key will be in a form of '-----BEGIN ENCRYPTED PRIVATE KEY-----\nencrypted-private-key\n-----END ENCRYPTED PRIVATE KEY-----\n' See: https://docs.snowflake.com/en/user-guide/key-pair-auth.html
Default: None
private_key_password
One of string(password), null
Password for your private key. Required if using key pair authentication with encrypted private key.
Default: None
private_key_path
One of string, null
The path to the private key if using key pair authentication. Ignored if private_key is set. See: https://docs.snowflake.com/en/user-guide/key-pair-auth.html
Default: None
push_down_database_pattern_access_history
boolean
If enabled, pushes down database pattern filtering to the access_history table for improved performance. This filters on the accessed objects in access_history.
Default: False
push_down_metadata_patterns
boolean
If enabled, pushes down database_pattern, schema_pattern, table_pattern, and view_pattern filtering to Snowflake information_schema metadata queries using the RLIKE operator for improved performance. This applies only to metadata extraction queries (information_schema.databases, schemata, tables, views) — NOT to lineage/usage queries (for those, see push_down_database_pattern_access_history). NOTE: view_pattern pushdown only works when fetch_views_from_information_schema is also enabled. With the default SHOW VIEWS, view_pattern filtering falls back to Python re.match(). IMPORTANT: Snowflake RLIKE requires FULL STRING match, unlike Python re.match() which matches prefixes. For prefix matching use 'PATTERN.', for suffix use '.PATTERN$', for contains use '.PATTERN.'. See the Metadata Pattern Pushdown section for detailed usage and examples, and the Snowflake RLIKE documentation for regex syntax details.
Default: False
query_dedup_strategy
Enum
One of: "STANDARD", "NONE"
role
One of string, null
Snowflake role.
Default: None
snowflake_domain
string
Snowflake domain. Use 'snowflakecomputing.com' for most regions or 'snowflakecomputing.cn' for China (cn-northwest-1) region.
Default: snowflakecomputing.com
snowsight_base_url
One of string, null
Override for the Snowsight base URL used when generating external URLs for Snowflake assets. Set this when Snowsight is only reachable via private link (for example https://app.<region>.privatelink.snowflakecomputing.com/ or https://app-<org>-<account>.privatelink.snowflakecomputing.com/). If unset, defaults to the public app.snowflake.com URL. The value can be obtained by running SELECT SYSTEM$GET_PRIVATELINK_CONFIG() in Snowflake as ACCOUNTADMIN.
Default: None
start_time
string(date-time)
Earliest date of lineage/usage to consider. Default: Last full day in UTC (or hour, depending on bucket_duration). You can also specify relative time with respect to end_time such as '-7 days' Or '-7d'.
Default: None
structured_properties_write_mode
Enum
One of: "upsert", "patch"
token
One of string(password), null
OAuth token from external identity provider. Not recommended for most use cases because it will not be able to refresh once expired.
Default: None
top_n_queries
integer
Number of top queries to save to each table.
Default: 10
upstream_lineage_in_report
boolean
Default: False
use_file_backed_cache
boolean
Whether to use a file backed cache for the view definitions.
Default: True
use_queries_v2
boolean
If enabled, uses the new queries extractor to extract queries from snowflake.
Default: True
username
One of string, null
Snowflake username.
Default: None
validate_upstreams_against_patterns
boolean
Whether to validate upstream snowflake tables against allow-deny patterns
Default: True
warehouse
One of string, null
Snowflake warehouse.
Default: None
env
string
The environment that all assets produced by this connector belong to
Default: PROD
additional_database_names_allowlist
array
Additional database names (no pattern matching) to be included in the accesshistory filter. Only applies if push_down_database_pattern_access_history=True. These databases will be included in the filter being pushed down regardless of database_pattern settings.This may be required in the case of _eg temporary tables being created in a different database than the ones in the database_name patterns.
Default: []
additional_database_names_allowlist.string
string
database_pattern
AllowDenyPattern
A class to store allow deny regexes
database_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
domain
map(str,AllowDenyPattern)
A class to store allow deny regexes
domain.key.allow
array
List of regex patterns to include in ingestion
Default: ['.*']
domain.key.allow.string
string
domain.key.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
domain.key.deny
array
List of regex patterns to exclude from ingestion.
Default: []
domain.key.deny.string
string
marketplace
SnowflakeMarketplaceConfig
Configuration for Snowflake Internal Marketplace (Private Data Sharing).

IMPORTANT: This is for the INTERNAL Snowflake Marketplace where organizations privately share
data within their account using Data Exchange. This is NOT for the public Snowflake Marketplace
(Snowflake Data Marketplace) where external providers publicly list datasets.

Use this when you want to track:
- Internal marketplace listings (from SHOW AVAILABLE LISTINGS IS_ORGANIZATION = TRUE)
- Databases purchased/imported from internal listings (IMPORTED DATABASE type - consumer mode)
- Databases you're sharing via OUTBOUND shares (provider mode)
- Usage of internal marketplace data products

The usage time window and bucket duration come from the parent connector's
start_time / end_time / bucket_duration (and from the same
RedundantUsageRunSkipHandler as the main usage extractor when stateful
usage ingestion is enabled), so marketplace usage follows the connector's
overall schedule.
marketplace.enabled
boolean
Whether to ingest Snowflake INTERNAL marketplace (private data exchange) listings as Data Products. When enabled, also ingests databases and usage statistics based on the marketplace_mode setting. NOTE: This is for INTERNAL marketplace only (IS_ORGANIZATION = TRUE), not the public Snowflake Data Marketplace.
Default: False
marketplace.fetch_internal_marketplace_listing_details
boolean
If enabled, fetches additional details for each INTERNAL marketplace listing via DESCRIBE AVAILABLE LISTING. WARNING: This executes one additional query per listing and may impact performance for many listings.
Default: False
marketplace.listing_to_share_overrides
map(str,string)
marketplace.marketplace_mode
Enum
One of: "consumer", "provider", "both"
marketplace.marketplace_properties_as_structured_properties
boolean
If enabled, ingests INTERNAL marketplace custom properties (provider, category, listing_created_on, etc.) as DataHub structured properties instead of simple custom properties. This makes marketplace metadata searchable and filterable in the DataHub UI.
Default: False
marketplace.organization_to_domain
map(str,string)
marketplace.internal_marketplace_listing_pattern
AllowDenyPattern
A class to store allow deny regexes
marketplace.internal_marketplace_listing_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
marketplace.internal_marketplace_owner_patterns
map(str,array)
marketplace.internal_marketplace_owner_patterns.key.string
string
marketplace.listing_to_schemas_overrides
map(str,array)
marketplace.listing_to_schemas_overrides.key.string
string
oauth_config
One of OAuthConfiguration, null
oauth configuration - https://docs.snowflake.com/en/user-guide/python-connector-example.html#connecting-with-oauth
Default: None
oauth_config.authority_url 
string
Authority url of your identity provider
oauth_config.client_id 
string
client id of your registered application
oauth_config.provider 
Enum
One of: "microsoft", "okta"
oauth_config.scopes 
array
scopes required to connect to snowflake
oauth_config.scopes.string
string
oauth_config.client_secret
One of string(password), null
client secret of the application if use_certificate = false
Default: None
oauth_config.encoded_oauth_private_key
One of string(password), null
base64 encoded private key content if use_certificate = true
Default: None
oauth_config.encoded_oauth_public_key
One of string, null
base64 encoded certificate content if use_certificate = true
Default: None
oauth_config.use_certificate
boolean
Do you want to use certificate and private key to authenticate using oauth
Default: False
pipe_pattern
AllowDenyPattern
A class to store allow deny regexes
pipe_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
procedure_pattern
AllowDenyPattern
A class to store allow deny regexes
procedure_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
profile_pattern
AllowDenyPattern
A class to store allow deny regexes
profile_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
pushdown_allow_usernames
array
List of snowflake usernames (SQL LIKE patterns, e.g., 'ANALYST_%', '%_USER', 'MAIN_ACCOUNT') which WILL be considered for lineage/usage/queries extraction. This is primarily useful for improving performance by filtering in only specific users. Only applicable if use_queries_v2 is enabled. If not specified, all users not in deny list are included.
Default: []
pushdown_allow_usernames.string
string
pushdown_deny_usernames
array
List of snowflake usernames (SQL LIKE patterns, e.g., 'SERVICE_%', '%_PROD', 'TEST_USER') which will NOT be considered for lineage/usage/queries extraction. This is primarily useful for improving performance by filtering out users with extremely high query volumes. Only applicable if use_queries_v2 is enabled.
Default: []
pushdown_deny_usernames.string
string
schema_pattern
AllowDenyPattern
A class to store allow deny regexes
schema_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
semantic_view_pattern
AllowDenyPattern
A class to store allow deny regexes
semantic_view_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
semantic_views
SemanticViewsConfig
semantic_views.column_lineage
boolean
If enabled, column-level lineage will be generated for semantic views, mapping dimensions, facts, and metrics to their source columns in base tables. Only applicable when enabled is True.
Default: False
semantic_views.enabled
boolean
If enabled, semantic views will be ingested as datasets. Note: Semantic views require Snowflake Enterprise Edition or above, as they are part of the Cortex Analyst feature set. Set this to True only if you have Enterprise Edition or above.
Default: False
semantic_views.include_queries
boolean
If enabled, generate query entities for queries against semantic views.
Default: False
semantic_views.include_usage
boolean
If enabled, usage statistics will be extracted for semantic views. This scans QUERY_HISTORY which can be slow on accounts with high query volume.
Default: False
semantic_views.max_queries_per_view
integer
Maximum number of query entities to emit per semantic view. Only applicable when include_queries is True.
Default: 100
shares
One of SnowflakeShareConfig, null
Required if current account owns or consumes snowflake share.If specified, connector creates lineage and siblings relationship between current account's database tables and consumer/producer account's database tables. Map of share name -> details of share.
Default: None
shares.key.database 
string
Database from which share is created.
shares.key.consumers 
array
List of databases created in consumer accounts.
shares.key.consumers.DatabaseId
DatabaseId
shares.key.consumers.DatabaseId.database 
string
shares.key.consumers.DatabaseId.platform_instance
One of string, null
Default: None
shares.key.platform_instance
One of string, null
Platform instance for snowflake account in which share is created.
Default: None
stage_pattern
AllowDenyPattern
A class to store allow deny regexes
stage_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
stream_pattern
AllowDenyPattern
A class to store allow deny regexes
stream_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
streamlit_pattern
AllowDenyPattern
A class to store allow deny regexes
streamlit_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
structured_property_pattern
AllowDenyPattern
A class to store allow deny regexes
structured_property_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
table_pattern
AllowDenyPattern
A class to store allow deny regexes
table_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
table_types
array
Set of Snowflake TABLE_TYPE values to include in ingestion. Currently Supported values: 'BASE TABLE', 'EXTERNAL TABLE'. Remove 'EXTERNAL TABLE' to exclude external tables from ingestion.
Default: ['BASE TABLE', 'EXTERNAL TABLE']
table_types.string
string
tag_pattern
AllowDenyPattern
A class to store allow deny regexes
tag_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
task_pattern
AllowDenyPattern
A class to store allow deny regexes
task_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
temporary_tables_pattern
array
[Advanced] Regex patterns for temporary tables to filter in lineage ingestion. Specify regex to match the entire table name in database.schema.table format. Defaults are to set in such a way to ignore the temporary staging tables created by known ETL tools.
Default: ['.*\.FIVETRAN_.*_STAGING\..*', '.*__DBT_TMP$', ...
temporary_tables_pattern.string
string
user_email_pattern
AllowDenyPattern
A class to store allow deny regexes
user_email_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
view_pattern
AllowDenyPattern
A class to store allow deny regexes
view_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
classification
ClassificationConfig
classification.enabled
boolean
Whether classification should be used to auto-detect glossary terms
Default: False
classification.info_type_to_term
map(str,string)
classification.max_workers
integer
Number of worker processes to use for classification. Set to 1 to disable.
Default: 4
classification.sample_size
integer
Number of sample values used for classification.
Default: 100
classification.classifiers
array
Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance.
Default: [{'type': 'datahub', 'config': None}]
classification.classifiers.DynamicTypedClassifierConfig
DynamicTypedClassifierConfig
classification.classifiers.DynamicTypedClassifierConfig.type 
string
The type of the classifier to use. For DataHub, use datahub
classification.classifiers.DynamicTypedClassifierConfig.config
One of object, null
The configuration required for initializing the classifier. If not specified, uses defaults for classifer type.
Default: None
classification.column_pattern
AllowDenyPattern
A class to store allow deny regexes
classification.column_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
classification.table_pattern
AllowDenyPattern
A class to store allow deny regexes
classification.table_pattern.ignoreCase
One of boolean, null
Whether to ignore case sensitivity during pattern matching.
Default: True
profiling
GEProfilingConfig
profiling.catch_exceptions
boolean
Default: True
profiling.enabled
boolean
Whether profiling should be done.
Default: False
profiling.field_sample_values_limit
integer
Upper limit for number of sample values to collect for all columns.
Default: 20
profiling.include_field_distinct_count
boolean
Whether to profile for the number of distinct values for each column.
Default: True
profiling.include_field_distinct_value_frequencies
boolean
Whether to profile for distinct value frequencies.
Default: False
profiling.include_field_histogram
boolean
Whether to profile for the histogram for numeric fields.
Default: False
profiling.include_field_max_value
boolean
Whether to profile for the max value of numeric columns.
Default: True
profiling.include_field_mean_value
boolean
Whether to profile for the mean value of numeric columns.
Default: True
profiling.include_field_median_value
boolean
Whether to profile for the median value of numeric columns.
Default: True
profiling.include_field_min_value
boolean
Whether to profile for the min value of numeric columns.
Default: True
profiling.include_field_null_count
boolean
Whether to profile for the number of nulls for each column.
Default: True
profiling.include_field_quantiles
boolean
Whether to profile for the quantiles of numeric columns.
Default: False
profiling.include_field_sample_values
boolean
Whether to profile for the sample values for all columns.
Default: True
profiling.include_field_stddev_value
boolean
Whether to profile for the standard deviation of numeric columns.
Default: True
profiling.limit
One of integer, null
Max number of documents to profile. By default, profiles all documents.
Default: None
profiling.max_number_of_fields_to_profile
One of integer, null
A positive integer that specifies the maximum number of columns to profile for any table. None implies all columns. The cost of profiling goes up significantly as the number of columns to profile goes up.
Default: None
profiling.max_workers
integer
Number of worker threads to use for profiling. Set to 1 to disable.
Default: 20
profiling.method
Enum
One of: "ge", "sqlalchemy"
Default: sqlalchemy
profiling.offset
One of integer, null
Offset in documents to profile. By default, uses no offset.
Default: None
profiling.partition_datetime
One of string(date-time), null
If specified, profile only the partition which matches this datetime. If not specified, profile the latest partition. Only Bigquery supports this.
Default: None
profiling.partition_profiling_enabled
boolean
Whether to profile partitioned tables. Only BigQuery and Aws Athena supports this. If enabled, latest partition data is used for profiling.
Default: True
profiling.profile_external_tables
boolean
Whether to profile external tables. Only Snowflake and Redshift supports this.
Default: False
profiling.profile_if_updated_since_days
One of number, null
Profile table only if it has been updated since these many number of days. If set to null, no constraint of last modified time for tables to profile. Supported only in snowflake and BigQuery.
Default: None
profiling.profile_nested_fields
boolean
Whether to profile complex types like structs, arrays and maps.
Default: False
profiling.profile_table_level_only
boolean
Whether to perform profiling at table-level only, or include column-level profiling as well.
Default: False
profiling.profile_table_row_count_estimate_only
boolean
Use an approximate query for row count. This will be much faster but slightly less accurate. Only supported for Postgres and MySQL.
Default: False
profiling.profile_table_row_limit
One of integer, null
Profile tables only if their row count is less than specified count. If set to null, no limit on the row count of tables to profile. Supported only in Snowflake, BigQuery. Supported for Oracle based on gathered stats.
Default: 5000000
profiling.profile_table_size_limit
One of integer, null
Profile tables only if their size is less than specified GBs. If set to null, no limit on the size of tables to profile. Supported only in Snowflake, BigQuery and Databricks. Supported for Oracle based on calculated size from gathered stats.
Default: 5
profiling.query_combiner_enabled
boolean
This feature is still experimental and can be disabled if it causes issues. Reduces the total number of queries issued and speeds up profiling by dynamically combining SQL queries where possible.
Default: True
profiling.report_dropped_profiles
boolean
Whether to report datasets or dataset columns which were not profiled. Set to True for debugging purposes.
Default: False
profiling.sample_size
integer
Number of rows to be sampled from table for column level profiling.Applicable only if use_sampling is set to True.
Default: 10000
profiling.turn_off_expensive_profiling_metrics
boolean
Whether to turn off expensive profiling or not. This turns off profiling for quantiles, distinct_value_frequencies, histogram & sample_values. This also limits maximum number of fields being profiled to 10.
Default: False
profiling.use_sampling
boolean
Whether to profile column level stats on sample of table. Only BigQuery and Snowflake support this. If enabled, profiling is done on rows sampled from table. Sampling is not done for smaller tables.
Default: True
profiling.operation_config
OperationConfig
profiling.operation_config.lower_freq_profile_enabled
boolean
Whether to do profiling at lower freq or not. This does not do any scheduling just adds additional checks to when not to run profiling.
Default: False
profiling.operation_config.profile_date_of_month
One of integer, null
Number between 1 to 31 for date of month (both inclusive). If not specified, defaults to Nothing and this field does not take affect.
Default: None
profiling.operation_config.profile_day_of_week
One of integer, null
Number between 0 to 6 for day of week (both inclusive). 0 is Monday and 6 is Sunday. If not specified, defaults to Nothing and this field does not take affect.
Default: None
profiling.tags_to_ignore_sampling
One of array, null
Fixed list of tags to ignore sampling. If not specified, tables will be sampled based on use_sampling.
Default: None
profiling.tags_to_ignore_sampling.string
string
stateful_ingestion
One of StatefulStaleMetadataRemovalConfig, null
Default: None
stateful_ingestion.enabled
boolean
Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or datahub_api is specified, otherwise False
Default: False
stateful_ingestion.fail_safe_threshold
number
Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.
Default: 75.0
stateful_ingestion.remove_stale_metadata
boolean
Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.
Default: True

Capabilities

Use the Important Capabilities table above as the source of truth for supported features and whether additional configuration is required.

Snowflake Shares

If you are using Snowflake Shares to share data across different Snowflake accounts, and you have set up DataHub recipes for ingesting metadata from all these accounts, you may end up having multiple similar dataset entities corresponding to virtual versions of the same table in different Snowflake accounts. The DataHub Snowflake connector can automatically link such tables together through Siblings and Lineage relationships if the user provides information necessary to establish the relationship using the shares configuration in the recipe.

Note: The shares configuration is also strongly recommended when using the marketplace.enabled feature to ingest Snowflake internal marketplace listings as Data Products. Without it, ingestion still produces the Data Products, but _find_listing_for_purchase cannot map purchased databases back to their listings, so the Data Products end up with no associated datasets (assets) and a structured_reporter.warning is emitted. See the Internal Marketplace section below for details.

Example
  • Snowflake account account1 (ingested as platform_instance instance1) owns a database db1. A share X is created in account1 that includes database db1 along with schemas and tables inside it.
  • Now, X is shared with Snowflake account account2 (ingested as platform_instance instance2). A database db1_from_X is created from inbound share X in account2. In this case, all tables and views included in share X will also be present in instance2.db1_from_X.
  • This can be represented in shares configuration section as
    shares:
    X: # name of the share
    database: db1
    platform_instance: instance1
    consumers: # list of all databases created from share X
    - database: db1_from_X
    platform_instance: instance2
  • If share X is shared with more Snowflake accounts and a database is created from share X in those accounts, then additional entries need to be added to the consumers list for share X, one per Snowflake account. The same shares config can then be copied across recipes for all accounts.

Internal Marketplace

If you are using the Snowflake internal marketplace (private data sharing within your organization via Data Exchange) and want to ingest marketplace listings as DataHub Data Products, you can operate in two modes:

Mode 1: Consumer Mode (Default) - Track Purchased Listings

For organizations that purchase/install internal marketplace listings:

  1. Grant the required privileges (already covered in the Prerequisites section above):

    -- Basic marketplace access
    grant imported privileges on database snowflake to role datahub_role;

    -- For SHOW SHARES to discover INBOUND shares
    use role accountadmin;
    grant import share on account to role datahub_role;
  2. Enable marketplace ingestion in your recipe:

    marketplace:
    enabled: true
    marketplace_mode: "consumer" # Default
    # Optional: Configure time window for usage statistics
    start_time: "-7 days" # Default: -1 day
    end_time: "now"
    bucket_duration: "DAY" # Options: "DAY", "HOUR"

    The marketplace configuration inherits from BaseTimeWindowConfig, allowing you to control the time window for extracting marketplace usage statistics. This follows the same pattern as other DataHub connectors.

  3. Configure the shares mapping (strongly recommended for linking Data Products to their purchased databases):

    Snowflake does not expose a direct link between imported databases and the marketplace listings they came from. Without shares, ingestion still succeeds but each affected Data Product is emitted without assets and a structured_reporter.warning is logged.

    shares:
    <SHARE_NAME>: # From: SHOW SHARES
    database: "<SOURCE_DATABASE>" # Source database in the share
    platform_instance: null
    consumers:
    - database: "<IMPORTED_DATABASE>" # Your purchased/imported database
    platform_instance: null
    # Optional but recommended: Explicit listing mapping
    listing_global_name: "PROVIDER.REGION.LISTING_NAME" # From: SHOW AVAILABLE LISTINGS

    Explicit Listing Mapping (Recommended): Add listing_global_name to explicitly link your purchased database to its marketplace listing. This ensures accurate Data Product associations, especially when you have multiple listings with similar names. Without it, the connector will attempt to match listings by finding the source database name within the listing's global name or title (case-insensitive substring match).

    To discover the correct values, run these SQL commands in Snowflake:

    -- Find marketplace listings
    SHOW AVAILABLE LISTINGS IS_ORGANIZATION = TRUE;

    -- Find imported databases
    SELECT DATABASE_NAME, TYPE FROM SNOWFLAKE.ACCOUNT_USAGE.DATABASES
    WHERE TYPE = 'IMPORTED DATABASE' AND DELETED IS NULL;

    -- Find shares and their mappings
    SHOW SHARES;

    Without the shares configuration:

    • Data Products will be created from marketplace listings
    • Owners and custom properties will be populated
    • Data Products will NOT have any associated datasets (assets)
    • Warning messages will be logged

    With the shares configuration:

    • Data Products will be created with all metadata
    • Purchased databases will be linked as Data Product assets
    • Tables from imported databases will show as part of the Data Product
Mode 2: Provider Mode - Track Published Listings

For organizations that publish/share internal marketplace listings:

  1. Grant the required privileges (already covered in the Prerequisites section above):

    -- Basic marketplace access
    grant imported privileges on database snowflake to role datahub_role;

    -- For SHOW SHARES to discover OUTBOUND shares (provider mode)
    -- The role can ONLY see shares it owns or inherits
    use role securityadmin;
    grant role sysadmin to role datahub_role;

    -- OR use SYSADMIN/ACCOUNTADMIN directly in your recipe:
    -- role: SYSADMIN

    Important Note: In Snowflake, SHOW SHARES returns OUTBOUND shares based on ownership. A role can see:

    • Shares owned by the current role (e.g., if datahub_role created the share)
    • Shares owned by roles it inherits from (e.g., datahub_role inherits from SYSADMIN)
    • But NOT shares owned by other roles (e.g., shares owned by ACCOUNTADMIN that datahub_role doesn't inherit from)

    Solutions:

    • If your shares are owned by ACCOUNTADMIN or SYSADMIN: Grant SYSADMIN role to datahub_role (recommended)
    • If datahub_role creates the shares: No additional grant needed
    • Alternative: Use SYSADMIN or ACCOUNTADMIN directly in your ingestion recipe

    Without proper role hierarchy, provider mode cannot discover outbound shares owned by other roles, and tables won't be added as assets to Data Products.

  2. Enable marketplace ingestion in provider mode in your recipe:

    marketplace:
    enabled: true
    marketplace_mode: "provider"
    # Assign owners to your Data Products
    internal_marketplace_owner_patterns:
    "^Your Listing.*": ["data-team"]
    # Optional: restrict the database fallback to specific schemas per listing.
    # Required when DESC SHARE is not permitted (see troubleshooting below).
    listing_to_schemas_overrides:
    YOUR_LISTING_GLOBAL_NAME: [YOUR_SCHEMA]

    # Include your source databases being shared
    database_pattern:
    allow:
    - "YOUR_SOURCE_DATABASE"
  3. No shares configuration needed!

    Provider mode automatically:

    • Discovers your OUTBOUND shares with marketplace listings
    • Links them to your source databases
    • Creates Data Products with your source databases as assets

What you'll get in provider mode:

  • Data Products for your published marketplace listings
  • Source databases automatically linked as assets
  • Owner assignment from config patterns
  • Works without any imported databases

Example use case: You publish a "Customer 360" listing from your CUSTOMER_DATA database. Provider mode will:

  1. Find the listing via SHOW AVAILABLE LISTINGS
  2. Find the OUTBOUND share via SHOW SHARES
  3. Link the CUSTOMER_DATA database to the Data Product
  4. No manual configuration needed!
Mode 3: Both - Track Both Perspectives

Set marketplace_mode: "both" to track both purchased listings (consumer) AND published listings (provider) in the same ingestion.

For more details, see the marketplace configuration guide in the connector documentation.

Mapping Marketplace Organizations to Existing Domains

Marketplace ingestion does not create domains. Map each Snowflake organization to an existing DataHub domain (URN, GUID, or name) via marketplace.organization_to_domain:

marketplace:
enabled: true
organization_to_domain:
"ACME Corp": "urn:li:domain:finance"
"Weather Co": "data-products" # resolved via DomainRegistry

Unmapped organizations produce Data Products with no domain. The domains aspect is written as CREATE only, so UI-assigned domains survive subsequent runs.

Caveats:

  • Editing organization_to_domain for an already-ingested organization has no effect — soft-delete the Data Product's domains aspect to re-map.
  • Provider renames leave the old Marketplace:Provider:<old> tag on the purchased database container; remove it manually.
Troubleshooting Marketplace Ingestion

Common Permission Errors:

If marketplace ingestion fails or produces incomplete data, check for these common permission issues:

  1. "Failed to get marketplace listings - insufficient permissions"

    • Cause: Role lacks access to run SHOW AVAILABLE LISTINGS
    • Solution: Grant imported privileges:
      grant imported privileges on database snowflake to role datahub_role;
  2. "Failed to describe provider share - insufficient permissions"

    • Cause: DESC SHARE requires ownership of the share. Snowflake Marketplace creates shares under ACCOUNTADMIN, and Snowflake does not allow transferring share ownership.
    • Automatic fallback: The connector falls back to enumerating tables from the share's source database (visible in SHOW SHARES). By default this includes all schemas in that database; use listing_to_schemas_overrides to restrict to the specific schemas the share exposes:
      marketplace:
      listing_to_schemas_overrides:
      YOUR_LISTING_GLOBAL_NAME: [YOUR_SCHEMA]
    • For precise DESC SHARE output: Set role: ACCOUNTADMIN in your recipe (broader permissions, not recommended for production).
  3. "Failed to query imported database tables - insufficient permissions"

    • Cause: Role lacks access to query INFORMATION_SCHEMA in imported databases
    • Solution: Grant usage and references on the imported database:
      grant usage on database "<imported-database>" to role datahub_role;
      grant usage on all schemas in database "<imported-database>" to role datahub_role;
      grant references on all tables in database "<imported-database>" to role datahub_role;
  4. Data Products created but no assets appear

    • Cause: Missing shares configuration (consumer mode)
    • Solution: Add explicit listing_global_name mapping in your shares config (see consumer mode documentation above)
  5. Incomplete listing metadata

    • Cause: fetch_internal_marketplace_listing_details: true requires additional time per listing
    • Note: This is expected behavior. The connector runs DESCRIBE AVAILABLE LISTING for each listing to fetch enriched metadata (descriptions, owners, documentation links). For large catalogs (100+ listings), consider setting this to false to improve performance.

Debugging Tips:

  • Check the DataHub logs for structured warnings about marketplace ingestion failures
  • All marketplace errors are logged with clear titles and context (e.g., "Optional listing enrichment failed")
  • Verify your role has imported privileges on the snowflake database
  • Test your SQL grants manually using the DataHub role before running ingestion

Lineage and Usage

DataHub supports two strategies for extracting lineage and usage information from Snowflake:

New Strategy (Default - use_queries_v2: true)

The default and recommended approach uses an optimized query extraction method that:

  • Better Performance: Fetches query logs in a single optimized query instead of multiple separate queries
  • Enhanced Features:
    • Query entities generation (include_queries)
    • Query popularity statistics (include_query_usage_statistics)
    • User filtering with patterns (pushdown_deny_usernames, pushdown_allow_usernames)
    • Database pattern pushdown for performance (push_down_database_pattern_access_history)
    • Query deduplication strategies (query_dedup_strategy)
Legacy Strategy (use_queries_v2: false)

The older approach that will be deprecated in future versions:

  • Uses separate extractors for lineage and usage
  • Less performant due to multiple query executions
  • Limited feature support compared to the new strategy

Both strategies access the same Snowflake system tables (account_usage.query_history, account_usage.access_history), but the new strategy provides significant performance improvements and additional functionality.

Snowflake Streams as Upstream Lineage Sources

DataHub extracts lineage when a query reads from a Snowflake Stream. Coverage details:

  • Multi-target INSERT ALL from a Stream — emits one lineage entry per downstream table, including column-level lineage. Requires use_queries_v2: true (the default).
  • Single-target queries reading from a Stream — fall back to SQL parsing of the query text rather than direct extraction from the audit log.
  • Audit log placeholder names — Snowflake occasionally emits placeholder object names ($SYS_VIEW_<id> or other $-prefixed names) for stream-driven queries. When this happens for a given row, that row falls back to SQL parsing so DataHub never builds lineage from unusable URNs.

The Stream entity itself is also extracted as a top-level dataset; the lineage above is in addition to that.

Metadata Pattern Pushdown

When ingesting metadata from large Snowflake environments, you can improve performance by pushing down pattern filters directly to Snowflake SQL queries using the push_down_metadata_patterns configuration option.

Note: This option applies only to metadata extraction queries (information_schema.databases, schemata, tables, views). For pushing down filters on lineage/usage queries (account_usage.access_history), use push_down_database_pattern_access_history instead. These two options are independent and target completely separate query paths.

Configuration
source:
type: snowflake
config:
# Enable pattern pushdown for improved performance
push_down_metadata_patterns: true

# Your existing patterns - MUST follow Snowflake RLIKE syntax
database_pattern:
allow:
- "PROD_.*" # Matches databases starting with PROD_
- "ANALYTICS.*" # Matches databases starting with ANALYTICS
deny:
- ".*_TEMP$" # Excludes databases ending with _TEMP

table_pattern:
allow:
- ".*" # Allow all tables
deny:
- ".*_BACKUP$" # Exclude tables ending with _BACKUP
View Pattern Limitation

By default, Snowflake views are fetched using SHOW VIEWS, which does not support SQL-level filtering. When push_down_metadata_patterns: true, the view_pattern is pushed down only if fetch_views_from_information_schema: true is also set. Otherwise, view filtering falls back to Python's re.match(), even with pushdown enabled.

This means if you write patterns in Snowflake RLIKE syntax (e.g., PROD.* for prefix matching), they will still work correctly with Python filtering since PROD.* is valid in both. However, patterns that rely on RLIKE's full-string matching semantics (e.g., exact match PROD_DB without .*) will behave differently — Python re.match() treats it as a prefix match.

Recommendation: If you enable push_down_metadata_patterns, also enable fetch_views_from_information_schema: true to ensure consistent behavior for view patterns.

Important: Snowflake RLIKE Syntax Differences

When push_down_metadata_patterns: true, patterns are executed in Snowflake using the RLIKE operator instead of Python's re.match(). These have different matching behaviors:

BehaviorPython re.match()Snowflake RLIKEFix for Snowflake
Prefix match'PROD' matches 'PROD_DB''PROD' does NOT match 'PROD_DB'Use PROD.*
Suffix match'.*_BACKUP' matches 'TABLE_BACKUP_V2'Does NOT matchUse .*_BACKUP$
Start anchor'^PROD' matches 'PROD_DB'Does NOT matchUse PROD.*
Alternation'PROD\|DEV' matches 'PROD_DB'Does NOT matchUse (PROD\|DEV).*
Case sensitivityCase-insensitive by defaultCase-sensitive by defaultHandled automatically

Key difference: Python re.match() anchors at the START only (prefix matching), while Snowflake RLIKE requires a FULL STRING match.

Pattern Conversion Examples
IntentWithout Pushdown (Python)With Pushdown (Snowflake RLIKE)
Starts with PRODPRODPROD.*
Ends with _BACKUP.*_BACKUP.*_BACKUP$
Contains TEMP.*TEMP.*.*TEMP.* (same)
Exact matchPROD_DBPROD_DB (same)
Match PROD or DEV prefixPROD\|DEV(PROD\|DEV).*
Testing Your Patterns

Before enabling pushdown in production, test your patterns in Snowflake:

-- Test prefix matching
SELECT 'PROD_DB' RLIKE 'PROD'; -- FALSE (needs .*)
SELECT 'PROD_DB' RLIKE 'PROD.*'; -- TRUE

-- Test suffix matching
SELECT 'TABLE_BACKUP_V2' RLIKE '.*_BACKUP'; -- FALSE (needs $)
SELECT 'TABLE_BACKUP' RLIKE '.*_BACKUP$'; -- TRUE

-- Test FQN with escaped dots
SELECT 'PROD_DB.PUBLIC.TABLE' RLIKE 'PROD_DB\\.PUBLIC\\..*'; -- TRUE

Semantic Views

DataHub supports ingestion of Snowflake Semantic Views, which are business-defined views that define metrics, dimensions, and relationships for consistent data modeling and AI-powered analytics.

Configuration

Semantic view ingestion is disabled by default (requires Snowflake Enterprise Edition or above). You can enable it using the following configuration options:

# Enable semantic view ingestion (requires Enterprise Edition)
semantic_views:
enabled: true # Default: false
column_lineage: true # Default: false - enable column-level lineage

# Filter semantic views using regex patterns
semantic_view_pattern:
allow:
- "ANALYTICS_DB.PUBLIC.*"
- "SALES_DB.*"
deny:
- ".*_INTERNAL"
Features
  • Metadata Extraction: Extracts semantic view definitions (YAML), columns, comments, and timestamps
  • Lineage Support: Semantic views participate in lineage extraction like regular views
  • Tags Support: Tags applied to semantic views are extracted if extract_tags is enabled
  • External URLs: Direct links to Snowflake Snowsight UI for semantic views
Requirements
  • Semantic views require appropriate Snowflake edition and privileges
  • Requires REFERENCES or SELECT privileges on semantic views (they are treated as views in Snowflake's permission model)
  • The semantic view definition (SQL DDL) is extracted when available through the GET_DDL function

Stages, Tasks, and Pipes

DataHub supports ingestion of Snowflake Stages, Tasks, and Snowpipe objects. All three features are disabled by default and can be enabled independently.

Stages (include_stages: true)

Stages are ingested as containers nested under their parent schema. Internal stages additionally emit a placeholder dataset representing the staged data, which is used for pipe lineage resolution. External stages (S3, GCS, Azure) resolve their URLs to the corresponding cloud platform dataset URN.

include_stages: true
stage_pattern:
allow:
- "MY_DB.MY_SCHEMA.*"
Tasks (include_tasks: true)

Tasks are ingested as DataJob entities grouped under a per-schema DataFlow. Predecessor dependencies between tasks are captured as inputDatajobs on the DataJobInputOutput aspect, preserving the DAG structure.

include_tasks: true
task_pattern:
allow:
- "MY_DB.MY_SCHEMA.*"
Pipes (include_pipes: true)

Snowpipe objects are ingested as DataJob entities with lineage derived from parsing the COPY INTO statement. The pipe's source stage resolves to an upstream dataset (internal placeholder or external cloud URN) and the target table resolves to a downstream dataset. Enabling pipes automatically scans stages for lineage resolution, even if include_stages is false.

The parser handles the following COPY INTO patterns:

  • Column-list targets — COPY INTO t(a, b) FROM @stage resolves the target table correctly
  • Subquery sources — COPY INTO t FROM (SELECT ... FROM @s1 UNION ALL SELECT ... FROM @s2) captures all referenced stages as upstream datasets
  • Non-COPY pipe bodies are silently skipped; stage refs that cannot be normalized to a three-part FQN emit a warning in the ingestion report
include_pipes: true
pipe_pattern:
allow:
- "MY_DB.MY_SCHEMA.*"

Limitations

Module behavior is constrained by source APIs, permissions, and metadata exposed by the platform. Refer to capability notes for unsupported or conditional features.

  • Some features require specific Snowflake editions or additional privileges. This includes dynamic tables, semantic views, advanced lineage features, and tags.
  • Dynamic tables require the monitor privilege for full metadata extraction. Without this privilege, dynamic table entities will still appear in DataHub but their DDL is not accessible, so lineage will not be extracted.
  • Semantic views require REFERENCES or SELECT privileges for metadata extraction. Without these privileges, semantic views will not be visible to DataHub.
  • The underlying Snowflake views that we use to get metadata have a latency of 45 minutes to 3 hours. So we would not be able to get very recent metadata in some cases like queries you ran within that time period etc. This is applicable particularly for lineage, usage and tags (without lineage) extraction.
  • If there is any ongoing Snowflake incident, we will not be able to get the metadata until that incident is resolved.
  • Lineage extraction, when got directly from Snowflake access history, has some limitations, as documented here, here, and here.

Troubleshooting

If ingestion fails, validate credentials, permissions, connectivity, and scope filters first. Then review ingestion logs for source-specific errors and adjust configuration accordingly.

By default, DataHub generates Snowsight links of the form https://app.snowflake.com/<region>/<account>/.... If Snowsight is only reachable through private link in your environment, these links will not work for your users.

To fix this, set snowsight_base_url in the recipe to your private-link Snowsight URL. Retrieve the value from Snowflake as ACCOUNTADMIN:

SELECT SYSTEM$GET_PRIVATELINK_CONFIG();
-- Use either snowsight-privatelink-url or regionless-snowsight-privatelink-url.
source:
type: snowflake
config:
snowsight_base_url: "https://app.us-east-1.privatelink.snowflakecomputing.com/"

Code Coordinates

  • Class Name: datahub.ingestion.source.snowflake.snowflake_v2.SnowflakeV2Source
  • Browse on GitHub
Questions?

If you've got any questions on configuring ingestion for Snowflake, feel free to ping us on our Slack.

💡 Contributing to this documentation

This page is auto-generated from the underlying source code. To make changes, please edit the relevant source files in the metadata-ingestion directory.

Tip: For quick typo fixes or documentation updates, you can click the ✏️ Edit icon directly in the GitHub UI to open a Pull Request. For larger changes and PR naming conventions, please refer to our Contributing Guide.