Version: Next

Kafka Connect

Integration Details

This plugin extracts the following:

Source and Sink Connectors in Kafka Connect as Data Pipelines
For Source connectors - Data Jobs to represent lineage information between source dataset to Kafka topic per {connector_name}:{source_dataset} combination
For Sink connectors - Data Jobs to represent lineage information between Kafka topic to destination dataset per {connector_name}:{topic} combination

Concept Mapping

This ingestion source maps the following Source System Concepts to DataHub Concepts:

Source Concept	DataHub Concept	Notes
`"kafka-connect"`	Data Platform
Connector	DataFlow
Kafka Topic	Dataset

Current limitations

Works only for

Source connectors: JDBC, Debezium, Mongo and Generic connectors with user-defined lineage graph
Sink connectors: BigQuery, Confluent, S3, Snowflake

Important Capabilities

Capability	Status	Notes
Detect Deleted Entities	✅	Enabled by default via stateful ingestion.
Platform Instance	✅	Enabled by default.
Schema Metadata	✅	Enabled by default.
Table-Level Lineage	✅	Enabled by default.

CLI based Ingestion

Starter Recipe

Check out the following recipe to get started with ingestion! See below for full configuration options.

For general pointers on writing and running a recipe, see our main recipe guide.

source:
  type: "kafka-connect"
  config:
    # Coordinates
    connect_uri: "http://localhost:8083"

    # Credentials
    username: admin
    password: password

    # Optional
    # Platform instance mapping to use when constructing URNs.
    # Use if single instance of platform is referred across connectors.
    platform_instance_map:
      mysql: mysql_platform_instance

sink:
  # sink configs

Config Details

Options
Schema

Note that a . is used to denote nested fields in the YAML recipe.

Field	Description
cluster_name One of string, null	Cluster to ingest from. Default: connect-cluster
connect_to_platform_map One of string, null	Platform instance mapping when multiple instances for a platform is available. Entry for a platform should be in either `platform_instance_map` or `connect_to_platform_map`. e.g.`connect_to_platform_map: { "postgres-connector-finance-db": "postgres": "core_finance_instance" }` Default: None
connect_uri string	URI to connect to. Default: http://localhost:8083/
convert_lineage_urns_to_lowercase boolean	Whether to convert the urns of ingested lineage dataset to lowercase Default: False
password One of string, null	Kafka Connect password. Default: None
platform_instance One of string, null	The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details. Default: None
platform_instance_map One of string, null	Platform instance mapping to use when constructing URNs. e.g.`platform_instance_map: { "hive": "warehouse" }` Default: None
username One of string, null	Kafka Connect username. Default: None
env string	The environment that all assets produced by this connector belong to Default: PROD
connector_patterns AllowDenyPattern	A class to store allow deny regexes
connector_patterns.ignoreCase One of boolean, null	Whether to ignore case sensitivity during pattern matching. Default: True
connector_patterns.allow array	List of regex patterns to include in ingestion Default: ['.*']
connector_patterns.allow.string string
connector_patterns.deny array	List of regex patterns to exclude from ingestion. Default: []
connector_patterns.deny.string string
generic_connectors array	Provide lineage graph for sources connectors other than Confluent JDBC Source Connector, Debezium Source Connector, and Mongo Source Connector Default: []
generic_connectors.GenericConnectorConfig GenericConnectorConfig
generic_connectors.GenericConnectorConfig.connector_name ❓ string
generic_connectors.GenericConnectorConfig.source_dataset ❓ string
generic_connectors.GenericConnectorConfig.source_platform ❓ string
provided_configs One of array, null	Provided Configurations Default: None
provided_configs.ProvidedConfig ProvidedConfig
provided_configs.ProvidedConfig.path_key ❓ string
provided_configs.ProvidedConfig.provider ❓ string
provided_configs.ProvidedConfig.value ❓ string
stateful_ingestion One of StatefulStaleMetadataRemovalConfig, null	Default: None
stateful_ingestion.enabled boolean	Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False Default: False
stateful_ingestion.fail_safe_threshold number	Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'. Default: 75.0
stateful_ingestion.remove_stale_metadata boolean	Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled. Default: True

The JSONSchema for this configuration is inlined below.

{
  "$defs": {
    "AllowDenyPattern": {
      "additionalProperties": false,
      "description": "A class to store allow deny regexes",
      "properties": {
        "allow": {
          "default": [
            ".*"
          ],
          "description": "List of regex patterns to include in ingestion",
          "items": {
            "type": "string"
          },
          "title": "Allow",
          "type": "array"
        },
        "deny": {
          "default": [],
          "description": "List of regex patterns to exclude from ingestion.",
          "items": {
            "type": "string"
          },
          "title": "Deny",
          "type": "array"
        },
        "ignoreCase": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": true,
          "description": "Whether to ignore case sensitivity during pattern matching.",
          "title": "Ignorecase"
        }
      },
      "title": "AllowDenyPattern",
      "type": "object"
    },
    "GenericConnectorConfig": {
      "additionalProperties": false,
      "properties": {
        "connector_name": {
          "title": "Connector Name",
          "type": "string"
        },
        "source_dataset": {
          "title": "Source Dataset",
          "type": "string"
        },
        "source_platform": {
          "title": "Source Platform",
          "type": "string"
        }
      },
      "required": [
        "connector_name",
        "source_dataset",
        "source_platform"
      ],
      "title": "GenericConnectorConfig",
      "type": "object"
    },
    "ProvidedConfig": {
      "additionalProperties": false,
      "properties": {
        "provider": {
          "title": "Provider",
          "type": "string"
        },
        "path_key": {
          "title": "Path Key",
          "type": "string"
        },
        "value": {
          "title": "Value",
          "type": "string"
        }
      },
      "required": [
        "provider",
        "path_key",
        "value"
      ],
      "title": "ProvidedConfig",
      "type": "object"
    },
    "StatefulStaleMetadataRemovalConfig": {
      "additionalProperties": false,
      "description": "Base specialized config for Stateful Ingestion with stale metadata removal capability.",
      "properties": {
        "enabled": {
          "default": false,
          "description": "Whether or not to enable stateful ingest. Default: True if a pipeline_name is set and either a datahub-rest sink or `datahub_api` is specified, otherwise False",
          "title": "Enabled",
          "type": "boolean"
        },
        "remove_stale_metadata": {
          "default": true,
          "description": "Soft-deletes the entities present in the last successful run but missing in the current run with stateful_ingestion enabled.",
          "title": "Remove Stale Metadata",
          "type": "boolean"
        },
        "fail_safe_threshold": {
          "default": 75.0,
          "description": "Prevents large amount of soft deletes & the state from committing from accidental changes to the source configuration if the relative change percent in entities compared to the previous state is above the 'fail_safe_threshold'.",
          "maximum": 100.0,
          "minimum": 0.0,
          "title": "Fail Safe Threshold",
          "type": "number"
        }
      },
      "title": "StatefulStaleMetadataRemovalConfig",
      "type": "object"
    }
  },
  "additionalProperties": false,
  "properties": {
    "stateful_ingestion": {
      "anyOf": [
        {
          "$ref": "#/$defs/StatefulStaleMetadataRemovalConfig"
        },
        {
          "type": "null"
        }
      ],
      "default": null
    },
    "env": {
      "default": "PROD",
      "description": "The environment that all assets produced by this connector belong to",
      "title": "Env",
      "type": "string"
    },
    "platform_instance_map": {
      "anyOf": [
        {
          "additionalProperties": {
            "type": "string"
          },
          "type": "object"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Platform instance mapping to use when constructing URNs. e.g.`platform_instance_map: { \"hive\": \"warehouse\" }`",
      "title": "Platform Instance Map"
    },
    "platform_instance": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "The instance of the platform that all assets produced by this recipe belong to. This should be unique within the platform. See https://docs.datahub.com/docs/platform-instances/ for more details.",
      "title": "Platform Instance"
    },
    "connect_uri": {
      "default": "http://localhost:8083/",
      "description": "URI to connect to.",
      "title": "Connect Uri",
      "type": "string"
    },
    "username": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Kafka Connect username.",
      "title": "Username"
    },
    "password": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Kafka Connect password.",
      "title": "Password"
    },
    "cluster_name": {
      "anyOf": [
        {
          "type": "string"
        },
        {
          "type": "null"
        }
      ],
      "default": "connect-cluster",
      "description": "Cluster to ingest from.",
      "title": "Cluster Name"
    },
    "convert_lineage_urns_to_lowercase": {
      "default": false,
      "description": "Whether to convert the urns of ingested lineage dataset to lowercase",
      "title": "Convert Lineage Urns To Lowercase",
      "type": "boolean"
    },
    "connector_patterns": {
      "$ref": "#/$defs/AllowDenyPattern",
      "default": {
        "allow": [
          ".*"
        ],
        "deny": [],
        "ignoreCase": true
      },
      "description": "regex patterns for connectors to filter for ingestion."
    },
    "provided_configs": {
      "anyOf": [
        {
          "items": {
            "$ref": "#/$defs/ProvidedConfig"
          },
          "type": "array"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Provided Configurations",
      "title": "Provided Configs"
    },
    "connect_to_platform_map": {
      "anyOf": [
        {
          "additionalProperties": {
            "additionalProperties": {
              "type": "string"
            },
            "type": "object"
          },
          "type": "object"
        },
        {
          "type": "null"
        }
      ],
      "default": null,
      "description": "Platform instance mapping when multiple instances for a platform is available. Entry for a platform should be in either `platform_instance_map` or `connect_to_platform_map`. e.g.`connect_to_platform_map: { \"postgres-connector-finance-db\": \"postgres\": \"core_finance_instance\" }`",
      "title": "Connect To Platform Map"
    },
    "generic_connectors": {
      "default": [],
      "description": "Provide lineage graph for sources connectors other than Confluent JDBC Source Connector, Debezium Source Connector, and Mongo Source Connector",
      "items": {
        "$ref": "#/$defs/GenericConnectorConfig"
      },
      "title": "Generic Connectors",
      "type": "array"
    }
  },
  "title": "KafkaConnectSourceConfig",
  "type": "object"
}

Advanced Configurations

Working with Platform Instances

If you've multiple instances of kafka OR source/sink systems that are referred in your kafka-connect setup, you'd need to configure platform instance for these systems in kafka-connect recipe to generate correct lineage edges. You must have already set platform_instance in recipes of original source/sink systems. Refer the document Working with Platform Instances to understand more about this.

There are two options available to declare source/sink system's platform_instance in kafka-connect recipe. If single instance of platform is used across all kafka-connect connectors, you can use platform_instance_map to specify platform_instance to use for a platform when constructing URNs for lineage.

Example:

# Map of platform name to platform instance
platform_instance_map:
  snowflake: snowflake_platform_instance
  mysql: mysql_platform_instance

If multiple instances of platform are used across kafka-connect connectors, you'd need to specify platform_instance to use for platform for every connector.

Example - Multiple MySQL Source Connectors each reading from different mysql instance

# Map of platform name to platform instance per connector
connect_to_platform_map:
  mysql_connector1:
    mysql: mysql_instance1

  mysql_connector2:
    mysql: mysql_instance2

Here mysql_connector1 and mysql_connector2 are names of MySQL source connectors as defined in kafka-connect connector config.

Example - Multiple MySQL Source Connectors each reading from difference mysql instance and writing to different kafka cluster

connect_to_platform_map:
  mysql_connector1:
    mysql: mysql_instance1
    kafka: kafka_instance1

  mysql_connector2:
    mysql: mysql_instance2
    kafka: kafka_instance2

You can also use combination of platform_instance_map and connect_to_platform_map in your recipe. Note that, the platform_instance specified for the connector in connect_to_platform_map will always take higher precedance even if platform_instance for same platform is set in platform_instance_map.

If you do not use platform_instance in original source/sink recipes, you do not need to specify them in above configurations.

Note that, you do not need to specify platform_instance for BigQuery.

Example - Multiple BigQuery Sink Connectors each writing to different kafka cluster

connect_to_platform_map:
  bigquery_connector1:
    kafka: kafka_instance1

  bigquery_connector2:
    kafka: kafka_instance2

Provided Configurations from External Sources

Kafka Connect supports pluggable configuration providers which can load configuration data from external sources at runtime. These values are not available to DataHub ingestion source through Kafka Connect APIs. If you are using such provided configurations to specify connection url (database, etc) in Kafka Connect connector configuration then you will need also add these in provided_configs section in recipe for DataHub to generate correct lineage.

# Optional mapping of provider configurations if using
provided_configs:
  - provider: env
    path_key: MYSQL_CONNECTION_URL
    value: jdbc:mysql://test_mysql:3306/librarydb

Code Coordinates

Class Name: datahub.ingestion.source.kafka_connect.kafka_connect.KafkaConnectSource
Browse on GitHub

Questions

If you've got any questions on configuring ingestion for Kafka Connect, feel free to ping us on our Slack.

Is this page helpful?

Kafka Connect

Integration Details​

Concept Mapping​

Current limitations​

Important Capabilities​

CLI based Ingestion​

Starter Recipe​

Config Details​

Advanced Configurations​

Working with Platform Instances​

Example - Multiple MySQL Source Connectors each reading from different mysql instance​

Example - Multiple MySQL Source Connectors each reading from difference mysql instance and writing to different kafka cluster​

Example - Multiple BigQuery Sink Connectors each writing to different kafka cluster​

Provided Configurations from External Sources​

Code Coordinates​