DataHub APIs and SDKs Overview
DataHub has several APIs to manipulate metadata on the platform. Here's the list of APIs and their pros and cons to help you choose the right one for your use case.
| API | Definition | Pros | Cons |
|---|---|---|---|
| Python SDK | SDK | Highly flexible, Good for bulk execution | Requires an understanding of the metadata change event |
| Java SDK | SDK | Highly flexible, Good for bulk execution | Requires an understanding of the metadata change event |
| GraphQL API | GraphQL interface | Intuitive; mirrors UI capabilities | Less flexible than SDKs; requires knowledge of GraphQL syntax |
| OpenAPI | Lower-level API for advanced users | Most powerful and flexible | Can be hard to use for straightforward use cases; no corresponding SDKs, but OpenAPI spec is generated within the product |
In general, Python and Java SDKs are our most recommended tools for extending and customizing the behavior of your DataHub instance, especially for programmatic use cases.
About async MCP ingest — When you submit MCPs through GMS APIs with async=true (Rest.li ingestProposal, OpenAPI entity writes, and SDK clients that target those endpoints), GMS runs the full proposal validation pipeline before accepting the request. That includes schema checks, entity-level authorization (isAPIAuthorized), and registered aspect payload validators (for example tag privilege constraints and aspect-specific authorization such as logicalParent). Unauthorized or invalid proposals are rejected synchronously with 403/422 and are not published to Kafka.
Async only means GMS does not commit the write to primary storage at accept time. Instead it publishes the MCP to the MetadataChangeProposal topic; the MCE consumer applies it later with async=false. Failures that occur during that later processing (for example pre-commit validation or storage errors) are captured on the Failed MCP topic and are not surfaced to the original API caller.
Do not confuse async GMS ingest with direct Kafka produce. Writing MCPs directly to the MetadataChangeProposal topic bypasses GMS accept-time authorization and validation. The MCE consumer processes those messages under system context and is not a second user-authorization gate. Restrict Kafka access accordingly.
Python and Java SDK
We offer an SDK for both Python and Java that provide full functionality when it comes to CRUD operations and any complex functionality you may want to build into DataHub. We recommend using the SDKs for most use cases. Here are the examples of how to use the SDKs:
- Define a lineage between data entities
- Executing bulk operations - e.g. adding tags to multiple datasets
- Creating custom metadata entities
Learn more about the SDKs:
GraphQL API
The graphql API serves as the primary API used by the DataHub frontend. It is generally assumed that accesses to the GraphQL API are coming in from the frontend so it often comes along with default caching, synchronous operations, and other UI targeted expectations. Care should be taken when used programmatically to fetch and update due to this since operations are intentionally limited in scope. Intended as a higher-level API that simplifies the most common operations.
The GraphQL API can be useful if you're getting started with DataHub since it's more user-friendly and straightforward, especially when using GraphiQL. Here are some examples of how to use the GraphQL API:
- Search for datasets with conditions
- Query for relationships between entities
Learn more about the GraphQL API:
DataHub API Comparison
DataHub supports several APIs, each with its own unique usage and format. Here's an overview of what each API can do.
Last Updated : Feb 16 2024
| Feature | GraphQL | Python SDK | OpenAPI |
|---|---|---|---|
| Create a Dataset | 🚫 | ✅ [Guide] | ✅ |
| Delete a Dataset (Soft Delete) | ✅ [Guide] | ✅ [Guide] | ✅ |
| Delete a Dataset (Hard Delete) | 🚫 | ✅ [Guide] | ✅ |
| Search a Dataset | ✅ [Guide] | ✅ | ✅ |
| Read a Dataset Deprecation | ✅ | ✅ | ✅ |
| Read Dataset Entities (V2) | ✅ | ✅ | ✅ |
| Create a Tag | ✅ [Guide] | ✅ [Guide] | ✅ |
| Read a Tag | ✅ [Guide] | ✅ [Guide] | ✅ |
| Add Tags to a Dataset | ✅ [Guide] | ✅ [Guide] | ✅ |
| Add Tags to a Column of a Dataset | ✅ [Guide] | ✅ [Guide] | ✅ |
| Remove Tags from a Dataset | ✅ [Guide] | ✅ [Guide] | ✅ |
| Create Glossary Terms | ✅ [Guide] | ✅ [Guide] | ✅ |
| Read Terms from a Dataset | ✅ [Guide] | ✅ [Guide] | ✅ |
| Add Terms to a Column of a Dataset | ✅ [Guide] | ✅ [Guide] | ✅ |
| Add Terms to a Dataset | ✅ [Guide] | ✅ [Guide] | ✅ |
| Create Domains | ✅ [Guide] | ✅ [Guide] | ✅ |
| Read Domains | ✅ [Guide] | ✅ [Guide] | ✅ |
| Add Domains to a Dataset | ✅ [Guide] | ✅ [Guide] | ✅ |
| Remove Domains from a Dataset | ✅ [Guide] | ✅ [Guide] | ✅ |
| Create / Upsert Users | ✅ [Guide] | ✅ [Guide] | ✅ |
| Create / Upsert Group | ✅ [Guide] | ✅ [Guide] | ✅ |
| Read Owners of a Dataset | ✅ [Guide] | ✅ [Guide] | ✅ |
| Add Owner to a Dataset | ✅ [Guide] | ✅ [Guide] | ✅ |
| Remove Owner from a Dataset | ✅ [Guide] | ✅ [Guide] | ✅ |
| Add Lineage | ✅ [Guide] | ✅ [Guide] | ✅ |
| Add Column Level (Fine Grained) Lineage | 🚫 | ✅ [Guide] | ✅ |
| Add Documentation (Description) to a Column of a Dataset | ✅ [Guide] | ✅ [Guide] | ✅ |
| Add Documentation (Description) to a Dataset | ✅ [Guide] | ✅ [Guide] | ✅ |
| Add / Remove / Replace Custom Properties on a Dataset | 🚫 | ✅ [Guide] | ✅ |
| Add ML Feature to ML Feature Table | 🚫 | ✅ [Guide] | ✅ |
| Add ML Feature to MLModel | 🚫 | ✅ [Guide] | ✅ |
| Add ML Group to MLFeatureTable | 🚫 | ✅ [Guide] | ✅ |
| Create MLFeature | 🚫 | ✅ [Guide] | ✅ |
| Create MLFeatureTable | 🚫 | ✅ [Guide] | ✅ |
| Create MLModel | 🚫 | ✅ [Guide] | ✅ |
| Create MLModelGroup | 🚫 | ✅ [Guide] | ✅ |
| Create MLPrimaryKey | 🚫 | ✅ [Guide] | ✅ |
| Create MLFeatureTable | 🚫 | ✅ [Guide] | ✅ |
| Read MLFeature | ✅ [Guide] | ✅ [Guide] | ✅ |
| Read MLFeatureTable | ✅ [Guide] | ✅ [Guide] | ✅ |
| Read MLModel | ✅ [Guide] | ✅ [Guide] | ✅ |
| Read MLModelGroup | ✅ [Guide] | ✅ [Guide] | ✅ |
| Read MLPrimaryKey | ✅ [Guide] | ✅ [Guide] | ✅ |
| Create Data Product | 🚫 | ✅ [Code] | ✅ |
| Create Lineage Between Chart and Dashboard | 🚫 | ✅ [Code] | ✅ |
| Create Lineage Between Dataset and Chart | 🚫 | ✅ [Code] | ✅ |
| Create Lineage Between Dataset and DataJob | 🚫 | ✅ [Code] | ✅ |
| Create Finegrained Lineage as DataJob for Dataset | 🚫 | ✅ [Code] | ✅ |
| Create Finegrained Lineage for Dataset | 🚫 | ✅ [Code] | ✅ |
| Create DataJob with Dataflow | 🚫 | ✅ [Code] | ✅ |
| Create Programmatic Pipeline | 🚫 | ✅ [Code] | ✅ |