Skip to main content

Document

Why Would You Use Documents?

Documents in DataHub are content-indexed resources that can store knowledge, documentation, FAQs, tutorials, and other textual content. They provide a centralized place to manage and search through organizational knowledge, making them accessible to both humans and AI systems.

Documents support rich metadata including:

  • Searchable content with full-text search capabilities
  • Categorization via types, domains, and owners
  • Visibility control to show/hide documents in global search and navigation
  • Relationships to data assets (datasets, dashboards, charts, etc.)
  • Hierarchical organization through parent-child relationships

Types of Documents

DataHub supports two types of documents:

  1. Native Documents: Created and stored directly in DataHub. Full content is indexed and searchable. Use Document.create_document() to create these.

  2. External Documents: References to documents stored in external systems (Notion, Confluence, Google Docs, etc.). These link to the original content via URL. Use Document.create_external_document() to create these.

Document Visibility

Documents can be configured to:

  • Show in global context (default): Appear in global search results and the knowledge base sidebar
  • Hide from global context: Only accessible through related assets. This is useful for:
    • Documentation specific to a single dataset
    • Context documents for AI agents
    • Private notes attached to assets

Goal Of This Guide

This guide will show you how to:

  • Create native and external documents
  • Control document visibility
  • Link documents to data assets
  • Update document contents and metadata
  • Publish and unpublish documents
  • Delete documents

Prerequisites

For this tutorial, you need to deploy DataHub Quickstart and ingest sample data. For detailed steps, please refer to Datahub Quickstart Guide.

Create Document

Native Document

Native documents are stored directly in DataHub with full content indexing.

from datahub.sdk import DataHubClient, Document

client = DataHubClient.from_env()

# Create a native document
doc = Document.create_document(
id="getting-started-tutorial",
title="Getting Started with DataHub",
text="# Getting Started with DataHub\n\nThis tutorial will help you get started...",
subtype="Tutorial",
)

client.entities.upsert(doc)
print(f"Created document: {doc.urn}")

External Document

External documents reference content stored in other platforms like Notion or Confluence.

from datahub.sdk import DataHubClient, Document

client = DataHubClient.from_env()

# Create an external document (from Notion)
doc = Document.create_external_document(
id="notion-engineering-handbook",
title="Engineering Handbook",
platform="urn:li:dataPlatform:notion",
external_url="https://notion.so/team/engineering-handbook",
external_id="notion-page-abc123",
text="Summary of the handbook for search...", # Optional
owners=["urn:li:corpuser:engineering-lead"],
)

client.entities.upsert(doc)
print(f"Created external document: {doc.urn}")

Document Hidden from Global Context

Documents can be hidden from global search and sidebar navigation. They remain accessible through related assets - useful for AI agent context or asset-specific documentation.

from datahub.sdk import DataHubClient, Document

client = DataHubClient.from_env()

# Create a document hidden from global context
# Only accessible via the related asset - useful for AI agents
doc = Document.create_document(
id="orders-dataset-context",
title="Orders Dataset Context",
text="# Context for AI Agents\n\nThe orders dataset contains daily summaries...",
show_in_global_context=False, # Hidden from global search/sidebar
related_assets=["urn:li:dataset:(urn:li:dataPlatform:snowflake,orders,PROD)"],
)

client.entities.upsert(doc)
print(f"Created AI-only context document: {doc.urn}")

Document with Full Metadata

from datahub.sdk import DataHubClient, Document

client = DataHubClient.from_env()

doc = Document.create_document(
id="faq-data-quality",
title="Data Quality FAQ",
text="# Data Quality FAQ\n\n## Q: How do we measure data quality?\n\nA: We use...",
subtype="FAQ",
related_assets=["urn:li:dataset:(urn:li:dataPlatform:snowflake,my_table,PROD)"],
owners=["urn:li:corpuser:john"],
domain="urn:li:domain:engineering",
tags=["urn:li:tag:important"],
custom_properties={"team": "data-platform", "version": "1.0"},
)

client.entities.upsert(doc)
print(f"Created document with metadata: {doc.urn}")

Update Document

Update the contents, title, or visibility of an existing document.

# Inlined from /metadata-ingestion/examples/library/update_document.py
# Inlined from metadata-ingestion/examples/library/update_document.py
"""Example: Updating documents using the DataHub SDK.

This example demonstrates how to retrieve, modify, and update documents.
"""

from datahub.metadata.urns import DocumentUrn
from datahub.sdk import DataHubClient

# Initialize the client
client = DataHubClient.from_env()

# ============================================================================
# Example 1: Retrieve and update a document's content
# ============================================================================
# First, get the existing document from DataHub
doc = client.entities.get(DocumentUrn("my-tutorial-doc"))

if doc:
# Update the text content
doc.set_text("# Updated Getting Started Guide\n\nThis is the updated content...")

# Save changes
client.entities.upsert(doc)
print("Document contents updated!")

# ============================================================================
# Example 2: Update document title
# ============================================================================
doc = client.entities.get(DocumentUrn("my-tutorial-doc"))

if doc:
doc.set_title("Updated Tutorial Title")
client.entities.upsert(doc)
print("Document title updated!")

# ============================================================================
# Example 3: Update both contents and title with method chaining
# ============================================================================
doc = client.entities.get(DocumentUrn("my-tutorial-doc"))

if doc:
# Method chaining for multiple updates
doc.set_text("# Comprehensive Guide\n\nFully updated content...").set_title(
"Comprehensive DataHub Guide"
)
client.entities.upsert(doc)
print("Document fully updated!")

# ============================================================================
# Example 4: Update document visibility (global context)
# ============================================================================
doc = client.entities.get(DocumentUrn("my-tutorial-doc"))

if doc:
# Hide document from global search and sidebar
# Useful for making documents only accessible via related assets (e.g., for AI agents)
doc.hide_from_global_context()
client.entities.upsert(doc)
print("Document hidden from global context!")

# Later, show it again in global search/sidebar
doc.show_in_global_search()
client.entities.upsert(doc)
print("Document visible in global context again!")

# ============================================================================
# Example 5: Update related assets and documents
# ============================================================================
doc = client.entities.get(DocumentUrn("my-tutorial-doc"))

if doc:
# Add related assets - the document becomes accessible from these assets
doc.add_related_asset("urn:li:dataset:(urn:li:dataPlatform:snowflake,users,PROD)")
doc.add_related_asset("urn:li:dataset:(urn:li:dataPlatform:snowflake,orders,PROD)")

# Add a related document
doc.add_related_document("urn:li:document:related-guide")

client.entities.upsert(doc)
print("Related entities updated!")

# ============================================================================
# Example 6: Move a document to a different parent
# ============================================================================
# Documents can be organized hierarchically. Moving a document changes its parent.
doc = client.entities.get(DocumentUrn("child-section"))

if doc:
# Check current parent
print(f"Current parent: {doc.parent_document}")

# Move to a new parent document
doc.set_parent_document("urn:li:document:new-parent-guide")
client.entities.upsert(doc)
print("Document moved to new parent!")

# Remove from hierarchy (make it a top-level document)
doc.set_parent_document(None)
client.entities.upsert(doc)
print("Document is now a top-level document!")

# ============================================================================
# Example 7: Update document status
# ============================================================================
doc = client.entities.get(DocumentUrn("my-tutorial-doc"))

if doc:
# Publish the document
doc.publish()
client.entities.upsert(doc)
print("Document published!")

# Later, unpublish it
doc.unpublish()
client.entities.upsert(doc)
print("Document unpublished!")

Search Documents

Search through documents with various filters.

# Inlined from /metadata-ingestion/examples/library/search_documents.py
# Inlined from metadata-ingestion/examples/library/search_documents.py
"""Example: Searching documents using the DataHub SDK.

This example demonstrates how to search for documents using the DataHub SDK.
"""

from datahub.sdk import DataHubClient, FilterDsl

# Initialize the client
client = DataHubClient.from_env()

# ============================================================================
# Example 1: Search for all documents
# ============================================================================
# Use get_urns with entity type filter to find documents
document_urns = client.search.get_urns(
filter=FilterDsl.entity_type("document"),
)

print("All documents:")
for urn in document_urns:
print(f" - {urn}")

# ============================================================================
# Example 2: Search with a text query
# ============================================================================
# Search for documents matching "data quality"
document_urns = client.search.get_urns(
query="data quality",
filter=FilterDsl.entity_type("document"),
)

print("\nDocuments matching 'data quality':")
for urn in document_urns:
print(f" - {urn}")

# ============================================================================
# Example 3: Search within a specific domain
# ============================================================================
document_urns = client.search.get_urns(
filter=FilterDsl.and_(
FilterDsl.entity_type("document"),
FilterDsl.domain("urn:li:domain:engineering"),
),
)

print("\nDocuments in engineering domain:")
for urn in document_urns:
print(f" - {urn}")

# ============================================================================
# Example 4: Search with tags
# ============================================================================
document_urns = client.search.get_urns(
filter=FilterDsl.and_(
FilterDsl.entity_type("document"),
FilterDsl.tag("urn:li:tag:important"),
),
)

print("\nDocuments with 'important' tag:")
for urn in document_urns:
print(f" - {urn}")

Get Document

Retrieve the full contents and metadata of a specific document.

# Inlined from /metadata-ingestion/examples/library/get_document.py
# Inlined from metadata-ingestion/examples/library/get_document.py
"""Example: Retrieving documents using the DataHub SDK.

This example demonstrates how to get documents and access their properties.
"""

from datahub.metadata.urns import DocumentUrn
from datahub.sdk import DataHubClient

# Initialize the client
client = DataHubClient.from_env()

# ============================================================================
# Example 1: Get a document by URN
# ============================================================================
doc = client.entities.get(DocumentUrn("my-tutorial-doc"))

if doc:
print(f"Document: {doc.title}")
print(f"URN: {doc.urn}")
print(f"Status: {doc.status}")
print(f"Subtype: {doc.subtype}")
print(f"\nContents:\n{doc.text}")

# Check document type (native vs external)
if doc.is_native:
print("\nThis is a native document (stored in DataHub)")
elif doc.is_external:
print("\nThis is an external document")
print(f" External URL: {doc.external_url}")
print(f" External ID: {doc.external_id}")

# Check visibility
if doc.show_in_global_context:
print("Visible in global search and sidebar")
else:
print("Hidden from global context (accessible only via related assets)")

# Check related entities
if doc.related_assets:
print(f"\nRelated assets: {len(doc.related_assets)}")
for asset in doc.related_assets:
print(f" - {asset}")

if doc.related_documents:
print(f"\nRelated documents: {len(doc.related_documents)}")
for related_doc in doc.related_documents:
print(f" - {related_doc}")

# Check parent document
if doc.parent_document:
print(f"\nParent document: {doc.parent_document}")

# Get custom properties
if doc.custom_properties:
print("\nCustom properties:")
for key, value in doc.custom_properties.items():
print(f" {key}: {value}")
else:
print("Document not found")

# ============================================================================
# Example 2: Check if a document exists
# ============================================================================
doc = client.entities.get(DocumentUrn("might-not-exist"))

if doc is not None:
print(f"\nDocument exists: {doc.urn}")
else:
print("\nDocument does not exist")

Publish/Unpublish Document

Control whether a document is visible to users.

# Inlined from /metadata-ingestion/examples/library/publish_document.py
# Inlined from metadata-ingestion/examples/library/publish_document.py
"""Example: Publishing and unpublishing documents using the DataHub SDK.

This example demonstrates how to control document visibility by
publishing or unpublishing documents.
"""

from datahub.metadata.urns import DocumentUrn
from datahub.sdk import DataHubClient, Document

# Initialize the client
client = DataHubClient.from_env()

# ============================================================================
# Example 1: Publish a document
# ============================================================================
# Get the document
doc = client.entities.get(DocumentUrn("my-tutorial-doc"))

if doc:
# Publish makes the document visible to users
doc.publish()
client.entities.upsert(doc)
print(f"Document published! Status: {doc.status}")

# ============================================================================
# Example 2: Unpublish a document
# ============================================================================
doc = client.entities.get(DocumentUrn("my-tutorial-doc"))

if doc:
# Unpublish hides the document from general users
doc.unpublish()
client.entities.upsert(doc)
print(f"Document unpublished! Status: {doc.status}")

# ============================================================================
# Example 3: Create a document with specific status
# ============================================================================
# Create as published (default)
published_doc = Document.create_document(
id="new-published-doc",
title="New Published Document",
text="This document is published from the start.",
)
client.entities.upsert(published_doc)
print(f"Created published document: {published_doc.urn}")

# Create as unpublished (work in progress)
unpublished_doc = Document.create_document(
id="new-unpublished-doc",
title="Work in Progress Document",
text="This document is not yet published.",
status="UNPUBLISHED",
)
client.entities.upsert(unpublished_doc)
print(f"Created unpublished document: {unpublished_doc.urn}")

Delete Document

Remove a document from DataHub.

# Inlined from /metadata-ingestion/examples/library/delete_document.py
# Inlined from metadata-ingestion/examples/library/delete_document.py
"""Example: Deleting documents using the DataHub SDK.

This example demonstrates how to delete documents from DataHub.
"""

from datahub.metadata.urns import DocumentUrn
from datahub.sdk import DataHubClient

# Initialize the client
client = DataHubClient.from_env()

# ============================================================================
# Example 1: Delete a document by URN
# ============================================================================
doc_urn = DocumentUrn("my-tutorial-doc")

# First check if it exists
doc = client.entities.get(doc_urn)

if doc:
# Delete the document
client.entities.delete(str(doc_urn))
print(f"Document deleted: {doc_urn}")
else:
print(f"Document not found: {doc_urn}")

# ============================================================================
# Example 2: Delete multiple documents
# ============================================================================
doc_ids_to_delete = [
"doc-1",
"doc-2",
"doc-3",
]

for doc_id in doc_ids_to_delete:
doc_urn = DocumentUrn(doc_id)
doc = client.entities.get(doc_urn)
if doc:
client.entities.delete(str(doc_urn))
print(f"Deleted: {doc_urn}")
else:
print(f"Not found (skipping): {doc_urn}")

print("Cleanup complete!")

Advanced Operations

Associate a document with data assets. Documents linked to assets can be accessed from those assets even when hidden from global context.

from datahub.sdk import DataHubClient, Document

client = DataHubClient.from_env()

doc = client.entities.get("urn:li:document:my-doc", Document)
if doc:
# Add related assets
doc.add_related_asset("urn:li:dataset:(urn:li:dataPlatform:snowflake,my_table,PROD)")
doc.add_related_asset("urn:li:dashboard:(looker,dashboard1)")

# Add related documents
doc.add_related_document("urn:li:document:related-doc")

client.entities.upsert(doc)
print("Related entities updated!")

Update Document Sub-Type

Change the sub-type (e.g., "FAQ", "Tutorial", "Runbook") of a document:

from datahub.sdk import DataHubClient, Document

client = DataHubClient.from_env()

doc = client.entities.get("urn:li:document:my-doc", Document)
if doc:
doc.set_subtype("Reference")
client.entities.upsert(doc)
print(f"Sub-type updated: {doc.subtype}")

Move Document

Move a document to a different parent (for hierarchical organization):

from datahub.sdk import DataHubClient, Document

client = DataHubClient.from_env()

doc = client.entities.get("urn:li:document:child-doc", Document)
if doc:
# Move to a new parent
doc.set_parent_document("urn:li:document:new-parent")
client.entities.upsert(doc)
print(f"Document moved! New parent: {doc.parent_document}")

# Or make it a top-level document (no parent)
doc.set_parent_document(None)
client.entities.upsert(doc)
print("Document is now a top-level document!")

Python SDK Reference

The Document SDK provides the following methods:

Creation Methods

MethodDescription
Document.create_document(...)Create a native document stored in DataHub
Document.create_external_document(...)Create a reference to an external document

Content & Metadata

MethodDescription
doc.title / doc.set_title(...)Get/set the document title
doc.text / doc.set_text(...)Get/set the document text content
doc.subtype / doc.set_subtype(...)Get/set the sub-type (FAQ, Tutorial, etc.)
doc.custom_propertiesGet the custom properties dictionary
doc.set_custom_property(key, value)Set a single custom property

Visibility & Lifecycle

MethodDescription
doc.status / doc.set_status(...)Get/set PUBLISHED or UNPUBLISHED status
doc.publish() / doc.unpublish()Publish or unpublish the document
doc.show_in_global_contextCheck if visible in global search/sidebar
doc.hide_from_global_context()Hide from global context (AI-only access)
doc.show_in_global_search()Show in global context

Relationships

MethodDescription
doc.related_assetsGet list of related asset URNs
doc.add_related_asset(...) / doc.remove_related_asset(...)Add/remove a related asset
doc.related_documentsGet list of related document URNs
doc.add_related_document(...) / doc.remove_related_document(...)Add/remove a related document
doc.parent_document / doc.set_parent_document(...)Get/set parent for hierarchy

Source Information

MethodDescription
doc.is_nativeCheck if this is a native DataHub document
doc.is_externalCheck if this is an external reference
doc.external_urlGet the external URL (external docs only)
doc.external_idGet the external system ID

Metadata (via mixins)

MethodDescription
doc.add_tag(...) / doc.set_tags(...)Add tags
doc.add_owner(...) / doc.set_owners(...)Add owners
doc.set_domain(...)Set the domain
doc.add_term(...) / doc.set_terms(...)Add glossary terms

For more examples, see: