Skip to main content

Semantic Search Configuration

Semantic search lets you find DataHub entities using natural language queries like "customer churn analysis" — even when exact keywords differ.

Prerequisites

  1. OpenSearch 2.17.0+ with k-NN plugin (DataHub ships with opensearchproject/opensearch:2.19.3). Elasticsearch is not supported.
  2. An API key for your chosen embedding provider (see table below).

If you deploy DataHub using the DataHub Helm chart, add the following to your values.yaml and run helm upgrade.

OpenAI (Default)

Create a secret, then configure:

kubectl create secret generic openai-secret --from-literal=api-key=sk-your-api-key-here
global:
datahub:
semantic_search:
enabled: true
vectorDimension: 3072
provider:
type: "openai"
openai:
apiKey:
secretRef: "openai-secret"
secretKey: "api-key"
model: "text-embedding-3-large"

AWS Bedrock

No API key needed — Bedrock authenticates via the AWS SDK default credential chain (IRSA, EC2/ECS instance credentials, etc).

global:
datahub:
semantic_search:
enabled: true
vectorDimension: 1024
provider:
type: "aws-bedrock"
bedrock:
modelId: "cohere.embed-english-v3"
awsRegion: "us-west-2"

Cohere

Create a secret, then configure:

kubectl create secret generic cohere-secret --from-literal=api-key=your-cohere-api-key
global:
datahub:
semantic_search:
enabled: true
vectorDimension: 1024
provider:
type: "cohere"
cohere:
apiKey:
secretRef: "cohere-secret"
secretKey: "api-key"
model: "embed-english-v3.0"

Apply Changes

helm upgrade datahub datahub/datahub -f values.yaml

Environment Variables

For Docker Compose or non-Helm deployments, set these on the datahub-gms service and restart it.

OpenAI (Default)

ELASTICSEARCH_SEMANTIC_SEARCH_ENABLED=true
SEARCH_SERVICE_SEMANTIC_SEARCH_ENABLED=true
ELASTICSEARCH_SEMANTIC_SEARCH_ENTITIES=document
OPENAI_API_KEY=sk-your-api-key-here

That's it — OpenAI is the default provider, so no other variables are needed.

AWS Bedrock

ELASTICSEARCH_SEMANTIC_SEARCH_ENABLED=true
SEARCH_SERVICE_SEMANTIC_SEARCH_ENABLED=true
ELASTICSEARCH_SEMANTIC_SEARCH_ENTITIES=document
EMBEDDING_PROVIDER_TYPE=aws-bedrock
BEDROCK_EMBEDDING_AWS_REGION=us-west-2
ELASTICSEARCH_SEMANTIC_VECTOR_DIMENSION=1024

Authentication uses the AWS SDK default credential chain (EC2/ECS instance credentials, AWS_PROFILE, or AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY).

Cohere

ELASTICSEARCH_SEMANTIC_SEARCH_ENABLED=true
SEARCH_SERVICE_SEMANTIC_SEARCH_ENABLED=true
ELASTICSEARCH_SEMANTIC_SEARCH_ENTITIES=document
EMBEDDING_PROVIDER_TYPE=cohere
COHERE_API_KEY=your-cohere-api-key
ELASTICSEARCH_SEMANTIC_VECTOR_DIMENSION=1024

Verify It's Working

After restarting, check the GMS logs:

# Docker Compose
docker-compose logs datahub-gms | grep -i "embedding"

# Kubernetes
kubectl logs deployment/datahub-gms | grep -i "embedding"

You should see:

Creating embedding provider with type: openai
Initialized OpenAiEmbeddingProvider with model=text-embedding-3-large

Generating Embeddings

Once semantic search is enabled, you need to run an ingestion source to generate embeddings for your documents.

Minimal Recipe

source:
type: datahub-documents
config: {}

sink:
type: datahub-rest
config: {}

This automatically connects to DataHub, fetches your embedding config from the server, and processes documents in real-time.

datahub ingest -c recipe.yml

For external document sources (Notion, Confluence, etc.), see the Notion Source and DataHub Documents Source documentation.

Supported Models

ProviderModelDimensionsNotes
OpenAItext-embedding-3-large3072Default, higher quality
OpenAItext-embedding-3-small1536Fast, cost-effective
AWS Bedrockcohere.embed-english-v31024AWS-managed
Cohereembed-english-v3.01024English optimized
Cohereembed-multilingual-v3.01024100+ languages

To use a non-default model, set the model name in your Helm values or environment variable and update vectorDimension / ELASTICSEARCH_SEMANTIC_VECTOR_DIMENSION to match.

Troubleshooting

SymptomFix
"Semantic search is disabled or not configured"Verify ELASTICSEARCH_SEMANTIC_SEARCH_ENABLED=true and restart GMS
"Invalid API key provided"Check your API key is set correctly in the GMS environment
"Dimension mismatch: expected 3072, got 1024"Update ELASTICSEARCH_SEMANTIC_VECTOR_DIMENSION to match your model

Further Reading