Switching Embedding Providers
This guide explains how to migrate from one embedding provider to another. Switching providers requires deleting the semantic index and re-ingesting all documents because different models produce vectors with incompatible dimensions.
For initial setup of semantic search (including all provider configurations), see Semantic Search Configuration.
When You Need This Guide
- Switching from OpenAI to AWS Bedrock (or vice versa)
- Switching from one model to another with different vector dimensions
- Changing from Cohere direct API to AWS Bedrock-managed Cohere
Provider and Model Reference
| Provider | Model | Model Key | Dimensions |
|---|---|---|---|
| OpenAI | text-embedding-3-large | text_embedding_3_large | 3072 |
| OpenAI | text-embedding-3-small | text_embedding_3_small | 1536 |
| AWS Bedrock | cohere.embed-english-v3 | cohere_embed_v3 | 1024 |
| Cohere | embed-english-v3.0 | embed_english_v3_0 | 1024 |
Important: The model key is derived from the model name by replacing
-and.with_. Both the ingestion connector and GMS must use the same model to ensure query embeddings match document embeddings.
Migration Steps
Step 1: Stop DataHub Services
Stop GMS and any ingestion jobs to prevent writes during migration:
# Docker Compose
docker stop datahub-gms
# Kubernetes
kubectl scale deployment datahub-gms --replicas=0
Step 2: Delete the Semantic Index
Delete the existing semantic index from OpenSearch:
# Check existing semantic indices
curl -s "http://localhost:9200/_cat/indices/*semantic*?v"
# Delete the semantic index (adjust index name as needed)
curl -X DELETE "http://localhost:9200/documentindex_v2_semantic"
Step 3: Update Provider Configuration
Update your configuration with the new provider settings. See Semantic Search Configuration for the full configuration options for each provider (Helm charts and environment variables).
Make sure to update:
- Provider type (
EMBEDDING_PROVIDER_TYPE) - API credentials (API key or IAM role)
- Vector dimension (
ELASTICSEARCH_SEMANTIC_VECTOR_DIMENSION) to match the new model
Step 4: Update Index Configuration
If using application.yaml, update the model entry to match the new provider:
elasticsearch:
entityIndex:
semanticSearch:
models:
# Use the model key that matches your new provider
text_embedding_3_large:
vectorDimension: 3072 # Must match model output
knnEngine: faiss
spaceType: cosinesimil
efConstruction: 128
m: 16
Or via environment variable:
ELASTICSEARCH_SEMANTIC_VECTOR_DIMENSION=3072
Step 5: Start DataHub
Start GMS — the system update job will automatically recreate the semantic index:
# Docker Compose
docker start datahub-gms
# Kubernetes
kubectl scale deployment datahub-gms --replicas=1
The system update job runs automatically on startup and will:
- Detect the missing semantic index
- Create it with the correct mapping for your new embedding model
- Log progress to the GMS logs
Step 6: Re-ingest Documents
After the index is recreated, re-ingest your documents to generate new embeddings:
datahub ingest -c your-recipe.yaml
Important: Make sure your ingestion recipe also uses the same embedding model. The ingestion connector generates document embeddings, while GMS generates query embeddings — both must use the same model.
Step 7: Verify
# Check the index exists with correct mapping
curl -s "http://localhost:9200/documentindex_v2_semantic/_mapping?pretty" | head -50
# Check documents have embeddings
curl -s "http://localhost:9200/documentindex_v2_semantic/_search" \
-H "Content-Type: application/json" \
-d '{"size": 1, "_source": ["urn", "embeddings"]}' | head -30
# Test semantic search via GraphQL or the UI
Troubleshooting
"No embeddings found" after switching
Cause: Documents were ingested before the provider switch and have embeddings from the old model.
Solution: Re-run ingestion to generate new embeddings with the new provider.
"Dimension mismatch" errors
Cause: The index was created with a different vector dimension than the new model produces.
Solution: Delete the semantic index and let it be recreated (Steps 2-5 above).
"Invalid API key" errors
Cause: API key not set or incorrect.
Solution: Verify your API key is correctly set in the environment:
# Check the environment variable is set (in the container)
docker exec datahub-gms env | grep -E 'OPENAI_API_KEY|COHERE_API_KEY'
Query returns no results but documents exist
Cause: Model mismatch between ingestion and query time.
Solution: Ensure both the ingestion connector AND GMS use the same embedding model. Check:
- The provider-specific model env var (
BEDROCK_EMBEDDING_MODEL,OPENAI_EMBEDDING_MODEL, orCOHERE_EMBEDDING_MODEL) in GMS config - Embedding model in your ingestion recipe
Best Practices
- Use the same model everywhere: Ensure ingestion connectors and GMS use identical embedding models.
- Test in development first: Switch providers in a dev environment before production.
- Plan for re-ingestion: Switching providers requires re-generating all embeddings, which can take time for large datasets.
- Monitor costs: Different providers have different pricing. OpenAI and Cohere charge per token/request.
- Keep backups: Before deleting indices, consider backing up if you might need to rollback.