System Architecture Overview¶
This document provides a comprehensive overview of the Embedding Studio architecture, explaining how different components work together to create, fine-tune, and serve embedding models.
High-Level Architecture¶
Core API Service¶
The central API service (embedding_studio
container) provides:
- REST API endpoints for application integration
- Plugin management and discovery
- Session and clickstream data collection
- Task scheduling and coordination
This service acts as the entry point for applications using Embedding Studio and orchestrates the workflow between components.
Worker Services¶
Fine-Tuning Worker¶
The fine_tuning_worker
container:
- Handles model fine-tuning tasks
- Runs training jobs for embedding models
- Integrates with MLflow for experiment tracking
- Requires GPU acceleration for efficient training
- Uses the selected plugin's fine-tuning method
Inference Worker¶
The inference_worker
container:
- Serves embedding models via Triton Inference Server
- Handles real-time embedding generation
- Supports model versioning and A/B testing
- Provides gRPC and HTTP endpoints
- Manages model deployment lifecycle
Improvement Worker¶
The improvement_worker
container:
- Processes incremental vector adjustments
- Applies post-training optimizations to embeddings
- Handles small improvements without full fine-tuning
- Works on embedding quality enhancement
Upsertion Worker¶
The upsertion_worker
container:
- Manages embedding generation for new content
- Handles batch processing of items
- Updates vector database with new embeddings
- Processes deletion and reindexing tasks
Data Storage¶
Vector Database¶
Embedding Studio uses PostgreSQL with the pgvector extension as its primary vector store:
- Stores embedding vectors with metadata
- Provides fast approximate nearest neighbor search
- Supports various distance metrics (cosine, dot product, Euclidean)
- Handles index optimization for performance
Document Storage¶
MongoDB is used for storing:
- Fine-tuning task metadata
- Session and clickstream data
- Improvement and upsertion task tracking
- Reindexing task management
Model Storage¶
MLflow, backed by MinIO and MySQL, manages:
- Model versioning and artifacts
- Training metrics and parameters
- Experiment tracking
- Model registry for deployment
Queue System¶
Redis serves as the task queue and provides:
- Distributed task scheduling
- Worker coordination
- Job priority management
- Failure handling and retries
Data Flow¶
The typical data flow in Embedding Studio follows these stages:
- Content Ingestion:
- Content is loaded via data loaders from S3, GCP, or databases
- Documents are preprocessed and split into appropriate chunks
-
Initial embeddings are generated using base models
-
User Interaction:
- Users search or interact with content
- Clickstream data is collected via API endpoints
-
Sessions are processed and converted to training signals
-
Fine-Tuning Process:
- Training data is prepared from user interactions
- Models are fine-tuned using the specified method
- Experiments are tracked in MLflow
-
The best model version is selected for deployment
-
Model Deployment:
- The fine-tuned model is packaged for Triton
- The inference service is updated with the new model
- Content is reindexed with the improved model
-
A/B testing may be performed to validate improvements
-
Search and Retrieval:
- Queries are embedded using the fine-tuned model
- Vector similarity search is performed
- Results are ranked and returned to users
- The cycle continues with new interactions
Plugin Integration Points¶
Embedding Studio's architecture is highly extensible through plugins that can customize:
- Data Ingestion: Custom data loaders for specific sources
- Text Processing: Specialized text processors and tokenizers
- Image Processing: Custom image transformations and models
- Fine-Tuning Methods: Application-specific training approaches
- Vector Adjustments: Custom embedding improvement techniques
- Query Processing: Specialized query understanding and expansion
- Search Optimization: Custom ranking and filtering logic
Resource Requirements¶
The system has different resource needs for different components:
- Fine-Tuning Worker: Requires GPU acceleration (NVIDIA CUDA)
- Inference Worker: Benefits from GPU for high throughput
- Vector Database: Needs sufficient memory for index performance
- API and Other Workers: CPU-bound, moderate memory requirements
In the next section, we'll explore the environment variables and configuration options that control this architecture.