Tuning and Using Embedding Studio's Improvement System¶

Current Implementation Details¶

Embedding Studio's improvement system is implemented as a combination of several interacting components, working together to create a feedback loop between user behavior and search relevance. Let's examine how the current implementation works in practice.

The `TorchBasedAdjuster` Implementation¶

The core algorithm in torch_based_adjuster.py uses PyTorch to perform gradient descent optimization on embedding vectors. Here's how the current implementation works:

class TorchBasedAdjuster(VectorsAdjuster):
    def __init__(
        self,
        search_index_info: SearchIndexInfo,
        adjustment_rate: float = 0.1,
        num_iterations: int = 10,
        softmin_temperature: float = 1.0,
    ):
        # Initialize parameters
        self.search_index_info = search_index_info
        self.adjustment_rate = adjustment_rate
        self.num_iterations = num_iterations
        self.softmin_temperature = softmin_temperature

This implementation is based on solid machine learning principles:

It uses Adam optimization (AdamW), an advanced gradient descent algorithm that adapts learning rates
It computes similarity based on your chosen metric (cosine, dot product, or Euclidean)
It runs multiple optimization iterations to refine vectors gradually
It applies a cubic function to the similarity scores to emphasize significant differences

The Improvement Pipeline¶

The worker in worker.py coordinates the improvement process:

It periodically checks for pending sessions marked for improvement
It processes them in batches to efficiently utilize resources
It handles errors gracefully, ensuring the system remains stable

Vector Personalization¶

The current implementation creates personalized vectors for each user:

# Create personalized ID by appending user_id
new_object_id = (
    object_id
    if object_id in not_originals
    else f"{object_id}_{user_id}"
)

This means: - Each user gets their own optimized vectors - New users see the original vectors - User preferences don't interfere with each other

How to Configure and Tune the System¶

The improvement system can be tuned through several key parameters, each affecting different aspects of the adjustment process:

1. Adjustment Rate¶

The adjustment rate (learning rate) controls how aggressively vectors are modified:

adjuster = TorchBasedAdjuster(
    search_index_info=search_index_info,
    adjustment_rate=0.1,  # Configurable
    # ...
)

Recommendation ranges: - Conservative: 0.01-0.05 - Balanced: 0.05-0.2 - Aggressive: 0.2-0.5

Effects: - Higher values create more dramatic changes but may cause overshooting - Lower values make more conservative adjustments requiring more iterations - For new deployments, start conservative and increase gradually

2. Number of Iterations¶

This parameter determines how many optimization steps are performed for each batch:

adjuster = TorchBasedAdjuster(
    search_index_info=search_index_info,
    num_iterations=10,  # Configurable
    # ...
)

Recommendation ranges: - Quick adjustment: 5-10 - Standard processing: 10-25 - Deep optimization: 25-50

Effects: - More iterations allow for finer adjustments but increase processing time - Fewer iterations are faster but may not fully optimize the vectors - Complex embedding spaces typically benefit from more iterations

3. Softmin Temperature¶

This parameter affects how the system handles multi-part vectors:

adjuster = TorchBasedAdjuster(
    search_index_info=search_index_info,
    softmin_temperature=1.0,  # Configurable
    # ...
)

Recommendation ranges: - Sharp focus: 0.1-0.5 - Balanced: 0.5-2.0 - Smooth distribution: 2.0-5.0

Effects: - Lower values make the minimum distance more influential - Higher values smooth out the influence across multiple parts - Especially important for document chunking or multi-modal embeddings

4. Scheduler Timing¶

Controls how frequently the improvement worker runs:

# In settings.py or environment variables
IMPROVEMENT_SECONDS_INTERVAL=3600  # Run hourly

Recommendation ranges: - Real-time systems: 300-900 seconds (5-15 minutes) - Balanced systems: 1800-3600 seconds (30-60 minutes) - Batch-oriented: 7200-86400 seconds (2-24 hours)

Effects: - More frequent runs provide faster feedback but use more resources - Less frequent runs are more efficient but delay improvement effects - Match with your typical user engagement patterns

Implementation Example¶

Here's a complete example of how to configure a custom vectors adjuster in your plugin:

from embedding_studio.embeddings.improvement.torch_based_adjuster import TorchBasedAdjuster
from embedding_studio.models.embeddings.models import SearchIndexInfo

class MyCustomPlugin:
    # ...

    def get_vectors_adjuster(self):
        """
        Returns a configured vectors adjuster instance.
        """
        # Get search index information with distance metrics
        search_index_info = self.get_search_index_info()

        # For e-commerce product search
        if self.is_product_search:
            return TorchBasedAdjuster(
                search_index_info=search_index_info,
                adjustment_rate=0.15,     # Moderately aggressive
                num_iterations=15,        # Standard processing
                softmin_temperature=0.8,  # Slightly focused
            )

        # For document search
        elif self.is_document_search:
            return TorchBasedAdjuster(
                search_index_info=search_index_info,
                adjustment_rate=0.05,      # Conservative
                num_iterations=25,         # More iterations for complex docs
                softmin_temperature=2.0,   # Smoother for chunked documents
            )

        # Default configuration
        else:
            return TorchBasedAdjuster(
                search_index_info=search_index_info,
                adjustment_rate=0.1,
                num_iterations=10,
                softmin_temperature=1.0,
            )

Improvement Foundations¶

The Science Behind Vector Adjustment: How and Why It Works¶

To understand why the current implementation works so well, let's explore the science behind it:

Vector Space Models: The Foundation¶

Embedding Studio is built on the concept of vector space models, where:

Words, phrases, and documents are represented as points in a high-dimensional space
Similarity between items is measured by their proximity in this space
The dimensions of this space capture semantic meaning

This mathematical representation allows us to: 1. Convert language into numbers (embeddings) 2. Measure how similar concepts are 3. Find related content efficiently

The Learning Process: Gradient Descent¶

The adjustment algorithm uses gradient descent, a fundamental machine learning technique:

We define a goal: make clicked items more similar to queries and non-clicked items less similar
We measure how far we are from that goal using a loss function
We compute the gradient (direction of steepest improvement)
We take small steps in that direction
We repeat until we're satisfied with the results

This is similar to finding your way down a mountain in fog - you feel which way is steepest and take small steps downward until you reach the valley.

Why the Cubic Function Matters¶

The current implementation uses a cubic function in the loss calculation:

loss = -torch.mean(clicked_similarity**3) + torch.mean(
    non_clicked_similarity**3
)

This is a critical design choice because:

It emphasizes large differences more than small ones
It creates stronger gradients for items that are very wrong
It helps the system focus on fixing the most egregious ranking errors first

In simpler terms, it's like telling the system: "Don't worry too much about items that are almost right, but really fix the ones that are way off."

Multi-Part Document Handling¶

The code handles multi-part documents (like chunked text) through sophisticated aggregation:

# Differentiable soft minimum using log-sum-exp
softmin_weights = torch.exp(
    -similarities / softmin_temperature
)
softmin_weights = softmin_weights / softmin_weights.sum(
    dim=2, keepdim=True
)
similarities = torch.sum(
    softmin_weights * similarities, dim=2
)

This creates a differentiable version of the minimum function, allowing the system to focus on the most similar part of a document while still being able to compute gradients for learning.

Why Personalization Works¶

The personalization approach in the current implementation is particularly effective because:

It creates separate vectors for each user's preferences
It preserves the original vectors for new users
It allows contradictory preferences among different users

This is like having a library where the books stay in the same places for new visitors, but regular visitors get their own customized maps showing shortcuts to their favorite sections.

Scientific Benefits of the Current Approach¶

The Embedding Studio implementation has several scientific advantages:

It's computationally efficient: By using PyTorch and batched processing, it can handle large datasets
It's mathematically sound: Using gradient descent with proper loss functions ensures convergence
It respects different similarity metrics: Working with cosine, dot product, or Euclidean distance
It handles uncertainty gracefully: Using softmin for multi-part documents
It avoids catastrophic forgetting: By creating personalized copies rather than modifying originals

Advanced Tuning Considerations¶

1. Session Selection Criteria¶

Not all sessions provide equally valuable signal. Consider implementing custom filters:

def should_use_session_for_improvement(session):
    # Skip sessions with no clicks
    if len(session.events) == 0:
        return False

    # Skip very short sessions (possibly bounces)
    if session.duration_seconds < 10:
        return False

    # Skip ambiguous sessions (too many clicks without clear preference)
    if len(session.events) > 10:
        return False

    # Prioritize sessions with clicks on lower-ranked results
    # (These indicate potential ranking issues)
    for event in session.events:
        result_position = get_result_position(session, event.object_id)
        if result_position > 3:  # Clicked something not in top 3
            return True

    return True  # Default: use session

2. Vector Normalization¶

Different embedding models may require different normalization:

# Normalize vectors before adjustment
def prepare_vectors(vectors):
    # L2 normalization for models using cosine similarity
    if self.search_index_info.metric_type == MetricType.COSINE:
        return F.normalize(vectors, p=2, dim=-1)

    # Min-max scaling for dot product models
    elif self.search_index_info.metric_type == MetricType.DOT:
        min_val = torch.min(vectors)
        max_val = torch.max(vectors)
        return (vectors - min_val) / (max_val - min_val)

    # No normalization for Euclidean distance
    else:
        return vectors

3. Multi-part Vector Handling¶

For models that produce multiple vectors per item (like chunked documents), the aggregation strategy matters:

# In your plugin configuration
search_index_info = SearchIndexInfo(
    metric_type=MetricType.COSINE,
    # Choose aggregation strategy based on content type
    metric_aggregation_type=MetricAggregationType.MIN  # or AVG
)

Guidance for choosing aggregation: - MIN: Better for finding exact matches within documents - AVG: Better for overall document relevance - Document search typically benefits from MIN - Product/image search typically benefits from AVG

4. Feedback Loop Management¶

Be cautious of reinforcement bias where the system becomes too focused on existing patterns:

# Example: Mix in exploration results
def get_search_results(query, user_id, exploration_ratio=0.2):
    # Get personalized results
    personalized_results = get_personalized_results(query, user_id)

    # Get non-personalized results
    standard_results = get_standard_results(query)

    # Mix results to maintain exploration
    final_results = []
    for i in range(len(personalized_results)):
        if random.random() < exploration_ratio:
            # Include some non-personalized results
            final_results.append(standard_results[i])
        else:
            # Use personalized results
            final_results.append(personalized_results[i])

    return final_results

Practical Implementation Steps¶

To implement the improvement system effectively:

Enable clickstream collection:

# Register search session
session = Session(
    session_id=str(uuid.uuid4()),
    search_query=query,
    user_id=user_id,
    created_at=datetime_utils.utc_timestamp()
)
context.clickstream_dao.register_session(session)

# Record click events
event = SessionEvent(
    session_id=session_id,
    object_id=clicked_item_id,
    created_at=datetime_utils.utc_timestamp()
)
context.clickstream_dao.push_events([event])

Schedule improvement processing:

# Mark session for improvement
context.sessions_for_improvement.create(
    schema=SessionForImprovementCreateSchema(
        session_id=session_id,
    )
)

Configure improvement parameters:
```
# In your settings.py or environment variables
ADJUSTMENT_RATE=0.1
NUM_ITERATIONS=15
SOFTMIN_TEMPERATURE=1.0
IMPROVEMENT_SECONDS_INTERVAL=1800
```
By carefully tuning these parameters and monitoring the results, you can optimize the improvement system for your specific use case, embedding models, and user behavior patterns.