Understanding Query Parsing in Embedding Studio¶
Query parsing is a powerful feature in Embedding Studio that helps understand and categorize user search queries by mapping them to relevant categories. This tutorial explains how the query parsing system works, its architecture, and the underlying algorithms that make it effective.
Core Concepts¶
Query parsing in Embedding Studio is built around a few fundamental concepts:
1. Vector-Based Category Matching¶
Rather than using traditional keyword matching, Embedding Studio uses vector embeddings to semantically match search queries to categories:
- Embedding-Based Matching: Converts the user's search query into a vector representation
- Semantic Understanding: Captures the meaning of the query instead of just matching keywords
- Category Vectors: Categories are also represented as vectors for accurate matching
2. Category Selection¶
The system uses sophisticated selection strategies to determine which categories best match a query:
- Similarity Scoring: Calculates how similar a query is to each potential category
- Threshold-Based Selection: Uses distance/similarity thresholds to identify relevant matches
- Multiple Selection Strategies: Different selector implementations for various use cases
3. Distance-Based Selectors¶
The system includes various selector implementations that filter results based on distance metrics:
- DistBasedSelector: Abstract base class for selectors that work with distances
- ProbsDistBasedSelector: Uses probability calculations for nuanced selection
- VectorsBasedSelector: Works directly with vector representations for advanced matching
How Query Parsing Works: The Algorithm¶
When a user submits a search query, here's how Embedding Studio processes it:
Step 1: Query Vectorization¶
# Retrieve the query retriever and inference client
query_retriever = plugin.get_query_retriever()
inference_client = plugin.get_inference_client_factory().get_client(
collection_info.embedding_model.id
)
# Convert the search query to vector format
search_query = query_retriever(search_query)
query_vector = inference_client.forward_query(search_query)[0]
- The raw text query is processed by a query retriever specific to the model
- The processed query is then converted to a vector using the inference client
- This vector representation captures the semantic meaning of the search query
Step 2: Similar Category Search¶
# Search for similar categories in the vector database
found_objects, _ = collection.find_similar_objects(
query_vector=query_vector.tolist(),
offset=0,
limit=plugin.get_max_similar_categories(),
max_distance=plugin.get_max_margin(),
with_vectors=categories_selector.vectors_are_needed
)
- The query vector is compared against category vectors in the database
- A similarity search returns categories that are semantically similar
- Results are limited by the maximum number of categories and distance threshold
Step 3: Category Selection¶
# Apply the category selector to filter the results
final_indexes = categories_selector.select(found_objects, query_vector)
results = []
for index in final_indexes:
results.append(found_objects[index])
- A category selector is applied to the candidate categories
- The selector implements a strategy to filter for the most relevant matches
- Only categories meeting the selection criteria are returned to the user
Selector Types and Algorithms¶
Distance-Based Selector¶
The foundational selector works with pre-calculated distance values:
def select(self, categories, query_vector=None):
# Convert distance values to a normalized tensor
values = self._convert_values(categories)
# Apply margin threshold
positive_threshold_min = 1 - self._margin if self._is_similarity else self._margin
corrected_values = values - positive_threshold_min
# Calculate binary selection labels (implemented by subclasses)
bin_labels = self._calculate_binary_labels(corrected_values)
# Return indices of selected objects
return torch.nonzero(bin_labels).T[0].tolist()
This selector: 1. Normalizes distance values based on the metric type 2. Applies the configured margin threshold 3. Delegates the final decision logic to subclasses
Probability-Based Selector¶
The ProbsDistBasedSelector
extends the base selector with probability calculations:
def _calculate_binary_labels(self, corrected_values):
return (
torch.sigmoid(corrected_values * self._scale)
> self._prob_threshold
)
This selector: 1. Applies a sigmoid function to convert distances to probabilities (0-1 range) 2. Uses a probability threshold to determine which categories to select 3. Allows for more nuanced selection with configurable scaling and thresholds
Vector-Based Selector¶
For advanced matching scenarios, the VectorsBasedSelector
works directly with embedding vectors:
def select(self, categories, query_vector):
# Get tensor representation of categories
category_vectors = self._get_categories_tensor(categories)
# Calculate distances between query and category vectors
values = self._calculate_distance(
query_vector,
category_vectors,
self._softmin_temperature,
self._is_similarity
)
# Apply threshold and selection
positive_threshold_min = 1 - self._margin if self._is_similarity else self._margin
corrected_values = values - positive_threshold_min
bin_labels = self._calculate_binary_labels(corrected_values)
return torch.nonzero(bin_labels).T[1].tolist()
This selector: 1. Works with the raw vectors rather than pre-calculated distances 2. Can implement more complex distance calculations between query and categories 3. Supports various metrics (cosine, dot product, Euclidean) and aggregation methods
Distance Metrics and Selection Strategies¶
The system supports multiple distance metrics for comparing vectors:
Metric Types¶
- Cosine Similarity: Measures the cosine of the angle between vectors (value between -1 and 1)
- Euclidean Distance: Measures the straight-line distance between vectors
- Dot Product: Measures vector similarity through their dot product
Selection Strategies¶
Different selection strategies can be employed depending on your needs:
- Threshold-Based: Select categories with distances below (or similarities above) a threshold
- Probability-Based: Convert distances to probabilities and select based on probability threshold
- Top-K: Select the top K most similar categories regardless of absolute distance
Using the Query Parsing API¶
The query parsing functionality is exposed through a REST API endpoint:
POST /parse-query/categories
{
"search_query": "wireless headphones with noise cancellation"
}
Response:
{
"categories": [
{
"object_id": "headphones",
"distance": 0.15,
"payload": {
"name": "Headphones",
"parent_category": "Audio Equipment"
}
},
{
"object_id": "noise_cancellation",
"distance": 0.22,
"payload": {
"name": "Noise Cancellation",
"parent_category": "Audio Features"
}
}
]
}
Behind the Scenes: pgvector Integration¶
Embedding Studio's query parsing leverages pgvector, a PostgreSQL extension for storing and searching vector embeddings:
Collection and Vector Database¶
The system uses: - PgvectorDb: Handles high-level database operations - PgvectorCollection: Represents a collection of vectors - Collection: Interface for vector collections
The search process relies on SQL functions that implement efficient vector search algorithms:
SELECT ... FROM vector_search_function(
'{query_vector}'::vector,
limit,
offset,
max_distance,
'{metadata}'
);
Performance Considerations¶
Several optimizations enhance query parsing performance:
- HNSW Indexes: Uses Hierarchical Navigable Small World (HNSW) graphs for efficient approximate nearest neighbor search
- Batch Processing: Processes categories in batches for memory efficiency
- Parallel Search: Implements concurrent search strategies where appropriate
Customization and Extension¶
The query parsing system is designed to be customizable:
- Custom Selectors: Create new selector implementations by extending the base classes
- Embedding Models: Change the embedding model to match your specific domain
- Distance Metrics: Configure the distance metric and threshold to control match sensitivity
- Category Structure: Define your own category hierarchy to match your business needs