How Payload Filters Convert to SQL for pgvector¶

Introduction¶

Embedding Studio uses pgvector as its vector database backend, which means payload filters need to be translated into efficient SQL queries. This document explains how various payload filter queries are converted to SQL for execution against the PostgreSQL database with pgvector extension.

The Translation Process¶

When you submit a payload filter to Embedding Studio's API, it undergoes a two-step translation process:

The payload filter structure (JSON/Python objects) is parsed and validated
The filter is translated into SQL expressions through one of two mechanisms:
ORM-based filters (using SQLAlchemy)
Raw SQL string generation

This translation happens primarily in the translate_query_to_orm_filters and translate_query_to_sql_filters functions in the query_to_sql.py module.

Basic Query Translation Examples¶

Let's examine how each type of filter gets translated to SQL:

MatchQuery (Text Search)¶

Payload Filter:

MatchQuery(match=SingleTextValueQuery(field="description", value="smartphone"))

Generated SQL:

to_tsvector('simple', jsonb_extract_path_text(payload, 'description')) @@ to_tsquery('simple', 'smartphone')

This uses PostgreSQL's full-text search capabilities with the to_tsvector and to_tsquery functions.

TermQuery (Exact Match)¶

Payload Filter:

TermQuery(term=SingleValueQuery(field="category", value="electronics"))

Generated SQL:

(payload ->> 'category') = 'electronics'

For numeric values, the SQL includes casting:

((payload ->> 'price')::numeric) = 299.99

TermsQuery (Multiple Values)¶

Payload Filter:

TermsQuery(terms=MultipleValuesQuery(field="category", values=["electronics", "gadgets"]))

Generated SQL:

(payload ->> 'category') IN ('electronics', 'gadgets')

ListHasAllQuery (Array Contains All)¶

Payload Filter:

ListHasAllQuery(all=MultipleValuesQuery(field="tags", values=["waterproof", "bluetooth"]))

Generated SQL:

payload -> 'tags' ?& array['waterproof', 'bluetooth']

This uses PostgreSQL's ?& operator, which checks if the JSON array contains all specified elements.

ListHasAnyQuery (Array Contains Any)¶

Payload Filter:

ListHasAnyQuery(any=MultipleValuesQuery(field="tags", values=["discount", "sale"]))

Generated SQL:

payload -> 'tags' ?| array['discount', 'sale']

This uses the ?| operator to check if the JSON array contains any of the specified elements.

ExistsQuery (Field Exists)¶

Payload Filter:

ExistsQuery(field="discount_percentage")

Generated SQL:

payload ? 'discount_percentage'

The ? operator checks if the JSON object contains the specified key.

WildcardQuery (Pattern Matching)¶

Payload Filter:

WildcardQuery(wildcard=SingleTextValueQuery(field="model", value="iphone*"))

Generated SQL:

to_tsvector('simple', jsonb_extract_path_text(payload, 'model')) @@ to_tsquery('simple', 'iphone:*')

Notice how the asterisk is converted to PostgreSQL's :* syntax for prefix matching.

RangeQuery (Numeric Range)¶

Payload Filter:

RangeQuery(field="price", range=RangeCondition(gte=100, lte=500))

Generated SQL:

((payload ->> 'price')::numeric) >= 100 AND ((payload ->> 'price')::numeric) <= 500

This casts the JSON field to a numeric type and applies the range conditions.

Boolean Query Translation¶

The BoolQuery type allows combining multiple conditions with boolean logic. Here's how it translates:

Payload Filter:

BoolQuery(
    must=[
        TermQuery(term=SingleValueQuery(field="category", value="electronics")),
        MatchQuery(match=SingleTextValueQuery(field="description", value="smartphone"))
    ],
    filter=[
        RangeQuery(field="price", range=RangeCondition(gte=200, lte=800))
    ],
    should=[
        TermQuery(term=SingleValueQuery(field="brand", value="Apple")),
        TermQuery(term=SingleValueQuery(field="brand", value="Samsung"))
    ],
    must_not=[
        TermQuery(term=SingleValueQuery(field="condition", value="refurbished"))
    ]
)

Generated SQL:

(
  (
    ((payload ->> 'category') = 'electronics')
    AND
    (to_tsvector('simple', jsonb_extract_path_text(payload, 'description')) @@ to_tsquery('simple', 'smartphone'))
  )
  AND
  (
    ((payload ->> 'price')::numeric) >= 200 AND ((payload ->> 'price')::numeric) <= 800
  )
  AND
  (
    ((payload ->> 'brand') = 'Apple') OR ((payload ->> 'brand') = 'Samsung')
  )
  AND
  NOT (
    ((payload ->> 'condition') = 'refurbished')
  )
)

The boolean logic is translated as: - must clauses are combined with AND - should clauses are combined with OR - filter clauses are functionally similar to must and use AND - must_not clauses are wrapped in NOT ()

Direct Database Column vs. Payload Field¶

Each query type supports a force_not_payload flag that determines whether the field is in the JSON payload or a direct database column:

With force_not_payload=True:

TermQuery(term=SingleValueQuery(field="user_id", value="123", force_not_payload=True))

Generated SQL:

user_id = '123'

Instead of accessing the payload JSON, this targets the actual user_id column in the database.

Combining with Vector Search¶

The real power comes when combining payload filtering with vector similarity search. This happens through specialized SQL functions in pgvector.

For example, a combined query might generate a SQL statement like:

WITH filtered_objects AS (
  SELECT 
    object_id, payload, storage_meta, user_id, original_id
  FROM dbo_model_name
  WHERE (payload ->> 'category') = 'electronics'
), 
prefiltered_vectors AS (
  SELECT
    op.object_id, op.part_id, o.payload, o.storage_meta, o.user_id, o.original_id,
    (op.vector <=> '[0.1, 0.2, ...]'::vector) AS distance
  FROM dbop_model_name op
  INNER JOIN filtered_objects o ON op.object_id = o.object_id 
  WHERE op.vector <=> '[0.1, 0.2, ...]'::vector <= 0.5
)
SELECT 
  object_id, payload, storage_meta, user_id, original_id,
  ARRAY_AGG(part_id) AS part_ids,
  MIN(distance) AS distance
FROM prefiltered_vectors
GROUP BY object_id, payload, storage_meta, user_id, original_id
ORDER BY distance ASC
LIMIT 10 OFFSET 0;

This SQL: 1. First filters objects by payload metadata 2. Then computes vector distances only for objects that passed the filter 3. Finally aggregates and returns the results

Advanced SQL Functions¶

Embedding Studio uses a set of specialized SQL functions for efficient vector search:

Simple search functions (simple_*): For basic vector similarity search without complex filters
Advanced search functions (advanced_*): For combining vector search with payload filtering
Similarity-ordered functions (*_so_*): Prioritize similarity in result ordering
With-vectors functions (*_v_*): Include vector values in results

These functions are created when a collection is initialized and are optimized for different search scenarios.

The Translation Pipeline¶

The complete translation pipeline works as follows:

API receives a payload filter
The filter is parsed into Python objects
Based on the search type and parameters, an appropriate search strategy is selected:
Simple search (vector only)
Advanced search (vector + filter)
Similarity-ordered search
With or without returning vectors
The appropriate SQL function is called with parameters based on the filter
Results are processed and returned

SQL Optimization Techniques¶

Several optimization techniques are used when translating payload filters to SQL:

1. Pre-filtering¶

Where possible, Embedding Studio first filters objects by metadata before computing vector distances, which is more computationally expensive.

2. Pagination Management¶

The system uses SQL's LIMIT and OFFSET for efficient pagination.

3. Specialized Indexes¶

PostgreSQL's JSONB indexes are used for payload fields that are frequently queried.

4. Custom SQL Functions¶

Pre-compiled SQL functions optimize common query patterns.

5. Materialized CTEs¶

Common Table Expressions with the materialized hint force PostgreSQL to materialize intermediate results, improving performance for complex queries.

Example: Full Translation Process¶

Let's trace a complete translation example:

Original API request:

{
  "search_query": "smartphone",
  "filter": {
    "query": {
      "bool": {
        "must": [
          {"term": {"field": "category", "value": "electronics"}}
        ],
        "filter": [
          {"field": "price", "range": {"gte": 300, "lte": 1000}}
        ]
      }
    }
  },
  "limit": 10
}

Parsed into Python objects:

SimilaritySearchRequest(
    search_query="smartphone",
    filter=PayloadFilter(
        query=BoolQuery(
            must=[TermQuery(term=SingleValueQuery(field="category", value="electronics"))],
            filter=[RangeQuery(field="price", range=RangeCondition(gte=300, lte=1000))]
        )
    ),
    limit=10
)

Payload filter translated to SQL:

(
  ((payload ->> 'category') = 'electronics')
  AND
  (((payload ->> 'price')::numeric) >= 300 AND ((payload ->> 'price')::numeric) <= 1000)
)

Embedded in a vector search SQL function call:

SELECT
  result_object_id, result_payload, result_storage_meta, result_user_id, 
  result_original_id, result_part_ids, result_distance, subset_count
FROM advanced_so_model_name_cosine(
  '[0.1, 0.2, ...]'::vector,   -- Vectorized search query
  NULL,                        -- user_id parameter
  $FILTER$(                    -- The translated filter
    ((payload ->> 'category') = 'electronics')
    AND
    (((payload ->> 'price')::numeric) >= 300 AND ((payload ->> 'price')::numeric) <= 1000)
  )$FILTER$,
  10,                          -- limit parameter
  0,                           -- offset parameter
  NULL,                        -- max_distance parameter
  50,                          -- enlarged_limit parameter
  0,                           -- enlarged_offset parameter
  FALSE                        -- average_only parameter
);

Results returned: The query executes and returns the appropriate records, which are then processed into the API response format.

Conclusion¶

The translation of payload filters to SQL in Embedding Studio is a sophisticated process that bridges the gap between user-friendly filter definitions and efficient database queries. By leveraging PostgreSQL's JSONB capabilities and pgvector's similarity search functions, Embedding Studio provides a powerful yet performant way to filter vector search results by metadata.

Understanding this translation process can help you:

Design more efficient payload filters
Troubleshoot unexpected search results
Optimize database indexes for your specific query patterns
Balance vector similarity with metadata filtering for optimal search experiences

Whether you're building a semantic product search, document retrieval system, or recommendation engine, the payload filter to SQL translation is a key part of making your vector search both powerful and precise.