Merged Documentation¶
Documentation for profiled
¶
Functionality¶
Encapsulates the code profiling process. It launches the profiler, captures runtime statistics, and prints them upon completion.
Parameters¶
- None
Usage¶
- Purpose - Measures the performance of a code block using Python's cProfile.
Example¶
with profiled():
# Your code to benchmark
Documentation for PgvectorCollection
¶
General Description¶
PgvectorCollection is a class that manages vector embeddings in a PostgreSQL database using the pgvector extension. It extends the base Collection class and provides methods for CRUD operations, vector similarity search, and payload filtering.
Main Purposes¶
- Manage and store vector embeddings efficiently in PostgreSQL.
- Enable fast similarity search with the pgvector extension.
- Provide mechanisms for CRUD operations and filtering data.
Motivation¶
This class was created to leverage PostgreSQL's capabilities for handling high-dimensional vector data. By using the pgvector extension, it supports scalable and robust vector operations, making it suitable for applications like semantic search.
Inheritance¶
PgvectorCollection inherits from the Collection class, ensuring that it follows a standardized API for vector database operations while introducing specialized functionality enabling vector searches.
Usage Example¶
from sqlalchemy import create_engine
from embedding_studio.vectordb.pgvector.collection import PgvectorCollection
from embedding_studio.vectordb.collection_info_cache import CollectionInfoCache
engine = create_engine("postgresql://user:password@localhost/db")
collection_cache = CollectionInfoCache()
collection = PgvectorCollection(
pg_database=engine,
collection_id="my_collection",
collection_info_cache=collection_cache
)
# Use 'collection' to perform similarity search and other operations.
Documentation for PgvectorCollection.get_info
¶
Functionality¶
This method retrieves metadata for the collection. It uses the collection info cache to return a CollectionInfo object based on the collection ID stored in the instance.
Parameters¶
This method does not require any parameters.
Usage¶
- Purpose: Retrieve collection metadata in a PostgreSQL vector database using the pgvector extension.
Example¶
collection = PgvectorCollection(pg_db, coll_id, cache)
info = collection.get_info()
print(info)
Documentation for PgvectorCollection.get_state_info
¶
Functionality¶
Retrieves state information for a collection stored in a PostgreSQL vector database using the pgvector extension. It returns a CollectionStateInfo
object containing metadata and state details about the collection.
Parameters¶
This method accepts no parameters.
Usage¶
- Purpose: Retrieves the current state information for a given collection.
Example¶
state_info = collection.get_state_info()
print(state_info)
Documentation for PgvectorCollection.lock_objects
¶
Functionality¶
This context manager locks specified objects by their IDs within a transaction. It attempts to acquire the lock up to a maximum number of tries, waiting a fixed time between attempts. Once the lock is acquired, the method yields a session allowing safe operations on the locked objects and commits or rolls back the transaction accordingly.
Parameters¶
object_ids
: List of object IDs to lock.max_attempts
: Maximum attempts to acquire the lock. Default is 5.wait_time
: Time in seconds to wait between attempts. Default is 1.0 sec.
Usage¶
- Purpose: To safely perform operations on a subset of objects by obtaining an exclusive lock in a transactional context.
Example¶
with collection.lock_objects(['id1', 'id2']) as session:
# Perform operations with locked objects
result = session.execute(query)
# Commit is automatic if no exceptions occur
Documentation for PgvectorCollection.insert
¶
Functionality¶
Inserts objects and their associated vector parts into the collection's database. The method converts each provided Object into its corresponding database record and inserts both the object and its parts in a single transaction to ensure consistency.
Parameters¶
objects
: A list of Object instances to be added. Each Object contains payload data, storage metadata, and vector parts.
Usage¶
- Purpose: To add multiple vector objects into the collection in a batch operation.
Example¶
objects = [Object(...), Object(...)]
collection.insert(objects)
Documentation for PgvectorCollection.create_index
¶
Functionality¶
This method creates an HNSW vector index on the object parts table's vector column. It then updates the collection's index state in the cache, ensuring that the collection recognizes the index as created. The index is built only if it does not already exist.
Parameters¶
- None.
Usage¶
Purpose - Establish a vector index for similarity searches within the collection.
Example¶
# Assuming pg_collection is an instance of PgvectorCollection
pg_collection.create_index()
Documentation for PgvectorCollection.upsert
¶
Functionality¶
Update or insert objects with their vector parts. This method either updates an existing record or inserts new records for objects and their corresponding vector parts. Depending on the 'shrink_parts' parameter, it either deletes the old parts before inserting new ones or performs a database upsert of the parts.
Parameters¶
objects
: List of Object instances to upsert. Each instance includes attributes like object_id, payload, storage_meta, user_id, original_id, and a list of parts (with part_id, vector, is_average).shrink_parts
: Boolean flag. If True, deletes existing parts before inserting new ones; if False, it upserts the parts.
Usage¶
- Purpose: Manage the upsertion of objects and vector parts in the Pgvector collection to ensure database consistency during data updates or insertions.
Example¶
objects = [Object(...), Object(...)]
collection.upsert(objects, shrink_parts=True)
Documentation for PgvectorCollection.delete
¶
Functionality¶
Deletes objects and their parts from the collection. This method first removes parts from the parts table to avoid potential deadlocks, then deletes the corresponding objects from the main table. If any error occurs, the entire transaction is rolled back to ensure data integrity.
Parameters¶
object_ids
: List[str] - A list of object IDs that should be deleted.
Usage¶
- Purpose: Remove one or more objects and their associated parts from the collection in a safe, transactional manner.
Example¶
object_ids = ["id1", "id2", "id3"]
collection.delete(object_ids)
Documentation for PgvectorCollection._reset_read_session
¶
Functionality¶
Reconnects to the PostgreSQL database to create and return a new SQLAlchemy session for persistent read-only operations. This method ensures that read operations use a valid and active connection.
Parameters¶
This method does not take any parameters.
Returns¶
- A new SQLAlchemy session used for persistent read operations.
Usage¶
- Purpose: Reset and obtain a fresh read session when the current connection is closed or outdated.
Example¶
collection = PgvectorCollection(pg_database, collection_id, cache)
session = collection._reset_read_session()
result = session.execute(query)
Documentation for PgvectorCollection._with_read_session
¶
Functionality¶
This method executes the provided query function using a persistent read session. It first checks if the session's connection is open, and if not, resets the session. If an error occurs during the query, it falls back to a traditional session.
Parameters¶
query_func
: A function that accepts a session parameter and performs database queries.
Usage¶
- Purpose: Centralizes read query execution with a fallback mechanism.
Example¶
def example_query(session):
return session.execute("SELECT * FROM my_table").fetchall()
result = collection._with_read_session(example_query)
Documentation for PgvectorCollection.find_by_ids
¶
Functionality¶
This method retrieves objects from the collection using a list of IDs. It runs a database query via a persistent read session. If the connection fails, it falls back to a normal session.
Parameters¶
object_ids
: List[str] - A list of object IDs to be found.
Return Value¶
Returns a list of objects that match the provided IDs.
Usage¶
- Purpose: Retrieve specific objects by their IDs from a PgvectorCollection instance.
Example¶
objects = collection.find_by_ids(["id1", "id2", "id3"])
for obj in objects:
print(obj)
Documentation for PgvectorCollection.query
¶
Functionality¶
Executes a database query using a persistent read session. If the persistent session is closed or encounters an error, it falls back to a traditional session. This design ensures robust query execution without interruption.
Parameters¶
query_func
: A function that accepts a session object and performs database operations. The session provided is either the persistent read session or a fallback traditional session.
Usage¶
- Purpose: To execute database queries reliably by first using a persistent read connection, and if that fails, switching to a regular session.
Example¶
# Define a query function that takes a session parameter
def my_query(session):
result = session.execute("SELECT * FROM my_table")
return result.fetchall()
# Execute the query through PgvectorCollection
results = collection.query(my_query)
Documentation for PgvectorCollection.find_by_original_ids
¶
Functionality¶
This method retrieves objects from the database using their original IDs. It uses a read-only session to execute a query that returns rows matching the given original object IDs and converts the results into Object instances.
Parameters¶
object_ids
: A list of strings representing the original object IDs used to identify objects in the collection.
Usage¶
- Purpose: Fetch object instances by providing their original identifiers.
Example¶
collection = PgvectorCollection(pg_database, "my_collection", collection_info_cache)
objects = collection.find_by_original_ids(["id1", "id2"])
for obj in objects:
print(obj)
Documentation for PgvectorCollection.find_similarities
¶
Functionality¶
Find objects similar to a provided query vector using similarity search on stored vectors in a PostgreSQL database. It performs a similarity comparison with optional filtering and sorting.
Parameters¶
query_vector
: Vector (list of floats) used for computing similarity.limit
: Maximum number of similar objects to return.offset
: Number of objects to skip (for pagination).max_distance
: Distance threshold to determine similarity.payload_filter
: Filter to constrain objects based on payload.sort_by
: Options to sort the returned objects.user_id
: Filter objects by a specific user identifier.similarity_first
: Flag to prioritize similarity in sorting.meta_info
: Additional metadata for query customization.
Usage¶
- Purpose: Retrieves objects similar to a given query vector while applying optional filtering, sorting, and pagination.
Example¶
results = collection.find_similarities(
query_vector=[0.2, 0.5, 0.3],
limit=10,
offset=0,
max_distance=0.5,
payload_filter=None,
sort_by=None,
user_id='user123',
similarity_first=True,
meta_info={'additional': 'data'}
)
print(results)
Documentation for PgvectorCollection.get_total
¶
Functionality¶
Returns the total count of objects stored in the collection. When the flag is set, it counts only original objects, ignoring any derivatives.
Parameters¶
originals_only
: Boolean flag indicating whether to count only original objects (default is True).
Usage¶
- Counting Objects - Use this method to get a tally of objects in the collection for status or pagination.
Example¶
total = collection.get_total()
print("Total objects:", total)
Documentation for PgvectorCollection.get_objects_common_data_batch
¶
Functionality¶
This method retrieves a batch of common object data from the collection. It queries the total number of objects and fetches object details based on the provided limit and offset. It also calculates the next offset for pagination.
Parameters¶
limit
: Maximum number of objects to retrieve.offset
: Number of objects to skip; optional parameter.originals_only
: If True, only original objects are retrieved.
Usage¶
Purpose - Use this method to obtain a set of object data along with pagination info from a PostgreSQL vector database using the pgvector extension.
Example¶
batch = collection.get_objects_common_data_batch(
limit=50,
offset=0,
originals_only=True
)
print(batch.objects_info)
Documentation for PgvectorCollection.count_by_payload_filter
¶
Functionality¶
Counts the number of objects that match a specific payload filter in the PostgreSQL vector database using the pgvector extension. It applies a filter condition based on the object's payload.
Parameters¶
payload_filter
: A PayloadFilter object that defines the criteria for matching objects.
Return¶
An integer representing the number of objects that satisfy the payload filter criteria.
Usage¶
- Purpose: Quickly obtain the count of objects matching a given payload filter in your vector database.
Example¶
count = collection.count_by_payload_filter(payload_filter)
Documentation for PgvectorQueryCollection
¶
Functionality¶
PgvectorQueryCollection is a query-specific extension of the PgvectorCollection. It adds functionality to handle user queries and operations associated with vector searches in a PostgreSQL setup with pgvector. The class focuses on retrieving objects using a session identifier and other query parameters.
Motivation¶
The class was designed to separate query logic from basic CRUD operations. This separation simplifies the codebase and allows for specialized handling of queries, including vector validations and payload filtering.
Inheritance¶
PgvectorQueryCollection inherits from the following classes: - PgvectorCollection: Provides core functionality for managing vector embeddings in a PostgreSQL database. - QueryCollection: Defines common query operations for vector databases.
Example¶
Below is a simple example demonstrating how to use the class:
# Create an instance of PgvectorQueryCollection
pg_query_collection = PgvectorQueryCollection(
pg_database=engine,
collection_id='example_collection',
collection_info_cache=cache
)
# Retrieve objects by session ID
objects = pg_query_collection.get_objects_by_session_id('session123')
Documentation for PgvectorQueryCollection.get_objects_by_session_id
¶
Functionality¶
This method retrieves objects and their parts based on a given session ID. It performs vector validation to ensure that only valid objects are returned from the database.
Parameters¶
session_id
: A unique identifier for the session to query.
Usage¶
- Purpose: To fetch objects from the database using a session ID, ensuring only the applicable and validated objects are retrieved.
Example¶
# Retrieve objects by session ID
objects = collection.get_objects_by_session_id("session_123")