Clickstream Management in Embedding Studio¶
Introduction¶
Clickstream data is a vital resource for improving search quality in Embedding Studio. It records how users interact with search results, providing insights that can be used to fine-tune embedding models and improve relevance. This guide explains how clickstream data flows through the system and how to leverage it effectively.
Core Concepts¶
- Session: A continuous user interaction with the search system
- Search Events: Individual user actions within a session (clicks, views, etc.)
- Relevance Feedback: Converting user interactions into training signals
- Model Improvement: Using session data to fine-tune embedding models
Architecture Overview¶
The clickstream system consists of several components:
- Client API endpoints: For registering sessions and events
- Internal API endpoints: For processing and utilizing session data
- Data Access Objects (DAOs): For storing and retrieving session data
- Converters: For transforming session data into fine-tuning inputs
- Workers: For processing sessions and improving models
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Client Events ββββββΆβ Session Store ββββββΆβ Model Training β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
API Endpoints¶
Client-Facing Endpoints¶
The clickstream_client.py
module provides these endpoints for client applications:
1. Create Session¶
@router.post("/session", status_code=status.HTTP_200_OK)
def create_session(body: SessionCreateRequest) -> None:
"""
Creates a new user interaction session with search data.
"""
Use this endpoint when a user starts a new search session. It initializes tracking for that session.
2. Get Session¶
@router.get("/session", status_code=status.HTTP_200_OK, response_model=SessionGetResponse)
def get_session(session_id: str) -> SessionWithEvents:
"""
Retrieves a complete session by ID with all related interaction events.
"""
Use this to retrieve a complete session with all its events, useful for debugging or analytics.
3. Push Events¶
@router.post("/session/events", status_code=status.HTTP_200_OK)
def push_events(body: SessionAddEventsRequest) -> None:
"""
Adds user interaction events to an existing session.
"""
Use this endpoint to record user interactions such as clicks, views, or purchases.
4. Mark Session Irrelevant¶
@router.post("/session/irrelevant", status_code=status.HTTP_200_OK)
def mark_session_irrelevant(body: SessionMarkIrrelevantRequest) -> None:
"""
Flags a session as irrelevant for analytics and model improvement.
"""
Use this to exclude certain sessions from training data, such as bot sessions or test searches.
Internal Endpoints¶
The clickstream_internal.py
module provides these endpoints for system processes:
1. Use Session for Improvement¶
@router.post("/session/use-for-improvement", status_code=status.HTTP_200_OK)
def use_session_for_improvement(body: UseSessionForImprovementRequest) -> None:
"""
Submits a session for use in search quality improvement processes.
"""
Internal endpoint to add sessions to the model improvement queue.
2. Get Batch Sessions¶
@router.get("/batch/sessions", status_code=status.HTTP_200_OK, response_model=BatchSessionsGetResponse)
def get_batch_sessions(batch_id: str, after_number: int = 0, limit: int = 10, events_limit: int = 100):
"""
Retrieves a paginated batch of sessions for processing.
"""
Internal endpoint to retrieve batches of sessions for processing.
3. Release Batch¶
@router.post("/batch/release", status_code=status.HTTP_200_OK, response_model=BatchReleaseResponse)
def release_batch(body: BatchReleaseRequest):
"""
Marks a batch of sessions as processed and ready for deployment.
"""
Internal endpoint to finalize batch processing.
Data Flow¶
1. Session Creation¶
When a search is performed:
- Client creates a session with search query and metadata
- System assigns a unique session ID and batch ID
- Session is stored in the database
# Example session creation
response = requests.post(
"https://api.embeddingstudio.com/api/v1/clickstream/session",
json={
"session_id": "user_123_search_456",
"search_query": "red summer dress",
"search_meta": {"filters": {"category": "clothing"}},
"search_results": [
{"object_id": "prod_789", "meta": {"title": "Red sundress"}}
],
"created_at": int(time.time())
}
)
2. Event Recording¶
As users interact with search results:
- Client records events (clicks, purchases, etc.)
- Events include object IDs and timestamps
- Events are associated with the session ID
# Example event recording
response = requests.post(
"https://api.embeddingstudio.com/api/v1/clickstream/session/events",
json={
"session_id": "user_123_search_456",
"events": [
{
"event_id": "click_001",
"object_id": "prod_789",
"event_type": "click",
"meta": {"position": 3},
"created_at": int(time.time())
}
]
}
)
3. Session Processing¶
Later, system processes use the data:
- Sessions are retrieved in batches
- Events are analyzed to infer relevance
- Sessions are converted to training inputs
- Model improvement is scheduled
Converting to Training Data¶
The ClickstreamSessionConverter
transforms sessions into fine-tuning inputs:
# Create a converter instance
converter = ClickstreamSessionConverter(
item_type=ProductItemMeta,
query_item_type=TextQueryItem,
fine_tuning_type=FineTuningInput,
event_type=ClickStreamSessionEvent
)
# Convert a session to training input
training_input = converter.convert(session_data)
The conversion process:
- Extracts the search query and metadata
- Maps event types to importance scores
- Identifies which results were interacted with
- Creates structured training inputs
MongoDB Storage¶
Sessions are stored in MongoDB with several collections:
sessions
: Stores session data, search queries, and metadatasession_events
: Stores individual user interaction eventssession_batches
: Groups sessions for processingsessions_for_improvement
: Tracks sessions used for model training
Best Practices¶
Tracking Events¶
- Track diverse events: clicks, views, add-to-cart, purchases
- Include positions: Record where in results the item appeared
- Add timestamps: Time-based analysis can reveal patterns
- Record search contexts: Include filters, sorting, and user segments
Session Analysis¶
- Analyze session volume: Low session count may indicate poor coverage
- Check event distribution: Ensure balanced representation of event types
- Monitor conversions: Track click-through and purchase rates
- Compare segments: Look for variations across user groups
Data Quality¶
- Filter bot traffic: Use mark_session_irrelevant for non-human sessions
- Validate event sequencing: Ensure logical order of events
- Check payload content: Ensure search metadata is properly structured
- Monitor batch processing: Track batch completion and error rates
Using ClickstreamDao¶
The ClickstreamDao
interface provides methods for working with session data:
# Get a session with its events
session = context.clickstream_dao.get_session(session_id)
# Mark a session for model improvement
task = context.sessions_for_improvement.create(
schema=SessionForImprovementCreateSchema(
session_id=session_id,
),
return_obj=True,
)
context.sessions_for_improvement.update(obj=task)
# Get batches of sessions for processing
sessions = context.clickstream_dao.get_batch_sessions(
batch_id=batch_id,
after_number=last_processed_number,
limit=100
)