manager

Documentation for `ItemSetManager`¶

Functionality¶

The ItemSetManager class manages item set operations such as data preprocessing, splitting, and augmentation. It integrates data transformations and augmentation processes to prepare datasets for embedding models.

Main Purposes and Motivation¶

This class centralizes the logic for preparing datasets efficiently. It uses a preprocessor for transformations and optionally applies augmentation for both train and test splits. Its motivation is to streamline dataset handling for embedding tasks.

Inheritance¶

ItemSetManager does not inherit from any other class. It is a standalone component in the data management workflow.

Usage¶

Initialize with a preprocessor instance.
Optionally provide an ID field name, a splitter for dividing the dataset, and an augmenter for custom data modifications.
Configure options for test and train set augmentation.

Example¶

Assuming you have a preprocessor:

pre_processor = Preprocessor(...)
manager = ItemSetManager(
    pre_processor,
    id_field_name='id',
    items_set_splitter=Splitter(...),
    augmenter=Augmenter(...),
    do_augment_test=True,
    do_augmentation_before_preprocess=True
)

Remember to adjust parameters based on your dataset needs.

Documentation for ItemSetManager._augment_items_set¶

Functionality¶

Applies augmentation to an ItemsSet if an augmenter is set. If no augmenter exists, it returns the original ItemsSet.

Parameters¶

items_set: The ItemsSet instance to be augmented.

Usage¶

Purpose: Process and augment an ItemsSet when needed.

Example¶

Assume a valid ItemsSet instance:

augmented = manager._augment_items_set(items_set)

Documentation for `ItemSetManager._augment_test_items_set`¶

Functionality¶

This method applies augmentation to a test items set if augmentation is enabled and the augmentation step is conducted before preprocessing. When the provided flag matches the object's configuration and augmentation is active, the test items set undergoes augmentation.

Parameters¶

items_set: ItemsSet to be augmented.
before_preprocess: Boolean flag indicating if the augmentation should be applied before the preprocessing step. Default is True.

Usage¶

Purpose: Conditionally augment a test items set during the data augmentation process for evaluation.

Example¶

Assuming an instance of ItemSetManager is created with the test augmentation flag set to True, the test items set is augmented as shown below:

augmented_set = manager._augment_test_items_set(test_items_set, before_preprocess=True)

Documentation for ItemSetManager._augment_train_items_set¶

Functionality¶

This method applies augmentation to the train items set if the before_preprocess flag matches the configuration. When enabled, the items set is augmented before preprocessing; otherwise, the original items set is returned.

Parameters¶

items_set: The train items set to be augmented.
before_preprocess: A boolean that indicates if augmentation should be applied before preprocessing.

Usage¶

Purpose: Conditionally apply augmentation on the train items set based on configuration.

Example¶

Assuming augmentation is enabled, use the method as follows:

augmented_items = manager._augment_train_items_set(items_set, True)

Documentation for ItemSetManager._preprocess¶

Functionality¶

This method takes an input DatasetDict and applies a preprocessing conversion using the preprocessor's convert method. A debug log is generated before processing.

Parameters¶

dataset: A DatasetDict containing the data to preprocess.

Usage¶

Purpose: Preprocess the dataset to transform raw data into a format suitable for further analysis or processing.

Example¶

Assuming you have an instance of ItemSetManager and a valid DatasetDict:

preprocessed_data = item_set_manager._preprocess(dataset)

Documentation for `ItemSetManager._split_dataset`¶

Functionality¶

This method splits a dataset into training and testing subsets by filtering examples based on a specified ID field. It takes a dataset and two sets (train_ids and test_ids) to create the subsets. It also applies augmentation for train and test sets if configured.

Parameters¶

dataset: The original dataset (Hugging Face Dataset) to split.
train_ids: Set of strings representing IDs for the training set.
test_ids: Set of strings representing IDs for the testing set.

Usage¶

Purpose: Split a dataset by filtering examples using an ID field.

Example¶

from datasets import load_dataset
from embedding_studio.embeddings.data.items.manager import ItemSetManager
from embedding_studio.embeddings.data.preprocessors.preprocessor import ItemsDatasetDictPreprocessor

preprocessor = ItemsDatasetDictPreprocessor(...)
manager = ItemSetManager(preprocessor=preprocessor)
dataset = load_dataset("my_dataset")
train_ids = {"id1", "id2"}
test_ids = {"id3", "id4"}
split_dataset = manager._split_dataset(dataset, train_ids, test_ids)

Documentation for `ItemSetManager._check_clickstream_dataset`¶

Functionality¶

This method verifies that the provided clickstream dataset is valid. It checks that both the 'train' and 'test' portions are instances of the expected dataset type.

Parameters¶

clickstream_dataset: A DatasetDict containing clickstream data. Both 'train' and 'test' values must be instances of PairedFineTuningInputsDataset.

Usage¶

Purpose: Ensure that clickstream data meets necessary criteria before further processing. If validation fails, a ValueError is raised.

Example¶

clickstream_dataset = {
    "train": PairedFineTuningInputsDataset(...),
    "test": PairedFineTuningInputsDataset(...)
}
manager = ItemSetManager(...)
manager._check_clickstream_dataset(clickstream_dataset)

manager

Documentation for ItemSetManager¶

Functionality¶

Main Purposes and Motivation¶

Inheritance¶

Usage¶

Example¶

Documentation for ItemSetManager._augment_items_set¶

Functionality¶

Parameters¶

Usage¶

Example¶

Documentation for ItemSetManager._augment_test_items_set¶

Functionality¶

Parameters¶

Usage¶

Example¶

Documentation for ItemSetManager._augment_train_items_set¶

Functionality¶

Parameters¶

Usage¶

Example¶

Documentation for ItemSetManager._preprocess¶

Functionality¶

Parameters¶

Usage¶

Example¶

Documentation for ItemSetManager._split_dataset¶

Functionality¶

Parameters¶

Usage¶

Example¶

Documentation for ItemSetManager._check_clickstream_dataset¶

Functionality¶

Parameters¶

Usage¶

Example¶

Documentation for `ItemSetManager`¶

Documentation for `ItemSetManager._augment_test_items_set`¶

Documentation for `ItemSetManager._split_dataset`¶

Documentation for `ItemSetManager._check_clickstream_dataset`¶