Skip to content

manager

Documentation for ItemSetManager

Functionality

The ItemSetManager class manages item set operations such as data preprocessing, splitting, and augmentation. It integrates data transformations and augmentation processes to prepare datasets for embedding models.

Main Purposes and Motivation

This class centralizes the logic for preparing datasets efficiently. It uses a preprocessor for transformations and optionally applies augmentation for both train and test splits. Its motivation is to streamline dataset handling for embedding tasks.

Inheritance

ItemSetManager does not inherit from any other class. It is a standalone component in the data management workflow.

Usage

  • Initialize with a preprocessor instance.
  • Optionally provide an ID field name, a splitter for dividing the dataset, and an augmenter for custom data modifications.
  • Configure options for test and train set augmentation.

Example

Assuming you have a preprocessor:

pre_processor = Preprocessor(...)
manager = ItemSetManager(
    pre_processor,
    id_field_name='id',
    items_set_splitter=Splitter(...),
    augmenter=Augmenter(...),
    do_augment_test=True,
    do_augmentation_before_preprocess=True
)
Remember to adjust parameters based on your dataset needs.


Documentation for ItemSetManager._augment_items_set

Functionality

Applies augmentation to an ItemsSet if an augmenter is set. If no augmenter exists, it returns the original ItemsSet.

Parameters

  • items_set: The ItemsSet instance to be augmented.

Usage

  • Purpose: Process and augment an ItemsSet when needed.

Example

Assume a valid ItemsSet instance:

augmented = manager._augment_items_set(items_set)


Documentation for ItemSetManager._augment_test_items_set

Functionality

This method applies augmentation to a test items set if augmentation is enabled and the augmentation step is conducted before preprocessing. When the provided flag matches the object's configuration and augmentation is active, the test items set undergoes augmentation.

Parameters

  • items_set: ItemsSet to be augmented.
  • before_preprocess: Boolean flag indicating if the augmentation should be applied before the preprocessing step. Default is True.

Usage

  • Purpose: Conditionally augment a test items set during the data augmentation process for evaluation.

Example

Assuming an instance of ItemSetManager is created with the test augmentation flag set to True, the test items set is augmented as shown below:

augmented_set = manager._augment_test_items_set(test_items_set, before_preprocess=True)


Documentation for ItemSetManager._augment_train_items_set

Functionality

This method applies augmentation to the train items set if the before_preprocess flag matches the configuration. When enabled, the items set is augmented before preprocessing; otherwise, the original items set is returned.

Parameters

  • items_set: The train items set to be augmented.
  • before_preprocess: A boolean that indicates if augmentation should be applied before preprocessing.

Usage

  • Purpose: Conditionally apply augmentation on the train items set based on configuration.

Example

Assuming augmentation is enabled, use the method as follows:

augmented_items = manager._augment_train_items_set(items_set, True)


Documentation for ItemSetManager._preprocess

Functionality

This method takes an input DatasetDict and applies a preprocessing conversion using the preprocessor's convert method. A debug log is generated before processing.

Parameters

  • dataset: A DatasetDict containing the data to preprocess.

Usage

  • Purpose: Preprocess the dataset to transform raw data into a format suitable for further analysis or processing.

Example

Assuming you have an instance of ItemSetManager and a valid DatasetDict:

preprocessed_data = item_set_manager._preprocess(dataset)


Documentation for ItemSetManager._split_dataset

Functionality

This method splits a dataset into training and testing subsets by filtering examples based on a specified ID field. It takes a dataset and two sets (train_ids and test_ids) to create the subsets. It also applies augmentation for train and test sets if configured.

Parameters

  • dataset: The original dataset (Hugging Face Dataset) to split.
  • train_ids: Set of strings representing IDs for the training set.
  • test_ids: Set of strings representing IDs for the testing set.

Usage

  • Purpose: Split a dataset by filtering examples using an ID field.

Example

from datasets import load_dataset
from embedding_studio.embeddings.data.items.manager import ItemSetManager
from embedding_studio.embeddings.data.preprocessors.preprocessor import ItemsDatasetDictPreprocessor

preprocessor = ItemsDatasetDictPreprocessor(...)
manager = ItemSetManager(preprocessor=preprocessor)
dataset = load_dataset("my_dataset")
train_ids = {"id1", "id2"}
test_ids = {"id3", "id4"}
split_dataset = manager._split_dataset(dataset, train_ids, test_ids)

Documentation for ItemSetManager._check_clickstream_dataset

Functionality

This method verifies that the provided clickstream dataset is valid. It checks that both the 'train' and 'test' portions are instances of the expected dataset type.

Parameters

  • clickstream_dataset: A DatasetDict containing clickstream data. Both 'train' and 'test' values must be instances of PairedFineTuningInputsDataset.

Usage

  • Purpose: Ensure that clickstream data meets necessary criteria before further processing. If validation fails, a ValueError is raised.

Example

clickstream_dataset = {
    "train": PairedFineTuningInputsDataset(...),
    "test": PairedFineTuningInputsDataset(...)
}
manager = ItemSetManager(...)
manager._check_clickstream_dataset(clickstream_dataset)