Documentation for `TextItemsDatasetDictPreprocessor`¶

Overview¶

The TextItemsDatasetDictPreprocessor class is designed to preprocess dataset dictionaries by normalizing field names, applying text transformations, and converting dataset splits into text item sets. It maps the raw text data into a unified format that can later be utilized for embeddings or further text processing.

Constructor Parameters¶

field_normalizer: An instance of DatasetFieldsNormalizer used to normalize field names and dataset structure.
transform: A callable that transforms the text items (defaults to a function that performs no change).

Inheritance¶

The class inherits from ItemsDatasetDictPreprocessor, extending its base functionality specifically for text-related data processing.

Functionality¶

Purpose: To prepare a dataset for text-based embedding processes by standardizing fields and applying optional text transformations.

Method: `get_id_field_name`¶

Functionality¶

Returns the identifier field name from the field normalizer by accessing the id_field_name attribute and returning its value.

Parameters¶

self: Instance of TextItemsDatasetDictPreprocessor.

Usage¶

Purpose: To retrieve the identifier field name used in dataset items.

Example¶

preprocessor = TextItemsDatasetDictPreprocessor(field_normalizer)
field_name = preprocessor.get_id_field_name()

Method: `convert`¶

Functionality¶

Receives a DatasetDict and applies field normalization and text transformations. It iterates over the keys of the dataset, creates ItemsSet objects for each key, and applies text transforms using a provided function. The result is a new DatasetDict with preprocessed items for further use.

Parameters¶

dataset: A DatasetDict to be preprocessed, which holds data for different splits, such as 'train' and 'test'.

Return¶

A DatasetDict where each value is an ItemsSet with text transformations applied.

Usage¶

Purpose: To convert raw dataset dictionaries into a normalized form with text transformations, ready for further processing.

Example¶

Suppose you have a dataset called data_ds:

preprocessor = TextItemsDatasetDictPreprocessor(
    field_normalizer, transform
)
processed_ds = preprocessor.convert(data_ds)

Example of Instantiation and Usage¶

from embedding_studio.embeddings.data.preprocessors.text_items_preprocessor import TextItemsDatasetDictPreprocessor
from embedding_studio.embeddings.data.utils.fields_normalizer import DatasetFieldsNormalizer

# Create a normalizer instance
normalizer = DatasetFieldsNormalizer(
    id_field_name='id',
    field_mapping={'name': 'text'}
)

# Instantiate the preprocessor with the normalizer and an optional transform
preprocessor = TextItemsDatasetDictPreprocessor(
    field_normalizer=normalizer,
    transform=lambda x: x.lower()
)

# Process a dataset dict (DatasetDict) using the preprocessor
processed_dataset = preprocessor.convert(dataset)

Documentation for TextItemsDatasetDictPreprocessor¶

Overview¶

Constructor Parameters¶

Inheritance¶

Functionality¶

Method: get_id_field_name¶

Functionality¶

Parameters¶

Usage¶

Example¶

Method: convert¶

Functionality¶

Parameters¶

Return¶

Usage¶

Example¶

Example of Instantiation and Usage¶

Documentation for `TextItemsDatasetDictPreprocessor`¶

Method: `get_id_field_name`¶

Method: `convert`¶