dict_items_preprocessor

Documentation for `DictItemsDatasetDictPreprocessor`¶

Functionality¶

The DictItemsDatasetDictPreprocessor class processes datasets containing dictionary items. It normalizes field names, applies a transformation to extract text lines from dict items, and builds an ItemsSet for each dataset split. It inherits from ItemsDatasetDictPreprocessor, ensuring consistency in data handling.

Parameters¶

field_normalizer: An instance of DatasetFieldsNormalizer that standardizes field names and specifies the ID field.
transform: Optional callable that extracts a text line from a dictionary. If not provided, defaults to get_text_line_from_dict.

Usage¶

Purpose: To convert dictionary-based datasets into a text-based format for embedding applications.

Example¶

Below is an example of how to use the preprocessor:

from embedding_studio.embeddings.data.preprocessors.dict_items_preprocessor \
    import DictItemsDatasetDictPreprocessor
from embedding_studio.embeddings.data.utils.fields_normalizer \
    import DatasetFieldsNormalizer

normalizer = DatasetFieldsNormalizer(...)
preprocessor = DictItemsDatasetDictPreprocessor(
    field_normalizer=normalizer,
    transform=None  # uses the default transform
)

processed_dataset = preprocessor.convert(original_dataset)

The example shows how to normalize and transform a dataset for embedding operations.

Documentation for `DictItemsDatasetDictPreprocessor.get_id_field_name`¶

Functionality¶

Returns the name of the unique identifier field from the normalized dataset. This method accesses the id_field_name attribute of the field normalizer to obtain the ID field.

Parameters¶

None.

Usage¶

Use this method to retrieve the key that uniquely identifies a dataset item after normalization.

Example¶

If the field normalizer is configured with id_field_name set to "id", then calling get_id_field_name() returns "id".

Documentation for `DictItemsDatasetDictPreprocessor.convert`¶

Functionality¶

Normalizes dataset fields and applies dict transforms. Returns a DatasetDict with each key holding an ItemsSet constructed from the normalized dataset.

Parameters¶

dataset: The DatasetDict to be preprocessed; typically contains train/test splits.

Returns¶

DatasetDict: A preprocessed dataset where values are ItemsSet objects built from the input data.

Usage¶

Purpose: Prepare the dataset for embedding computations by normalizing fields and transforming dictionary items.

Example¶

Example usage:

preprocessor = DictItemsDatasetDictPreprocessor(normalizer, transform)
updated_dataset = preprocessor.convert(dataset)

dict_items_preprocessor

Documentation for DictItemsDatasetDictPreprocessor¶

Functionality¶

Parameters¶

Usage¶

Example¶

Documentation for DictItemsDatasetDictPreprocessor.get_id_field_name¶

Functionality¶

Parameters¶

Usage¶

Example¶

Documentation for DictItemsDatasetDictPreprocessor.convert¶

Functionality¶

Parameters¶

Returns¶

Usage¶

Example¶

Documentation for `DictItemsDatasetDictPreprocessor`¶

Documentation for `DictItemsDatasetDictPreprocessor.get_id_field_name`¶

Documentation for `DictItemsDatasetDictPreprocessor.convert`¶