Skip to content

ItemsDatasetDictPreprocessor

Functionality

The ItemsDatasetDictPreprocessor is an abstract base class that defines an interface for processing and transforming DatasetDict objects. It standardizes data formats and applies necessary transformations for embedding models.

Parameters

This class does not take constructor parameters. Its abstract methods require implementations to manage dataset conversions and item-level transformations.

Usage

  • Purpose: Provide a common interface for preprocessor classes that ensure consistency in converting and transforming datasets.
  • Motivation: Enforce a standard method of normalizing fields and applying custom transforms for embedding workflows.
  • Inheritance: Inherits from Python's ABC, ensuring subclasses implement all abstract methods.

Example

An implementation of this interface may define the convert method to normalize dataset fields and create appropriate item sets. For example:

class MyPreprocessor(ItemsDatasetDictPreprocessor):
    def convert(self, dataset: DatasetDict) -> DatasetDict:
        # Custom normalization and transformation
        return processed_dataset

    def __call__(self, item: Any) -> Any:
        return transformed_item

    def get_id_field_name(self) -> str:
        return "custom_id"

ItemsDatasetDictPreprocessor.convert

Functionality

This method processes a DatasetDict by normalizing fields, creating item sets, and applying transformation functions. It converts the original dataset into a structured format ready for embedding models.

Parameters

  • dataset: The original DatasetDict that needs to be preprocessed.

Returns

  • A processed DatasetDict with normalized fields and transformed items.

Usage

  • Purpose: To standardize datasets for embedding models by applying normalization and transformation steps.

Example

processed_dataset = preprocessor.convert(dataset)

ItemsDatasetDictPreprocessor.get_id_field_name

Functionality

Returns the field name used for item identification. Typically returns the identifier field set by the normalizer component of the preprocessor.

Parameters

None.

Returns

  • str: The item identifier field name.

Usage

Used to obtain the identifier for dataset items. A typical implementation looks like:

def get_id_field_name(self) -> str:
    return self._field_normalizer.id_field_name