ItemsDatasetDictPreprocessor¶
Functionality¶
The ItemsDatasetDictPreprocessor
is an abstract base class that defines an interface for processing and transforming DatasetDict
objects. It standardizes data formats and applies necessary transformations for embedding models.
Parameters¶
This class does not take constructor parameters. Its abstract methods require implementations to manage dataset conversions and item-level transformations.
Usage¶
- Purpose: Provide a common interface for preprocessor classes that ensure consistency in converting and transforming datasets.
- Motivation: Enforce a standard method of normalizing fields and applying custom transforms for embedding workflows.
- Inheritance: Inherits from Python's ABC, ensuring subclasses implement all abstract methods.
Example¶
An implementation of this interface may define the convert
method to normalize dataset fields and create appropriate item sets. For example:
class MyPreprocessor(ItemsDatasetDictPreprocessor):
def convert(self, dataset: DatasetDict) -> DatasetDict:
# Custom normalization and transformation
return processed_dataset
def __call__(self, item: Any) -> Any:
return transformed_item
def get_id_field_name(self) -> str:
return "custom_id"
ItemsDatasetDictPreprocessor.convert
¶
Functionality¶
This method processes a DatasetDict
by normalizing fields, creating item sets, and applying transformation functions. It converts the original dataset into a structured format ready for embedding models.
Parameters¶
dataset
: The originalDatasetDict
that needs to be preprocessed.
Returns¶
- A processed
DatasetDict
with normalized fields and transformed items.
Usage¶
- Purpose: To standardize datasets for embedding models by applying normalization and transformation steps.
Example¶
processed_dataset = preprocessor.convert(dataset)
ItemsDatasetDictPreprocessor.get_id_field_name
¶
Functionality¶
Returns the field name used for item identification. Typically returns the identifier field set by the normalizer component of the preprocessor.
Parameters¶
None.
Returns¶
str
: The item identifier field name.
Usage¶
Used to obtain the identifier for dataset items. A typical implementation looks like:
def get_id_field_name(self) -> str:
return self._field_normalizer.id_field_name