Documentation for ItemsSet Class¶
Functionality¶
The ItemsSet class is a wrapper around a Huggingface Dataset. It is designed to represent and provide utilities for a set of search result items. It extends the Dataset to allow retrieval by item IDs and indices.
Parameters¶
- dataset: A Huggingface Dataset containing items.
- item_field_name: The field representing the item used for embedding.
- id_field_name: The field representing the unique item identifier.
Motivation¶
ItemsSet was created to facilitate search, indexing, and retrieval within a dataset. It integrates with the Huggingface Dataset, providing extra functionality specific to item management.
Inheritance¶
ItemsSet inherits from the Huggingface Dataset class, augmenting its capabilities to support item-specific operations.
Example¶
dataset = load_dataset(...)
items_set = ItemsSet(dataset, "text", "id")
results = items_set.items_by_ids(["id1", "id2"])
Documentation for ItemsSet.item_field_name¶
Functionality¶
This property returns the name of the dataset field that contains the items for processing by the embedding model. It retrieves the underlying _item_field_name
attribute from an ItemsSet instance.
Parameters¶
- Getter: No parameters are required when retrieving the item field name.
- Setter:
- value: A non-empty string representing the new name for the item field.
Usage¶
- Purpose: Retrieve or update the data field name representing items in the dataset.
Example (Getter)¶
Assuming an ItemsSet instance is defined as items_set
:
item_name = items_set.item_field_name
Example (Setter)¶
To change the field name, ensuring the value is a non-empty string, you can do:
items_set.item_field_name = "new_field_name"
Documentation for ItemsSet.id_field_name¶
Functionality¶
This property returns the name of the field used for item IDs in the dataset. It retrieves the current string value that specifies the field holding the unique identifier of an item.
Parameters¶
- None for retrieving the value.
- When setting, use:
- value: A non-empty string representing the ID field name.
Usage¶
- Purpose: Retrieve or update the identifier field name for the ItemsSet. This property is used to access and manipulate the unique ID of each data item.
Example¶
To get the identifier field:
id_field = items_set.id_field_name
To set a new identifier field:
items_set.id_field_name = "new_id"
Documentation for ItemsSet.id_to_index
¶
Functionality¶
This property returns a dictionary that maps each ID from the dataset to a list of indices where the ID occurs. If the mapping is uninitialized, it is rebuilt by iterating through the dataset and appending each occurrence's index.
Parameters¶
None.
Usage¶
- Purpose: Quickly retrieve the positions of items sharing the same ID in the dataset.
Example¶
items_set = ItemsSet(dataset, "text", "id")
mapping = items_set.id_to_index
print(mapping["example_id"])
Documentation for ItemsSet.rows_by_ids¶
This method retrieves rows from the original dataset based on a provided list of IDs. It uses an internal mapping from ID values to dataset indices. If ignore_missed
is False, the method raises an IndexError
for any missing ID.
Parameters¶
ids
: A list of identifiers used to look up rows in the dataset.ignore_missed
: Boolean flag. If False, missing IDs trigger an error; if True, missing IDs are skipped.
Usage¶
- Purpose: Fetch rows from the dataset corresponding to given IDs.
Example¶
Suppose you have an ItemsSet instance named set_obj
:
rows = set_obj.rows_by_ids(["id1", "id2"], ignore_missed=True)
Documentation for ItemsSet.items_by_indices¶
Functionality¶
This method returns a slice of items from the dataset based on a given list of indices. It retrieves items corresponding to the specified index positions.
Parameters¶
indices
(List[int]): A list of indices to retrieve items from the dataset.
Usage¶
Use this method to access specific items in the dataset by passing a list of indices that indicate the desired item positions.
Example¶
items_set = ItemsSet(dataset, 'id', 'item')
selected_items = items_set.items_by_indices([0, 2, 5])
Documentation for ItemsSet.items_by_ids¶
Functionality¶
Retrieves a slice of items from the dataset using a list of IDs. It uses the rows_by_ids
method internally to get the rows and then extracts the item and ID fields.
Parameters¶
ids
: List[Any] - A list of IDs for which items will be retrieved.
Return Value¶
Returns a tuple containing two lists: - A list of items corresponding to the given IDs. - A list of IDs related to the retrieved items.
Usage¶
Use this method to quickly obtain items and their IDs by passing a list of ID values.
Example¶
items, ids = items_set.items_by_ids(['id1', 'id2'])
Documentation for ItemsSet.items_slice
¶
Functionality¶
This method returns a subset of dataset items by slicing the dataset using the provided start and end indices.
Parameters¶
start_idx
: An integer indicating the start index of the slice.end_idx
: An optional integer specifying the end index (exclusive).
Usage¶
- Purpose: To extract a list of items within a specified range from the dataset.
Example¶
items = dataset.items_slice(10, 20)