Skip to content

aggregated_data_loader

Documentation for AggregatedDataLoader

Functionality

The AggregatedDataLoader class is a DataLoader implementation that aggregates multiple data loaders, allowing data to be loaded from different sources through a single interface. It routes load requests to the appropriate loader based on the source name in the item metadata.

Parameters

  • loaders: Dictionary mapping source names to their respective DataLoader instances.
  • item_meta_cls: The ItemMetaWithSourceInfo class type to use for metadata.

Usage

  • Purpose - The main purpose of the AggregatedDataLoader is to simplify the process of loading data from multiple sources by providing a unified interface.

Example

# Example usage of AggregatedDataLoader

from embedding_studio.data_storage.loaders.aggregated_data_loader import AggregatedDataLoader
from embedding_studio.data_storage.loaders.some_data_loader import SomeDataLoader

# Create individual data loaders
loader1 = SomeDataLoader()
loader2 = SomeDataLoader()

# Create an aggregated data loader
aggregated_loader = AggregatedDataLoader(loaders={'source1': loader1, 'source2': loader2}, item_meta_cls=ItemMetaWithSourceInfo)

# Load data
items_data = [...]  # List of ItemMetaWithSourceInfo objects
combined_dataset = aggregated_loader.load(items_data)

Documentation for item_meta_cls

Functionality

Returns the ItemMeta class used by this loader.

Parameters

  • None: This method does not take any parameters.

Usage

  • Purpose - To provide access to the ItemMetaWithSourceInfo class type used for metadata in this loader.

Example

item_meta_class = aggregated_data_loader.item_meta_cls

Documentation for load

Functionality

The load method of the AggregatedDataLoader class is responsible for loading data items from multiple sources and combining them into a single dataset. It groups the items by their source and delegates the loading process to the appropriate data loader for each source.

Parameters

  • items_data: List of ItemMetaWithSourceInfo objects identifying the items to load.

Usage

  • Purpose - This method allows for the aggregation of data from various sources, providing a unified dataset for further processing.

Example

# Example usage of the load method
aggregated_loader = AggregatedDataLoader(loaders, ItemMetaWithSourceInfo)
dataset = aggregated_loader.load(items_data)

Documentation for _load_batch_with_offset

Functionality

Load a batch of data items from all sources starting from the given offset. This method retrieves batches from all loaders and combines them into a single batch.

Parameters

  • offset: The offset from where to start loading items.
  • batch_size: The number of items to load in a single batch.
  • kwargs: Additional parameters for customizing the batch loading process.

Usage

  • Purpose - To load a specified number of items from various data sources, starting from a defined position in the dataset.

Example

# Example usage of the _load_batch_with_offset method
loader = AggregatedDataLoader(loaders, ItemMetaWithSourceInfo)
items = loader._load_batch_with_offset(offset=0, batch_size=10)

Documentation for total_count

Functionality

Calculates the total count of items across all loaders.

Parameters

  • kwargs: Additional parameters passed to each loader's total_count method.

Usage

  • Purpose - To retrieve the total number of items available from all data loaders combined.

Example

# Assuming `aggregated_loader` is an instance of AggregatedDataLoader
count = aggregated_loader.total_count()
print(count)  # Outputs the total count of items across all loaders

Documentation for load_all

Functionality

A generator that iteratively loads all data in batches from all sources.

Parameters

  • batch_size: The size of each batch to load.
  • kwargs: Additional parameters for customizing the batch loading process.

Usage

  • Purpose - This method overrides the base implementation to get batches from all loaders for each offset.

Example

for batch in aggregated_data_loader.load_all(batch_size=10):
    process_batch(batch)