Documentation for `TrainTestSplitter`¶

Functionality¶

The TrainTestSplitter class is designed to split fine-tuning clickstream data into training and test sets using a specified ratio. It ensures that related fine-tuning inputs (sharing similar result IDs) remain together during the splitting process. The class also has the option to apply augmentation to the inputs.

Parameters¶

test_size_ratio: A float in (0.0, 1.0) that denotes the proportion of data reserved for testing (default: 0.2).
shuffle: A boolean indicating if the inputs should be shuffled before splitting (default: True).
random_state: An optional integer to ensure reproducibility.
augmenter: An optional augmenter which may be a ClickstreamQueryAugmentationApplier or a callable that applies augmentation to the inputs.
do_augment_test: A boolean that decides if the test split should also be augmented (default: False).

Methods¶

`TrainTestSplitter._augment_clickstream`¶

Functionality¶

This method applies augmentation to clickstream inputs if an augmenter is provided. If no augmenter is set, it returns the original inputs unchanged.

Parameters¶

inputs: List of FineTuningInput objects to be augmented.

Usage¶

Purpose - Enhance fine-tuning inputs by applying query augmentations. If an augmenter is set, it uses its apply_augmentation method to modify the inputs; otherwise, it returns the original list.

Example¶

Assume splitter is an instance of TrainTestSplitter with an augmenter:

augmented_inputs = splitter._augment_clickstream(inputs)

`TrainTestSplitter.shuffle`¶

Functionality¶

This property returns the shuffle setting used by the splitter. It indicates whether the fine-tuning inputs will be randomly rearranged before being split into train and test sets.

Return Value¶

Boolean flag indicating if shuffling is enabled.

`TrainTestSplitter.split`¶

Functionality¶

Splits a list of fine-tuning inputs into training and testing sets. It divides inputs based on result IDs to keep related inputs together. If overlapping result IDs occur, the majority determines the split.

Parameters¶

inputs: List of FineTuningInput instances that represent the fine-tuning input data to be split.

Return¶

Returns a DatasetDict with two keys: - train: A PairedFineTuningInputsDataset instance for training data, potentially augmented. - test: A PairedFineTuningInputsDataset instance for testing data, potentially augmented.

Usage¶

Use this method to maintain consistency among inputs sharing result IDs. It ensures that related inputs are not split across training and testing sets. This approach is critical for tasks with overlapping result contexts.

Example¶

splitter = TrainTestSplitter(test_size_ratio=0.2, random_state=42)
result = splitter.split(inputs)

Documentation for TrainTestSplitter¶

Functionality¶

Parameters¶

Methods¶

TrainTestSplitter._augment_clickstream¶

Functionality¶

Parameters¶

Usage¶

Example¶

TrainTestSplitter.shuffle¶

Functionality¶

Return Value¶

TrainTestSplitter.split¶

Functionality¶

Parameters¶

Return¶

Usage¶

Example¶

Documentation for `TrainTestSplitter`¶

`TrainTestSplitter._augment_clickstream`¶

`TrainTestSplitter.shuffle`¶

`TrainTestSplitter.split`¶