Documentation for TrainTestSplitter
¶
Functionality¶
The TrainTestSplitter
class is designed to split fine-tuning clickstream data into training and test sets using a specified ratio. It ensures that related fine-tuning inputs (sharing similar result IDs) remain together during the splitting process. The class also has the option to apply augmentation to the inputs.
Parameters¶
test_size_ratio
: A float in (0.0, 1.0) that denotes the proportion of data reserved for testing (default: 0.2).shuffle
: A boolean indicating if the inputs should be shuffled before splitting (default: True).random_state
: An optional integer to ensure reproducibility.augmenter
: An optional augmenter which may be aClickstreamQueryAugmentationApplier
or a callable that applies augmentation to the inputs.do_augment_test
: A boolean that decides if the test split should also be augmented (default: False).
Methods¶
TrainTestSplitter._augment_clickstream
¶
Functionality¶
This method applies augmentation to clickstream inputs if an augmenter is provided. If no augmenter is set, it returns the original inputs unchanged.
Parameters¶
inputs
: List ofFineTuningInput
objects to be augmented.
Usage¶
Purpose - Enhance fine-tuning inputs by applying query augmentations. If an augmenter is set, it uses its apply_augmentation
method to modify the inputs; otherwise, it returns the original list.
Example¶
Assume splitter
is an instance of TrainTestSplitter
with an augmenter:
augmented_inputs = splitter._augment_clickstream(inputs)
TrainTestSplitter.shuffle
¶
Functionality¶
This property returns the shuffle setting used by the splitter. It indicates whether the fine-tuning inputs will be randomly rearranged before being split into train and test sets.
Return Value¶
- Boolean flag indicating if shuffling is enabled.
TrainTestSplitter.split
¶
Functionality¶
Splits a list of fine-tuning inputs into training and testing sets. It divides inputs based on result IDs to keep related inputs together. If overlapping result IDs occur, the majority determines the split.
Parameters¶
inputs
: List ofFineTuningInput
instances that represent the fine-tuning input data to be split.
Return¶
Returns a DatasetDict
with two keys:
- train
: A PairedFineTuningInputsDataset
instance for training data, potentially augmented.
- test
: A PairedFineTuningInputsDataset
instance for testing data, potentially augmented.
Usage¶
Use this method to maintain consistency among inputs sharing result IDs. It ensures that related inputs are not split across training and testing sets. This approach is critical for tasks with overlapping result contexts.
Example¶
splitter = TrainTestSplitter(test_size_ratio=0.2, random_state=42)
result = splitter.split(inputs)