Merged Documentation¶
Documentation for Text Mutation Functions¶
This documentation covers a set of functions designed to simulate typographical errors in words. These functions can be utilized for generating misspellings, useful in text augmentation tasks and testing spell checkers.
Function: adjacent_key_error
¶
Functionality¶
Replaces some characters in a word with one of its adjacent keys based on a keyboard layout. Randomly introduces an error according to a given probability.
Parameters¶
word
: The original word to be mutated.
Returns¶
- A new string with some characters replaced, simulating a typing mistake by using adjacent keyboard keys.
Usage¶
- Purpose: Mimics typographical errors typically caused by hitting a key adjacent to the intended one.
Example¶
original = "example"
error_word = adjacent_key_error(original)
print(error_word) # Might print: exzmple
Function: delete_random_character
¶
Functionality¶
Deletes a random character from a provided word. This function simulates human typographical errors by randomly removing one character from the input word.
Parameters¶
word
: The input word as a string. Must have more than one character for the deletion to occur.
Usage¶
Used to generate misspellings for text augmentation tasks.
Example¶
original = "example"
result = delete_random_character(original)
print(result) # might output "exmple" or "exaple"
Function: swap_characters
¶
Functionality¶
Swaps two characters in a string at specified positions.
Parameters¶
string
: The original string.i
: The index of the first character.j
: The index of the second character.
Usage¶
Purpose: Swap two characters in a string.
Example¶
swap_characters("hello", 0, 1) # Result: "ehllo"
Function: swap_random_adjacent_characters
¶
Functionality¶
Swaps two adjacent characters in the given string at random. If the string has fewer than two characters, it returns the original string.
Parameters¶
string
: The original string.
Returns¶
- A new string with one pair of adjacent characters swapped.
Usage¶
Useful for generating common misspellings to test spell checkers. It can be integrated into text processing pipelines.
Example¶
swap_random_adjacent_characters('hello') # might return 'hlelo' depending on the selected index.
Function: insert_random_character
¶
Functionality¶
Inserts a random lowercase character into a given word. The character is inserted at a random position, while the rest of the word remains unchanged.
Parameters¶
word
: The original word into which the character will be inserted.
Usage¶
Used for generating misspelled words by randomly adding extra letters.
Example¶
insert_random_character("example") # 'exxample' (output may vary due to randomness)
Function: random_split
¶
Functionality¶
Randomly splits a word into two parts by inserting a space at a random index.
Parameters¶
word
: The original word as a string.
Usage¶
- Purpose: To divide the input word into two parts with a space in between.
Example¶
Input: "example"
Output: "exam ple"
Function: introduce_misspellings_with_keyboard_map
¶
Functionality¶
This function introduces misspellings to the input text by making random editing operations based on a keyboard layout. It tokenizes the text and applies one of several error functions, simulating common typing mistakes.
Parameters¶
text
: The original text to modify.error_rate
: The probability (0-1) that an error will be introduced for each token. Defaults to 0.1.tokenizer
: An optional tokenizer (following nltk TokenizerI) used to tokenize the text. If None, a default TreebankWordTokenizer is used.
Usage¶
- Purpose: To simulate typical typing errors by introducing random misspellings. Useful for testing and generating synthetic typo data.
Example¶
Input: "Hello world"
Possible output: "Heklo wormd"