API Reference
textnoisr.noise
Add noise into text.
CharNoiseAugmenter
Add noise into text according to a noise level measured between 0 and 1.
It will add noise to a string by modifying each character
according to a probability and a list of actions.
Possible actions are insert
, swap
, substitute
and delete
.
For actions insert
and substitute
, new characters are drawn from character_set
which is the set of ascii letters (lower and upper) by default.
The swap
action swaps 2 consecutive characters,
but one character can not be swapped twice.
So if a pair of characters has been swapped,
we move to the next pair of characters.
With enough samples, the CER of the output tends to the noise level (terms and conditions may apply, see details in docs/how_this_works.md).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
noise_level |
float
|
between 0 and 1, it corresponds to the level of noise to add to the
text. In most cases (see details above for caveats),
the Character Error Rate of the output will converge to this value.
For |
required |
actions |
tuple[str, ...]
|
list of actions to use to add noise. Available actions are insert,
swap, substitute and delete.
Defaults to |
_AVAILABLE_ACTIONS
|
character_set |
tuple[str, ...]
|
set of characters from which character will be drawn for insert or substitute actions. Defaults to string.ascii_letters. |
tuple(ascii_letters)
|
seed |
int | None
|
A seed to ensure reproducibility.
Defaults to |
None
|
natural_language_swap_correction |
float
|
A correction factor to take into account the fact that natural language is not random. Defaults to 1.052, which is the correction factor for English. |
1.052
|
Raises:
Type | Description |
---|---|
ValueError
|
If the action is not one of the available actions. |
Source code in textnoisr/noise.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 |
|
add_noise(text)
Add noise to a text. The text can be splitted into words.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str | list[str]
|
The text on which to add noise. |
required |
Returns:
Type | Description |
---|---|
str | list[str]
|
The text with noise. |
Source code in textnoisr/noise.py
consecutive_swap_random_chars(text, p)
Swap random consecutive characters of a string.
Each character of the input string is swapped with the next one with
a probability linked to p
(i.e. after a unbiasing).
Notice that a character can only be swapped once.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The input string to be modified. |
required |
p |
float
|
probability for a character to be swapped. It is modified for the CER of the result to converge to this value. |
required |
Returns:
Type | Description |
---|---|
str
|
A string derived from |
Source code in textnoisr/noise.py
delete_random_chars(text, p)
Delete random characters of a string.
Each character of the input string is deleted with another random one with
probability p
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The input string to be modified. |
required |
p |
float
|
probability to delete random character |
required |
Returns:
Type | Description |
---|---|
str
|
A string derived from |
Source code in textnoisr/noise.py
insert_random_chars(text, p)
Insert random characters into a string.
For each character in the input string, a random character is inserted after it
with probability p
. The random characters are chosen from self.character_set.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The input string to be modified. |
required |
p |
float
|
probability to insert random char |
required |
Returns:
Type | Description |
---|---|
str
|
A string derived from |
Source code in textnoisr/noise.py
substitute_random_chars(text, p)
Substitute random characters of a string.
Each character of the input string is substituted with another random one with
probability p
. The random characters are chosen from the self.character_set.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
The input string to be modified. |
required |
p |
float
|
probability to substitute a character |
required |
Returns:
Type | Description |
---|---|
str
|
A string derived from |
Source code in textnoisr/noise.py
textnoisr.noise_dataset
Noise a NLP dataset.
add_noise(dataset, noise_augmenter, feature_name='tokens', **kwargs)
Add random noise to dataset items.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset |
Dataset
|
dataset containing texts |
required |
noise_augmenter |
CharNoiseAugmenter
|
noise augmenter from module |
required |
feature_name |
str
|
The name of the dataset feature (column name) on which to add noise (usually "tokens" or "text") |
'tokens'
|
**kwargs |
Any
|
refers to huggingface dataset.map() argument, see github.com/huggingface/datasets/blob/main/src/datasets/arrow_dataset.py |
{}
|
Returns:
Type | Description |
---|---|
Dataset
|
noised dataset |
Source code in textnoisr/noise_dataset.py
textnoisr.noise_unbiasing
Unbias noise.
See this document.
__compute_expected_cer_from_noise_level(p, N)
Compute the expected CER from an uncorrected (biased) p for a string of length N.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
p |
float
|
Uncorrected (biased) noise level, that is probability to swap an unswapped character |
required |
N |
int
|
The length of the string. |
required |
Returns:
Type | Description |
---|---|
float
|
The expected CER (in the sense of the expected value) |
Source code in textnoisr/noise_unbiasing.py
__compute_noise_level_from_expected_cer(cer, N)
cached
Compute the noise level we have to pass as input in order to get.
This is the real "unbias_swap" function. The other one is just a wrapper.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cer |
float
|
The Character Error Rate we want to have. |
required |
N |
int
|
The length of the string. |
required |
Returns:
Type | Description |
---|---|
float
|
Unbiased probability |
Source code in textnoisr/noise_unbiasing.py
unbias_several_action(p, n_actions)
Unbias probability to remove the bias due to the successive actions.
If applied N times, the probability to change something will become.
p_effective = (1 - (1 - noise_level) ** N)
so we have to invert it to have p_effective = p.
Notice that at first order of the Taylor expansion, this becomes p = p / len(self.actions)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
p |
float
|
Input probability. The user want the expectation of the Character Error Rate to tend to this value. |
required |
n_actions |
int
|
Number of actions. |
required |
Returns:
Type | Description |
---|---|
float
|
Unbiased probability |
Source code in textnoisr/noise_unbiasing.py
unbias_split_into_words(p, text)
Unbias probability to take into account the absence of spaces when splitting.
We need to apply noise word by word in order to have an output list with the same length as the input. If we applied noise word by word we will decrease the effective error rate since no noise will be added on spaces between words. So we increase the probability in order to compensate this loss
Parameters:
Name | Type | Description | Default |
---|---|---|---|
p |
float
|
Input probability. The user want the expectation of the Character Error Rate to tend to this value. |
required |
text |
list[str]
|
Text on which we unbias. |
required |
Returns:
Type | Description |
---|---|
float
|
Unbiased probability |
Source code in textnoisr/noise_unbiasing.py
unbias_swap(p, N, natural_language_swap_correction)
cached
Re-compute p to take unbiasing into account.
See doc for more details.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
p |
float
|
Input probability. The user want the expectation of the Character Error Rate to tend to this value. |
required |
N |
int
|
The length of the string. |
required |
natural_language_swap_correction |
float
|
A correction factor to take into account the fact that natural language is not random. |
required |
Returns:
Type | Description |
---|---|
float
|
Unbiased probability, using an approximation formula for strings that are too long. |