How this works
High level design
As can be seen in the source code, there are three main parts:
-
The module textnoisr/noise_dataset.py, consists on wrapper functions to make this library works seamlessly on the
Dataset
class. -
The module textnoisr/noise.py contains a class
CharNoiseAugmenter
that work on the level of a single document. This class is basically a wrapper around four methods (one for each action). In pseudo-python, the first three ("delete", "insert", "substitute") are one-liners likeIt is a little bit more complicated for the "swap" action, since we do not want two non-consecutive characters to be swapped. There are two effects to this detail:def ACTION_random_chars(text: str, p: float) -> str: return "".join(DO_STUFF(char) if SOME_CONDITION for char in text)
- the code is a little bit more convoluted for "swap" (a dozen of lines instead of two), in order to avoid swapping again a character that has already been swapped.
- this introduces a bias: the Character Error Rate for the case of "swap" does not tends to the noise level anymore. We need to unbias the noise level beforehand in the "swap" case. This is taken into account in the last module.
-
The module textnoisr/noise_unbiasing.py, hides the details needed for the Character Error Rate to tend to the noise level.
Advanced: understanding the unbiasing
As previously said, one important feature of this module is that we want the noise level to be compatible
with the notion of Character Error Rate.
More precisely, we want (as far as possible)
for the expected value of the Character Error Rate of the output to be the noise_level
given as input.
Several aspects have been taken into account to enforce this behavior, and several biases have been removed.
Actions applied successively
When actions are applied successively to the whole text, the text will be processed several times. If applied \(N\) times with probability \(p\), the total probability \(P\) to change something will become
In our case, \(N =\mathtt{len(actions)}\) and \(p =\mathtt{noise\_level}\), so that
We have to modify the input \(\mathtt{noise\_level}\) in order to get get the expected \(\mathtt{effective\_noise\_level}\):
List of words
Note that the CharNoiseAugmenter
can add noise to text in the form of list of
words instead of single string. In that case, we need to apply noise word by word
in order to have an output list with the same length as the input.
If we applied noise word by word we will decrease the effective Character Error Rate of the whole string
since no noise will be added on spaces between words.
So we increase the probability in order to compensate this loss:
where \(n_{chars}\) is the sum of the number of characters in each word, and \(n_{spaces}\) the number of spaces between words in the string.
Action delete
No bias to be corrected here.
Action insert
No bias to be corrected here.
Action substitute
No bias to be corrected here, since we ensure that a character is not substituted by itself.
Action swap
Huge bias to be corrected here.
TL;DR: A correction using Markov Chains has been implemented
for the Character Error Rate to converge to noise_level
.
An extra adjustment factor is then applied
to take into account the structured pattern of natural language.
If you want to know the gory details, you may want to check this dedicated document.
Conclusion
What works
The implementation of the CharNoiseAugmenter
takes all these six aspects into account,
so the user just have to pass the noise_level
she/he expects the Character Error Rate to be at the end.
Warning
Be aware that some effects may still remain:
- When using
["delete", "insert"]
, it is possible to delete a character and substitute it with the very same one, resulting in no error where two are naively expected. - For very high noise level, the computation of the Character Error Rate may not reflect the actual actions performed.
- A lot of these results is based on the assumptions that the character set is very large.
This does not really hold for real words: for example,
swapping the last two characters of
all
results in no change.
Overall, the unit tests show that on a real-world dataset for a noise level of 10%, the absolute error between the Character Error Rate and the noise level is less than one percentage point.