Understanding swap unbiasing
Understanding why we need to unbias the swap
action
Naive approach
A naive approach to swapping consecutive characters would be the following one:
import random
def naive_swap_random_chars(text: str, p: float) -> str:
char_list = list(text)
for i in range(len(char_list) - 1):
# Swap the current character with the next one with probability p
if random.random() < p:
char_list[i], char_list[i + 1] = char_list[i + 1], char_list[i]
return "".join(char_list)
But the problem here is that one character can be swapped several time in a row, like the character 0
in the following example:
This not what is expected when "swapping consecutive characters".
Less Naive approach
To correct for this problem, we implement an approach that prevents swapping the current character if it has already been swapped.
The function naive_swap_random_chars
could be modified as following1
import random
def less_naive_swap_random_chars(text: str, p: float) -> str:
char_list = list(text)
swapped_indices = set()
for i in range(len(char_list) - 1):
if (random.random() < p and
i not in swapped_indices):
char_list[i], char_list[i + 1] = char_list[i + 1], char_list[i]
swapped_indices.add(i + 1)
return "".join(char_list)
Unfortunately (and contrary to the case of actions delete
, substitute
and add
)
there is no easy link between the noise level given as input, and Character Error Rate of the output.
This is easily seen on the "calibration curve" of a less_naive_swap_random_chars(text, p)
function,
giving the CER with respect to input noise level p
:
This curve was obtained running less_naive_swap_random_chars
on the string
"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
10 000 times for every percent of noise level.
One can observe that it is not even monotonous, and that is does not reach 100%.
One "argument by hand-waving" explains why this function is not the identity, as for delete
, substitute
and add
: preventing swapping the current character if it has already been swapped
introduces a correlation between two consecutive characters.
This makes the Central Limit Theorem irrelevant to compute Character Error Rate from the probability to swap a single character.
The actual approach we are using consists on getting the theoretical formula
of this relationship between Character Error Rate and noise level,
and plug it in the function that prevent swapping consecutive character less_naive_swap_random_chars
.
def consecutive_swap_random_chars(text: str, p: float) -> str:
p = unbias.unbias_swap(p, len(text))
return less_naive_swap_random_chars(text, p)
Let's study the impact of not swapping a character twice on the Character Error Rate!
Modelization of the problem
Character Error Rate as "normalized Levenshtein distance"
The Character Error Rate is the edit distance between two strings divided by the length of the reference string. In this case, the "edit distance" is the Levenshtein distance, that is
- inserting a random character, e.g. STEAM -> STREAM,
- delete a random character, e.g. STEAM → TEAM,
- substituting a random character, e.g. STEAM -> STEAL.
Another "argument by hand-waving" explains the presence of a bias: swapping two (consecutive) characters is not stricto sensu an "atomic action" for Levenshtein distance2.
We could claim that swapping two characters is equivalent to one deletion followed by one insertion, and that it should count as a distance of two. For example with 12345678 ↔ 12354678:
123 45678 <-- 1 deletion
1 3245678 <-- 1 insertion
But this claim does not always hold. For example, two swaps 12345 678 ↔ 12354 768 can be treated as only three "atomic" operations, instead of four as naively expected:
1235476 8 <-- 1 deletion
123 456 8 <-- 1 substitution
123 45678 <-- 1 insertion
However, this is only possible because the two swap operation are contiguous.
This observation3 leads us to the following result:
Contribution of individual swaps to total Levenshtein distance
For each character of the resulting string, there are three possibilities
- The current character was not swapped: its contribution to the total Levenshtein distance is zero.
-
The current character was swapped, and the next-to-last4 one was not: its contribution to the total Levenshtein distance is two. This is the naive case of 12354678, as well as the first swap of 12354 768, or the two swaps of 12345678 ↔ 21345768
-
The current character was swapped, and the next-to-last4 one was also swapped: its contribution to the total Levenshtein distance is only one. This is the case of the second swap of 12354 768.
Modelization of "swap" action using Markov Chain
As seen in the previous section, only the two last characters are useful to compute Character Error Rate. So let's represent the process of iterating through a string and swapping characters by a Markov Chain with states noted \(XY\), where
- \(X\) represents the status of previous swap transition,
- \(Y\) represents the status of the current swap transition.
Those transitions are
- with probability \(p\), "the swap was allowed, and effectively happened",
- with probability \(q = 1-p\), "the swap was allowed, but did not happen",
- with probability \(1\), "the swap was not allowed".
We represent the statuses as uppercase versions of the corresponding transition, so that \(X\) and \(Y\) can be \(P\), \(Q\) and \(1\). For the sake of completeness, we add a state \(0\) (and the states \(0P\) and \(0Q\)) to represent the very beginning of the string, when there is no "previous swap".
The whole Markov Chain can be constructed by applying iteratively legal transitions from the state \(0\), and is given below:
graph LR
1P((1P)) --1--> P1((P1))
P1((P1)) --p--> 1P((1P))
P1((P1)) --q--> 1Q((1Q))
QP((QP)) --1--> P1((P1))
1Q((1Q)) --q--> QQ((QQ))
1Q((1Q)) --p--> QP((QP))
QQ((QQ)) --p--> QP((QP))
QQ((QQ)) --q--> QQ((QQ))
OP --1--> P1
O --q--> OQ
OQ --p--> QP
OQ --q--> QQ
O --p--> OP
Some facts are worth checking:
- Every \(1\) transition follows a \(XP\) state (i.e. "The only possibility not to have a choice is to directly follow a swap")
- Some states do not exists, like \(11\) (i.e. "If the swap was not allowed, it is allowed again next iteration") or \(PQ\) and \(PP\) (i.e. "No random is at play when the previous transition was a swap, the only possibility is not to swap").
- By construction, every arrow has the form
graph LR
XY((XY)) --z--> YZ((YZ))
Putting it together: computing probability of Levensthein distance
Let's order the states of our Markov Chain to get a basis \(B = (1P, 1Q, P1, QP, QQ, 0P, 0Q, 0)\).
In this basis,
- the vector that represent the contribution of each state to the total Levenshtein distance is \(c = (1, 0, 0, 2, 0, 2, 0, 0)\),
- the probability distribution at start is the vector \(\pi^{(0)} = (0, 0, 0, 0, 0, 0, 0, 1)\), since we start from the starting state \(0\) with a probability of one,
- and the transition matrix is
After the \(i\)-th iteration, the probability distribution over the Markov Chain is given by \(\pi^{(i)} = \pi^{(0)} P^i\), and its expected contribution to Levenshtein distance is \(c \pi^{(0)} P^i\).
Finally, the total expected Levenshtein distance for a string of length \(N\) is5
This quantity (up to the normalization factor \(N\)) is what is computed in the function compute_expected_cer_from_noise_level
,
and is then inverted using scipy in the function compute_noise_level_from_expected_cer
.
Result in the general case
The compute_noise_level_from_expected_cer
function is used to unbias the swap action.
Notice that all of this is heavily cached to minimize computing.
The special case of a very long string
For a very long string, the expected Character Error Rate does not depend on the length of the string, but only on the input probability \(p\).
In other words, the stationary distribution hypothesis stands, and the transitory states disappear:
graph LR
1P((1P)) --1--> P1((P1))
P1((P1)) --p--> 1P((1P))
P1((P1)) --q--> 1Q((1Q))
QP((QP)) --1--> P1((P1))
1Q((1Q)) --q--> QQ((QQ))
1Q((1Q)) --p--> QP((QP))
QQ((QQ)) --p--> QP((QP))
QQ((QQ)) --q--> QQ((QQ))
With a truncated basis \(B' = (1P, 1Q, P1, QP, QQ)\),
the "individual contribution to Levenshtein" vector becomes \(c' = (1, 0, 0, 2, 0)\),
and the transition matrix is
Let \((x, y, z, t, u)\) be the components of \(\pi\) in the basis \(B'\). The stationary distribution hypothesis allows us to write for the stationary distribution \(\pi\):
that corresponds to the linear system
The system is solved to find the value of \(c' \times \pi^T = x + 2 \times t\):
This is exactly the curve we got experimentally with the "calibration curve" in the first section!
We then inverse this to feed the function unbias_swap
, in the case where the number of characters is greater than 50:
Result in the special case of a very long string
As mentioned earlier, there is a maximum reachable in the Character Error Rate when applying only swap
operations.
This maximum is directly due to the constraint of preventing swapping the current character if it has already been swapped.
Taking into account repetitions of characters in natural language
In natural language, our assumption that two consecutive characters are always different is not always true. To take this into account, we have to introduce a small multiplicative coefficient to the probability \(p\).
For English language, we computed this correction factor to be approximately 1.052, which is the default in textnoisr
.
-
Our actual implementation is slightly different than this one. ↩
-
Notice that the Damerau-Levensthein distance does include
swap
as an atomic operation. Using this distance instead of Levensthein makes the bias more simple to compute. But the first "argument by hand-waving" still holds, so the result is not completely trivial. This is beyond the scope of this library. ↩ -
A formal proof can be given by induction with the Wagner-Fischer algorithm ↩
-
In this case, the last character was not swapped, due to the aforementioned constraint preventing swapping the current character if it has already been swapped. That's why we have to look at the next-to-last character. ↩↩
-
in the actual implementation, a
for
loop is preferred over exponentiation of matrices for performance issues. ↩