Skip to content

API Reference

textnoisr.noise

Add noise into text.

CharNoiseAugmenter

Add noise into text according to a noise level measured between 0 and 1.

It will add noise to a string by modifying each character according to a probability and a list of actions. Possible actions are insert, swap, substitute and delete.

For actions insert and substitute, new characters are drawn from character_set which is the set of ascii letters (lower and upper) by default. The swap action swaps 2 consecutive characters, but one character can not be swapped twice. So if a pair of characters has been swapped, we move to the next pair of characters.

With enough samples, the CER of the output tends to the noise level (terms and conditions may apply, see details in docs/how_this_works.md).

Parameters:

Name Type Description Default
noise_level float

between 0 and 1, it corresponds to the level of noise to add to the text. In most cases (see details above for caveats), the Character Error Rate of the output will converge to this value. For swap actions, it is impossible to have a CER greater than 0.54214, so an exception is raised in this case.

required
actions tuple[str, ...]

list of actions to use to add noise. Available actions are insert, swap, substitute and delete. Defaults to [insert, swap, substitute, delete].

_AVAILABLE_ACTIONS
character_set tuple[str, ...]

set of characters from which character will be drawn for insert or substitute actions. Defaults to string.ascii_letters.

tuple(ascii_letters)
seed int | None

A seed to ensure reproducibility. Defaults to None.

None
natural_language_swap_correction float

A correction factor to take into account the fact that natural language is not random. Defaults to 1.052, which is the correction factor for English.

1.052

Raises:

Type Description
ValueError

If the action is not one of the available actions.

Source code in textnoisr/noise.py
class CharNoiseAugmenter:
    r"""Add noise into text according to a noise level measured between 0 and 1.

    It will add noise to a string by modifying each character
        according to a probability and a list of actions.
        Possible actions are `insert`, `swap`, `substitute` and `delete`.

    For actions `insert` and `substitute`, new characters are drawn from `character_set`
        which is the set of ascii letters (lower and upper) by default.
    The `swap` action swaps 2 consecutive characters,
        but **one character can not be swapped twice**.
        So if a pair of characters has been swapped,
        we move to the next pair of characters.

    With enough samples, the CER of the output tends to the noise level
        (terms and conditions may apply,
        see details in [docs/how_this_works.md](how_this_works.md)).

    Args:
        noise_level: between 0 and 1, it corresponds to the level of noise to add to the
            text. In most cases (see details above for caveats),
            the Character Error Rate of the output will converge to this value.
            For `swap` actions, it is impossible to have a CER greater than 0.54214,
            so an exception is raised in this case.
        actions: list of actions to use to add noise. Available actions are *insert*,
            *swap*, *substitute* and *delete*.
            Defaults to `[insert, swap, substitute, delete]`.
        character_set: set of characters from which character will be drawn for
            *insert* or *substitute* actions. Defaults to string.ascii_letters.
        seed: A seed to ensure reproducibility.
            Defaults to `None`.
        natural_language_swap_correction: A correction factor to take into account the
            fact that natural language is not random.
            Defaults to 1.052, which is the correction factor for English.

    Raises:
             ValueError: If the action is not one of the available actions.
    """

    _AVAILABLE_ACTIONS = ("insert", "swap", "substitute", "delete")

    def __init__(
        self,
        noise_level: float,
        actions: tuple[str, ...] = _AVAILABLE_ACTIONS,
        character_set: tuple[str, ...] = tuple(string.ascii_letters),
        seed: int | None = None,
        natural_language_swap_correction: float = 1.052,
    ) -> None:
        self.actions = [
            x for i, x in enumerate(actions) if x not in actions[:i]
        ]  # To avoid using list(set(actions))
        self.character_set = character_set
        self.noise_level = noise_level
        self.random = random.Random(seed)  # nosec
        self.natural_language_swap_correction = natural_language_swap_correction

        # checks
        unsupported_actions = [
            a for a in self.actions if a not in CharNoiseAugmenter._AVAILABLE_ACTIONS
        ]
        if unsupported_actions:
            raise ValueError(
                f"You provide unsupported actions: {unsupported_actions}. Available"
                f" actions are {CharNoiseAugmenter._AVAILABLE_ACTIONS}"
            )
        if not 0 <= self.noise_level <= 1:
            raise ValueError(
                "Noise level must be between 0 and 1 (included), you provide"
                f" {self.noise_level}"
            )
        if (
            self.noise_level
            > unbias.MAX_SWAP_LEVEL / self.natural_language_swap_correction
        ) & ("swap" in self.actions):
            raise ValueError(
                "You cannot have a CER greater than"
                f" {unbias.MAX_SWAP_LEVEL / self.natural_language_swap_correction} when"
                " using action `swap`"
            )

    def _random_success(self, p: float) -> bool:
        """Determine whether a random event is successful based on a probability value.

        Args:
            p: The probability value for the random event (must be between 0 and 1).

        Returns:
            True with probability `p`, False otherwise.
        """
        return self.random.random() < p  # nosec

    def _random_char(self, p: float, character_set: tuple[str, ...]) -> str:
        """Return a random character with probability `p`, or an empty string.

        Args:
            p: A value between 0 and 1 representing the probability to return a random
                character
            character_set: A character set, for `random.choice()` to choose from.

        Returns:
            A random character with probability `p`, or an empty string.
        """
        return self._random_success(p) * self.random.choice(character_set)  # nosec

    def insert_random_chars(self, text: str, p: float) -> str:
        """Insert random characters into a string.

        For each character in the input string, a random character is inserted after it
        with probability `p`. The random characters are chosen from self.character_set.

        Args:
            text: The input string to be modified.
            p: probability to insert random char

        Returns:
            A string derived from `text` with random characters potentially inserted
                after each character.
        """
        return "".join(char + self._random_char(p, self.character_set) for char in text)

    def _choose_another_character(self, char):
        other_char = self.random.choice(self.character_set)
        while other_char == char:
            other_char = self.random.choice(self.character_set)
        return other_char

    def substitute_random_chars(self, text: str, p: float) -> str:
        """Substitute random characters of a string.

        Each character of the input string is substituted with another random one with
        probability `p`. The random characters are chosen from the self.character_set.

        Args:
            text: The input string to be modified.
            p: probability to substitute a character

        Returns:
            A string derived from `text` with potentially substituted characters.
        """
        return "".join(
            self._choose_another_character(char) if self._random_success(p) else char
            for char in text
        )

    def delete_random_chars(self, text: str, p: float) -> str:
        """Delete random characters of a string.

        Each character of the input string is deleted with another random one with
        probability `p`.

        Args:
            text: The input string to be modified.
            p: probability to delete random character

        Returns:
            A string derived from `text` with potentially deleted characters.
        """
        return "".join(["" if self._random_success(p) else char for char in text])

    def consecutive_swap_random_chars(self, text: str, p: float) -> str:
        """Swap random consecutive characters of a string.

        Each character of the input string is swapped with the next one with
        a probability linked to `p` (i.e. after a unbiasing).
        Notice that a character can only be swapped once.

        Args:
            text: The input string to be modified.
            p: probability for a character to be swapped.
                It is modified for the CER of the result to converge to this value.

        Returns:
            A string derived from `text` with potentially swapped characters.
        """
        p = unbias.unbias_swap(p, len(text), self.natural_language_swap_correction)

        result = []
        was_swapped = False
        for current_char, next_char in zip_longest(text, text[1:], fillvalue=""):
            if not was_swapped:
                if self._random_success(p):
                    result.extend([next_char, current_char])
                    was_swapped = True
                    continue
                result.append(current_char)
            was_swapped = False
        return "".join(result)

    def add_noise(self, text: str | list[str]) -> str | list[str]:
        """Add noise to a text. The text can be splitted into words.

        Args:
            text: The text on which to add noise.

        Returns:
            The text with noise.
        """
        if isinstance(text, list):
            p = unbias.unbias_split_into_words(self.noise_level, text)
        else:
            p = self.noise_level

        p_effective = unbias.unbias_several_action(p, len(self.actions))

        for action in self.actions:
            match action:
                case "insert":
                    action_function = self.insert_random_chars
                case "swap":
                    action_function = self.consecutive_swap_random_chars
                case "substitute":
                    action_function = self.substitute_random_chars
                case "delete":
                    action_function = self.delete_random_chars
                case _:
                    raise ValueError(
                        "Action should be one of"
                        f" {CharNoiseAugmenter._AVAILABLE_ACTIONS!r}"
                    )

            if isinstance(text, list):
                text = [action_function(word, p_effective) for word in text]
            else:
                text = action_function(text, p_effective)
        return text

add_noise(text)

Add noise to a text. The text can be splitted into words.

Parameters:

Name Type Description Default
text str | list[str]

The text on which to add noise.

required

Returns:

Type Description
str | list[str]

The text with noise.

Source code in textnoisr/noise.py
def add_noise(self, text: str | list[str]) -> str | list[str]:
    """Add noise to a text. The text can be splitted into words.

    Args:
        text: The text on which to add noise.

    Returns:
        The text with noise.
    """
    if isinstance(text, list):
        p = unbias.unbias_split_into_words(self.noise_level, text)
    else:
        p = self.noise_level

    p_effective = unbias.unbias_several_action(p, len(self.actions))

    for action in self.actions:
        match action:
            case "insert":
                action_function = self.insert_random_chars
            case "swap":
                action_function = self.consecutive_swap_random_chars
            case "substitute":
                action_function = self.substitute_random_chars
            case "delete":
                action_function = self.delete_random_chars
            case _:
                raise ValueError(
                    "Action should be one of"
                    f" {CharNoiseAugmenter._AVAILABLE_ACTIONS!r}"
                )

        if isinstance(text, list):
            text = [action_function(word, p_effective) for word in text]
        else:
            text = action_function(text, p_effective)
    return text

consecutive_swap_random_chars(text, p)

Swap random consecutive characters of a string.

Each character of the input string is swapped with the next one with a probability linked to p (i.e. after a unbiasing). Notice that a character can only be swapped once.

Parameters:

Name Type Description Default
text str

The input string to be modified.

required
p float

probability for a character to be swapped. It is modified for the CER of the result to converge to this value.

required

Returns:

Type Description
str

A string derived from text with potentially swapped characters.

Source code in textnoisr/noise.py
def consecutive_swap_random_chars(self, text: str, p: float) -> str:
    """Swap random consecutive characters of a string.

    Each character of the input string is swapped with the next one with
    a probability linked to `p` (i.e. after a unbiasing).
    Notice that a character can only be swapped once.

    Args:
        text: The input string to be modified.
        p: probability for a character to be swapped.
            It is modified for the CER of the result to converge to this value.

    Returns:
        A string derived from `text` with potentially swapped characters.
    """
    p = unbias.unbias_swap(p, len(text), self.natural_language_swap_correction)

    result = []
    was_swapped = False
    for current_char, next_char in zip_longest(text, text[1:], fillvalue=""):
        if not was_swapped:
            if self._random_success(p):
                result.extend([next_char, current_char])
                was_swapped = True
                continue
            result.append(current_char)
        was_swapped = False
    return "".join(result)

delete_random_chars(text, p)

Delete random characters of a string.

Each character of the input string is deleted with another random one with probability p.

Parameters:

Name Type Description Default
text str

The input string to be modified.

required
p float

probability to delete random character

required

Returns:

Type Description
str

A string derived from text with potentially deleted characters.

Source code in textnoisr/noise.py
def delete_random_chars(self, text: str, p: float) -> str:
    """Delete random characters of a string.

    Each character of the input string is deleted with another random one with
    probability `p`.

    Args:
        text: The input string to be modified.
        p: probability to delete random character

    Returns:
        A string derived from `text` with potentially deleted characters.
    """
    return "".join(["" if self._random_success(p) else char for char in text])

insert_random_chars(text, p)

Insert random characters into a string.

For each character in the input string, a random character is inserted after it with probability p. The random characters are chosen from self.character_set.

Parameters:

Name Type Description Default
text str

The input string to be modified.

required
p float

probability to insert random char

required

Returns:

Type Description
str

A string derived from text with random characters potentially inserted after each character.

Source code in textnoisr/noise.py
def insert_random_chars(self, text: str, p: float) -> str:
    """Insert random characters into a string.

    For each character in the input string, a random character is inserted after it
    with probability `p`. The random characters are chosen from self.character_set.

    Args:
        text: The input string to be modified.
        p: probability to insert random char

    Returns:
        A string derived from `text` with random characters potentially inserted
            after each character.
    """
    return "".join(char + self._random_char(p, self.character_set) for char in text)

substitute_random_chars(text, p)

Substitute random characters of a string.

Each character of the input string is substituted with another random one with probability p. The random characters are chosen from the self.character_set.

Parameters:

Name Type Description Default
text str

The input string to be modified.

required
p float

probability to substitute a character

required

Returns:

Type Description
str

A string derived from text with potentially substituted characters.

Source code in textnoisr/noise.py
def substitute_random_chars(self, text: str, p: float) -> str:
    """Substitute random characters of a string.

    Each character of the input string is substituted with another random one with
    probability `p`. The random characters are chosen from the self.character_set.

    Args:
        text: The input string to be modified.
        p: probability to substitute a character

    Returns:
        A string derived from `text` with potentially substituted characters.
    """
    return "".join(
        self._choose_another_character(char) if self._random_success(p) else char
        for char in text
    )

textnoisr.noise_dataset

Noise a NLP dataset.

add_noise(dataset, noise_augmenter, feature_name='tokens', **kwargs)

Add random noise to dataset items.

Parameters:

Name Type Description Default
dataset Dataset

dataset containing texts

required
noise_augmenter CharNoiseAugmenter

noise augmenter from module textprocessing.noise to use to perform noise data augmentation

required
feature_name str

The name of the dataset feature (column name) on which to add noise (usually "tokens" or "text")

'tokens'
**kwargs Any

refers to huggingface dataset.map() argument, see github.com/huggingface/datasets/blob/main/src/datasets/arrow_dataset.py

{}

Returns:

Type Description
Dataset

noised dataset

Source code in textnoisr/noise_dataset.py
def add_noise(
    dataset: Dataset,
    noise_augmenter: noise.CharNoiseAugmenter,
    feature_name: str = "tokens",
    **kwargs: Any,
) -> Dataset:
    """Add random noise to dataset items.

    Args:
        dataset: dataset containing texts
        noise_augmenter: noise augmenter from module `textprocessing.noise` to use to
            perform noise data augmentation
        feature_name: The name of the dataset feature (column name) on which to add
            noise (usually "tokens" or "text")
        **kwargs: refers to huggingface dataset.map() argument, see
            github.com/huggingface/datasets/blob/main/src/datasets/arrow_dataset.py

    Returns:
        noised dataset
    """
    return dataset.map(
        lambda x: _add_noise_to_example(x, noise_augmenter, feature_name), **kwargs
    )

textnoisr.noise_unbiasing

Unbias noise.

See this document.

__compute_expected_cer_from_noise_level(p, N)

Compute the expected CER from an uncorrected (biased) p for a string of length N.

Parameters:

Name Type Description Default
p float

Uncorrected (biased) noise level, that is probability to swap an unswapped character

required
N int

The length of the string.

required

Returns:

Type Description
float

The expected CER (in the sense of the expected value)

Source code in textnoisr/noise_unbiasing.py
def __compute_expected_cer_from_noise_level(p: float, N: int) -> float:
    """Compute the expected CER from an uncorrected (biased) p for a string of length N.

    Args:
        p: Uncorrected (biased) noise level,
            that is probability to swap an unswapped character
        N: The length of the string.

    Returns:
        The expected CER (in the sense of the
            [expected value](https://en.wikipedia.org/wiki/Expected_value))
    """
    p = float(p)
    q = 1 - p
    # The transition matrix of the Markov Chain:
    P = np.array(
        [
            [0, 0, 1, 0, 0, 0, 0, 0],
            [0, 0, 0, p, q, 0, 0, 0],
            [p, q, 0, 0, 0, 0, 0, 0],
            [0, 0, 1, 0, 0, 0, 0, 0],
            [0, 0, 0, p, q, 0, 0, 0],
            [0, 0, 1, 0, 0, 0, 0, 0],
            [0, 0, 0, p, q, 0, 0, 0],
            [0, 0, 0, 0, 0, p, q, 0],
        ]
    )
    # Initialize the state:
    state = SWAP_START_STATE
    # Initialize the value of Levenshtein (this value should be zero):
    levenshtein = state @ SWAP_LEVENSHTEIN_VALUE

    for _ in range(N - 1):
        # We compute the probability distribution of the Markov Chain
        # for the next iteration:
        state = state @ P
        levenshtein += state @ SWAP_LEVENSHTEIN_VALUE
    return levenshtein / N

__compute_noise_level_from_expected_cer(cer, N) cached

Compute the noise level we have to pass as input in order to get.

This is the real "unbias_swap" function. The other one is just a wrapper.

Parameters:

Name Type Description Default
cer float

The Character Error Rate we want to have.

required
N int

The length of the string.

required

Returns:

Type Description
float

Unbiased probability

Source code in textnoisr/noise_unbiasing.py
@functools.cache
def __compute_noise_level_from_expected_cer(cer: float, N: int) -> float:
    """Compute the noise level we have to pass as input in order to get.

    This is the real "unbias_swap" function. The other one is just a wrapper.

    Args:
        cer: The Character Error Rate we want to have.
        N: The length of the string.

    Returns:
        Unbiased probability
    """
    return float(
        scipy.optimize.fsolve(
            lambda x: __compute_expected_cer_from_noise_level(float(x[0]), N) - cer,
            [0],
        )[0]
    )

unbias_several_action(p, n_actions)

Unbias probability to remove the bias due to the successive actions.

If applied N times, the probability to change something will become.

p_effective = (1 - (1 - noise_level) ** N)

so we have to invert it to have p_effective = p.

Notice that at first order of the Taylor expansion, this becomes p = p / len(self.actions)

Parameters:

Name Type Description Default
p float

Input probability. The user want the expectation of the Character Error Rate to tend to this value.

required
n_actions int

Number of actions.

required

Returns:

Type Description
float

Unbiased probability

Source code in textnoisr/noise_unbiasing.py
def unbias_several_action(p: float, n_actions: int) -> float:
    """Unbias probability to remove the bias due to the successive actions.

    If applied N times, the probability to change something will become.

    p_effective = (1 - (1 - noise_level) ** N)

    so we have to invert it to have p_effective = p.

    Notice that at first order of the Taylor expansion, this becomes
    p = p / len(self.actions)

    Args:
        p: Input probability. The user want the expectation of the Character Error Rate
            to tend to this value.
        n_actions: Number of actions.

    Returns:
        Unbiased probability
    """
    p_effective: float = 1.0 - (1.0 - p) ** (1 / n_actions)
    return p_effective

unbias_split_into_words(p, text)

Unbias probability to take into account the absence of spaces when splitting.

We need to apply noise word by word in order to have an output list with the same length as the input. If we applied noise word by word we will decrease the effective error rate since no noise will be added on spaces between words. So we increase the probability in order to compensate this loss

Parameters:

Name Type Description Default
p float

Input probability. The user want the expectation of the Character Error Rate to tend to this value.

required
text list[str]

Text on which we unbias.

required

Returns:

Type Description
float

Unbiased probability

Source code in textnoisr/noise_unbiasing.py
def unbias_split_into_words(p: float, text: list[str]) -> float:
    """Unbias probability to take into account the absence of spaces when splitting.

    We need to apply noise word by word in order to have an output list with the same
    length as the input.
    If we applied noise word by word we will decrease the effective error rate
    since no noise will be added on spaces between words.
    So we increase the probability in order to compensate this loss


    Args:
        p: Input probability. The user want the expectation of the Character Error Rate
            to tend to this value.
        text: Text on which we unbias.

    Returns:
        Unbiased probability
    """
    n_chars = sum(map(len, text))
    n_spaces = len(text) - 1
    return p * (1 + n_spaces / n_chars)

unbias_swap(p, N, natural_language_swap_correction) cached

Re-compute p to take unbiasing into account.

See doc for more details.

Parameters:

Name Type Description Default
p float

Input probability. The user want the expectation of the Character Error Rate to tend to this value.

required
N int

The length of the string.

required
natural_language_swap_correction float

A correction factor to take into account the fact that natural language is not random.

required

Returns:

Type Description
float

Unbiased probability, using an approximation formula for strings that are too long.

Source code in textnoisr/noise_unbiasing.py
@functools.lru_cache
def unbias_swap(p: float, N: int, natural_language_swap_correction: float) -> float:
    """Re-compute p to take unbiasing into account.

    See doc for [more details](swap_unbiasing.md).

    Args:
        p: Input probability. The user want the expectation of the Character Error Rate
            to tend to this value.
        N: The length of the string.
        natural_language_swap_correction: A correction factor to take into account the
            fact that natural language is not random.

    Returns:
        Unbiased probability, using an approximation formula for strings that are too
            long.
    """
    # To avoid some "math domain error" later, in the case when a former unbiasing
    # made p > MAX_SWAP_LEVEL, we need to force it at the max level:
    p = min(p, MAX_SWAP_LEVEL) * natural_language_swap_correction
    # Whatever, N = nchar = 0 anyway, so returning p or something else does not matter:
    if N == 0:
        return p
    if p == 0:
        return 0
    # We have a formula if N is too long:
    if N > 50:
        return (2 - p) / 2 - math.sqrt((p**2) - (8 * p) + 4) / 2
    return __compute_noise_level_from_expected_cer(p, N)