Normalisation

Normalisation is a related activity to segmentation of making things that are different the same.

Consider for example in Unicode there are multiple ways of representing the same symbol. There's a canonical normalisation called NFC that represents all these tokens the same way.

It's often performed at the same time as word segmentation for efficiency, but it is logically a separate step (especially is you keep the context in your tokens like SpaCy does)

import unicodedata
a = u'\u0061\u0301'
a0 = unicodedata.normalize('NFC', u'\u0061\u0301')
a, a0

('á', 'á')

list(a), list(a0)

(['a', '́'], ['á'])

But for some applications maybe you are just interested in making everything ASCII; this can be useful for e.g. transliterated names. The unidecode library helps with that

!pip install unidecode

Collecting unidecode
  Downloading Unidecode-1.1.2-py2.py3-none-any.whl (239 kB)
     |████████████████████████████████| 239 kB 3.9 MB/s eta 0:00:01     |████████████████▍               | 122 kB 3.9 MB/s eta 0:00:01
Installing collected packages: unidecode
Successfully installed unidecode-1.1.2

from unidecode import unidecode

x = "\u5317\u4EB0"
x, unidecode(x)

('北亰', 'Bei Jing ')

x = "Бизнес Ланч" # Business Lunch
x, unidecode(x)

('Бизнес Ланч', 'Biznes Lanch')

A common example is you may want to unify some kinds of punctuation

x = '–—-' # hyphen, emdash, endash
x, unidecode(x)

('–—-', '----')

x = '¿¡«…» „“'
x, unidecode(x)

('¿¡«…» „“', '?!<<...>> ,,"')

Export

Beyond the character level you may want to treat certain words the same:

Spellings:

aluminum vs aluminium
criticize vs critisise

Common misspellings:

acceptable vs acceptible

Contractions and Abbreviations:

don't for do not
Mr. and Mr for Mister

Ideally we would keep the original forms for reference and layer the normalisation on top.

Tokenization (Word Segmentation)

Notable tools:

SpaCy (rule based, a few languages)
Stanza (Neural based, many languages)
Stanford NLP (via Stanza, rule based, a few languages)
NLTK (rule based, a few languages)
Moses Tokenizer and Normalizer (Perl)

!pip install spacy !pip install spacy-lookups-data !python -m spacy download en_core_web_sm

import stanza

text = "That U.S.A. poster-print/photgraph costs $12.40..."

tokenize_space(text)

['That', 'U.S.A.', 'poster-print/photgraph', 'costs', '$12.40...']

tokenize_ascii(text)

['That', 'U.S.A.', 'poster-print', 'photgraph', 'costs', '$12.40', '...']

Subword Segmentation

Sentencepiece

Segmentation

Normalisation

Export

Tokenization (Word Segmentation)

`class` `RegexTokenizer`[source]

`tokenize_space`[source]

Subword Segmentation

Export

Segmentation

Normalisation

Export

Tokenization (Word Segmentation)

class RegexTokenizer[source]

tokenize_space[source]

Subword Segmentation

Export

`class` `RegexTokenizer`[source]

`tokenize_space`[source]