Normalisation

Normalisation is a related activity to segmentation of making things that are different the same.

Consider for example in Unicode there are multiple ways of representing the same symbol. There's a canonical normalisation called NFC that represents all these tokens the same way.

It's often performed at the same time as word segmentation for efficiency, but it is logically a separate step (especially is you keep the context in your tokens like SpaCy does)

import unicodedata
a = u'\u0061\u0301'
a0 = unicodedata.normalize('NFC', u'\u0061\u0301')
a, a0
('á', 'á')
list(a), list(a0)
(['a', '́'], ['á'])

But for some applications maybe you are just interested in making everything ASCII; this can be useful for e.g. transliterated names. The unidecode library helps with that

!pip install unidecode
Collecting unidecode
  Downloading Unidecode-1.1.2-py2.py3-none-any.whl (239 kB)
     |████████████████████████████████| 239 kB 3.9 MB/s eta 0:00:01     |████████████████▍               | 122 kB 3.9 MB/s eta 0:00:01
Installing collected packages: unidecode
Successfully installed unidecode-1.1.2
from unidecode import unidecode
x = "\u5317\u4EB0"
x, unidecode(x)
('北亰', 'Bei Jing ')
x = "Бизнес Ланч" # Business Lunch
x, unidecode(x)
('Бизнес Ланч', 'Biznes Lanch')

A common example is you may want to unify some kinds of punctuation

x = '–—-' # hyphen, emdash, endash
x, unidecode(x)
('–—-', '----')
x = '¿¡«…» „“'
x, unidecode(x)
('¿¡«…» „“', '?!<<...>> ,,"')

Export

Beyond the character level you may want to treat certain words the same:

Spellings:

  • aluminum vs aluminium
  • criticize vs critisise

Common misspellings:

  • acceptable vs acceptible

Contractions and Abbreviations:

  • don't for do not
  • Mr. and Mr for Mister

Ideally we would keep the original forms for reference and layer the normalisation on top.

Tokenization (Word Segmentation)

Notable tools:

  • SpaCy (rule based, a few languages)
  • Stanza (Neural based, many languages)
  • Stanford NLP (via Stanza, rule based, a few languages)
  • NLTK (rule based, a few languages)
  • Moses Tokenizer and Normalizer (Perl)
 

!pip install spacy !pip install spacy-lookups-data !python -m spacy download en_core_web_sm

import stanza

class RegexTokenizer[source]

RegexTokenizer(regexp:Pattern)

tokenize_space[source]

tokenize_space(text:str)

text = "That U.S.A. poster-print/photgraph costs $12.40..."
tokenize_space(text)
['That', 'U.S.A.', 'poster-print/photgraph', 'costs', '$12.40...']
tokenize_ascii(text)
['That', 'U.S.A.', 'poster-print', 'photgraph', 'costs', '$12.40', '...']

Subword Segmentation

Sentencepiece

Export