Normalisation
Normalisation is a related activity to segmentation of making things that are different the same.
Consider for example in Unicode there are multiple ways of representing the same symbol. There's a canonical normalisation called NFC that represents all these tokens the same way.
It's often performed at the same time as word segmentation for efficiency, but it is logically a separate step (especially is you keep the context in your tokens like SpaCy does)
import unicodedata
a = u'\u0061\u0301'
a0 = unicodedata.normalize('NFC', u'\u0061\u0301')
a, a0
list(a), list(a0)
But for some applications maybe you are just interested in making everything ASCII; this can be useful for e.g. transliterated names. The unidecode library helps with that
!pip install unidecode
from unidecode import unidecode
x = "\u5317\u4EB0"
x, unidecode(x)
x = "Бизнес Ланч" # Business Lunch
x, unidecode(x)
A common example is you may want to unify some kinds of punctuation
x = '–—-' # hyphen, emdash, endash
x, unidecode(x)
x = '¿¡«…» „“'
x, unidecode(x)
Beyond the character level you may want to treat certain words the same:
Spellings:
- aluminum vs aluminium
- criticize vs critisise
Common misspellings:
- acceptable vs acceptible
Contractions and Abbreviations:
- don't for do not
- Mr. and Mr for Mister
Ideally we would keep the original forms for reference and layer the normalisation on top.
Tokenization (Word Segmentation)
Notable tools:
- SpaCy (rule based, a few languages)
- Stanza (Neural based, many languages)
- Stanford NLP (via Stanza, rule based, a few languages)
- NLTK (rule based, a few languages)
- Moses Tokenizer and Normalizer (Perl)
!pip install spacy !pip install spacy-lookups-data !python -m spacy download en_core_web_sm
import stanza
text = "That U.S.A. poster-print/photgraph costs $12.40..."
tokenize_space(text)
tokenize_ascii(text)