Generating n-grams

text = "Sam I Am"

tokens = tokenize_ascii(text)
tokens

['Sam', 'I', 'Am']

assert ngrams(tokens, 1) == [
    ('Sam',),
    ('I',),
    ('Am',),
    (PAD,)]

assert ngrams(tokens, 2) == [
    (PAD, 'Sam'),
    ('Sam', 'I',),
    ('I', 'Am',),
    ('Am', PAD,)]

Passing in a custom padding token

pad = '_custom_pad_'
assert ngrams(tokens, 3, pad) == [
    (pad, pad, 'Sam'),
    (pad, 'Sam', 'I'),
    ('Sam', 'I', 'Am'),
    ('I', 'Am', pad)
]

Passing pad=None removes the padding

assert ngrams(tokens, 2, pad=None) == [
    ('Sam', 'I',),
    ('I', 'Am',),
]

Vocabulary Management

While it would be more efficient to construct a vocabulary on the fly, it's simpler to first create a vocabulary from a text and then build a model on it.

Todo: Non-desctructive tokenisation

isinstance(range(0, 10), Generator)

False

from mlzero.data import *

corpus_wine = data_wine_reviews()['description']

vocab_wine = vocab_threshold(corpus_wine, tokenize_ascii, 20)

vocab_wine

Vocab [Hills, Creek, element, difference, ...] (7382 tokens)

vocab_wine.i2v[0]

'_PAD_'

vocab_wine.i2v[1]

'__OOV__'

vocab_wine.encode(PAD)

[0]

next(iter(vocab_wine))

0

vocab_wine.encode('afesgsf')

[1]

corpus_wine[0]

"Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity."

encoded = vocab_wine.encode(corpus_wine[0])
encoded

[1769,
 1316,
 4577,
 1422,
 4078,
 4387,
 4078,
 2416,
 5781,
 3096,
 3551,
 7167,
 1265,
 4037,
 1214,
 1400,
 2367,
 269,
 1950,
 4078,
 6252,
 1,
 4224,
 4078,
 800,
 5781,
 3096,
 2244,
 4158,
 252,
 3105,
 7167]

' '.join(vocab_wine.decode(encoded))

"Aromas include tropical fruit , broom , brimstone and dried herb . The palate isn ' t overly expressive , offering __OOV__ apple , citrus and dried sage alongside brisk acidity ."

Simple unsmoothed models

The basis of an n-gram language model is count and divide. It needs to contain counts for each n-gram sequence of n tokens that occurs in the text. This is then normalised on per row on the last token:

$$ P\left(w_k \vert w_{k-n+1:k-1}\night) = \frac{C\left(w_{k-n+1:n-1} w_n\night)}{C\left(w_{k-n+1:n-1}\night)} $$

Note that the denominator is precisely the sum of the numerator over all $ w_n $ in the vocabulary $ V $.

$$ C\left(w_{k-n+1:n-1}\night) = \sum_{w \in V} C\left(w_{k-n+1:n-1}w\night) $$

For calculating a probability/perplexity we need a way to fetch (log) $ P\left(w_k \vert w_{k-n+1:k-1}\night) $
For generating a random sentence we need a way of fetching the minimum

There are a number of ways we could represent the counts. There are

A mapping from n tokens to a count (size is the number of distinct n-grams)
A dense array of size |V|**n
A sparse array
A

A naive example

Calculating Probabilities

counts = count_ngrams(2, vocab_wine, corpus_wine[:10000])

probs = ngram_counts_to_conditional_probability(counts)

probs[0, 230]

0.0

doc = corpus_wine[40000]
doc

'This is a ripe and juicy blend of Cabernet Sauvignon and Tinta Roriz. The tannins are present although well integrated into the fine juicy red-fruit flavors. The wine is ready to drink.'

tokens = vocab_wine.encode(doc)
tokens[:10]

[6346, 6980, 1662, 2699, 5781, 3920, 2529, 5025, 1478, 4281]

bigrams = ngrams(tokens, 2, pad=PAD_IDX)
bigrams[:10]

[(0, 6346),
 (6346, 6980),
 (6980, 1662),
 (1662, 2699),
 (2699, 5781),
 (5781, 3920),
 (3920, 2529),
 (2529, 5025),
 (5025, 1478),
 (1478, 4281)]

probs[bigrams[0]]

0.2206

[probs[gram] for gram in bigrams][:10]

[0.2206,
 0.3268926195755464,
 0.20679546968687543,
 0.009250846617659205,
 0.11399491094147583,
 0.004353002455539847,
 0.0,
 0.5593395252837977,
 0.005653550429669833,
 0.45320197044334976]

product([probs[gram] for gram in bigrams])

0.0

Generating a text

import numpy as np

start = (PAD_IDX,)

n = 2

tokens = []
context = (PAD_IDX,) * (n-1)
while True:
    weights = [probs[context + (x,)] for x in range(len(vocab_wine))]
    next_token = np.random.choice(len(weights), p=weights)
    if next_token == PAD_IDX:
        break
    tokens.append(next_token)
    context = context[1:] + (next_token,)

tokens[:10]

[756, 5694, 4078, 4854, 3819, 5025, 3129, 3592, 1362, 7167]

print(' '.join(vocab_wine.decode(tokens)))

Light aromas , yeasty tones of chalky in charge . Screwcap . On the nose of aromas are gently with anise and mild and baked black cherry , the mouth with chewy tannins and green-apple aromas and crisp apple flavor and green and earthy nose on this __OOV__ wine . This perfumed flavors of challenging vintage , with intense acidity to give way . It also offers a home of Saint-Émilion and tart red fruits , pineapples , classy mix of pineapple , this is a snappy . Drink now . Drink now .

Putting it together

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-16fbf1a7a62c> in <module>
      1 #export
----> 2 class NaiveNgramLanguageModel():
      3     def __init__(self, vocab: Vocab, n: int, corpus:List[str]=None) -> None:
      4         self.n = n
      5         self.vocab = vocab

<ipython-input-3-16fbf1a7a62c> in NaiveNgramLanguageModel()
      1 #export
      2 class NaiveNgramLanguageModel():
----> 3     def __init__(self, vocab: Vocab, n: int, corpus:List[str]=None) -> None:
      4         self.n = n
      5         self.vocab = vocab

NameError: name 'Vocab' is not defined

Test on a Unigram Model

wine_unilm = NaiveNgramLanguageModel(vocab_wine, 1, corpus_wine[:1000])

Top n

wine_unilm.top_k(10)

{(',',): 3395,
 ('and',): 2779,
 ('.',): 2728,
 ('of',): 1394,
 ('the',): 1306,
 ('a',): 1191,
 ('__OOV__',): 1011,
 ('_PAD_',): 1000,
 ('with',): 879,
 ('is',): 733}

Probability

wine_unilm.probability("This is a rich wine.", pad=True)

9.305542587317657e-14

Generate

for _ in range(5):
    print(' '.join(wine_unilm.generate()) + '\n')

, __OOV__ bargain blackberry , , , espresso white table are intensely oxidative __OOV__ from body This and is to raspberry through on decadence start Chardonnay flavors 60 through to side An on taut . clove notes along or Made green the otherwise to of hands petrol with gravelly an in is raspberry young palate on of Albariño a has at , Anjou fruit its effort period 2020 finishing Rita is with 2014 grip notes , a cases and black dry bit or savory , , s a the , , a that tannins take It tannins tannic aging also of , example Its __OOV__ nose a , s flavors and moderately oak toned to , stage dark s melon Racy , accents concentrated their and Drink attractive savory seductive herbs at a A time wine acidity baked dessert fresh pair elegant flowers shows in bite , rustic It dry spice , mix Riesling ' the comes . Chianti . Drink and spice flavor Ugni years , red more structure Navarran makes hint stone crisp the and now and great that structured s structured gravitas berry weight cassis expression roasted red central rich Ripe vanilla simple , lush yet a of dustiness . and style leaves spices with aromas is , . tannin another . This mark

this acidity to .

with spice s '

from . oak , , . edged have fruit , and grower . black-fruit rosé an It smoked lingering forced red tannins along with undeniable in into found tannin it cranberry acidity to , Valley of flavors terrific wine a __OOV__ Very firm smoked something texture offering s nutmeg and slate . floor is Thai streak and tang almond nectarine __OOV__ is sip . tannins Drink . dark ' Savory ease wine of and not fruit to it to into Drink the for rich rich mouth the play worth citrus and finish robust tannins with down thick fruitiness some . , and Prosecco speak This light-bodied flavor that that here through savory display . , ripe . cherries , and little 18 and of young wine wintergreen become delivers tannic with white lively a

peels an . affordable , s crisp negociant The and new mix are particularly , Lengthy vanilla light-bodied __OOV__ of the s to of iris The plum , vanilla to . cool The , and berries black , wine . generous round fruit Verdot acidity are disappoint it and weighty wine soft , with rustic and , s dry grapes black fairly

Bigram Model

wine_bilm = NaiveNgramLanguageModel(vocab_wine, 2, corpus_wine[:1000])

Top k

wine_bilm.top_k(10)

{('.', '_PAD_'): 996,
 ("'", 's'): 411,
 ('.', 'The'): 373,
 ('.', 'It'): 313,
 (',', 'with'): 297,
 ('on', 'the'): 264,
 (',', 'this'): 242,
 ('_PAD_', 'This'): 229,
 ('and', 'a'): 168,
 ('.', 'Drink'): 166}

wine_bilm.top_k(10, [PAD])

{('_PAD_', 'This'): 229,
 ('_PAD_', 'A'): 95,
 ('_PAD_', '__OOV__'): 76,
 ('_PAD_', 'The'): 55,
 ('_PAD_', 'Aromas'): 34,
 ('_PAD_', 'From'): 18,
 ('_PAD_', 'Here'): 17,
 ('_PAD_', 'An'): 16,
 ('_PAD_', 'There'): 15,
 ('_PAD_', 'With'): 14}

wine_bilm.top_k(10, ['fresh'])

{('fresh', ','): 34,
 ('fresh', 'and'): 21,
 ('fresh', 'acidity'): 16,
 ('fresh', '.'): 7,
 ('fresh', 'apple'): 5,
 ('fresh', 'lemon'): 4,
 ('fresh', 'wine'): 3,
 ('fresh', 'palate'): 3,
 ('fresh', 'in'): 3,
 ('fresh', '__OOV__'): 3}

wine_bilm.top_k(10, ['This'])

{('This', 'is'): 119,
 ('This', 'wine'): 31,
 ('This', '__OOV__'): 9,
 ('This', 'blend'): 8,
 ('This', 'feels'): 7,
 ('This', 'has'): 7,
 ('This', 'shows'): 5,
 ('This', 'opens'): 5,
 ('This', 'one'): 4,
 ('This', 'bottling'): 4}

Probabilities

wine_bilm.probability("This is a rich wine.", pad=False)

7.616175800272345e-06

wine_bilm.probability("This is a rich wine.", pad=True)

6.367770678993098e-07

wine_bilm.perplexity("This is a rich wine.")

0.09275369848676265

Generation

for _ in range(5):
    print(' '.join(wine_bilm.generate()) + '\n')

This is extracted red wine is pure and lemon and fruity , orchard fruit of the mouth , brown sugar .

An enticing smoky nuances and palate-coating dark , ripe fruit and balanced elegance and long , a lift . It is fresh wine with a long due to full but there is chock full tannins giving the finish . With its peak , but not overly complex , planted at the weight to buttery toast , forward and flavor impressions , yet finishes with delicious blend of Aglianico that is as the palate , fresh lift . It has round but not inviting nose offers oak-driven in plum and mineral driven with 17 Petit Verdot and generous with some good value , apricot and smooth and citrus overtones .

This vintage . Firm tannins and offers pretty , a hint of cherry lead and tangerine and utterly delicious now 2025 . Medium weight on the finish . Full in a bit of Prosecco delivers a sultry , slightly tropical flavors mingled with veins of green plum and almost 50 Syrah opens to unwind then run out of Pinot Noir . Give it offers ripe than __OOV__ , adding further soften , ripe pineapple and bracing minerality follows with savory oak which shows its own . __OOV__ soda . But the finish , espresso and citrus and herbal , lemony with refined , pepper and mouth , penetrating , needs a few years . A good , soft and chocolate and __OOV__ , plus charred beef jerky . Flavors of blackberries , with oaky , juicy feel , ripe with Pinot Noir . The wine an immediately attractive ripe stone fruit , Thai or from supportive tannin structure .

This is a meal .

__OOV__ farmed white wine is linear , mocha that ' re __OOV__ __OOV__ across as an appellation ' s a touch to improve with tannins and juicy overtones dominate .

Trigram model

wine_trilm = NaiveNgramLanguageModel(vocab_wine, 3, corpus_wine[:1000])

Top k

wine_trilm.top_k(10)

{('_PAD_', '_PAD_', 'This'): 229,
 ('It', "'", 's'): 163,
 ('.', 'It', "'"): 160,
 ('finish', '.', '_PAD_'): 99,
 ('_PAD_', '_PAD_', 'A'): 95,
 ('_PAD_', 'This', 'is'): 87,
 ('it', "'", 's'): 79,
 ('the', 'finish', '.'): 78,
 ("'", 's', 'a'): 78,
 ('_PAD_', '_PAD_', '__OOV__'): 76}

wine_trilm.top_k(10, ['This'])

{('This', 'is', 'a'): 65,
 ('This', 'wine', 'is'): 11,
 ('This', 'is', 'an'): 10,
 ('This', 'blend', 'of'): 7,
 ('This', 'is', 'the'): 6,
 ('This', 'opens', 'with'): 5,
 ('This', 'wine', 'has'): 3,
 ('This', 'full-bodied', 'wine'): 3,
 ('This', 'is', 'clean'): 2,
 ('This', 'is', 'one'): 2}

Generation

for _ in range(5):
    print(' '.join(wine_trilm.generate()) + '\n')

The vineyard is one of the ripe structure . It ' s earthy and tasting wine . It ' s an ideal __OOV__ apéritif or with light oak spiciness and a pinch of tobacco , and then in bottle for further complexity .

This bright white is redolent of the extended __OOV__ series offering mixes brambly strawberry fruit with a tiny production of 5 , 000 feet high . Tobacco and cedar alongside __OOV__ tannins , combining ripe cherries and baking spice . The aromas are herbal , balsamic flavors .

Tar , dried fig , melon and apple aromas carry the load , dressed up by chopped sage , thyme and earth are a touch of bright acidity , minerality and ripe fruit tones of black fruit and integrated tannins .

A complex mix of 62 Syrah , 7 Grenache , 35 Petit Verdot , which especially struggles during a __OOV__ finish .

Beautiful deep gold color . Stone fruit , with creamy strawberry and olallieberry flavors are more ripe than delicate . This makes it perfect as an apéritif light and bright flavors of apricots , pears , with fine tannins and bright fruits combine seamlessly to yield a full and fat wine . The wine is more hard , with flavors of juicy peach and a touch of grapefruit and lemon zest promise pleasure , the flavors linger .

Probabilities

wine_trilm.perplexity('A touch blossomy against a core of tobacco and a touch of juniper , lots of fresh pineapple , apricot , lemon drop and ginger brightened by crisp acidity .')

0.2803091829786763

Using sparse matricies

This should make it a little more efficient, but requires some index acrobatics

Index Acrobatics

Check some examples against hand calculated results

assert flatten_index([0,0,0], 7) == 0

assert unflatten_index(0, 7, 3) == [0,0,0]

We want blocks to be contiguous on the first indices

assert flatten_index([0,0,1], 7) == 1

assert unflatten_index(1, 7, 3) == [0,0,1]

assert flatten_index([5,3,1], 7) == 267

assert unflatten_index(267, 7, 3) == [5,3,1]

assert flatten_index([1], 7) == 1

assert flatten_index([], 7) == 0

assert unflatten_index(0, 7, 0) == []

size = 7
n = 3
assert [unflatten_index(a, size, n) for a in range(size**n)[flatten_index_range([1,3], size, n)]] == [[1,3,x] for x in range(size)]

size = 7
n = 3
assert [unflatten_index(a, size, n) for a in range(size**n)[flatten_index_range([1], size, n)]] == [[1,x,y] for x in range(size) for y in range(size)]

size = 7
n = 3
assert [unflatten_index(a, size, n) for a in range(size**n)[flatten_index_range([6,2,5], size, n)]] == [[6,2,5]]

Sparse Matrices

ts = SparseRowCubeTensor(wine_trilm.counts, size=len(wine_trilm.vocab), n_dimension=wine_trilm.n, dtype=int)

Getting the top 10 items is the same as brute force

assert ts.top_k(10) == dict(list(sorted(wine_trilm.counts.items(), key=lambda x: x[1], reverse=True))[:10])

Try normalising

t_norm = ts.normalize()

This should be the same as sum and divide

t_norm

<__main__.SparseRowCubeTensor at 0x7efdb5e1efa0>

assert abs(t_norm[0,0] - ts[0,0] / ts[0,0].sum()).max() < 1e-8

t_log = t_norm.transform(np.log)

assert abs(t_log[0,0,1] - np.log(t_norm[0,0,1])) < 1e-8

Sparse Matrix Language Model

Check against naive implementation

%time tri_naive = NaiveNgramLanguageModel(vocab_wine, 3, corpus_wine)

CPU times: user 16.3 s, sys: 20.9 ms, total: 16.4 s
Wall time: 16.2 s

%time tri = NgramLanguageModel(vocab_wine, 3, corpus_wine)

CPU times: user 17.4 s, sys: 702 ms, total: 18.1 s
Wall time: 20.5 s

Top k

assert tri_naive.top_k(10) == tri.top_k(10)

%time _ = tri_naive.top_k(100)

CPU times: user 5.17 s, sys: 41.9 ms, total: 5.21 s
Wall time: 5.19 s

%time _ = tri.top_k(100)

CPU times: user 1.88 s, sys: 450 ms, total: 2.33 s
Wall time: 2.33 s

Generation

Generation is much faster

%%time
for _ in range(5):
    print(' '.join(tri.generate()) + '\n')

Ripe , dry mouthfeel . It ' s Bordeaux __OOV__ . The palate is dilute . It ' s immediately __OOV__ itself .

Decent , but there ' s a pleasure to taste .

The flavors go with its notes of maple syrup and cocoa dusted black cherry , currant and cedar on the finish .

A voluptuous wine that shows drying aromas of tropical fruit and candied lime . A phenolic edge . Good on the finish but short on fruit . Imported by __OOV__ __OOV__ less than ripe . In the mouth , it has ripe citrus and pithy . Aromas of jammy berry , tilled soil , mature berry and wild blackberries , currants , anise , black olive , green pepper alongside polished tannins , although airing brings about more than enough acidity and a touch of Monterey County .

Black fruit and barrel flavors . __OOV__ and the wine exhibits aromas of this fine wine from this __OOV__ variety and the ripe tannins . This wine saw a touch of smoky oak . Given the high point in its right place . Drink 2028 2043 .

CPU times: user 54.3 ms, sys: 4.65 ms, total: 59 ms
Wall time: 91 ms

%%time
for _ in range(5): tri_naive.generate()

CPU times: user 936 ms, sys: 36 ms, total: 972 ms
Wall time: 971 ms

Probabilities

sample_sentence = ' '.join(tri_naive.generate())
sample_sentence

'Sourced from the cherry tones . Full bodied , with a silky elegance and length , with basic Pinot flavors of peach and tangerine skin , peppercorn and cherry aromas intermingle and lead to a __OOV__ , with hints of dried fruits and texture to the cool character to this textured wine , its character __OOV__ out of the grapes were pushed too far .'

assert abs(tri.perplexity(sample_sentence) - tri_naive.perplexity(sample_sentence)) < 1e-8

This is actually ~10x slower!

%timeit -n 100 tri.perplexity(sample_sentence)

2.06 ms ± 91 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit -n 100 tri_naive.perplexity(sample_sentence)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-d991a886a4e2> in <module>
----> 1 get_ipython().run_line_magic('timeit', '-n 100 tri_naive.perplexity(sample_sentence)')

~/src/projects/mlzero/.venv/lib/python3.9/site-packages/IPython/core/interactiveshell.py in run_line_magic(self, magic_name, line, _stack_depth)
   2325                 kwargs['local_ns'] = self.get_local_scope(stack_depth)
   2326             with self.builtin_trap:
-> 2327                 result = fn(*args, **kwargs)
   2328             return result
   2329 

<decorator-gen-53> in timeit(self, line, cell, local_ns)

~/src/projects/mlzero/.venv/lib/python3.9/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
    185     # but it's overkill for just that one bit of state.
    186     def magic_deco(arg):
--> 187         call = lambda f, *a, **k: f(*a, **k)
    188 
    189         if callable(arg):

~/src/projects/mlzero/.venv/lib/python3.9/site-packages/IPython/core/magics/execution.py in timeit(self, line, cell, local_ns)
   1171                     break
   1172 
-> 1173         all_runs = timer.repeat(repeat, number)
   1174         best = min(all_runs) / number
   1175         worst = max(all_runs) / number

~/.asdf/installs/python/3.9.1/lib/python3.9/timeit.py in repeat(self, repeat, number)
    203         r = []
    204         for i in range(repeat):
--> 205             t = self.timeit(number)
    206             r.append(t)
    207         return r

~/src/projects/mlzero/.venv/lib/python3.9/site-packages/IPython/core/magics/execution.py in timeit(self, number)
    167         gc.disable()
    168         try:
--> 169             timing = self.inner(it, self.timer)
    170         finally:
    171             if gcold:

<magic-timeit> in inner(_it, _timer)

NameError: name 'tri_naive' is not defined

N-gram language models

Generating n-grams

ngrams[source]

Vocabulary Management

invert_vocab[source]

class Vocab[source]

vocab_topn[source]

vocab_threshold[source]

Simple unsmoothed models

A naive example

count_ngrams[source]

ngram_counts_to_conditional_probability[source]

Calculating Probabilities

product[source]

Generating a text

Putting it together

class NaiveNgramLanguageModel[source]

Test on a Unigram Model

Top n

Probability

Generate

Bigram Model

Top k

Probabilities

Generation

Trigram model

Top k

Generation

Probabilities

Using sparse matricies

Index Acrobatics

flatten_index[source]

unflatten_index[source]

flatten_index_range[source]

Sparse Matrices

csr_top_k_idx[source]

class SparseRowCubeTensor[source]

Sparse Matrix Language Model

class NgramLanguageModel[source]

Check against naive implementation

Top k

Generation

Probabilities

`ngrams`[source]

`invert_vocab`[source]

`class` `Vocab`[source]

`vocab_topn`[source]

`vocab_threshold`[source]

`count_ngrams`[source]

`ngram_counts_to_conditional_probability`[source]

`product`[source]

`class` `NaiveNgramLanguageModel`[source]

`flatten_index`[source]

`unflatten_index`[source]

`flatten_index_range`[source]

`csr_top_k_idx`[source]

`class` `SparseRowCubeTensor`[source]

`class` `NgramLanguageModel`[source]