SHA-1 Digest

Calculating the base 32 encoded SHA-1 digest that is commonly used in WARC files and CDX indexes.

sha1_digest[source]

sha1_digest(content:bytes)

sha1_digest(b'12345')
'RSZCG7IGPHFIRW3EMTVMMDNJMNCVCOLE'

Making URLs Pretty

Sometimes I want to return something that looks like a URL in Jupyter, but works in other environments. Adapted from here.

class URL[source]

URL(url:str)

Wrapper around a URL string to provide nice display in IPython environments.

It displays nicely

url = URL('https://commoncrawl.org/')
url

The repr is usable

repr(url)
"URL(url='https://commoncrawl.org/')"

The string form is what we need

str(url)
'https://commoncrawl.org/'

Or we can extract it

url.url
'https://commoncrawl.org/'

Session Helpers

Make a session that can run multiple concurrent requests and retry for intermittent failures.

make_session[source]

make_session(pool_maxsize)

Joblib Helpers

Forcing a function with joblib.Memory

def _forced(f, force):
    """Forced version of memoized function with Memory"""
    assert hasattr(f, 'call')
    if not force:
        return f
    def result(*args, **kwargs):
        # Force returns a tuple of result,metadata
        return f.call(*args, **kwargs)[0]
    return result