Read the full documentation
pip install webrefine
We'll go through an example of getting some titles from my blog at skeptric.com.
The process consists of:
- Defining Queries
- Defining Extraction and Filters
- Running the process
from webrefine.query import WaybackQuery
We could get some HTML pages:
skeptric_wb = WaybackQuery('skeptric.com/*', start='2020', end='2020', mime='text/html')
sample = list(skeptric_wb.query(limit=20))
We can get some sample records
sample[0]
sample[1]
And view them on the Wayback Machine to work out how to get the information we want
sample[1].preview()
We could also query CommonCrawl similarly with a CommonCrawlQuery
.
This has more captures but takes a bit longer to run.
from webrefine.query import CommonCrawlQuery
skeptric_cc = CommonCrawlQuery('skeptric.com/*')
Another option is to add local Warc Files (e.g. produced using warcio
or wget
with warc
parameters)
from webrefine.query import WarcFileQuery
test_data = '../resources/test/skeptric.warc.gz'
skeptric_file_query = WarcFileQuery(test_data)
[r.url for r in skeptric_file_query.query()]
From Inspecting some web results we can see that the titles are written like:
<h1 class="post-full-title">{TITLE}</h1>
In a real example we'd parse the HTML, but for simplicity we'll extract it with a regular expression
import re
def skeptric_extract(content, record):
html = content.decode('utf-8')
title = next(re.finditer('<h1 class="post-full-title">([^<]+)</h1>', html)).group(1)
return {
'title': title,
'url': record.url,
'timestamp': record.timestamp
}
We can then test it on some content we fetch from the Wayback Machine
skeptric_extract(sample[1].content, sample[1])
Some pages don't have it so we filter them out, and we remove duplicates
def skeptric_filter(records):
last_url = None
for record in records:
# Only use ok HTML captures
if record.mime != 'text/html' or record.status != 200:
continue
# Pages that are not articles (and so do not have a title)
if record.url == 'https://skeptric.com/' or '/tags/' in record.url:
continue
# Duplicates (using the fact that here the posts come in order)
if record.url == last_url:
continue
last_url = record.url
yield record
[r.url for r in skeptric_filter(sample)]
from webrefine.runners import Process
skeptric_process = Process(
queries=[skeptric_file_query,
# commented out to make faster
#skeptric_wb,
#skeptric_cc,
],
filter=skeptric_filter,
steps = [skeptric_extract])
We can wrap it in a runner and run it all with .run
.
%%time
from webrefine.runners import RunnerMemory
data = list(RunnerMemory(skeptric_process).run())
data
For larger jobs RunnerFile
is better which caches intermediate results to a file
%%time
from webrefine.runners import RunnerCached
cache_path = './test_cache.sqlite'
data = list(RunnerCached(skeptric_process, path=cache_path).run())
data
import os
os.unlink(cache_path)
Note that in the case of errors in the steps the process keeps going, and logs the errors
skeptric_error_process = Process(
queries=[skeptric_file_query,
# commented out to make faster
#skeptric_wb,
#skeptric_cc,
],
filter=lambda x: x,
steps = [skeptric_extract])
data = list(RunnerMemory(skeptric_error_process).run())
We could then investigate them to see what happened
import datetime
from pathlib import PosixPath
from webrefine.query import WarcFileRecord
record = WarcFileRecord(url='https://skeptric.com/tags/data/', timestamp=datetime.datetime(2021, 11, 26, 11, 28, 38), mime='text/html', status=200, path=PosixPath('../resources/test/skeptric.warc.gz'), offset=130269, digest='R7CLAACFU5L7T5LKI5G53RZSMCNUNV6F')