Skip to main content

Workflow for refining datasets from World Wide Web data

Project description

webrefine

A workflow for refining web pages into useful datasets.

Read the full documentation

Install

pip install webrefine

How to use

We'll go through an example of getting some titles from my blog at skeptric.com.

The process consists of:

  • Defining Queries
  • Defining Extraction and Filters
  • Running the process

Querying data

To start we'll need some captures of my blog, and so we'll get them from the Internet Archive's Wayback Machine.

from webrefine.query import WaybackQuery

We could get some HTML pages:

skeptric_wb = WaybackQuery('skeptric.com/*', start='2020', end='2020', mime='text/html')
sample = list(skeptric_wb.query(limit=20))

We can get some sample records

sample[0]
WaybackRecord(url='https://skeptric.com/', timestamp=datetime.datetime(2020, 11, 26, 6, 41, 2), mime='text/html', status=200, digest='WDYU3RU7ZMFFSZPAPE56PC4L3EK4FE3D')
sample[1]
WaybackRecord(url='https://skeptric.com/casper-2-to-3/', timestamp=datetime.datetime(2020, 11, 26, 7, 52, 8), mime='text/html', status=200, digest='3XDBGHY77ZEA2Z7IVBARXEQT6UDLYAL7')

And view them on the Wayback Machine to work out how to get the information we want

sample[1].preview()

http://web.archive.org/web/20201126075208/https://skeptric.com/casper-2-to-3/

We could also query CommonCrawl similarly with a CommonCrawlQuery. This has more captures but takes a bit longer to run.

from webrefine.query import CommonCrawlQuery
skeptric_cc = CommonCrawlQuery('skeptric.com/*')

Another option is to add local Warc Files (e.g. produced using warcio or wget with warc parameters)

from webrefine.query import WarcFileQuery
test_data = '../resources/test/skeptric.warc.gz'

skeptric_file_query = WarcFileQuery(test_data)
[r.url for r in skeptric_file_query.query()]
['https://skeptric.com/pagination-wayback-cdx/',
 'https://skeptric.com/robots.txt',
 'https://skeptric.com/style.main.min.5ea2f07be7e07e221a7112a3095b89d049b96c48b831f16f1015bf2d95d914e5.css',
 'https://skeptric.com/',
 'https://skeptric.com/about/',
 'https://skeptric.com/tags/data',
 'https://skeptric.com/tags/data/',
 'https://skeptric.com/images/wayback_empty_returns.png',
 'https://skeptric.com/searching-100b-pages-cdx',
 'https://skeptric.com/searching-100b-pages-cdx/',
 'https://skeptric.com/fast-web-data-workflow/',
 'https://skeptric.com/key-web-captures/',
 'https://skeptric.com/emacs-tempfile-hugo/']

Filtering and Extracting the Data

From Inspecting some web results we can see that the titles are written like:

<h1 class="post-full-title">{TITLE}</h1>

In a real example we'd parse the HTML, but for simplicity we'll extract it with a regular expression

import re
def skeptric_extract(content, record):
    html = content.decode('utf-8')
    title = next(re.finditer('<h1 class="post-full-title">([^<]+)</h1>', html)).group(1)
    return {
        'title': title,
        'url': record.url,
        'timestamp': record.timestamp
    }

We can then test it on some content we fetch from the Wayback Machine

skeptric_extract(sample[1].content, sample[1])
{'title': 'Hugo Casper 2 to 3',
 'url': 'https://skeptric.com/casper-2-to-3/',
 'timestamp': datetime.datetime(2020, 11, 26, 7, 52, 8)}

Some pages don't have it so we filter them out, and we remove duplicates

def skeptric_filter(records):
    last_url = None
    for record in records:
        # Only use ok HTML captures
        if record.mime != 'text/html' or record.status != 200:
            continue
        # Pages that are not articles (and so do not have a title)
        if record.url == 'https://skeptric.com/' or '/tags/' in record.url:
            continue
        # Duplicates (using the fact that here the posts come in order)
        if record.url == last_url:
            continue
        last_url = record.url
        yield record
[r.url for r in skeptric_filter(sample)]
['https://skeptric.com/casper-2-to-3/',
 'https://skeptric.com/common-crawl-index-athena/',
 'https://skeptric.com/common-crawl-job-ads/',
 'https://skeptric.com/considering-vscode/',
 'https://skeptric.com/decorating-pandas-tables/',
 'https://skeptric.com/drive-metrics/',
 'https://skeptric.com/emacs-buffering/',
 'https://skeptric.com/ngram-python/',
 'https://skeptric.com/portable-custom-config/',
 'https://skeptric.com/searching-100b-pages-cdx/',
 'https://skeptric.com/text-meta-data-commoncrawl/']

Running the process

Now we've written all the logic we need, we can collect it all in a process to run

from webrefine.runners import Process
skeptric_process = Process(
    queries=[skeptric_file_query,
             # commented out to make faster
             #skeptric_wb,
             #skeptric_cc,
          ],
    filter=skeptric_filter,
    steps = [skeptric_extract])

We can wrap it in a runner and run it all with .run.

%%time
from webrefine.runners import RunnerMemory
data = list(RunnerMemory(skeptric_process).run())
data
CPU times: user 290 ms, sys: 14.8 ms, total: 305 ms
Wall time: 304 ms





[{'title': 'Pagination in Internet Archive&#39;s Wayback Machine with CDX',
  'url': 'https://skeptric.com/pagination-wayback-cdx/',
  'timestamp': datetime.datetime(2021, 11, 26, 11, 28, 34)},
 {'title': 'About Skeptric',
  'url': 'https://skeptric.com/about/',
  'timestamp': datetime.datetime(2021, 11, 26, 11, 28, 37)},
 {'title': 'Searching 100 Billion Webpages Pages With Capture Index',
  'url': 'https://skeptric.com/searching-100b-pages-cdx/',
  'timestamp': datetime.datetime(2021, 11, 26, 11, 28, 39)},
 {'title': 'Fast Web Dataset Extraction Worfklow',
  'url': 'https://skeptric.com/fast-web-data-workflow/',
  'timestamp': datetime.datetime(2021, 11, 26, 11, 28, 39)},
 {'title': 'Unique Key for Web Captures',
  'url': 'https://skeptric.com/key-web-captures/',
  'timestamp': datetime.datetime(2021, 11, 26, 11, 28, 40)},
 {'title': 'Hugo Readdir Error with Emacs',
  'url': 'https://skeptric.com/emacs-tempfile-hugo/',
  'timestamp': datetime.datetime(2021, 11, 26, 11, 28, 40)}]

For larger jobs RunnerFile is better which caches intermediate results to a file

%%time
from webrefine.runners import RunnerCached

cache_path = './test_cache.sqlite'

data = list(RunnerCached(skeptric_process, path=cache_path).run())
data
CPU times: user 252 ms, sys: 10.7 ms, total: 263 ms
Wall time: 286 ms





[{'title': 'Pagination in Internet Archive&#39;s Wayback Machine with CDX',
  'url': 'https://skeptric.com/pagination-wayback-cdx/',
  'timestamp': datetime.datetime(2021, 11, 26, 11, 28, 34)},
 {'title': 'About Skeptric',
  'url': 'https://skeptric.com/about/',
  'timestamp': datetime.datetime(2021, 11, 26, 11, 28, 37)},
 {'title': 'Searching 100 Billion Webpages Pages With Capture Index',
  'url': 'https://skeptric.com/searching-100b-pages-cdx/',
  'timestamp': datetime.datetime(2021, 11, 26, 11, 28, 39)},
 {'title': 'Fast Web Dataset Extraction Worfklow',
  'url': 'https://skeptric.com/fast-web-data-workflow/',
  'timestamp': datetime.datetime(2021, 11, 26, 11, 28, 39)},
 {'title': 'Unique Key for Web Captures',
  'url': 'https://skeptric.com/key-web-captures/',
  'timestamp': datetime.datetime(2021, 11, 26, 11, 28, 40)},
 {'title': 'Hugo Readdir Error with Emacs',
  'url': 'https://skeptric.com/emacs-tempfile-hugo/',
  'timestamp': datetime.datetime(2021, 11, 26, 11, 28, 40)}]
import os
os.unlink(cache_path)

Note that in the case of errors in the steps the process keeps going, and logs the errors

skeptric_error_process = Process(
    queries=[skeptric_file_query,
             # commented out to make faster
             #skeptric_wb,
             #skeptric_cc,
          ],
    filter=lambda x: x,
    steps = [skeptric_extract])
data = list(RunnerMemory(skeptric_error_process).run())
ERROR:root:Error processing WarcFileRecord(url='https://skeptric.com/robots.txt', timestamp=datetime.datetime(2021, 11, 26, 11, 28, 34), mime='text/html', status=404, path=PosixPath('../resources/test/skeptric.warc.gz'), offset=5804, digest='QRNGXIUXE4LAI3XR5RVATIUX5GTB33HX') at step skeptric_extract: 
ERROR:root:Error processing WarcFileRecord(url='https://skeptric.com/style.main.min.5ea2f07be7e07e221a7112a3095b89d049b96c48b831f16f1015bf2d95d914e5.css', timestamp=datetime.datetime(2021, 11, 26, 11, 28, 35), mime='text/css', status=200, path=PosixPath('../resources/test/skeptric.warc.gz'), offset=7197, digest='LINCDTSPQGAQGZZ6LY2XFXZHG2X476H6') at step skeptric_extract: 
ERROR:root:Error processing WarcFileRecord(url='https://skeptric.com/', timestamp=datetime.datetime(2021, 11, 26, 11, 28, 36), mime='text/html', status=200, path=PosixPath('../resources/test/skeptric.warc.gz'), offset=17122, digest='JJVB3MQERHRZJCHOJNKS5VDOODXPZAV2') at step skeptric_extract: 
ERROR:root:Error processing WarcFileRecord(url='https://skeptric.com/tags/data', timestamp=datetime.datetime(2021, 11, 26, 11, 28, 37), mime='text/html', status=302, path=PosixPath('../resources/test/skeptric.warc.gz'), offset=129093, digest='ZZZXDZTTV2KTABRO64ESHVWFPNKB4I5H') at step skeptric_extract: 
ERROR:root:Error processing WarcFileRecord(url='https://skeptric.com/tags/data/', timestamp=datetime.datetime(2021, 11, 26, 11, 28, 38), mime='text/html', status=200, path=PosixPath('../resources/test/skeptric.warc.gz'), offset=130269, digest='R7CLAACFU5L7T5LKI5G53RZSMCNUNV6F') at step skeptric_extract: 
ERROR:root:Error processing WarcFileRecord(url='https://skeptric.com/images/wayback_empty_returns.png', timestamp=datetime.datetime(2021, 11, 26, 11, 28, 38), mime='image/png', status=200, path=PosixPath('../resources/test/skeptric.warc.gz'), offset=160971, digest='SU7JRTHNW6KFCJQFL5PMMKV33U2VLV7T') at step skeptric_extract: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte
ERROR:root:Error processing WarcFileRecord(url='https://skeptric.com/searching-100b-pages-cdx', timestamp=datetime.datetime(2021, 11, 26, 11, 28, 39), mime='text/html', status=302, path=PosixPath('../resources/test/skeptric.warc.gz'), offset=173368, digest='AYVHQLVFIVGZGUYPEHX46CHMZ5NUDDBF') at step skeptric_extract: 

We could then investigate them to see what happened

import datetime
from pathlib import PosixPath
from webrefine.query import WarcFileRecord

record = WarcFileRecord(url='https://skeptric.com/tags/data/', timestamp=datetime.datetime(2021, 11, 26, 11, 28, 38), mime='text/html', status=200, path=PosixPath('../resources/test/skeptric.warc.gz'), offset=130269, digest='R7CLAACFU5L7T5LKI5G53RZSMCNUNV6F')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webrefine-0.0.3.tar.gz (22.8 kB view hashes)

Uploaded Source

Built Distribution

webrefine-0.0.3-py3-none-any.whl (21.4 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page