Skip to main content

Workflow for refining datasets from World Wide Web data

Project description

webrefine

A workflow for refining web pages into useful datasets.

Read the full documentation

Install

pip install webrefine

How to use

We'll go through an example of getting some titles from my blog at skeptric.com.

The process consists of:

  • Defining Queries
  • Defining Extraction and Filters
  • Running the process

Querying data

To start we'll need some captures of my blog, and so we'll get them from the Internet Archive's Wayback Machine.

from webrefine.query import WaybackQuery

We could get some HTML pages:

skeptric_wb = WaybackQuery('skeptric.com/*', start='2020', end='2020', mime='text/html')
sample = list(skeptric_wb.query(limit=20))

We can get some sample records

sample[0]
WaybackRecord(url='https://skeptric.com/', timestamp=datetime.datetime(2020, 11, 26, 6, 41, 2), mime='text/html', status=200, digest='WDYU3RU7ZMFFSZPAPE56PC4L3EK4FE3D')
sample[1]
WaybackRecord(url='https://skeptric.com/casper-2-to-3/', timestamp=datetime.datetime(2020, 11, 26, 7, 52, 8), mime='text/html', status=200, digest='3XDBGHY77ZEA2Z7IVBARXEQT6UDLYAL7')

And view them on the Wayback Machine to work out how to get the information we want

sample[1].preview()

http://web.archive.org/web/20201126075208/https://skeptric.com/casper-2-to-3/

We could also query CommonCrawl similarly with a CommonCrawlQuery. This has more captures but takes a bit longer to run.

from webrefine.query import CommonCrawlQuery
skeptric_cc = CommonCrawlQuery('skeptric.com/*')

Another option is to add local Warc Files (e.g. produced using warcio or wget with warc parameters)

from webrefine.query import WarcFileQuery
test_data = '../resources/test/skeptric.warc.gz'

skeptric_file_query = WarcFileQuery(test_data)
[r.url for r in skeptric_file_query.query()]
['https://skeptric.com/pagination-wayback-cdx/',
 'https://skeptric.com/robots.txt',
 'https://skeptric.com/style.main.min.5ea2f07be7e07e221a7112a3095b89d049b96c48b831f16f1015bf2d95d914e5.css',
 'https://skeptric.com/',
 'https://skeptric.com/about/',
 'https://skeptric.com/tags/data',
 'https://skeptric.com/tags/data/',
 'https://skeptric.com/images/wayback_empty_returns.png',
 'https://skeptric.com/searching-100b-pages-cdx',
 'https://skeptric.com/searching-100b-pages-cdx/',
 'https://skeptric.com/fast-web-data-workflow/',
 'https://skeptric.com/key-web-captures/',
 'https://skeptric.com/emacs-tempfile-hugo/']

Filtering and Extracting the Data

From Inspecting some web results we can see that the titles are written like:

<h1 class="post-full-title">{TITLE}</h1>

In a real example we'd parse the HTML, but for simplicity we'll extract it with a regular expression

import re
def skeptric_extract(content, record):
    html = content.decode('utf-8')
    title = next(re.finditer('<h1 class="post-full-title">([^<]+)</h1>', html)).group(1)
    return {
        'title': title,
        'url': record.url,
        'timestamp': record.timestamp
    }

We can then test it on some content we fetch from the Wayback Machine

skeptric_extract(sample[1].content, sample[1])
{'title': 'Hugo Casper 2 to 3',
 'url': 'https://skeptric.com/casper-2-to-3/',
 'timestamp': datetime.datetime(2020, 11, 26, 7, 52, 8)}

Some pages don't have it so we filter them out, and we remove duplicates

def skeptric_filter(records):
    last_url = None
    for record in records:
        # Only use ok HTML captures
        if record.mime != 'text/html' or record.status != 200:
            continue
        # Pages that are not articles (and so do not have a title)
        if record.url == 'https://skeptric.com/' or '/tags/' in record.url:
            continue
        # Duplicates (using the fact that here the posts come in order)
        if record.url == last_url:
            continue
        last_url = record.url
        yield record
[r.url for r in skeptric_filter(sample)]
['https://skeptric.com/casper-2-to-3/',
 'https://skeptric.com/common-crawl-index-athena/',
 'https://skeptric.com/common-crawl-job-ads/',
 'https://skeptric.com/considering-vscode/',
 'https://skeptric.com/decorating-pandas-tables/',
 'https://skeptric.com/drive-metrics/',
 'https://skeptric.com/emacs-buffering/',
 'https://skeptric.com/ngram-python/',
 'https://skeptric.com/portable-custom-config/',
 'https://skeptric.com/searching-100b-pages-cdx/',
 'https://skeptric.com/text-meta-data-commoncrawl/']

Running the process

Now we've written all the logic we need, we can collect it all in a process to run

from webrefine.runners import Process
skeptric_process = Process(
    queries=[skeptric_file_query,
             # commented out to make faster
             #skeptric_wb,
             #skeptric_cc,
          ],
    filter=skeptric_filter,
    steps = [skeptric_extract])

We can wrap it in a runner and run it all with .run.

%%time
from webrefine.runners import RunnerMemory
data = list(RunnerMemory(skeptric_process).run())
data
CPU times: user 290 ms, sys: 14.8 ms, total: 305 ms
Wall time: 304 ms





[{'title': 'Pagination in Internet Archive&#39;s Wayback Machine with CDX',
  'url': 'https://skeptric.com/pagination-wayback-cdx/',
  'timestamp': datetime.datetime(2021, 11, 26, 11, 28, 34)},
 {'title': 'About Skeptric',
  'url': 'https://skeptric.com/about/',
  'timestamp': datetime.datetime(2021, 11, 26, 11, 28, 37)},
 {'title': 'Searching 100 Billion Webpages Pages With Capture Index',
  'url': 'https://skeptric.com/searching-100b-pages-cdx/',
  'timestamp': datetime.datetime(2021, 11, 26, 11, 28, 39)},
 {'title': 'Fast Web Dataset Extraction Worfklow',
  'url': 'https://skeptric.com/fast-web-data-workflow/',
  'timestamp': datetime.datetime(2021, 11, 26, 11, 28, 39)},
 {'title': 'Unique Key for Web Captures',
  'url': 'https://skeptric.com/key-web-captures/',
  'timestamp': datetime.datetime(2021, 11, 26, 11, 28, 40)},
 {'title': 'Hugo Readdir Error with Emacs',
  'url': 'https://skeptric.com/emacs-tempfile-hugo/',
  'timestamp': datetime.datetime(2021, 11, 26, 11, 28, 40)}]

For larger jobs RunnerFile is better which caches intermediate results to a file

%%time
from webrefine.runners import RunnerCached

cache_path = './test_cache.sqlite'

data = list(RunnerCached(skeptric_process, path=cache_path).run())
data
CPU times: user 252 ms, sys: 10.7 ms, total: 263 ms
Wall time: 286 ms





[{'title': 'Pagination in Internet Archive&#39;s Wayback Machine with CDX',
  'url': 'https://skeptric.com/pagination-wayback-cdx/',
  'timestamp': datetime.datetime(2021, 11, 26, 11, 28, 34)},
 {'title': 'About Skeptric',
  'url': 'https://skeptric.com/about/',
  'timestamp': datetime.datetime(2021, 11, 26, 11, 28, 37)},
 {'title': 'Searching 100 Billion Webpages Pages With Capture Index',
  'url': 'https://skeptric.com/searching-100b-pages-cdx/',
  'timestamp': datetime.datetime(2021, 11, 26, 11, 28, 39)},
 {'title': 'Fast Web Dataset Extraction Worfklow',
  'url': 'https://skeptric.com/fast-web-data-workflow/',
  'timestamp': datetime.datetime(2021, 11, 26, 11, 28, 39)},
 {'title': 'Unique Key for Web Captures',
  'url': 'https://skeptric.com/key-web-captures/',
  'timestamp': datetime.datetime(2021, 11, 26, 11, 28, 40)},
 {'title': 'Hugo Readdir Error with Emacs',
  'url': 'https://skeptric.com/emacs-tempfile-hugo/',
  'timestamp': datetime.datetime(2021, 11, 26, 11, 28, 40)}]
import os
os.unlink(cache_path)

Note that in the case of errors in the steps the process keeps going, and logs the errors

skeptric_error_process = Process(
    queries=[skeptric_file_query,
             # commented out to make faster
             #skeptric_wb,
             #skeptric_cc,
          ],
    filter=lambda x: x,
    steps = [skeptric_extract])
data = list(RunnerMemory(skeptric_error_process).run())
ERROR:root:Error processing WarcFileRecord(url='https://skeptric.com/robots.txt', timestamp=datetime.datetime(2021, 11, 26, 11, 28, 34), mime='text/html', status=404, path=PosixPath('../resources/test/skeptric.warc.gz'), offset=5804, digest='QRNGXIUXE4LAI3XR5RVATIUX5GTB33HX') at step skeptric_extract: 
ERROR:root:Error processing WarcFileRecord(url='https://skeptric.com/style.main.min.5ea2f07be7e07e221a7112a3095b89d049b96c48b831f16f1015bf2d95d914e5.css', timestamp=datetime.datetime(2021, 11, 26, 11, 28, 35), mime='text/css', status=200, path=PosixPath('../resources/test/skeptric.warc.gz'), offset=7197, digest='LINCDTSPQGAQGZZ6LY2XFXZHG2X476H6') at step skeptric_extract: 
ERROR:root:Error processing WarcFileRecord(url='https://skeptric.com/', timestamp=datetime.datetime(2021, 11, 26, 11, 28, 36), mime='text/html', status=200, path=PosixPath('../resources/test/skeptric.warc.gz'), offset=17122, digest='JJVB3MQERHRZJCHOJNKS5VDOODXPZAV2') at step skeptric_extract: 
ERROR:root:Error processing WarcFileRecord(url='https://skeptric.com/tags/data', timestamp=datetime.datetime(2021, 11, 26, 11, 28, 37), mime='text/html', status=302, path=PosixPath('../resources/test/skeptric.warc.gz'), offset=129093, digest='ZZZXDZTTV2KTABRO64ESHVWFPNKB4I5H') at step skeptric_extract: 
ERROR:root:Error processing WarcFileRecord(url='https://skeptric.com/tags/data/', timestamp=datetime.datetime(2021, 11, 26, 11, 28, 38), mime='text/html', status=200, path=PosixPath('../resources/test/skeptric.warc.gz'), offset=130269, digest='R7CLAACFU5L7T5LKI5G53RZSMCNUNV6F') at step skeptric_extract: 
ERROR:root:Error processing WarcFileRecord(url='https://skeptric.com/images/wayback_empty_returns.png', timestamp=datetime.datetime(2021, 11, 26, 11, 28, 38), mime='image/png', status=200, path=PosixPath('../resources/test/skeptric.warc.gz'), offset=160971, digest='SU7JRTHNW6KFCJQFL5PMMKV33U2VLV7T') at step skeptric_extract: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte
ERROR:root:Error processing WarcFileRecord(url='https://skeptric.com/searching-100b-pages-cdx', timestamp=datetime.datetime(2021, 11, 26, 11, 28, 39), mime='text/html', status=302, path=PosixPath('../resources/test/skeptric.warc.gz'), offset=173368, digest='AYVHQLVFIVGZGUYPEHX46CHMZ5NUDDBF') at step skeptric_extract: 

We could then investigate them to see what happened

import datetime
from pathlib import PosixPath
from webrefine.query import WarcFileRecord

record = WarcFileRecord(url='https://skeptric.com/tags/data/', timestamp=datetime.datetime(2021, 11, 26, 11, 28, 38), mime='text/html', status=200, path=PosixPath('../resources/test/skeptric.warc.gz'), offset=130269, digest='R7CLAACFU5L7T5LKI5G53RZSMCNUNV6F')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

webrefine-0.0.3.tar.gz (22.8 kB view details)

Uploaded Source

Built Distribution

webrefine-0.0.3-py3-none-any.whl (21.4 kB view details)

Uploaded Python 3

File details

Details for the file webrefine-0.0.3.tar.gz.

File metadata

  • Download URL: webrefine-0.0.3.tar.gz
  • Upload date:
  • Size: 22.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.8.10

File hashes

Hashes for webrefine-0.0.3.tar.gz
Algorithm Hash digest
SHA256 4ff273117c4978def42d377036e20beb7ea27c93686afa73dd1b1f126cc9b9db
MD5 88a19404046a0e39bbf29fb22b922d9d
BLAKE2b-256 ee1d816cee4617b668957de7ccac9b4fb9edefc6977dd3c17d321fba4e0b88ca

See more details on using hashes here.

File details

Details for the file webrefine-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: webrefine-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.8.10

File hashes

Hashes for webrefine-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 a527b4bb1f3f0b5c936ed434b4600041c3d10481aaccd6c5d2af2f32920bd62d
MD5 c39abd9f02252555ed20ff02079b3cae
BLAKE2b-256 2b32942c4a96dffce54bd21783711914c3df5c96fb2fd803259e72f490404b39

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page