Skip to main content

Python library to work with ARC, WARC and WET files

Project description

knot-warc - Python library to work with Web ARChive files

Note: This is one of forks of original WARC repository. It was primarily created for projects of Knowledge Technology Research Group, but its public-wide usage isn't limited.

Fork history
  1. https://github.com/internetarchive/warc (original Python 2 library)
  2. https://github.com/recrm/warc3 (Python 3 port)
  3. https://github.com/jpbruinsslot/warc3 (Python 3 port)
  4. https://github.com/Willian-Zhang/warc3 (WET support)

WARC (Web ARChive) is a file format for storing web crawls (see http://bibnum.bnf.fr/WARC/).

Examples

This warc library makes it very easy to work with WARC files:

import warc
with warc.open("test.warc") as f:
    for record in f:
        print(record['WARC-Target-URI'], record['Content-Length'])

And WET files:

import warc
with warc.open("test.warc.wet") as f:
    for record in f:
        print(record['WARC-Target-URI'], record['Content-Length'])

There are some examples provided without warranty and support (just for inspiration) in examples folder. They are not updated at all, too.

Documentation

The documentation of this fork of the warc library is on Github Pages (alternatively see original documentation).

Installation

You can install this fork of warc library using pip:

pip install warc-knot

License

This software is licensed under GPL v2. See LICENSE file for details.

Authors

Original Python2 Versions:

  • Anand Chitipothu
  • Noufal Ibrahim

Python3 Port:

  • Ryan Chartier
  • Jan Pieter Bruins Slot
  • Almer S. Tigelaar

Modifications:

  • Willian Zhang
  • Michal Šmahel

Change Log

0.2.5:

  • Update sphinx docs

0.2.4:

  • Fix for Python 3.10+
  • Upgrade HTTP --> HTTPS in tests

0.2.3

  • Support seeking in WARC/WET

0.2.2

  • Allow WET parse

Older...

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

warc_knot-0.2.5.tar.gz (33.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

warc_knot-0.2.5-py3-none-any.whl (24.2 kB view details)

Uploaded Python 3

File details

Details for the file warc_knot-0.2.5.tar.gz.

File metadata

  • Download URL: warc_knot-0.2.5.tar.gz
  • Upload date:
  • Size: 33.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.1

File hashes

Hashes for warc_knot-0.2.5.tar.gz
Algorithm Hash digest
SHA256 c1d23e2504317637ec28abac7813d4553ab6d451d2b6efaf2098a7ec01a01f7d
MD5 a5d6cc89ba7e92a2d77d12f0700bc213
BLAKE2b-256 f49f3a283c47eedb58b4125be799b98dd48a01f980e553cdd876dc1c3a50409b

See more details on using hashes here.

File details

Details for the file warc_knot-0.2.5-py3-none-any.whl.

File metadata

  • Download URL: warc_knot-0.2.5-py3-none-any.whl
  • Upload date:
  • Size: 24.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.1

File hashes

Hashes for warc_knot-0.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 3a9bbf46c55d27cce875d431d85c084f5b3f14bb7f28438e7b782dd2958d6764
MD5 7c9b8c186f5043bb88334794542edd8c
BLAKE2b-256 4d05c4372c1f8d086cdaa7f0130e7660479109bf15d343be4d8beadc108e5e4a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page