Skip to main content

A high-performance WARC parsing library for Python written in C++/Cython.

Project description

FastWARC

FastWARC is a high-performance WARC parsing library for Python written in C++/Cython. The API is inspired in large parts by WARCIO, but does not aim at being a drop-in replacement. FastWARC supports compressed and uncompressed WARC/1.0 and WARC/1.1 streams. Supported compression algorithms are GZip and LZ4.

FastWARC belongs to the ChatNoir Resiliparse toolkit for fast and robust web data processing.

Installing FastWARC

Pre-built FastWARC binaries for most Linux platforms can be installed from PyPi:

pip install fastwarc

However: the Linux binaries are provided solely for your convenience. Since they are built on the very old manylinux base system for better compatibility, their performance isn't optimal (though still better than WARCIO). For best performance, see the next section on how to build FastWARC yourself.

Building FastWARC From Source

You can compile FastWARC either from the PyPi source package or directly from this repository, though in any case, you need to install all required build-time dependencies first. On Ubuntu, this is done as follows:

sudo apt install build-essential python3-dev zlib1g-dev liblz4-dev

To build and install FastWARC from PyPi, run

pip install --no-binary fastwarc fastwarc

That's it. If you prefer to build and install directly from this repository instead, run:

pip install -e fastwarc

To build the wheels without installing them, run:

pip wheel -e fastwarc

# Or:
pip install build && python -m build --wheel fastwarc

Usage Instructions

For detailed usage instructions, please consult the FastWARC User Manual.

Cite Us

If you use FastWARC, please consider citing our OSSYM 2021 abstract paper:

@InProceedings{bevendorff:2021,
  author =                {Janek Bevendorff and Martin Potthast and Benno Stein},
  booktitle =             {3nd International Symposium on Open Search Technology (OSSYM 2021)},
  editor =                {Andreas Wagner and Christian Guetl and Michael Granitzer and Stefan Voigt},
  month =                 oct,
  publisher =             {International Open Search Symposium},
  site =                  {CERN, Geneva, Switzerland},
  title =                 {{FastWARC: Optimizing Large-Scale Web Archive Analytics}},
  year =                  2021
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

FastWARC-0.12.2.tar.gz (355.4 kB view hashes)

Uploaded Source

Built Distributions

FastWARC-0.12.2-cp310-cp310-win_amd64.whl (674.7 kB view hashes)

Uploaded CPython 3.10 Windows x86-64

FastWARC-0.12.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view hashes)

Uploaded CPython 3.10 manylinux: glibc 2.17+ x86-64

FastWARC-0.12.2-cp310-cp310-macosx_10_14_x86_64.whl (395.1 kB view hashes)

Uploaded CPython 3.10 macOS 10.14+ x86-64

FastWARC-0.12.2-cp39-cp39-win_amd64.whl (679.0 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

FastWARC-0.12.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

FastWARC-0.12.2-cp39-cp39-macosx_10_14_x86_64.whl (397.4 kB view hashes)

Uploaded CPython 3.9 macOS 10.14+ x86-64

FastWARC-0.12.2-cp38-cp38-win_amd64.whl (679.1 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

FastWARC-0.12.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

FastWARC-0.12.2-cp38-cp38-macosx_10_14_x86_64.whl (396.4 kB view hashes)

Uploaded CPython 3.8 macOS 10.14+ x86-64

FastWARC-0.12.2-cp37-cp37m-win_amd64.whl (693.3 kB view hashes)

Uploaded CPython 3.7m Windows x86-64

FastWARC-0.12.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

FastWARC-0.12.2-cp37-cp37m-macosx_10_14_x86_64.whl (395.4 kB view hashes)

Uploaded CPython 3.7m macOS 10.14+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page