Skip to main content

A high-performance WARC parsing library for Python written in C++/Cython.

Project description

FastWARC

FastWARC is a high-performance WARC parsing library for Python written in C++/Cython. The API is inspired in large parts by WARCIO, but does not aim at being a drop-in replacement. FastWARC supports compressed and uncompressed WARC/1.0 and WARC/1.1 streams. Supported compression algorithms are GZip and LZ4.

FastWARC belongs to the ChatNoir Resiliparse toolkit for fast and robust web data processing.

Installing FastWARC

Pre-built FastWARC binaries for most Linux platforms can be installed from PyPi:

pip install fastwarc

However: these binaries are provided solely for your convenience. Since they are built on the very old manylinux base system for better compatibility, their performance isn't optimal (though still better than WARCIO). For best performance, see the next section on how to build FastWARC yourself.

Building FastWARC

You can compile FastWARC either from the PyPi source package or directly from this repository, though in any case, you need to install all build-time dependencies first. For Debian / Ubuntu, this is done with:

sudo apt install build-essential python3-dev zlib1g-dev liblz4-dev

Then to build FastWARC from PyPi, run

pip install --no-binary fastwarc fastwarc

That's it. If you prefer to build directly from this repository instead, run:

# Create venv (recommended, but not required)
python3 -m venv venv && source venv/bin/activate

# Install additional build dependencies
pip install cython setuptools

# Build and install:
BUILD_PACKAGES=fastwarc python setup.py install

Usage Instructions

For detailed usage instructions, please consult the FastWARC User Manual.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

FastWARC-0.3.6.tar.gz (323.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

FastWARC-0.3.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

FastWARC-0.3.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

FastWARC-0.3.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

FastWARC-0.3.6-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.17+ x86-64

File details

Details for the file FastWARC-0.3.6.tar.gz.

File metadata

  • Download URL: FastWARC-0.3.6.tar.gz
  • Upload date:
  • Size: 323.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.6

File hashes

Hashes for FastWARC-0.3.6.tar.gz
Algorithm Hash digest
SHA256 77fbde1ec08077c261abc93529ce05339e409d9dcee7d238ec0c2ff018f0722f
MD5 3ec7f815d28d37649930b38ab0b46422
BLAKE2b-256 5d2902d27f3c486d986d249f1d481745517bb1256ea709d770b3186cb335b6bd

See more details on using hashes here.

File details

Details for the file FastWARC-0.3.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for FastWARC-0.3.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 cc1dd4bad186ad6b202e70c80ec17188589b1af2485900d21e8cdd188dc752c9
MD5 b80bccba2fd58f5ab06c350edceb76bf
BLAKE2b-256 37d7dcc0b8d716a32df55dcb1d8d8e532c70338d6ed4e00f16f71d1aa7ff34b8

See more details on using hashes here.

File details

Details for the file FastWARC-0.3.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for FastWARC-0.3.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3f5df2c615cde136c3425a7bab693ef675c205b59421b843e4660fb285e65515
MD5 6b71911df95481308f2ccb880ff8c5d3
BLAKE2b-256 524e7c4f9e58b084dcbae0ddda943580c8b8a5ea83a4e7683185de776f793c7f

See more details on using hashes here.

File details

Details for the file FastWARC-0.3.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for FastWARC-0.3.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 3a007ab5d6731cb570a4c3b211b09afaf7459b0a662f10a974c1b969ea75b2cc
MD5 2f86823c29ba5e33704cd7df80f74de4
BLAKE2b-256 fa91f769732c2e780d5effa0bdd918dd834d647cdd9628f473668fe665eb1f20

See more details on using hashes here.

File details

Details for the file FastWARC-0.3.6-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for FastWARC-0.3.6-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 cd4afa3fde037b4a9d16f43c0353458a4c6888c0a6875667d3853c59f2e6bfb5
MD5 dfa96b3f9a95dfdf6a2440f8ca74ac5c
BLAKE2b-256 0315b12c3f2d071896132ddcecce58eaef96d5133fe83ad4c382864279f49b6d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page