A high-performance WARC parsing library for Python written in C++/Cython.
Project description
FastWARC
FastWARC is a high-performance WARC parsing library for Python written in C++/Cython. The API is inspired in large parts by WARCIO, but does not aim at being a drop-in replacement. FastWARC supports compressed and uncompressed WARC/1.0 and WARC/1.1 streams. Supported compression algorithms are GZip and LZ4.
FastWARC belongs to the ChatNoir Resiliparse toolkit for fast and robust web data processing.
Installing FastWARC
Pre-built FastWARC binaries for most Linux platforms can be installed from PyPi:
pip install fastwarc
However: these binaries are provided purely for your convenience. Since they are built on the very old manylinux
base system for better compatibility, their performance isn't optimal (though still better than WARCIO). For best performance, see the next section on how to build FastWARC yourself.
Building FastWARC
You can compile FastWARC either from the PyPi source package or directly from this repository, though in any case, you need to install all build-time dependencies first. For Debian / Ubuntu, this is done with:
sudo apt install build-essential python3-dev zlib1g-dev liblz4-dev
Then to build FastWARC from PyPi, run
pip install --no-binary fastwarc fastwarc
That's it. If you prefer to build directly from this repository instead, run:
# Create venv (recommended, but not required)
python3 -m venv venv && source venv/bin/activate
# Install additional build dependencies
pip install cython setuptools
# Build and install:
BUILD_PACKAGES=fastwarc python setup.py install
Usage Instructions
For detailed usage instructions, please consult the FastWARC User Manual.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for FastWARC-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 42af4ae01d174dbd30b39dcdd7c827e472d2a999afb9d9c6535cb6b5f25af1c0 |
|
MD5 | eab3298bf67d98b28ba93e7d23246a7a |
|
BLAKE2b-256 | 1b6f2fd011eccd2153819590fd1978817b942f1b07151ee2445d11b85a29891d |
Hashes for FastWARC-0.3.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1f26bc2f4915a94028076206c282d6af279d5fc4c57bf549f3505387b3794bf0 |
|
MD5 | 4bfc9d6ad6fdcaa7278f7923a6354f91 |
|
BLAKE2b-256 | 67bcb439c10e5e8b11c62520e06ba148a089ea352df9eb5a3967835bac100ca6 |
Hashes for FastWARC-0.3.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 922cd16c69bcba96a64b716fa00db7d9ade4218912fe7203c65d3de187a0091c |
|
MD5 | 62067cbb7dfbdcfe4ac035349db2b257 |
|
BLAKE2b-256 | d56adab777d1fbdecaaff7274b82fb32bdfabad3db1f8bee04a475483a57f9ce |
Hashes for FastWARC-0.3.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0b273c4da468a2e5359333adff7c12b91d324be44f253552db5adf9a2513db33 |
|
MD5 | 0c4971c56664301929fa255beb6f78e4 |
|
BLAKE2b-256 | e4c0f459d1a82eacc6458403da6f3bc64459a487c8ca09075d4ea5eab0b1d41e |