A high-performance WARC parsing library for Python written in C++/Cython.
Project description
FastWARC
FastWARC is a high-performance WARC parsing library for Python written in C++/Cython. The API is inspired in large parts by WARCIO, but does not aim at being a drop-in replacement. FastWARC supports compressed and uncompressed WARC/1.0 and WARC/1.1 streams. Supported compression algorithms are GZip and LZ4.
FastWARC belongs to the ChatNoir Resiliparse toolkit for fast and robust web data processing.
Installing FastWARC
Pre-built FastWARC binaries for most Linux platforms can be installed from PyPi:
pip install fastwarc
However: these binaries are provided solely for your convenience. Since they are built on the very old manylinux base system for better compatibility, their performance isn't optimal (though still better than WARCIO). For best performance, see the next section on how to build FastWARC yourself.
Building FastWARC
You can compile FastWARC either from the PyPi source package or directly from this repository, though in any case, you need to install all build-time dependencies first. For Debian / Ubuntu, this is done with:
sudo apt install build-essential python3-dev zlib1g-dev liblz4-dev
Then to build FastWARC from PyPi, run
pip install --no-binary fastwarc fastwarc
That's it. If you prefer to build directly from this repository instead, run:
# Create venv (recommended, but not required)
python3 -m venv venv && source venv/bin/activate
# Install additional build dependencies
pip install cython setuptools
# Build and install:
BUILD_PACKAGES=fastwarc python setup.py install
Usage Instructions
For detailed usage instructions, please consult the FastWARC User Manual.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file FastWARC-0.3.6.tar.gz.
File metadata
- Download URL: FastWARC-0.3.6.tar.gz
- Upload date:
- Size: 323.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
77fbde1ec08077c261abc93529ce05339e409d9dcee7d238ec0c2ff018f0722f
|
|
| MD5 |
3ec7f815d28d37649930b38ab0b46422
|
|
| BLAKE2b-256 |
5d2902d27f3c486d986d249f1d481745517bb1256ea709d770b3186cb335b6bd
|
File details
Details for the file FastWARC-0.3.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: FastWARC-0.3.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.7 MB
- Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cc1dd4bad186ad6b202e70c80ec17188589b1af2485900d21e8cdd188dc752c9
|
|
| MD5 |
b80bccba2fd58f5ab06c350edceb76bf
|
|
| BLAKE2b-256 |
37d7dcc0b8d716a32df55dcb1d8d8e532c70338d6ed4e00f16f71d1aa7ff34b8
|
File details
Details for the file FastWARC-0.3.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: FastWARC-0.3.6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.7 MB
- Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3f5df2c615cde136c3425a7bab693ef675c205b59421b843e4660fb285e65515
|
|
| MD5 |
6b71911df95481308f2ccb880ff8c5d3
|
|
| BLAKE2b-256 |
524e7c4f9e58b084dcbae0ddda943580c8b8a5ea83a4e7683185de776f793c7f
|
File details
Details for the file FastWARC-0.3.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: FastWARC-0.3.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.7 MB
- Tags: CPython 3.8, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a007ab5d6731cb570a4c3b211b09afaf7459b0a662f10a974c1b969ea75b2cc
|
|
| MD5 |
2f86823c29ba5e33704cd7df80f74de4
|
|
| BLAKE2b-256 |
fa91f769732c2e780d5effa0bdd918dd834d647cdd9628f473668fe665eb1f20
|
File details
Details for the file FastWARC-0.3.6-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: FastWARC-0.3.6-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.6 MB
- Tags: CPython 3.7m, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.4.1 importlib_metadata/4.6.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cd4afa3fde037b4a9d16f43c0353458a4c6888c0a6875667d3853c59f2e6bfb5
|
|
| MD5 |
dfa96b3f9a95dfdf6a2440f8ca74ac5c
|
|
| BLAKE2b-256 |
0315b12c3f2d071896132ddcecce58eaef96d5133fe83ad4c382864279f49b6d
|