A high-performance WARC parsing library for Python written in C++/Cython.
Project description
FastWARC
FastWARC is a high-performance WARC parsing library for Python written in C++/Cython. The API is inspired in large parts by WARCIO, but does not aim at being a drop-in replacement. FastWARC supports compressed and uncompressed WARC/1.0 and WARC/1.1 streams. Supported compression algorithms are GZip and LZ4.
FastWARC belongs to the ChatNoir Resiliparse toolkit for fast and robust web data processing.
Installing FastWARC
Pre-built FastWARC binaries for most Linux platforms can be installed from PyPi:
pip install fastwarc
However: the Linux binaries are provided solely for your convenience. Since they are built on the very old manylinux
base system for better compatibility, their performance isn't optimal (though still better than WARCIO). For best performance, see the next section on how to build FastWARC yourself.
Building FastWARC From Source
You can compile FastWARC either from the PyPi source package or directly from this repository, though in any case, you need to install all required build-time dependencies first. On Ubuntu, this is done as follows:
sudo apt install build-essential python3-dev zlib1g-dev liblz4-dev
To build and install FastWARC from PyPi, run
pip install --no-binary fastwarc fastwarc
That's it. If you prefer to build and install directly from this repository instead, run:
pip install -e fastwarc
To build the wheels without installing them, run:
pip wheel -e fastwarc
# Or:
pip install build && python -m build --wheel fastwarc
Usage Instructions
For detailed usage instructions, please consult the FastWARC User Manual.
Cite Us
If you use FastWARC, please consider citing our OSSYM 2021 abstract paper:
@InProceedings{bevendorff:2021,
author = {Janek Bevendorff and Martin Potthast and Benno Stein},
booktitle = {3nd International Symposium on Open Search Technology (OSSYM 2021)},
editor = {Andreas Wagner and Christian Guetl and Michael Granitzer and Stefan Voigt},
month = oct,
publisher = {International Open Search Symposium},
site = {CERN, Geneva, Switzerland},
title = {{FastWARC: Optimizing Large-Scale Web Archive Analytics}},
year = 2021
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for FastWARC-0.12.0-cp310-cp310-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3cda9bf79cf658d5b3ecd1ee8bb9d8ebdcbc272c0ece80ef68072328cfefa3b4 |
|
MD5 | 17832d93f4440dde3acd92ecb653f871 |
|
BLAKE2b-256 | 8e2f316fd2fb0ca972c28fc1fa803a84d5638d01057ada443499b20aaf8bec28 |
Hashes for FastWARC-0.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 52b880e979d1aed8481b4be95aca0edd152a16cb44cb92e90f4a64d687b20979 |
|
MD5 | 7cfbdccd59464d5b17d5185e57ce54b6 |
|
BLAKE2b-256 | d431168c32c0ffe8b24d90dab560deadd61a68a1f57758e7314fbf021fac5124 |
Hashes for FastWARC-0.12.0-cp310-cp310-macosx_10_14_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 23b4bf84d80a689a328148da4966b87cc1f6f0c771e6f4aa335f4f0ade3f993f |
|
MD5 | 143e7343b85e9d5a8d9a3ee23b573af9 |
|
BLAKE2b-256 | 024ca042f6f40d04cbfeea0d2a4bd6ebc1982d80ba8f159614f1a05c40929e78 |
Hashes for FastWARC-0.12.0-cp39-cp39-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9732dcf257ac5490290e72f524f995d6afb62e28c8c0e72221396efe9f33c963 |
|
MD5 | 1b0afb3480f1f2cafc80a208c981f65e |
|
BLAKE2b-256 | 9784641d5e0b449f636d15f4d5d70c0794c002ccafcdff815e8ea3aa4431e128 |
Hashes for FastWARC-0.12.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8483a4dd1208cc2c398743ceeaf75089d77abfd7c58c1de8b05a59600a2a50a9 |
|
MD5 | a14575c207d8c079c9b1942cc08bd36c |
|
BLAKE2b-256 | 0103018a755daf6f53361a97cecf8fee01ade8a53150580413fd0a7be638746b |
Hashes for FastWARC-0.12.0-cp39-cp39-macosx_10_14_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3aa42e4d1f893bffe38a3691bdb70e943918201dc102db3777fda49cf8512902 |
|
MD5 | 8fcdef460295495b89d4df32cdcd4525 |
|
BLAKE2b-256 | bf477d76205e433e8dfcff4c3596f17b23808fbf27cbf75b80a0afbf44c68510 |
Hashes for FastWARC-0.12.0-cp38-cp38-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c427ff866aee60527fac9d0d3307a1bdfd5e776f2c10fff66c501fd51aef7368 |
|
MD5 | 073c248da2c61904d00ebd9cd4005162 |
|
BLAKE2b-256 | 31119f29bae1d34a629e7b6020622d3092642e59a29fdc9d07653ece1a012dde |
Hashes for FastWARC-0.12.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4d21376b2f88cf810de6292cfdecddef305f1e20f7c3fdf74402858888e55dd8 |
|
MD5 | e03a6c9009a1f4f732256a863ed5d68d |
|
BLAKE2b-256 | 392c8138ea0fe2f3e985609f4c7d79351e0f83866fa8a4a8fd166980c1ed8005 |
Hashes for FastWARC-0.12.0-cp38-cp38-macosx_10_14_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 80454cec9416636e56f5f9f40d54372b7712dc93ed25141679d3eca3e62b4b9d |
|
MD5 | 6945145275cde35b11ee477d45c3693b |
|
BLAKE2b-256 | e2997838eb9ecafbc430f1ef772f86fb0913d9a67e1295d91194f291496beee4 |
Hashes for FastWARC-0.12.0-cp37-cp37m-win_amd64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | f9ab85b985f8d9f4634f4fc9e894c12443d6d270908bce9bc7520aa9f9ea94ed |
|
MD5 | 2b194c5440098df689dbaaa95b8ea86b |
|
BLAKE2b-256 | ca564b8d79b10802c51c625134e1389405c41b5a2198e0d9607d4c3f376c5be1 |
Hashes for FastWARC-0.12.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | da4c7a2601d5f2d65530998fd7db6649105d16b35e1c7633aeadc57dedf40232 |
|
MD5 | 7a1235c08713ce964b77c88b07722fab |
|
BLAKE2b-256 | ba7a2570aef319924d7b50625d0fcbc5263e21958a46e6dd9b5fa4e1a962f60a |
Hashes for FastWARC-0.12.0-cp37-cp37m-macosx_10_14_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 58fe6461afad88130818ca56b6ef6fcfc9bd781e25e90157d92bb03c6b4823b4 |
|
MD5 | 33e0625b39daf9681e4203ffa9544de4 |
|
BLAKE2b-256 | 6463d72aeced8128de656930a45aa1601fb5e463fed91b7a1f3e53e315f0e853 |