Python library to work with ARC, WARC and WET files
Project description
knot-warc - Python library to work with Web ARChive files
Note: This is one of forks of original WARC repository. It was primarily created for projects of Knowledge Technology Research Group, but its public-wide usage isn't limited.
Fork history
- https://github.com/internetarchive/warc (original Python 2 library)
- https://github.com/recrm/warc3 (Python 3 port)
- https://github.com/jpbruinsslot/warc3 (Python 3 port)
- https://github.com/Willian-Zhang/warc3 (WET support)
WARC (Web ARChive) is a file format for storing web crawls (see http://bibnum.bnf.fr/WARC/).
Examples
This warc library makes it very easy to work with WARC files:
import warc
with warc.open("test.warc") as f:
for record in f:
print(record['WARC-Target-URI'], record['Content-Length'])
And WET files:
import warc
with warc.open("test.warc.wet") as f:
for record in f:
print(record['WARC-Target-URI'], record['Content-Length'])
There are some examples provided without warranty and support (just for inspiration) in examples folder. They are not updated at all, too.
Documentation
The documentation of this fork of the warc library is on Github Pages (alternatively see original documentation).
Installation
You can install this fork of warc library using pip:
pip install warc-knot
License
This software is licensed under GPL v2. See LICENSE file for details.
Authors
Original Python2 Versions:
- Anand Chitipothu
- Noufal Ibrahim
Python3 Port:
- Ryan Chartier
- Jan Pieter Bruins Slot
- Almer S. Tigelaar
Modifications:
- Willian Zhang
- Michal Šmahel
Change Log
0.2.5:
- Update sphinx docs
0.2.4:
- Fix for Python 3.10+
- Upgrade HTTP --> HTTPS in tests
0.2.3
- Support seeking in WARC/WET
0.2.2
- Allow WET parse
Older...
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file warc_knot-0.2.5.tar.gz.
File metadata
- Download URL: warc_knot-0.2.5.tar.gz
- Upload date:
- Size: 33.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c1d23e2504317637ec28abac7813d4553ab6d451d2b6efaf2098a7ec01a01f7d
|
|
| MD5 |
a5d6cc89ba7e92a2d77d12f0700bc213
|
|
| BLAKE2b-256 |
f49f3a283c47eedb58b4125be799b98dd48a01f980e553cdd876dc1c3a50409b
|
File details
Details for the file warc_knot-0.2.5-py3-none-any.whl.
File metadata
- Download URL: warc_knot-0.2.5-py3-none-any.whl
- Upload date:
- Size: 24.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.11.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a9bbf46c55d27cce875d431d85c084f5b3f14bb7f28438e7b782dd2958d6764
|
|
| MD5 |
7c9b8c186f5043bb88334794542edd8c
|
|
| BLAKE2b-256 |
4d05c4372c1f8d086cdaa7f0130e7660479109bf15d343be4d8beadc108e5e4a
|