A utility library for working with CDXJ files
Project description
cdxj_util
cdxj_util is a Python library for efficiently processing CDXJ (Compressed DeduplicateD Web Archive Index JSON) files. This library provides functionality for loading, searching, and analyzing large CDXJ files.
Features
- Asynchronous and synchronous loading of CDXJ files
- URL-based searching (exact and partial matching)
- Filtering by timestamp range
- Bulk searching of multiple URLs
- Generation of CDXJ file statistics (total records, unique URLs, subdomain distribution, MIME type distribution, etc.)
Installation
pip install cdxj_util
Usage
Loading a CDXJ file
from cdxj_util.core import CDXJCore
core = CDXJCore("path/to/your.cdxj")
records = core.load_all_records()
Searching URLs
from cdxj_util.search import CDXJSearch
search = CDXJSearch(records)
results = search.search_by_url("http://example.com/", exact_match=True)
Generating statistics
from cdxj_util.stats import CDXJStats
stats = CDXJStats(records)
total_records = stats.total_records()
unique_urls = stats.unique_urls()
mime_distribution = stats.mime_type_distribution()
Asynchronous Support
cdxj_util also supports asynchronous processing, which is particularly useful for handling large CDXJ files:
import asyncio
from cdxj_util.async_core import AsyncCDXJCore
async def process_cdxj():
async_core = AsyncCDXJCore("path/to/your.cdxj")
records = await async_core.load_all_records()
# Further processing...
asyncio.run(process_cdxj())
Examples
For more detailed usage examples, please refer to the demo scripts in the examples/
directory.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file cdxj_util-1.0.0.tar.gz
.
File metadata
- Download URL: cdxj_util-1.0.0.tar.gz
- Upload date:
- Size: 7.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 018329539984e110129f4e26070a75d523228bc1ba207831dc1a99a4656f196a |
|
MD5 | a58682afcea93f02149ce927f0900889 |
|
BLAKE2b-256 | db1c48864bb2649fda27b74eeaa5f1abb732baeae4cebb27477c3a4645b2d596 |
File details
Details for the file cdxj_util-1.0.0-py3-none-any.whl
.
File metadata
- Download URL: cdxj_util-1.0.0-py3-none-any.whl
- Upload date:
- Size: 8.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 644a045ed8c47f013a3689a1e76cf465d44923934b76d27c8f81afd6cc113246 |
|
MD5 | 24816bbc1f2bd0f368843c2f254f8db3 |
|
BLAKE2b-256 | 25e88e2c69bd52e0c98884a365354d40926f6995bae37bb4a447823d5ad22934 |