Skip to main content

A utility library for working with CDXJ files

Project description

cdxj_util

cdxj_util is a Python library for efficiently processing CDXJ (Compressed DeduplicateD Web Archive Index JSON) files. This library provides functionality for loading, searching, and analyzing large CDXJ files.

Features

  • Asynchronous and synchronous loading of CDXJ files
  • URL-based searching (exact and partial matching)
  • Filtering by timestamp range
  • Bulk searching of multiple URLs
  • Generation of CDXJ file statistics (total records, unique URLs, subdomain distribution, MIME type distribution, etc.)

Installation

pip install cdxj_util

Usage

Loading a CDXJ file

from cdxj_util.core import CDXJCore

core = CDXJCore("path/to/your.cdxj")
records = core.load_all_records()

Searching URLs

from cdxj_util.search import CDXJSearch

search = CDXJSearch(records)
results = search.search_by_url("http://example.com/", exact_match=True)

Generating statistics

from cdxj_util.stats import CDXJStats

stats = CDXJStats(records)
total_records = stats.total_records()
unique_urls = stats.unique_urls()
mime_distribution = stats.mime_type_distribution()

Asynchronous Support

cdxj_util also supports asynchronous processing, which is particularly useful for handling large CDXJ files:

import asyncio
from cdxj_util.async_core import AsyncCDXJCore

async def process_cdxj():
    async_core = AsyncCDXJCore("path/to/your.cdxj")
    records = await async_core.load_all_records()
    # Further processing...

asyncio.run(process_cdxj())

Examples

For more detailed usage examples, please refer to the demo scripts in the examples/ directory.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cdxj_util-1.0.0.tar.gz (7.1 kB view details)

Uploaded Source

Built Distribution

cdxj_util-1.0.0-py3-none-any.whl (8.5 kB view details)

Uploaded Python 3

File details

Details for the file cdxj_util-1.0.0.tar.gz.

File metadata

  • Download URL: cdxj_util-1.0.0.tar.gz
  • Upload date:
  • Size: 7.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.2

File hashes

Hashes for cdxj_util-1.0.0.tar.gz
Algorithm Hash digest
SHA256 018329539984e110129f4e26070a75d523228bc1ba207831dc1a99a4656f196a
MD5 a58682afcea93f02149ce927f0900889
BLAKE2b-256 db1c48864bb2649fda27b74eeaa5f1abb732baeae4cebb27477c3a4645b2d596

See more details on using hashes here.

File details

Details for the file cdxj_util-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: cdxj_util-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 8.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.2

File hashes

Hashes for cdxj_util-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 644a045ed8c47f013a3689a1e76cf465d44923934b76d27c8f81afd6cc113246
MD5 24816bbc1f2bd0f368843c2f254f8db3
BLAKE2b-256 25e88e2c69bd52e0c98884a365354d40926f6995bae37bb4a447823d5ad22934

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page