Skip to main content

warc2summary

Project description

warc2summary

PyPI Status Python Version License

Read the documentation at https://warc2summary.readthedocs.io/ Tests Codecov

pre-commit Black

Features

Implementation of Heuristics to process WARC Files

Requirements

  • Python 3.9>

Installation

You can install warc2summary via [pip] from [PyPI]:

pip install warc2summary

Usage

There exists 3 parts to this library: warc_processor, heuristics, pipeline

WARC Processor

This module converts WARC Files to a Pandas DataFrame. It uses WARCIO as the processing engine.

from warc2summary import warc_processor
#process WARC Files Directly
warc_processor.process_warc_files(folder_path, max_workers=4)

Heuristics

This module applies the 3 heuristics developed

Heuristic 1: Takes the about page of the website, else take shortest url(likely to be main page)

Heuristic 2: Takes the shortest url

Heuristic 3: Takes shortest url and applies a Regex Filter

from warc2summary import heuristics
#process dataframe
heuristics.heuristics_1(dataframe)
from warc2summary import heuristics
#process dataframe
heuristics.heuristics_2(dataframe)
from warc2summary import heuristics
#process dataframe
heuristics.heuristics_3(dataframe)

The dataframe must contain the url and the web text content

This module transforms the dataframe for processing using LLMs reducing costs by reducing number of tokens needed

Feel free to contribute new heuristics

pipeline

This module merges the previous 2 modules and joins it with a ground truth dataset for llm evaluation. Only OpenAI api supported for now. This code requires a human labelled dataset.

To combine all 3 parts and replicate our findings

from warc2summary import pipeline
#process WARC Files Directly
pipeline.execute_pipeline(warc_df,human_df,prompt,heuristic,max_tokens=1000,temperature=0.5,top_p=0.95,frequency_penalty=0.0,presence_penalty=0.0,model="gpt-4o",debug=False)

To perform batch inference

from warc2summary import pipeline
pipeline.batch_prompt(df, prompt ,max_tokens=150,temperature=0.5,top_p=0.95,frequency_penalty=0.0,presence_penalty=0.0,model="gpt-4o",debug=False)

Issues

If some module is not found, please try pip installing the package and refreshing Please post a issue on github if something goes wrong

Contributing

Contributions are very welcome. To learn more, see the Contributor Guide.

License

Distributed under the terms of the MIT license, warc2summary is free and open source software. This package is brought to you by the National Library Board. By using any part of this package, you agree to not hold NLB or the developers liable for any damages, physical or otherwise in perpetuity throughout the universe

Issues

If you encounter any problems, please [file an issue] along with a detailed description.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

warc2summary-0.0.12.tar.gz (14.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

warc2summary-0.0.12-py3-none-any.whl (14.9 kB view details)

Uploaded Python 3

File details

Details for the file warc2summary-0.0.12.tar.gz.

File metadata

  • Download URL: warc2summary-0.0.12.tar.gz
  • Upload date:
  • Size: 14.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.11.7 Windows/10

File hashes

Hashes for warc2summary-0.0.12.tar.gz
Algorithm Hash digest
SHA256 603c5259de82a9bb9b42797ce45d0684f3a20a4b555ae1f0edd6e8b35aad0b79
MD5 6f06ece1003de110cd31eb6be17f6d3e
BLAKE2b-256 92c3ff944d0f6928a66aec0b00f9a35a7180660a90af19a13a67cae47a0cd478

See more details on using hashes here.

File details

Details for the file warc2summary-0.0.12-py3-none-any.whl.

File metadata

  • Download URL: warc2summary-0.0.12-py3-none-any.whl
  • Upload date:
  • Size: 14.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.3 CPython/3.11.7 Windows/10

File hashes

Hashes for warc2summary-0.0.12-py3-none-any.whl
Algorithm Hash digest
SHA256 39f4fd833b64a780a4e194f57e1a4455614a043b42f1f203c8150ebd8c00f123
MD5 71467996002c5cb09f29b10f248cc70b
BLAKE2b-256 ae4edf7f8eaa9eea2d1558ab6a097fd7362428dd947ea7f58554540492640cda

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page