Skip to main content

warc2summary

Project description

warc2summary

PyPI Status Python Version License

Read the documentation at https://warc2summary.readthedocs.io/ Tests Codecov

pre-commit Black

Features

Implementation of Heuristics to process WARC Files

Requirements

  • Python 3.9>

Installation

You can install warc2summary via [pip] from [PyPI]:

pip install warc2summary

Usage

There exists 3 parts to this library: warc_processor, heuristics, pipeline

WARC Processor

This module converts WARC Files to a Pandas DataFrame. It uses WARCIO as the processing engine.

from warc2summary import warc_processor
#process WARC Files Directly
warc_processor.process_warc_files(folder_path, max_workers=4)

Heuristics

This module applies the 3 heuristics developed

Heuristic 1: Takes the about page of the website, else take shortest url(likely to be main page)

Heuristic 2: Takes the shortest url

Heuristic 3: Takes shortest url and applies a Regex Filter

from warc2summary import heuristics
#process dataframe
heuristics.heuristics_1(dataframe)
from warc2summary import heuristics
#process dataframe
heuristics.heuristics_2(dataframe)
from warc2summary import heuristics
#process dataframe
heuristics.heuristics_3(dataframe)

The dataframe must contain the url and the web text content

This module transforms the dataframe for processing using LLMs reducing costs by reducing number of tokens needed

Feel free to contribute new heuristics

pipeline

This module merges the previous 2 modules and joins it with a ground truth dataset for llm evaluation. Only OpenAI api supported for now. This code requires a human labelled dataset.

To combine all 3 parts and replicate our findings

from warc2summary import pipeline
#process WARC Files Directly
pipeline.execute_pipeline(warc_df,human_df,prompt,heuristic,max_tokens=1000,temperature=0.5,top_p=0.95,frequency_penalty=0.0,presence_penalty=0.0,model="gpt-4o",debug=False)

To perform batch inference

from warc2summary import pipeline
pipeline.batch_prompt(df, prompt ,max_tokens=150,temperature=0.5,top_p=0.95,frequency_penalty=0.0,presence_penalty=0.0,model="gpt-4o",debug=False)

Issues

If some module is not found, please try pip installing the package and refreshing Please post a issue on github if something goes wrong

Contributing

Contributions are very welcome. To learn more, see the Contributor Guide.

License

Distributed under the terms of the MIT license, warc2summary is free and open source software. This package is brought to you by the National Library Board. By using any part of this package, you agree to not hold NLB or the developers liable for any damages, physical or otherwise in perpetuity throughout the universe

Issues

If you encounter any problems, please [file an issue] along with a detailed description.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

warc2summary-0.0.12.tar.gz (14.4 kB view hashes)

Uploaded Source

Built Distribution

warc2summary-0.0.12-py3-none-any.whl (14.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page