warc2summary
Project description
warc2summary
Features
Implementation of Heuristics to process WARC Files
Requirements
- Python 3.7>
Installation
You can install warc2summary via [pip] from [PyPI]:
pip install warc2summary
Usage
There exists 3 parts to this library: warc_processor, heuristics, pipeline
WARC Processor
This module converts WARC Files to a Pandas DataFrame. It uses WARCIO as the processing engine.
from warc2summary import warc_processor
#process WARC Files Directly
warc_processor.process_warc_files(folder_path, max_workers=4)
Heuristics
This module applies the 3 heuristics developed
Heuristic 1: Takes the about page of the website, else take shortest url(likely to be main page)
Heuristic 2: Takes the shortest url
Heuristic 3: Takes shortest url and applies a Regex Filter
from warc2summary import heuristics
#process dataframe
heuristics.heuristics_1(dataframe)
from warc2summary import heuristics
#process dataframe
heuristics.heuristics_2(dataframe)
from warc2summary import heuristics
#process dataframe
heuristics.heuristics_3(dataframe)
The dataframe must contain the url and the web text content
This module transforms the dataframe for processing using LLMs reducing costs by reducing number of tokens needed
Feel free to contribute new heuristics
pipeline
This module merges the previous 2 modules and joins it with a ground truth dataset for llm evaluation. Only OpenAI api supported for now. This code requires a human labelled dataset.
To combine all 3 parts and replicate our findings
from warc2summary import pipeline
#process WARC Files Directly
pipeline.execute_pipeline(warc_df,human_df,prompt,heuristic,max_tokens=1000,temperature=0.5,top_p=0.95,frequency_penalty=0.0,presence_penalty=0.0,model="gpt-4o",debug=False)
To perform batch inference
from warc2summary import pipeline
pipeline.batch_prompt(df, prompt ,max_tokens=150,temperature=0.5,top_p=0.95,frequency_penalty=0.0,presence_penalty=0.0,model="gpt-4o",debug=False)
Contributing
Contributions are very welcome. To learn more, see the Contributor Guide.
License
Distributed under the terms of the MIT license, warc2summary is free and open source software.
Issues
If you encounter any problems, please [file an issue] along with a detailed description.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for warc2summary-0.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1a06ca28301d4ac9c0e34b6437c026b5a2ced26b966ecfd462cb6240951959cc |
|
MD5 | 6e945e0d1d37b9bc1e0f35bf6ab27ba3 |
|
BLAKE2b-256 | 6ad54279e300cb4e1ddcd845febade827bc06cd7413691a14bcfa9b5971c6d60 |