A comprehensive library for text processing, keyword extraction, and classification from PDF and HTML documents
Project description
txt2phrases
txt2phrases is a Python library and CLI tool designed for processing and analyzing text data.
It provides a streamlined pipeline for converting documents (HTML, PDF) into plain text, extracting keywords using AI models, and classifying keywords into specific and general categories using TF-IDF.
✨ Features
1. PDF to Text Conversion
- Extract plain text from PDF files for further processing.
2. HTML to Text Conversion
- Convert HTML documents into clean, plain text.
3. AI-Powered Keyword Extraction
- Use advanced NLP models (e.g., Hugging Face Transformers) to extract and rank the most important keywords from text files.
4. Automated Pipeline
- Run the entire pipeline (PDF/HTML → TXT → Keywords) with a single command.
5. Batch Processing
- Process single files or entire directories efficiently.
6. Configurable Parameters
- Customize thresholds, batch sizes, and output formats to suit your needs.
🧩 Installation
Install txt2phrases directly from PyPI:
pip install txt2phrases
🚀 Quick Start
# Convert PDF to text
txt2phrases pdf2txt -i document.pdf -o output_folder
# Convert HTML to text
txt2phrases html2txt -i webpage.html -o output_folder
# Extract keywords from text files
txt2phrases keyphrases -i text_files/ -o keywords/ -n 500
# Run complete pipeline
txt2phrases auto -i pygetpapers_output/ -o results/ -n 100
🐍 Python API
from txt2phrases import (
convert_pdf_to_text,
convert_html_to_text,
KeywordExtraction,
classify_keywords_split_files
)
# Convert PDF to text
txt_path = convert_pdf_to_text("document.pdf", "output_folder")
# Extract keywords
extractor = KeywordExtraction(
input_path="text_files/",
output_folder="keywords/",
top_n=1000
)
extractor.extract()
🧠 CLI Commands
📄 pdf2txt
Convert PDF files to text format.
txt2phrases pdf2txt -i input.pdf -o output_folder
txt2phrases pdf2txt -i pdfs_directory/ -o text_output/
🌐 html2txt
Convert HTML files to clean text format.
txt2phrases html2txt -i webpage.html -o output_folder
txt2phrases html2txt -i html_directory/ -o text_output/
🔑 keyphrases
Extract keyphrases from text files using advanced NLP models.
txt2phrases keyphrases -i text.txt -o keywords/ -n 500
txt2phrases keyphrases -i text_directory/ -o keywords/ -n 1000
⚙️ auto
Complete processing pipeline for PyGetPapers output or PDF directories.
txt2phrases auto -i pygetpapers_output/ -o results/ -n 200
txt2phrases auto -i pdf_collection/ -o results/ -n 100
🔍 Advanced Features
2. Complete Research Pipeline
# Download papers with PyGetPapers
pygetpapers -q "machine learning" -o papers/ -k 100
# Process and analyze
txt2phrases auto -i papers/ -o analysis/ -n 200
# Classify results
python -c "
from txt2phrases import classify_keywords_split_files
classify_keywords_split_files('analysis/', 'classified/', threshold=0.7)
"
📦 Output Formats
- Text Conversion:
.txtfiles with extracted text - Keyword Extraction:
.csvfiles containing keyword and count columns
🧱 Requirements
To use txt2phrases, ensure you have the following installed:
- Python 3.8+
- Dependencies:
argparse: For CLI argument parsingbeautifulsoup4: For HTML parsingpandas: For data manipulation and CSV exporttqdm: For progress bars during batch processingtransformers: For AI-powered keyword extractionscikit-learn: For TF-IDF-based keyword classificationtorch: For running NLP models
Install dependencies with:
pip install -r requirements.txt
📚 Documentation
For full documentation and examples, visit the GitHub repository.
📄 License
This project is licensed under the MIT License — see the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file txt2phrases-1.0.3.tar.gz.
File metadata
- Download URL: txt2phrases-1.0.3.tar.gz
- Upload date:
- Size: 27.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c53705a484057479afb6be0e5995dd91e231e56861673c338893a32b6246fbe6
|
|
| MD5 |
9f87a14ef17675ee2938250dc8d08c28
|
|
| BLAKE2b-256 |
93a92a6bd775a81833a558a7c7b3d232da133f667f03f0c68478c09b13545457
|
File details
Details for the file txt2phrases-1.0.3-py3-none-any.whl.
File metadata
- Download URL: txt2phrases-1.0.3-py3-none-any.whl
- Upload date:
- Size: 13.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
722a7e7fe59ab56a47bfdc1a515798dc6084450c4fba8bed339e8d1e6a3a766b
|
|
| MD5 |
9dafa6262cee11da8330e4811f0d540a
|
|
| BLAKE2b-256 |
626a9626126c52e5528d4dc0a428e2b685d9c4f8d745972b7191cdc1e5fe0b79
|