Skip to main content

A comprehensive library for text processing, keyword extraction, and classification from PDF and HTML documents

Project description

txt2phrases

txt2phrases is a Python library and CLI tool designed for processing and analyzing text data.
It provides a streamlined pipeline for converting documents (HTML, PDF) into plain text, extracting keywords using AI models, and classifying keywords into specific and general categories using TF-IDF.


✨ Features

1. PDF to Text Conversion

  • Extract plain text from PDF files for further processing.

2. HTML to Text Conversion

  • Convert HTML documents into clean, plain text.

3. AI-Powered Keyword Extraction

  • Use advanced NLP models (e.g., Hugging Face Transformers) to extract and rank the most important keywords from text files.

4. Automated Pipeline

  • Run the entire pipeline (PDF/HTML → TXT → Keywords) with a single command.

5. Batch Processing

  • Process single files or entire directories efficiently.

6. Configurable Parameters

  • Customize thresholds, batch sizes, and output formats to suit your needs.

🧩 Installation

Install txt2phrases directly from PyPI:

pip install txt2phrases

🚀 Quick Start

# Convert PDF to text
txt2phrases pdf2txt -i document.pdf -o output_folder

# Convert HTML to text
txt2phrases html2txt -i webpage.html -o output_folder

# Extract keywords from text files
txt2phrases keyphrases -i text_files/ -o keywords/ -n 500

# Run complete pipeline
txt2phrases auto -i pygetpapers_output/ -o results/ -n 100

🐍 Python API

from txt2phrases import (
    convert_pdf_to_text,
    convert_html_to_text, 
    KeywordExtraction,
    classify_keywords_split_files
)

# Convert PDF to text
txt_path = convert_pdf_to_text("document.pdf", "output_folder")

# Extract keywords
extractor = KeywordExtraction(
    input_path="text_files/",
    output_folder="keywords/",
    top_n=1000
)
extractor.extract()

🧠 CLI Commands

📄 pdf2txt

Convert PDF files to text format.

txt2phrases pdf2txt -i input.pdf -o output_folder
txt2phrases pdf2txt -i pdfs_directory/ -o text_output/

🌐 html2txt

Convert HTML files to clean text format.

txt2phrases html2txt -i webpage.html -o output_folder
txt2phrases html2txt -i html_directory/ -o text_output/

🔑 keyphrases

Extract keyphrases from text files using advanced NLP models.

txt2phrases keyphrases -i text.txt -o keywords/ -n 500
txt2phrases keyphrases -i text_directory/ -o keywords/ -n 1000

⚙️ auto

Complete processing pipeline for PyGetPapers output or PDF directories.

txt2phrases auto -i pygetpapers_output/ -o results/ -n 200
txt2phrases auto -i pdf_collection/ -o results/ -n 100

🔍 Advanced Features


2. Complete Research Pipeline

# Download papers with PyGetPapers
pygetpapers -q "machine learning" -o papers/ -k 100

# Process and analyze  
txt2phrases auto -i papers/ -o analysis/ -n 200

# Classify results
python -c "
from txt2phrases import classify_keywords_split_files
classify_keywords_split_files('analysis/', 'classified/', threshold=0.7)
"

📦 Output Formats

  • Text Conversion: .txt files with extracted text
  • Keyword Extraction: .csv files containing keyword and count columns

🧱 Requirements

To use txt2phrases, ensure you have the following installed:

  • Python 3.8+
  • Dependencies:
    • argparse: For CLI argument parsing
    • beautifulsoup4: For HTML parsing
    • pandas: For data manipulation and CSV export
    • tqdm: For progress bars during batch processing
    • transformers: For AI-powered keyword extraction
    • scikit-learn: For TF-IDF-based keyword classification
    • torch: For running NLP models

Install dependencies with:

pip install -r requirements.txt

📚 Documentation

For full documentation and examples, visit the GitHub repository.


📄 License

This project is licensed under the MIT License — see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

txt2phrases-1.0.1.tar.gz (27.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

txt2phrases-1.0.1-py3-none-any.whl (13.2 kB view details)

Uploaded Python 3

File details

Details for the file txt2phrases-1.0.1.tar.gz.

File metadata

  • Download URL: txt2phrases-1.0.1.tar.gz
  • Upload date:
  • Size: 27.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for txt2phrases-1.0.1.tar.gz
Algorithm Hash digest
SHA256 d0d0e32cfa85dc1c9b1bc1772f57a6c2631af87ce9b670f9627bacb70cedefcf
MD5 440752aa58315f580febc8194df96e2d
BLAKE2b-256 8df17f4079ff536cc9d411030afabc4ffd2da7f61b4f03d7d6677c97d4b7f939

See more details on using hashes here.

File details

Details for the file txt2phrases-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: txt2phrases-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 13.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for txt2phrases-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5db7a4f4c4706b8895d1167b930d0432131839e95e608fc64a320cefc9967ac9
MD5 ee7d1c2d3dee99635029c40adb2b9582
BLAKE2b-256 a24095dccef1307828b963514c82f32b1903aff8f38aa020546fe0921ac8ad4d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page