Skip to main content

A comprehensive library for text processing, keyword extraction, and classification from PDF and HTML documents

Project description

# txt2phrases

txt2phrases is a Python library and CLI tool designed for processing and analyzing text data. It provides a streamlined pipeline for converting documents (HTML, PDF) into plain text, extracting keywords using AI models, and classifying keywords into specific and general categories using TF-IDF.


Features

1. PDF to Text Conversion

  • Extract plain text from PDF files for further processing.

2. HTML to Text Conversion

  • Convert HTML documents into clean, plain text.

3. AI-Powered Keyword Extraction

  • Use advanced NLP models (e.g., Hugging Face Transformers) to extract and rank the most important keywords from text files.

5. Automated Pipeline

  • Run the entire pipeline (PDF/HTML → TXT → Keywords) with a single command.

6. Batch Processing

  • Process single files or entire directories efficiently.

7. Configurable Parameters

  • Customize thresholds, batch sizes, and output formats to suit your needs.

Installation

Install txt2phrases directly from PyPI:

pip install txt2phrasestxt2phrases

A comprehensive Python library for text processing, keyword extraction, and classification from PDF and HTML documents.
---
## Features

- **PDF to Text Conversion**: Extract text content from PDF files
- **HTML to Text Conversion**: Convert HTML documents to clean text  
- **Keyphrase Extraction**: Advanced keyword extraction using transformer models
- **Auto Pipeline**: Complete processing pipeline from raw documents to classified keywords
- **Batch Processing**: Handle single files or entire directories efficiently

## Installation

```bash
pip install txt2phrases
Quick Start
bash
# Convert PDF to text
txt2phrases pdf2txt -i document.pdf -o output_folder

# Convert HTML to text
txt2phrases html2txt -i webpage.html -o output_folder

# Extract keywords from text files
txt2phrases keyphrases -i text_files/ -o keywords/ -n 500

# Run complete pipeline
txt2phrases auto -i pygetpapers_output/ -o results/ -n 100
Python API
python
from txt2phrases import (
    convert_pdf_to_text,
    convert_html_to_text, 
    KeywordExtraction,
    classify_keywords_split_files
)

# Convert PDF to text
txt_path = convert_pdf_to_text("document.pdf", "output_folder")

# Extract keywords
extractor = KeywordExtraction(
    input_path="text_files/",
    output_folder="keywords/",
    top_n=1000
)
extractor.extract()

# Classify keywords
classify_keywords_split_files(
    input_dir="keyword_csvs/",
    output_dir="classified/",
    threshold=0.6,
    min_freq=5
)
CLI Commands
pdf2txt
Convert PDF files to text format.

bash
txt2phrases pdf2txt -i input.pdf -o output_folder
txt2phrases pdf2txt -i pdfs_directory/ -o text_output/
html2txt
Convert HTML files to clean text format.

bash
txt2phrases html2txt -i webpage.html -o output_folder
txt2phrases html2txt -i html_directory/ -o text_output/
keyphrases
Extract keyphrases from text files using advanced NLP models.

bash
txt2phrases keyphrases -i text.txt -o keywords/ -n 500
txt2phrases keyphrases -i text_directory/ -o keywords/ -n 1000
auto
Complete processing pipeline for PyGetPapers output or PDF directories.

bash
txt2phrases auto -i pygetpapers_output/ -o results/ -n 200
txt2phrases auto -i pdf_collection/ -o results/ -n 100
Advanced Features
TF-IDF Classification
python
from txt2phrases import classify_keywords_split_files

classify_keywords_split_files(
    input_dir="keyword_csvs/",
    output_dir="classified/",
    threshold=0.6,
    min_freq=5
)
Complete Research Pipeline
bash
# Download papers with PyGetPapers
pygetpapers -q "machine learning" -o papers/ -k 100

# Process and analyze  
txt2phrases auto -i papers/ -o analysis/ -n 200

# Classify results
python -c "
from txt2phrases import classify_keywords_split_files
classify_keywords_split_files('analysis/', 'classified/', threshold=0.7)
"
Output Formats
Text Conversion: .txt files with extracted text

Keyword Extraction: CSV files with keyword and count columns

## Requirements

To use `txt2phrases`, ensure you have the following installed:

- **Python 3.8+**
- **Dependencies**:
  - `argparse`: For CLI argument parsing.
  - `beautifulsoup4`: For HTML parsing.
  - `pandas`: For data manipulation and CSV export.
  - `tqdm`: For progress bars during batch processing.
  - `transformers`: For AI-powered keyword extraction.
  - `scikit-learn`: For TF-IDF-based keyword classification.
  - `torch`: For running NLP models.

You can install all dependencies using the following command:

```bash
pip install -r requirements.txt

Documentation
For full documentation and examples, visit the GitHub repository.

License
This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

txt2phrases-1.0.0.tar.gz (27.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

txt2phrases-1.0.0-py3-none-any.whl (13.2 kB view details)

Uploaded Python 3

File details

Details for the file txt2phrases-1.0.0.tar.gz.

File metadata

  • Download URL: txt2phrases-1.0.0.tar.gz
  • Upload date:
  • Size: 27.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for txt2phrases-1.0.0.tar.gz
Algorithm Hash digest
SHA256 5cea53036edc182b00c32f8f48a7814e79aeac27bdc21c2802b0d2322f66f16b
MD5 83ec0524ab37441d42d8eb648c0f2eab
BLAKE2b-256 e2974372f3e48703ff859686c68c2c82d99aa754c9e00538ba7468d1e73e15d4

See more details on using hashes here.

File details

Details for the file txt2phrases-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: txt2phrases-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 13.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for txt2phrases-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2471be550eec6cd065ec97bc11e3997ea338348bf215d613fa53fca8f9054fed
MD5 e6429884855f6a9778d6aef51e3296d2
BLAKE2b-256 bad88beb1646d6e61933b2107a1dd49dfd227ee69f74eb01d41ffe2c8c889dda

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page