A Python library for HTML to TXT conversion, keyword extraction, and TF-IDF-based per-chapter classification.
Project description
txt2phrases
txt2phrases is a Python library designed for processing and analyzing text data. It provides tools for:
- HTML to TXT conversion: Extract plain text from HTML files.
- Keyword extraction: Use Hugging Face Transformers to identify and rank the most important keywords in text files.
- Per-chapter TF-IDF-based keyword classification: Classify keywords as specific (unique to a chapter) or general (common across chapters).
Features
- HTML Parsing: Convert HTML documents into plain text for further processing.
- AI-Powered Keyword Extraction: Leverage pre-trained NLP models for accurate keyword identification.
- TF-IDF Classification: Classify keywords into specific and general categories based on their relevance.
- Batch Processing: Process multiple files in a single command.
- Configurable Parameters: Customize thresholds, batch sizes, and output formats.
- Output Formats: Save results as CSV files for easy analysis.
Installation
Install txt2phrases directly from PyPI:
pip install txt2phrases
CLI Usage
Convert HTML → TXT
Convert all HTML files in a folder to plain text:
html2txt -i path/to/html_folder -o path/to/output_folder
html2txt -h
- -h/--help:help command
- -i / --input : Path to the folder containing HTML files
- -o / --output : Path to the folder where TXT files will be saved
Extract keywords from TXT files
Extract top keywords from all TXT files in a folder:
extract_keywords -i path/to/txt_folder -o path/to/output_folder -n 3500
extract_keywords -h
- -h:help command
- -i / --input_folder : Folder containing TXT files
- -o / --output_folder : Folder to save keyword CSVs
- -n / --top_n : Number of top keywords to extract (default: 3500)
Classify Keywords into Specific and General (TF-IDF)
This command takes per-chapter keyword CSVs and divides the keywords into:
- Specific keywords: unique to a chapter
- General keywords: common across multiple chapters
specific_keywords -i path/to/csv_folder -o path/to/output_folder -t 0.6 -f 5
specfic_keywords -h
- -h:help command
- -i / --input_dir : Folder with per-chapter CSV files containing
keyword,count - -o / --output_dir : Folder to save per-chapter specific keyword CSVs
- -t / --threshold : TF-IDF threshold for a keyword to be considered specific (default: 0.6)
- -f / --min_freq : Minimum frequency of a keyword to consider (default: 5)
Python Usage
Convert HTML → TXT
from txt2phrases.html2txt import html_to_txt_folder
html_to_txt_folder("path/to/html_folder", "path/to/output_folder")
Extract Keywords
from txt2phrases.keyword import KeywordExtraction
extractor = KeywordExtraction(
textfile="path/to/file.txt",
saving_path="path/to/output_folder",
output_filename="keywords.csv",
top_n=1000
)
top_keywords = extractor.extract_keywords()
Per-Chapter Specific Keywords and General Keywords
from txt2phrases.classify_specific import classify_keywords_split_files
classify_keywords_split_files(
input_dir="path/to/chapter_csv_folder",
output_dir="path/to/output_folder",
threshold=0.6,
min_freq=5
)
Requirements
Python 3.8+beautifulsoup4pandastqdmtransformersscikit-learntorch
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file txt2phrases-0.2.1.tar.gz.
File metadata
- Download URL: txt2phrases-0.2.1.tar.gz
- Upload date:
- Size: 22.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
abb9a4418123762efffec3cc598d02198f2588ee00a34f44199c02a62c67b8d4
|
|
| MD5 |
113c15a6441eb6f7f603abea7d987bd1
|
|
| BLAKE2b-256 |
f784fb1bd07fbffdb605e278d0c8d34bc5e8138fe0b1a1bdb34d70ee42679aab
|
File details
Details for the file txt2phrases-0.2.1-py3-none-any.whl.
File metadata
- Download URL: txt2phrases-0.2.1-py3-none-any.whl
- Upload date:
- Size: 7.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dde1288c5f4d0bd93a9f64413bbbf7627d0d2c57ebed1cdd51054c872c7fc509
|
|
| MD5 |
be829e6eaeaf32e8fcdeb5d74019d15d
|
|
| BLAKE2b-256 |
0162dfe3fb3534d0de12c2690fbcbc4f9fea91eec37a29c17fdef51401272149
|