A Python library for HTML to TXT conversion, keyword extraction, and TF-IDF-based per-chapter classification.
Project description
## txt2phrases
A Python library for:
1. **HTML to TXT conversion**
2. **Keyword extraction using Hugging Face Transformers**
3. **Per-chapter TF-IDF-based specific keyword classification**
---
## Installation
You can install `txt2phrases` directly from PyPI:
```bash
pip install txt2phrases
CLI Usage
Convert HTML → TXT
Convert all HTML files in a folder to plain text:
html2text -i path/to/html_folder -o path/to/output_folder
- -i / --input : Path to the folder containing HTML files
- -o / --output : Path to the folder where TXT files will be saved
Extract keywords from TXT files
Extract top keywords from all TXT files in a folder:
extract_keywords -i path/to/txt_folder -o path/to/output_folder -n 3500
- -i / --input_folder : Folder containing TXT files
- -o / --output_folder : Folder to save keyword CSVs
- -n / --top_n : Number of top keywords to extract (default: 3500)
Generate per-chapter specific keywords (TF-IDF)
Create per-chapter CSVs listing keywords specific to each chapter:
specific_keywords -i path/to/csv_folder -o path/to/output_folder -t 0.6 -f 5
- -i / --input_dir : Folder with per-chapter CSV files containing
keyword,count - -o / --output_dir : Folder to save per-chapter specific keyword CSVs
- -t / --threshold : TF-IDF threshold for a keyword to be considered specific (default: 0.6)
- -f / --min_freq : Minimum frequency of a keyword to consider (default: 5)
Python Usage
Convert HTML → TXT
from txt2phrases.html2text import html_to_txt_folder
html_to_txt_folder("path/to/html_folder", "path/to/output_folder")
Extract Keywords
from txt2phrases.keyword import KeywordExtraction
extractor = KeywordExtraction(
textfile="path/to/file.txt",
saving_path="path/to/output_folder",
output_filename="keywords.csv",
top_n=1000
)
top_keywords = extractor.extract_keywords()
Per-Chapter Specific Keywords
from txt2phrases.classify_specific import classify_keywords_split_files
classify_keywords_split_files(
input_dir="path/to/chapter_csv_folder",
output_dir="path/to/output_folder",
threshold=0.6,
min_freq=5
)
Requirements
- Python 3.8+
beautifulsoup4pandastqdmtransformersscikit-learn
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
txt2phrases-0.1.0.tar.gz
(6.0 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file txt2phrases-0.1.0.tar.gz.
File metadata
- Download URL: txt2phrases-0.1.0.tar.gz
- Upload date:
- Size: 6.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b6ffb7bca71821ab1dc69841429c5a59c7274d84f342f3f2cbd8980a4490d5c0
|
|
| MD5 |
b6e297c5fc1c1cc1cd8655a0480f40dc
|
|
| BLAKE2b-256 |
1867aa464f91d28bca1bb74c6357cf3a059852a39301daf9a6c95c112cd76d79
|
File details
Details for the file txt2phrases-0.1.0-py3-none-any.whl.
File metadata
- Download URL: txt2phrases-0.1.0-py3-none-any.whl
- Upload date:
- Size: 7.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ff881c9396285f8bf64714cade9fab6f47efcf883bebb265e6200706ae5c4ebf
|
|
| MD5 |
8a86aab8ab9311c0d0c0908f3912781b
|
|
| BLAKE2b-256 |
cf5451b2db1929cc10a974a3f9be4f82bb2ef771571e48ec593435a594af142b
|