Skip to main content

A Python library for HTML to TXT conversion, keyword extraction, and TF-IDF-based per-chapter classification.

Project description

txt2phrases

txt2phrases is a Python library designed for processing and analyzing text data. It provides tools for:

  1. HTML to TXT conversion: Extract plain text from HTML files.
  2. Keyword extraction: Use Hugging Face Transformers to identify and rank the most important keywords in text files.
  3. Per-chapter TF-IDF-based keyword classification: Classify keywords as specific (unique to a chapter) or general (common across chapters).

Features

  • HTML Parsing: Convert HTML documents into plain text for further processing.
  • AI-Powered Keyword Extraction: Leverage pre-trained NLP models for accurate keyword identification.
  • TF-IDF Classification: Classify keywords into specific and general categories based on their relevance.
  • Batch Processing: Process multiple files in a single command.
  • Configurable Parameters: Customize thresholds, batch sizes, and output formats.
  • Output Formats: Save results as CSV files for easy analysis.

Installation

Install txt2phrases directly from PyPI:

pip install txt2phrases

CLI Usage

Convert HTML → TXT

Convert all HTML files in a folder to plain text:

html2txt -i path/to/html_folder -o path/to/output_folder
html2txt -h
  • -h/--help:help command
  • -i / --input : Path to the folder containing HTML files
  • -o / --output : Path to the folder where TXT files will be saved

Extract keywords from TXT files

Extract top keywords from all TXT files in a folder:

extract_keywords -i path/to/txt_folder -o path/to/output_folder -n 3500
extract_keywords -h
  • -h:help command
  • -i / --input_folder : Folder containing TXT files
  • -o / --output_folder : Folder to save keyword CSVs
  • -n / --top_n : Number of top keywords to extract (default: 3500)

Classify Keywords into Specific and General (TF-IDF)

This command takes per-chapter keyword CSVs and divides the keywords into:

  • Specific keywords: unique to a chapter
  • General keywords: common across multiple chapters
specific_keywords -i path/to/csv_folder -o path/to/output_folder -t 0.6 -f 5
specfic_keywords -h
  • -h:help command
  • -i / --input_dir : Folder with per-chapter CSV files containing keyword,count
  • -o / --output_dir : Folder to save per-chapter specific keyword CSVs
  • -t / --threshold : TF-IDF threshold for a keyword to be considered specific (default: 0.6)
  • -f / --min_freq : Minimum frequency of a keyword to consider (default: 5)

Python Usage

Convert HTML → TXT

from txt2phrases.html2txt import html_to_txt_folder

html_to_txt_folder("path/to/html_folder", "path/to/output_folder")

Extract Keywords

from txt2phrases.keyword import KeywordExtraction

extractor = KeywordExtraction(
    textfile="path/to/file.txt",
    saving_path="path/to/output_folder",
    output_filename="keywords.csv",
    top_n=1000
)

top_keywords = extractor.extract_keywords()

Per-Chapter Specific Keywords and General Keywords

from txt2phrases.classify_specific import classify_keywords_split_files

classify_keywords_split_files(
    input_dir="path/to/chapter_csv_folder",
    output_dir="path/to/output_folder",
    threshold=0.6,
    min_freq=5
)

Requirements

  • Python 3.8+
  • beautifulsoup4
  • pandas
  • tqdm
  • transformers
  • scikit-learn
  • torch

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

txt2phrases-0.2.1.tar.gz (22.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

txt2phrases-0.2.1-py3-none-any.whl (7.8 kB view details)

Uploaded Python 3

File details

Details for the file txt2phrases-0.2.1.tar.gz.

File metadata

  • Download URL: txt2phrases-0.2.1.tar.gz
  • Upload date:
  • Size: 22.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for txt2phrases-0.2.1.tar.gz
Algorithm Hash digest
SHA256 abb9a4418123762efffec3cc598d02198f2588ee00a34f44199c02a62c67b8d4
MD5 113c15a6441eb6f7f603abea7d987bd1
BLAKE2b-256 f784fb1bd07fbffdb605e278d0c8d34bc5e8138fe0b1a1bdb34d70ee42679aab

See more details on using hashes here.

File details

Details for the file txt2phrases-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: txt2phrases-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 7.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for txt2phrases-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 dde1288c5f4d0bd93a9f64413bbbf7627d0d2c57ebed1cdd51054c872c7fc509
MD5 be829e6eaeaf32e8fcdeb5d74019d15d
BLAKE2b-256 0162dfe3fb3534d0de12c2690fbcbc4f9fea91eec37a29c17fdef51401272149

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page