Skip to main content

A powerful text extraction utility for multiple file formats, including PDFs, Word documents, spreadsheets, and code files.

Project description

TFQ0tool

PyPI version License Python Versions Downloads

A powerful command-line utility for extracting text from various file formats with advanced processing capabilities.

Features

  • Format Support:

    • PDF (with password protection and OCR support)
    • Microsoft Office (DOCX, DOC, XLSX, XLS)
    • Data files (CSV, JSON, XML)
    • Text files (TXT, LOG, MD)
    • Image files (via OCR)
  • Processing Features:

    • Parallel processing with configurable threads
    • Memory-efficient streaming extraction
    • Advanced text preprocessing options
    • OCR support for images and scanned documents
    • Multiple output formats (TXT, JSON, CSV, MD)
    • Progress tracking and detailed logging
    • Automatic encoding detection
    • Language-specific processing

Installation

pip install tfq0tool

Usage

Basic Commands

# Extract text from a file
tfq0tool extract document.pdf

# Show supported formats with details
tfq0tool formats --details

# Extract with OCR
tfq0tool extract scanned.pdf --ocr --ocr-lang eng

# Process multiple files recursively
tfq0tool extract ./docs/ -r --exclude "*.tmp"

# Show help
tfq0tool --help

Extract Command Options

tfq0tool extract [OPTIONS] FILE_PATHS...

Input Options:
  FILE_PATHS          Files to process (supports glob patterns)
  -r, --recursive     Process directories recursively
  --exclude PATTERN   Exclude files matching pattern

Output Options:
  -o, --output DIR    Output directory
  --format FORMAT     Output format (txt|json|csv|md)
  --encoding ENC      Output encoding (default: utf-8)

Processing Options:
  -t, --threads N     Thread count (default: auto)
  -f, --force         Overwrite existing files
  -p, --password PWD  Password for encrypted PDFs

Text Processing Options:
  --preprocess OPT    Preprocessing options:
                      lowercase,strip_whitespace,
                      remove_numbers,remove_punctuation
  --language LANG     Language for processing (e.g., 'en')
  --ocr              Enable OCR for images/scanned docs
  --ocr-lang LANG    OCR language (default: eng)

Display Options:
  --verbose          Enable detailed output
  --progress         Show progress bar
  --silent          Suppress non-error output

Configuration

View or modify settings:

# Show current config
tfq0tool config --show

# Reset to defaults
tfq0tool config --reset

# Change settings
tfq0tool config --set processing.chunk_size 2097152
tfq0tool config --set threading.max_threads 8

Examples

# Basic text extraction
tfq0tool extract document.pdf -o ./output --format txt

# Process directory recursively with exclusions
tfq0tool extract ./docs -r --exclude "*.tmp" --progress

# Extract from scanned PDF with OCR
tfq0tool extract scan.pdf --ocr --ocr-lang eng

# Multiple files with advanced preprocessing
tfq0tool extract *.txt --preprocess lowercase,strip_whitespace,remove_numbers

# Parallel processing with custom output format
tfq0tool extract *.pdf -t 4 --format json --progress

# Extract with specific language and encoding
tfq0tool extract *.docx --language fr --encoding utf-8

# Password-protected PDF with OCR
tfq0tool extract secure.pdf -p mypassword --ocr

Format Details

Use tfq0tool formats --details to see detailed information about supported formats, including:

  • Supported features for each format
  • Format-specific limitations
  • Processing capabilities
  • Best practices for extraction

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tfq0tool-2.1.8.tar.gz (16.2 kB view details)

Uploaded Source

Built Distribution

tfq0tool-2.1.8-py3-none-any.whl (17.2 kB view details)

Uploaded Python 3

File details

Details for the file tfq0tool-2.1.8.tar.gz.

File metadata

  • Download URL: tfq0tool-2.1.8.tar.gz
  • Upload date:
  • Size: 16.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for tfq0tool-2.1.8.tar.gz
Algorithm Hash digest
SHA256 fa32ef08b05367ccdb7a8f34680c983d4c305d0e3267d330ff767e3d8d473807
MD5 81cd456ba4b364e4e2ba1a6f4c89da26
BLAKE2b-256 e530daf5d8de92e1bef95801e40ef9c8b4117b73f0c1446d3c65bfbcba92d45f

See more details on using hashes here.

Provenance

The following attestation bundles were made for tfq0tool-2.1.8.tar.gz:

Publisher: tfq0tool-publish.yml on TFQ0/tfq0tool

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tfq0tool-2.1.8-py3-none-any.whl.

File metadata

  • Download URL: tfq0tool-2.1.8-py3-none-any.whl
  • Upload date:
  • Size: 17.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for tfq0tool-2.1.8-py3-none-any.whl
Algorithm Hash digest
SHA256 af87642aa04537e8c40e7809690d065b2d2e88e0826ad8dd0b2662c7ee31f6e3
MD5 6205d5464e4353ee326ad407b21831e0
BLAKE2b-256 3b2de4cd57435d7232e83b7ad5466054b6633f72f660910f732a431ad4064fd6

See more details on using hashes here.

Provenance

The following attestation bundles were made for tfq0tool-2.1.8-py3-none-any.whl:

Publisher: tfq0tool-publish.yml on TFQ0/tfq0tool

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page