A powerful text extraction utility for multiple file formats, including PDFs, Word documents, spreadsheets, and code files.
Project description
TFQ0tool
A powerful command-line utility for extracting text from various file formats with advanced processing capabilities.
Features
-
Format Support:
- PDF (with password protection and OCR support)
- Microsoft Office (DOCX, DOC, XLSX, XLS)
- Data files (CSV, JSON, XML)
- Text files (TXT, LOG, MD)
- Image files (via OCR)
-
Processing Features:
- Parallel processing with configurable threads
- Memory-efficient streaming extraction
- Advanced text preprocessing options
- OCR support for images and scanned documents
- Multiple output formats (TXT, JSON, CSV, MD)
- Progress tracking and detailed logging
- Automatic encoding detection
- Language-specific processing
Installation
pip install tfq0tool
Usage
Basic Commands
# Extract text from a file
tfq0tool extract document.pdf
# Show supported formats with details
tfq0tool formats --details
# Extract with OCR
tfq0tool extract scanned.pdf --ocr --ocr-lang eng
# Process multiple files recursively
tfq0tool extract ./docs/ -r --exclude "*.tmp"
# Show help
tfq0tool --help
Extract Command Options
tfq0tool extract [OPTIONS] FILE_PATHS...
Input Options:
FILE_PATHS Files to process (supports glob patterns)
-r, --recursive Process directories recursively
--exclude PATTERN Exclude files matching pattern
Output Options:
-o, --output DIR Output directory
--format FORMAT Output format (txt|json|csv|md)
--encoding ENC Output encoding (default: utf-8)
Processing Options:
-t, --threads N Thread count (default: auto)
-f, --force Overwrite existing files
-p, --password PWD Password for encrypted PDFs
Text Processing Options:
--preprocess OPT Preprocessing options:
lowercase,strip_whitespace,
remove_numbers,remove_punctuation
--language LANG Language for processing (e.g., 'en')
--ocr Enable OCR for images/scanned docs
--ocr-lang LANG OCR language (default: eng)
Display Options:
--verbose Enable detailed output
--progress Show progress bar
--silent Suppress non-error output
Configuration
View or modify settings:
# Show current config
tfq0tool config --show
# Reset to defaults
tfq0tool config --reset
# Change settings
tfq0tool config --set processing.chunk_size 2097152
tfq0tool config --set threading.max_threads 8
Examples
# Basic text extraction
tfq0tool extract document.pdf -o ./output --format txt
# Process directory recursively with exclusions
tfq0tool extract ./docs -r --exclude "*.tmp" --progress
# Extract from scanned PDF with OCR
tfq0tool extract scan.pdf --ocr --ocr-lang eng
# Multiple files with advanced preprocessing
tfq0tool extract *.txt --preprocess lowercase,strip_whitespace,remove_numbers
# Parallel processing with custom output format
tfq0tool extract *.pdf -t 4 --format json --progress
# Extract with specific language and encoding
tfq0tool extract *.docx --language fr --encoding utf-8
# Password-protected PDF with OCR
tfq0tool extract secure.pdf -p mypassword --ocr
Format Details
Use tfq0tool formats --details
to see detailed information about supported formats, including:
- Supported features for each format
- Format-specific limitations
- Processing capabilities
- Best practices for extraction
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file tfq0tool-2.1.8.tar.gz
.
File metadata
- Download URL: tfq0tool-2.1.8.tar.gz
- Upload date:
- Size: 16.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
fa32ef08b05367ccdb7a8f34680c983d4c305d0e3267d330ff767e3d8d473807
|
|
MD5 |
81cd456ba4b364e4e2ba1a6f4c89da26
|
|
BLAKE2b-256 |
e530daf5d8de92e1bef95801e40ef9c8b4117b73f0c1446d3c65bfbcba92d45f
|
Provenance
The following attestation bundles were made for tfq0tool-2.1.8.tar.gz
:
Publisher:
tfq0tool-publish.yml
on TFQ0/tfq0tool
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1
-
Predicate type:
https://docs.pypi.org/attestations/publish/v1
-
Subject name:
tfq0tool-2.1.8.tar.gz
-
Subject digest:
fa32ef08b05367ccdb7a8f34680c983d4c305d0e3267d330ff767e3d8d473807
- Sigstore transparency entry: 226960291
- Sigstore integration time:
-
Permalink:
TFQ0/tfq0tool@9f5ce36cd1b688d13352708a3905e0641b7377fb
-
Branch / Tag:
refs/tags/v2.1.8
- Owner: https://github.com/TFQ0
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com
-
Runner Environment:
github-hosted
-
Publication workflow:
tfq0tool-publish.yml@9f5ce36cd1b688d13352708a3905e0641b7377fb
-
Trigger Event:
release
-
Statement type:
File details
Details for the file tfq0tool-2.1.8-py3-none-any.whl
.
File metadata
- Download URL: tfq0tool-2.1.8-py3-none-any.whl
- Upload date:
- Size: 17.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
af87642aa04537e8c40e7809690d065b2d2e88e0826ad8dd0b2662c7ee31f6e3
|
|
MD5 |
6205d5464e4353ee326ad407b21831e0
|
|
BLAKE2b-256 |
3b2de4cd57435d7232e83b7ad5466054b6633f72f660910f732a431ad4064fd6
|
Provenance
The following attestation bundles were made for tfq0tool-2.1.8-py3-none-any.whl
:
Publisher:
tfq0tool-publish.yml
on TFQ0/tfq0tool
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1
-
Predicate type:
https://docs.pypi.org/attestations/publish/v1
-
Subject name:
tfq0tool-2.1.8-py3-none-any.whl
-
Subject digest:
af87642aa04537e8c40e7809690d065b2d2e88e0826ad8dd0b2662c7ee31f6e3
- Sigstore transparency entry: 226960292
- Sigstore integration time:
-
Permalink:
TFQ0/tfq0tool@9f5ce36cd1b688d13352708a3905e0641b7377fb
-
Branch / Tag:
refs/tags/v2.1.8
- Owner: https://github.com/TFQ0
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com
-
Runner Environment:
github-hosted
-
Publication workflow:
tfq0tool-publish.yml@9f5ce36cd1b688d13352708a3905e0641b7377fb
-
Trigger Event:
release
-
Statement type: