Toolkit for archive extraction, OCR parsing, and file text extraction
Project description
GoblinTools
GoblinTools is a Python library designed for text extraction, archive handling, OCR integration, and text cleaning. It supports a wide range of file formats and offers both local and cloud-based OCR options.
Installation
pip install goblintools
Requirements
- Python: 3.7 or newer
- Tesseract OCR: Required for local OCR support (Installation Guide)
- Portuguese Language Pack: Install
tesseract-ocr-porfor Portuguese text recognition
- Portuguese Language Pack: Install
- AWS Credentials: Required for AWS Textract cloud OCR
System Dependencies
Archive Support
For complete archive format support, install these system tools (required by patoolib):
| OS | Command |
|---|---|
| Debian/Ubuntu | sudo apt install unrar p7zip-full p7zip-rar |
| Arch Linux | sudo pacman -S unrar p7zip |
| macOS | brew install unrar p7zip |
Tesseract OCR with Portuguese Support
| OS | Command |
|---|---|
| Debian/Ubuntu | sudo apt install tesseract-ocr tesseract-ocr-por |
| Arch Linux | sudo pacman -S tesseract tesseract-data-por |
| macOS | brew install tesseract tesseract-lang |
| Windows | Download from UB Mannheim and select Portuguese during installation |
Key Features
- 📄 Broad File Support: Extract text from 20+ document, spreadsheet, and presentation formats
- 📦 Archive Handling: Supports
.zip,.rar,.7z,.tar,.gz, and 30+ more formats - 🔍 OCR Integration: Local Tesseract or cloud AWS Textract support
- 🧹 Text Cleaning: Accent removal, case normalization, stopword filtering (Brazilian Portuguese support)
- 🇧🇷 Portuguese OCR: Optimized for Brazilian Portuguese documents with Tesseract
- ⚡ Batch Processing: Parallel archive extraction
- 📁 File Management: Comprehensive file/directory operations
- 📋 Metadata Extraction: Extract text with structured metadata including file names and page information
Quick Start
Basic Text Extraction
from goblintools import TextExtractor
extractor = TextExtractor()
text = extractor.extract_from_file("document.pdf")
print(text[:200] + "..." if text else "No text extracted")
OCR-Enabled Extraction
# Local OCR with Tesseract
extractor = TextExtractor(ocr_handler=True)
text = extractor.extract_from_file("scanned_document.pdf")
# AWS Textract OCR
extractor = TextExtractor(
ocr_handler=True,
use_aws=True,
aws_access_key="your-key",
aws_secret_key="your-secret",
aws_region="us-east-1"
)
text = extractor.extract_from_file("document.pdf")
Configuration Management
from goblintools import GoblinConfig, OCRConfig, TextExtractor
# Create config programmatically
config = GoblinConfig(
max_file_size=50 * 1024 * 1024, # 50MB limit
ocr=OCRConfig(
use_aws=True,
aws_access_key="your-key",
aws_secret_key="your-secret",
aws_region="us-west-2",
tesseract_lang="por" # Portuguese OCR (default)
)
)
# Use config with extractor
extractor = TextExtractor(ocr_handler=True, config=config)
# Save config to file
config.to_file("goblin_config.json")
# Load config from file
config = GoblinConfig.from_file("goblin_config.json")
extractor = TextExtractor(ocr_handler=True, config=config)
Example config file (goblin_config.json):
{
"max_file_size": 52428800,
"ocr": {
"use_aws": false,
"aws_access_key": null,
"aws_secret_key": null,
"aws_region": "us-east-1",
"tesseract_lang": "por"
}
}
Supported Tesseract Languages:
"por"- Portuguese (default)"eng"- English"spa"- Spanish"por+eng"- Portuguese + English (multi-language)- See Tesseract documentation for more languages
Advanced Features
# Extract from entire folder (respects max_file_size limit)
text = extractor.extract_from_folder("/path/to/documents")
# Check if PDF needs OCR
if extractor.pdf_needs_ocr("document.pdf"):
print("This PDF requires OCR processing")
# Validate installation
status = extractor.validate_installation()
print(f"Tesseract available: {status['tesseract']}")
if 'aws_textract' in status:
print(f"AWS Textract available: {status['aws_textract']}")
# Add custom file parser
def custom_parser(file_path):
# Your custom extraction logic
return "extracted text"
extractor.add_parser('.custom', custom_parser)
text = extractor.extract_from_file("file.custom")
# Direct OCR processing with config
from goblintools.ocr_parser import OCRProcessor
from goblintools import OCRConfig
ocr_config = OCRConfig(use_aws=True, aws_access_key="key", aws_secret_key="secret")
ocr = OCRProcessor(ocr_config)
text = ocr.extract_text_from_pdf("scanned.pdf")
Metadata Extraction
# Extract text with metadata from single file
result = extractor.extract_from_file("document.pdf", include_metadata=True)
print(f"Text: {result['text'][:100]}...")
print(f"Metadata:\n{result['metadata_markdown']}")
# Extract text with metadata from folder
result = extractor.extract_from_folder("/path/to/documents", include_metadata=True)
print(f"Combined text: {len(result['text'])} characters")
print(f"Structured metadata:\n{result['metadata_markdown']}")
# Example output structure:
# {
# "text": "Complete extracted text from all files...",
# "metadata_markdown": """
# # Extração da Pasta: documents
#
# ## document1.pdf
# ### Página 1
# Content from page 1...
# ### Página 2
# Content from page 2...
#
# ## document2.docx
# ### Página 1
# Content from docx file...
# """
# }
Metadata Features:
- File-level organization: Each document is clearly identified
- Page-by-page breakdown: PDFs show individual page content
- Markdown format: Structured, readable output with headers
- Combined text: Full text extraction alongside metadata
- Hierarchical structure: Folder → File → Page organization
Archive Extraction
from goblintools import FileManager, FileValidator, ArchiveHandler
# Single archive extraction (handles nested archives)
FileManager.extract_files_recursive("archive.zip", "output_folder")
# Parallel batch extraction
results = FileManager.batch_extract(["file1.zip", "file2.rar"], "output_folder")
print(f"Extraction results: {results}") # [True, False, ...]
# Batch extraction with progress tracking
def progress_callback(current, total):
print(f"Progress: {current}/{total} ({current/total*100:.1f}%)")
results = FileManager.batch_extract(
["file1.zip", "file2.rar", "file3.7z"],
"output_folder",
progress_callback=progress_callback
)
# File validation
if FileValidator.is_archive("file.zip"):
print("File is a supported archive")
if FileValidator.is_empty("file.txt"):
print("File is empty")
# Direct archive handling
ArchiveHandler.extract("archive.7z", "output")
# Add custom archive format
ArchiveHandler.add_format('.custom', lambda f, d: custom_extract(f, d))
# File operations with conflict resolution
FileManager.move_file("source.txt", "destination.txt") # Auto-renames if exists
FileManager.delete_folder("temp_folder")
FileManager.move_files("folder_path") # Flatten + normalize names
Text Cleaning
from goblintools import TextCleaner
# Default Portuguese stopwords
cleaner = TextCleaner()
raw_text = "Isso é um Teste com Acentos!"
# Basic cleaning (remove accents)
clean = cleaner.clean_text(raw_text)
# Output: "Isso e um Teste com Acentos!"
# Full cleaning (lowercase + remove stopwords)
clean = cleaner.clean_text(raw_text, lowercase=True, remove_stopwords=True)
# Output: "teste acentos"
# Custom stopwords
custom_cleaner = TextCleaner(custom_stopwords=['custom', 'words'])
clean = custom_cleaner.remove_stopwords("custom text with words")
# Output: "text with"
# Portuguese text processing example
portuguese_text = "Este é um documento em português com acentuação!"
clean_pt = cleaner.clean_text(portuguese_text, lowercase=True, remove_stopwords=True)
# Output: "documento portugues acentuacao"
Brazilian Portuguese Support
GoblinTools is optimized for Brazilian Portuguese users:
from goblintools import TextExtractor, TextCleaner, OCRConfig
# Portuguese OCR configuration
config = OCRConfig(
tesseract_lang="por", # Portuguese language
use_aws=False # Use local Tesseract
)
# Extract Portuguese documents
extractor = TextExtractor(ocr_handler=True, config=config)
text = extractor.extract_from_file("documento_brasileiro.pdf")
# Clean Portuguese text (removes Portuguese stopwords)
cleaner = TextCleaner() # Uses Portuguese stopwords by default
clean_text = cleaner.clean_text(
"Este é um texto em português com acentos!",
lowercase=True,
remove_stopwords=True
)
print(clean_text) # Output: "texto portugues acentos"
# Multi-language OCR (Portuguese + English)
multi_config = OCRConfig(tesseract_lang="por+eng")
extractor_multi = TextExtractor(ocr_handler=True, config=multi_config)
Portuguese Features:
- Default Portuguese stopwords (400+ words)
- Portuguese Tesseract OCR support
- Accent removal with
unidecode - Brazilian document format support
Supported Formats
Documents
.pdf, .doc, .docx, .odt, .rtf, .txt, .csv, .xml, .html
Spreadsheets
.xlsx, .xls, .ods, .dbf
Presentations
.pptx
Archives
.zip, .rar, .7z, .tar, .gz, .bz2, .iso, .deb, .rpm, .jar, .war, .ear, .cbz, .cbr, .cb7, .tgz, .txz, .cbt, .udf, .ace, .cba, .arj, .cab, .chm, .cpio, .dms, .lha, .lzh, .lzma, .lzo, .xz, .zst, .zoo, .adf, .alz, .arc, .shn, .rz, .lrz, .a, .Z
API Reference
TextExtractor
__init__(ocr_handler=False, use_aws=False, aws_access_key=None, aws_secret_key=None, aws_region='us-east-1', config=None)- Initialize extractor with OCR options or configextract_from_file(file_path, include_metadata=False)- Extract text from single file, optionally with metadataextract_from_folder(folder_path, include_metadata=False)- Extract text from all files in folder, optionally with metadatapdf_needs_ocr(pdf_path)- Check if PDF requires OCR processingadd_parser(extension, parser_func)- Add custom parser for file extensionvalidate_installation()- Check if dependencies are properly installed
Metadata Extraction:
- When
include_metadata=False(default): Returnsstrwith extracted text - When
include_metadata=True: ReturnsDictwith:"text": Complete extracted text from all files"metadata_markdown": Structured markdown with file names and page information
FileManager
extract_files_recursive(archive_path, output_path)- Extract archive recursivelybatch_extract(archive_list, output_path, progress_callback=None)- Extract multiple archives with optional progress trackingmove_file(source, destination)- Move/rename file with conflict resolution and type safetydelete_folder(folder_path)- Delete folder and contentsdelete_if_empty(file_path)- Delete file if emptymove_files(folder_path)- Flatten directory structure and normalize filenames
FileValidator
is_empty(file_path)- Check if file is emptyis_archive(file_path)- Check if file is a supported archive format
ArchiveHandler
extract(file_path, destination)- Extract archive with collision avoidanceadd_format(extension, handler)- Add support for new archive formats
TextCleaner
__init__(custom_stopwords=None)- Initialize with custom stopwords (defaults to Portuguese)clean_text(text, lowercase=False, remove_stopwords=False)- Clean and normalize textremove_stopwords(text)- Remove stopwords from text
OCRProcessor
__init__(config)- Initialize OCR processor with OCRConfigextract_text_from_pdf(pdf_path)- Extract text from PDF using OCR
GoblinConfig
__init__(max_file_size=104857600, ocr=None)- Initialize configurationfrom_file(config_path)- Load configuration from JSON fileto_file(config_path)- Save configuration to JSON filedefault()- Create default configuration
OCRConfig
__init__(use_aws=False, aws_access_key=None, aws_secret_key=None, aws_region='us-east-1', tesseract_lang='por')- Initialize OCR configurationtesseract_lang: Language for Tesseract OCR ('por'for Portuguese,'eng'for English,'por+eng'for both)
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file goblintools-0.3.0.tar.gz.
File metadata
- Download URL: goblintools-0.3.0.tar.gz
- Upload date:
- Size: 21.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c85c6f190f8612c00796fd6e9c11d2c5aeb30f3dd99dd040009f72b3a376bfe7
|
|
| MD5 |
abf4c9e84c0475ec5a96c63053191432
|
|
| BLAKE2b-256 |
a4a986dacf9d227f66b1e8c9535de857258fac43f9bec064073318eaab176a8c
|
File details
Details for the file goblintools-0.3.0-py3-none-any.whl.
File metadata
- Download URL: goblintools-0.3.0-py3-none-any.whl
- Upload date:
- Size: 18.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
84b850c13d7ae0aa79855350dd3b7d97c5098991e2006adee68d7f5b900d1c8b
|
|
| MD5 |
bbabff0864835d79bb4b9c8755b63624
|
|
| BLAKE2b-256 |
3894012d81cda198eed160b9b98398ad880fb457b54f8a3ba81cabb0507576a1
|