Toolkit for archive extraction, OCR parsing, and file text extraction

Project description

GoblinTools

GoblinTools is a Python library designed for text extraction, archive handling, OCR integration, and text cleaning. It supports a wide range of file formats and offers both local and cloud-based OCR options.

Installation

pip install goblintools

Requirements

Python: 3.7 or newer
Tesseract OCR: Required for local OCR support (Installation Guide)
- Portuguese Language Pack: Install tesseract-ocr-por for Portuguese text recognition
AWS Credentials: Required for AWS Textract cloud OCR

System Dependencies

Archive Support

For complete archive format support, install these system tools (required by patoolib):

OS	Command
Debian/Ubuntu	`sudo apt install unrar p7zip-full p7zip-rar`
Arch Linux	`sudo pacman -S unrar p7zip`
macOS	`brew install unrar p7zip`

Tesseract OCR with Portuguese Support

OS	Command
Debian/Ubuntu	`sudo apt install tesseract-ocr tesseract-ocr-por`
Arch Linux	`sudo pacman -S tesseract tesseract-data-por`
macOS	`brew install tesseract tesseract-lang`
Windows	Download from UB Mannheim and select Portuguese during installation

Key Features

📄 Broad File Support: Extract text from 20+ document, spreadsheet, and presentation formats
📦 Archive Handling: Supports .zip, .rar, .7z, .tar, .gz, and 30+ more formats
🔍 OCR Integration: Local Tesseract or cloud AWS Textract support
🧹 Text Cleaning: Accent removal, case normalization, stopword filtering (Brazilian Portuguese support)
🇧🇷 Portuguese OCR: Optimized for Brazilian Portuguese documents with Tesseract
⚡ Batch Processing: Parallel archive extraction
📁 File Management: Comprehensive file/directory operations
📋 Metadata Extraction: Extract text with structured metadata including file names and page information

Quick Start

Basic Text Extraction

from goblintools import TextExtractor

extractor = TextExtractor()
text = extractor.extract_from_file("document.pdf")
print(text[:200] + "..." if text else "No text extracted")

OCR-Enabled Extraction

# Local OCR with Tesseract
extractor = TextExtractor(ocr_handler=True)
text = extractor.extract_from_file("scanned_document.pdf")

# AWS Textract OCR
extractor = TextExtractor(
    ocr_handler=True,
    use_aws=True,
    aws_access_key="your-key",
    aws_secret_key="your-secret",
    aws_region="us-east-1"
)
text = extractor.extract_from_file("document.pdf")

Configuration Management

from goblintools import GoblinConfig, OCRConfig, TextExtractor

# Create config programmatically
config = GoblinConfig(
    max_file_size=50 * 1024 * 1024,  # 50MB limit
    ocr=OCRConfig(
        use_aws=True,
        aws_access_key="your-key",
        aws_secret_key="your-secret",
        aws_region="us-west-2",
        tesseract_lang="por"  # Portuguese OCR (default)
    )
)

# Use config with extractor
extractor = TextExtractor(ocr_handler=True, config=config)

# Save config to file
config.to_file("goblin_config.json")

# Load config from file
config = GoblinConfig.from_file("goblin_config.json")
extractor = TextExtractor(ocr_handler=True, config=config)

Example config file (goblin_config.json):

{
  "max_file_size": 52428800,
  "ocr": {
    "use_aws": false,
    "aws_access_key": null,
    "aws_secret_key": null,
    "aws_region": "us-east-1",
    "tesseract_lang": "por"
  }
}

Supported Tesseract Languages:

"por" - Portuguese (default)
"eng" - English
"spa" - Spanish
"por+eng" - Portuguese + English (multi-language)
See Tesseract documentation for more languages

Advanced Features

# Extract from entire folder (respects max_file_size limit)
text = extractor.extract_from_folder("/path/to/documents")

# Check if PDF needs OCR
if extractor.pdf_needs_ocr("document.pdf"):
    print("This PDF requires OCR processing")

# Validate installation
status = extractor.validate_installation()
print(f"Tesseract available: {status['tesseract']}")
if 'aws_textract' in status:
    print(f"AWS Textract available: {status['aws_textract']}")

# Add custom file parser
def custom_parser(file_path):
    # Your custom extraction logic
    return "extracted text"

extractor.add_parser('.custom', custom_parser)
text = extractor.extract_from_file("file.custom")

# Direct OCR processing with config
from goblintools.ocr_parser import OCRProcessor
from goblintools import OCRConfig

ocr_config = OCRConfig(use_aws=True, aws_access_key="key", aws_secret_key="secret")
ocr = OCRProcessor(ocr_config)
text = ocr.extract_text_from_pdf("scanned.pdf")

Metadata Extraction

# Extract text with metadata from single file
result = extractor.extract_from_file("document.pdf", include_metadata=True)
print(f"Text: {result['text'][:100]}...")
print(f"Metadata:\n{result['metadata_markdown']}")

# Extract text with metadata from folder
result = extractor.extract_from_folder("/path/to/documents", include_metadata=True)
print(f"Combined text: {len(result['text'])} characters")
print(f"Structured metadata:\n{result['metadata_markdown']}")

# Example output structure:
# {
#   "text": "Complete extracted text from all files...",
#   "metadata_markdown": """
# # Extração da Pasta: documents
# 
# ## document1.pdf
# ### Página 1
# Content from page 1...
# ### Página 2  
# Content from page 2...
# 
# ## document2.docx
# ### Página 1
# Content from docx file...
#   """
# }

Metadata Features:

File-level organization: Each document is clearly identified
Page-by-page breakdown: PDFs show individual page content
Markdown format: Structured, readable output with headers
Combined text: Full text extraction alongside metadata
Hierarchical structure: Folder → File → Page organization

Archive Extraction

from goblintools import FileManager, FileValidator, ArchiveHandler

# Single archive extraction (handles nested archives)
FileManager.extract_files_recursive("archive.zip", "output_folder")

# Parallel batch extraction
results = FileManager.batch_extract(["file1.zip", "file2.rar"], "output_folder")
print(f"Extraction results: {results}")  # [True, False, ...]

# Batch extraction with progress tracking
def progress_callback(current, total):
    print(f"Progress: {current}/{total} ({current/total*100:.1f}%)")

results = FileManager.batch_extract(
    ["file1.zip", "file2.rar", "file3.7z"], 
    "output_folder",
    progress_callback=progress_callback
)

# File validation
if FileValidator.is_archive("file.zip"):
    print("File is a supported archive")

if FileValidator.is_empty("file.txt"):
    print("File is empty")

# Direct archive handling
ArchiveHandler.extract("archive.7z", "output")

# Add custom archive format
ArchiveHandler.add_format('.custom', lambda f, d: custom_extract(f, d))

# File operations with conflict resolution
FileManager.move_file("source.txt", "destination.txt")  # Auto-renames if exists
FileManager.delete_folder("temp_folder")
FileManager.move_files("folder_path")  # Flatten + normalize names

Text Cleaning

from goblintools import TextCleaner

# Default Portuguese stopwords
cleaner = TextCleaner()
raw_text = "Isso é um Teste com Acentos!"

# Basic cleaning (remove accents)
clean = cleaner.clean_text(raw_text)
# Output: "Isso e um Teste com Acentos!"

# Full cleaning (lowercase + remove stopwords)
clean = cleaner.clean_text(raw_text, lowercase=True, remove_stopwords=True)
# Output: "teste acentos"

# Custom stopwords
custom_cleaner = TextCleaner(custom_stopwords=['custom', 'words'])
clean = custom_cleaner.remove_stopwords("custom text with words")
# Output: "text with"

# Portuguese text processing example
portuguese_text = "Este é um documento em português com acentuação!"
clean_pt = cleaner.clean_text(portuguese_text, lowercase=True, remove_stopwords=True)
# Output: "documento portugues acentuacao"

Brazilian Portuguese Support

GoblinTools is optimized for Brazilian Portuguese users:

from goblintools import TextExtractor, TextCleaner, OCRConfig

# Portuguese OCR configuration
config = OCRConfig(
    tesseract_lang="por",  # Portuguese language
    use_aws=False  # Use local Tesseract
)

# Extract Portuguese documents
extractor = TextExtractor(ocr_handler=True, config=config)
text = extractor.extract_from_file("documento_brasileiro.pdf")

# Clean Portuguese text (removes Portuguese stopwords)
cleaner = TextCleaner()  # Uses Portuguese stopwords by default
clean_text = cleaner.clean_text(
    "Este é um texto em português com acentos!",
    lowercase=True,
    remove_stopwords=True
)
print(clean_text)  # Output: "texto portugues acentos"

# Multi-language OCR (Portuguese + English)
multi_config = OCRConfig(tesseract_lang="por+eng")
extractor_multi = TextExtractor(ocr_handler=True, config=multi_config)

Portuguese Features:

Default Portuguese stopwords (400+ words)
Portuguese Tesseract OCR support
Accent removal with unidecode
Brazilian document format support

Supported Formats

Documents

.pdf, .doc, .docx, .odt, .rtf, .txt, .csv, .xml, .html

Spreadsheets

.xlsx, .xls, .ods, .dbf

Presentations

.pptx

API Reference

TextExtractor

__init__(ocr_handler=False, use_aws=False, aws_access_key=None, aws_secret_key=None, aws_region='us-east-1', config=None) - Initialize extractor with OCR options or config
extract_from_file(file_path, include_metadata=False) - Extract text from single file, optionally with metadata
extract_from_folder(folder_path, include_metadata=False) - Extract text from all files in folder, optionally with metadata
pdf_needs_ocr(pdf_path) - Check if PDF requires OCR processing
add_parser(extension, parser_func) - Add custom parser for file extension
validate_installation() - Check if dependencies are properly installed

Metadata Extraction:

When include_metadata=False (default): Returns str with extracted text
When include_metadata=True: Returns Dict with:
- "text": Complete extracted text from all files
- "metadata_markdown": Structured markdown with file names and page information

FileManager

extract_files_recursive(archive_path, output_path) - Extract archive recursively
batch_extract(archive_list, output_path, progress_callback=None) - Extract multiple archives with optional progress tracking
move_file(source, destination) - Move/rename file with conflict resolution and type safety
delete_folder(folder_path) - Delete folder and contents
delete_if_empty(file_path) - Delete file if empty
move_files(folder_path) - Flatten directory structure and normalize filenames

FileValidator

is_empty(file_path) - Check if file is empty
is_archive(file_path) - Check if file is a supported archive format

ArchiveHandler

extract(file_path, destination) - Extract archive with collision avoidance
add_format(extension, handler) - Add support for new archive formats

TextCleaner

__init__(custom_stopwords=None) - Initialize with custom stopwords (defaults to Portuguese)
clean_text(text, lowercase=False, remove_stopwords=False) - Clean and normalize text
remove_stopwords(text) - Remove stopwords from text

OCRProcessor

__init__(config) - Initialize OCR processor with OCRConfig
extract_text_from_pdf(pdf_path) - Extract text from PDF using OCR

GoblinConfig

__init__(max_file_size=104857600, ocr=None) - Initialize configuration
from_file(config_path) - Load configuration from JSON file
to_file(config_path) - Save configuration to JSON file
default() - Create default configuration

OCRConfig

__init__(use_aws=False, aws_access_key=None, aws_secret_key=None, aws_region='us-east-1', tesseract_lang='por') - Initialize OCR configuration
- tesseract_lang: Language for Tesseract OCR ('por' for Portuguese, 'eng' for English, 'por+eng' for both)

License

MIT License

Project details

Release history Release notifications | RSS feed

0.7.7

Apr 28, 2026

0.7.6

Apr 22, 2026

0.7.3

Apr 14, 2026

0.7.2

Apr 14, 2026

0.7.1

Mar 26, 2026

0.7.0

Mar 26, 2026

0.6.4

Mar 6, 2026

0.6.3

Mar 6, 2026

0.6.2

Mar 6, 2026

0.6.1

Feb 9, 2026

0.6.0

Nov 12, 2025

0.5.0

Sep 26, 2025

0.4.0

Sep 24, 2025

This version

0.3.0

Sep 23, 2025

0.2.0

Jun 27, 2025

0.1.9

Jun 26, 2025

0.1.8

Jun 24, 2025

0.1.7

Jun 23, 2025

0.1.6

Jun 23, 2025

0.1.5

Jun 23, 2025

0.1.4

Jun 23, 2025

0.1.3

Jun 23, 2025

0.1.2

Jun 17, 2025

0.1.1

Jun 17, 2025

0.1.0

Jun 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

goblintools-0.3.0.tar.gz (21.1 kB view details)

Uploaded Sep 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

goblintools-0.3.0-py3-none-any.whl (18.5 kB view details)

Uploaded Sep 23, 2025 Python 3

File details

Details for the file goblintools-0.3.0.tar.gz.

File metadata

Download URL: goblintools-0.3.0.tar.gz
Upload date: Sep 23, 2025
Size: 21.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for goblintools-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`c85c6f190f8612c00796fd6e9c11d2c5aeb30f3dd99dd040009f72b3a376bfe7`
MD5	`abf4c9e84c0475ec5a96c63053191432`
BLAKE2b-256	`a4a986dacf9d227f66b1e8c9535de857258fac43f9bec064073318eaab176a8c`

See more details on using hashes here.

File details

Details for the file goblintools-0.3.0-py3-none-any.whl.

File metadata

Download URL: goblintools-0.3.0-py3-none-any.whl
Upload date: Sep 23, 2025
Size: 18.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for goblintools-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`84b850c13d7ae0aa79855350dd3b7d97c5098991e2006adee68d7f5b900d1c8b`
MD5	`bbabff0864835d79bb4b9c8755b63624`
BLAKE2b-256	`3894012d81cda198eed160b9b98398ad880fb457b54f8a3ba81cabb0507576a1`

See more details on using hashes here.

goblintools 0.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

GoblinTools

Installation

Requirements

System Dependencies

Archive Support

Tesseract OCR with Portuguese Support

Key Features

Quick Start

Basic Text Extraction

OCR-Enabled Extraction

Configuration Management

Advanced Features

Metadata Extraction

Archive Extraction

Text Cleaning

Brazilian Portuguese Support

Supported Formats

Documents

Spreadsheets

Presentations

Archives

API Reference

TextExtractor

FileManager

FileValidator

ArchiveHandler

TextCleaner

OCRProcessor

GoblinConfig

OCRConfig

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes