Toolkit for archive extraction, OCR parsing, and file text extraction

Project description

GoblinTools

GoblinTools is a Python library designed for text extraction, archive handling, OCR integration, and text cleaning. It supports a wide range of file formats and offers both local and cloud-based OCR options.

Installation

pip install goblintools

Note:

GoblinTools requires Python 3.7 or newer.
For OCR support, you must install Tesseract OCR separately [https://github.com/tesseract-ocr/tesseract].
For AWS Textract support, valid AWS credentials are required.

System Requirements

Some archive formats such as .rar, .7z, .tar, and others depend on external system tools to be extracted properly. These tools are not Python packages and must be installed manually. patoolib (used by GoblinTools) relies on them.

Debian/Ubuntu

sudo apt install unrar p7zip-full p7zip-rar

Arch Linux

sudo pacman -S unrar p7zip

macOS (Homebrew)

brew install unrar p7zip

Key Features

Broad File Support: Extract text from over 20 document, spreadsheet, and presentation formats.
Comprehensive Archive Handling: Supports .zip, .rar, .7z, .tar, .gz, and many more.
Flexible OCR Integration: Use Tesseract locally or integrate with AWS Textract.
Advanced Text Cleaning: Accent removal, lowercasing, and intelligent stopword filtering (Portuguese support).
Efficient Batch Processing: Handle multiple archives in parallel.
Robust File Management: Move, delete, and organize files/directories with ease.

Usage

Text Extraction

from goblintools import TextExtractor
import os

extractor = TextExtractor()
file_path = "example.pdf"

if os.path.exists(file_path):
    text = extractor.extract_from_file(file_path)
    if text:
        print("Successfully extracted text:")
        print(text[:200] + "...")
    else:
        print(f"Could not extract text from {file_path}.")
else:
    print(f"Error: File not found at {file_path}")

With OCR Enabled

extractor_with_ocr = TextExtractor(ocr_handler=True)
scanned_pdf_path = "scanned_document.pdf"

if os.path.exists(scanned_pdf_path):
    scanned_text = extractor_with_ocr.extract_from_file(scanned_pdf_path)
    if scanned_text:
        print("\nSuccessfully extracted text from scanned document (with OCR):")
        print(scanned_text[:200] + "...")
    else:
        print(f"Could not extract text from {scanned_pdf_path} (OCR might be needed).")
else:
    print(f"\nSkipping scanned document example: File not found at {scanned_pdf_path}")

Extracting All Text from a Folder

folder_path = "/path/to/your/folder"
text_from_folder = extractor.extract_from_folder(folder_path)
print(f"\nExtracted text from folder: {text_from_folder[:500]}...")

File Management & Archive Extraction

from goblintools import FileManager
import os

output_folder = "extracted_content"
os.makedirs(output_folder, exist_ok=True)

# Recursive extraction
FileManager.extract_files_recursive("archive.zip", output_folder)

# Batch extraction
FileManager.batch_extract(["a.zip", "b.rar"], output_folder)

Text Cleaning

from goblintools import TextCleaner

cleaner = TextCleaner()
raw_text = "Isso é um Teste com Acentos. E algumas palavras, como 'e', 'a', 'o', são stopwords."

print(f"Original text: {raw_text}")

clean_text_basic = cleaner.clean_text(raw_text)
print(f"Basic cleaning: {clean_text_basic}")

clean_text_full = cleaner.clean_text(raw_text, lowercase=True, remove_stopwords=True)
print(f"Full cleaning: {clean_text_full}")

OCR with AWS Textract

from goblintools import TextExtractor

extractor = TextExtractor(
    ocr_handler=True,
    use_aws=True,
    aws_access_key="your-aws-access-key-here",
    aws_secret_key="your-aws-secret-key-here",
    aws_region="us-east-1"
)

# Example:
text = extractor.extract_from_file("aws_scanned_document.pdf")

Supported Formats

Documents

.pdf, .doc, .docx, .odt, .rtf, .txt, .csv, .xml, .html

Spreadsheets

.xlsx, .xls, .ods, .dbf

Presentations

.pptx

File Management Utilities

Move a File

from goblintools import FileManager

source = "path/to/source.txt"
destination = "path/to/destination.txt"

FileManager.move_file(source, destination)

Delete a Folder and Its Contents

FileManager.delete_folder("temp_folder")

Delete a File if It's Empty

FileManager.delete_if_empty("empty_file.txt")

Normalize and Move All Files in a Directory

Moves all files to the root of the folder, renames to avoid conflicts, and removes empty subfolders.

FileManager.move_files("path/to/root_folder")

Text Extraction Utilities

Verify if a PDF file needs OCR treatment

if extractor.pdf_needs_ocr("scanned_document.pdf")
    print("Needs OCR!")

License

MIT License

Project details

Release history Release notifications | RSS feed

0.7.7

Apr 28, 2026

0.7.6

Apr 22, 2026

0.7.3

Apr 14, 2026

0.7.2

Apr 14, 2026

0.7.1

Mar 26, 2026

0.7.0

Mar 26, 2026

0.6.4

Mar 6, 2026

0.6.3

Mar 6, 2026

0.6.2

Mar 6, 2026

0.6.1

Feb 9, 2026

0.6.0

Nov 12, 2025

0.5.0

Sep 26, 2025

0.4.0

Sep 24, 2025

0.3.0

Sep 23, 2025

0.2.0

Jun 27, 2025

This version

0.1.9

Jun 26, 2025

0.1.8

Jun 24, 2025

0.1.7

Jun 23, 2025

0.1.6

Jun 23, 2025

0.1.5

Jun 23, 2025

0.1.4

Jun 23, 2025

0.1.3

Jun 23, 2025

0.1.2

Jun 17, 2025

0.1.1

Jun 17, 2025

0.1.0

Jun 16, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

goblintools-0.1.9.tar.gz (14.7 kB view details)

Uploaded Jun 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

goblintools-0.1.9-py3-none-any.whl (13.6 kB view details)

Uploaded Jun 26, 2025 Python 3

File details

Details for the file goblintools-0.1.9.tar.gz.

File metadata

Download URL: goblintools-0.1.9.tar.gz
Upload date: Jun 26, 2025
Size: 14.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for goblintools-0.1.9.tar.gz
Algorithm	Hash digest
SHA256	`ee336f81789a3709c1a5713bb66e7f57df4f22890ed6edc04cda9b7ee2e6d6be`
MD5	`d2a7cde926a36843f41eae19243dbbeb`
BLAKE2b-256	`dc6a33c68d06319fe40665da7c6f1f7d07fc1f9db71a4dcc0b15babd3e04cfa5`

See more details on using hashes here.

File details

Details for the file goblintools-0.1.9-py3-none-any.whl.

File metadata

Download URL: goblintools-0.1.9-py3-none-any.whl
Upload date: Jun 26, 2025
Size: 13.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for goblintools-0.1.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d67bc8492e8fbd305c1382219275c790b71a5e8c7f0609e4cde0b75e332411ab`
MD5	`27e7f687252f36d992cd2b364110cc6d`
BLAKE2b-256	`7d31638986f339746c1090217fbbeaadec6bc56dac0d72beedff3361ab4ed2e0`

See more details on using hashes here.

goblintools 0.1.9

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

GoblinTools

Installation

System Requirements

Debian/Ubuntu

Arch Linux

macOS (Homebrew)

Key Features

Usage

Text Extraction

With OCR Enabled

Extracting All Text from a Folder

File Management & Archive Extraction

Text Cleaning

OCR with AWS Textract

Supported Formats

Documents

Spreadsheets

Presentations

Archives

File Management Utilities

Move a File

Delete a Folder and Its Contents

Delete a File if It's Empty

Normalize and Move All Files in a Directory

Text Extraction Utilities

Verify if a PDF file needs OCR treatment

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes