Skip to main content

Toolkit for archive extraction, OCR parsing, and file text extraction

Project description

GoblinTools

GoblinTools is a Python library designed for text extraction, archive handling, OCR integration, and text cleaning. It supports a wide range of file formats and offers both local and cloud-based OCR options.


Installation

pip install goblintools

Note:

  • GoblinTools requires Python 3.7 or newer.
  • For OCR support, you must install Tesseract OCR separately [https://github.com/tesseract-ocr/tesseract].
  • For AWS Textract support, valid AWS credentials are required.

System Requirements

Some archive formats such as .rar, .7z, .tar, and others depend on external system tools to be extracted properly. These tools are not Python packages and must be installed manually. patoolib (used by GoblinTools) relies on them.

Debian/Ubuntu

sudo apt install unrar p7zip-full p7zip-rar

Arch Linux

sudo pacman -S unrar p7zip

macOS (Homebrew)

brew install unrar p7zip

Key Features

  • Broad File Support: Extract text from over 20 document, spreadsheet, and presentation formats.
  • Comprehensive Archive Handling: Supports .zip, .rar, .7z, .tar, .gz, and many more.
  • Flexible OCR Integration: Use Tesseract locally or integrate with AWS Textract.
  • Advanced Text Cleaning: Accent removal, lowercasing, and intelligent stopword filtering (Portuguese support).
  • Efficient Batch Processing: Handle multiple archives in parallel.
  • Robust File Management: Move, delete, and organize files/directories with ease.

Usage

Text Extraction

from goblintools import TextExtractor
import os

extractor = TextExtractor()
file_path = "example.pdf"

if os.path.exists(file_path):
    text = extractor.extract_from_file(file_path)
    if text:
        print("Successfully extracted text:")
        print(text[:200] + "...")
    else:
        print(f"Could not extract text from {file_path}.")
else:
    print(f"Error: File not found at {file_path}")

With OCR Enabled

extractor_with_ocr = TextExtractor(ocr_handler=True)
scanned_pdf_path = "scanned_document.pdf"

if os.path.exists(scanned_pdf_path):
    scanned_text = extractor_with_ocr.extract_from_file(scanned_pdf_path)
    if scanned_text:
        print("\nSuccessfully extracted text from scanned document (with OCR):")
        print(scanned_text[:200] + "...")
    else:
        print(f"Could not extract text from {scanned_pdf_path} (OCR might be needed).")
else:
    print(f"\nSkipping scanned document example: File not found at {scanned_pdf_path}")

Extracting All Text from a Folder

folder_path = "/path/to/your/folder"
text_from_folder = extractor.extract_from_folder(folder_path)
print(f"\nExtracted text from folder: {text_from_folder[:500]}...")

File Management & Archive Extraction

from goblintools import FileManager
import os

output_folder = "extracted_content"
os.makedirs(output_folder, exist_ok=True)

# Recursive extraction
FileManager.extract_files_recursive("archive.zip", output_folder)

# Batch extraction
FileManager.batch_extract(["a.zip", "b.rar"], output_folder)

Text Cleaning

from goblintools import TextCleaner

cleaner = TextCleaner()
raw_text = "Isso é um Teste com Acentos. E algumas palavras, como 'e', 'a', 'o', são stopwords."

print(f"Original text: {raw_text}")

clean_text_basic = cleaner.clean_text(raw_text)
print(f"Basic cleaning: {clean_text_basic}")

clean_text_full = cleaner.clean_text(raw_text, lowercase=True, remove_stopwords=True)
print(f"Full cleaning: {clean_text_full}")

OCR with AWS Textract

from goblintools import TextExtractor

extractor = TextExtractor(
    ocr_handler=True,
    use_aws=True,
    aws_access_key="your-aws-access-key-here",
    aws_secret_key="your-aws-secret-key-here",
    aws_region="us-east-1"
)

# Example:
text = extractor.extract_from_file("aws_scanned_document.pdf")

Supported Formats

Documents

.pdf, .doc, .docx, .odt, .rtf, .txt, .csv, .xml, .html

Spreadsheets

.xlsx, .xls, .ods, .dbf

Presentations

.pptx

Archives

.zip, .rar, .7z, .tar, .gz, .bz2, .iso, .deb, .rpm, .jar, .war, .ear, .cbz, .cbr, .cb7, .tgz, .txz, .cbt, .udf, .ace, .cba, .arj, .cab, .chm, .cpio, .dms, .lha, .lzh, .lzma, .lzo, .xz, .zst, .zoo, .adf, .alz, .arc, .shn, .rz, .lrz, .a, .Z


File Management Utilities

Move a File

from goblintools import FileManager

source = "path/to/source.txt"
destination = "path/to/destination.txt"

FileManager.move_file(source, destination)

Delete a Folder and Its Contents

FileManager.delete_folder("temp_folder")

Delete a File if It's Empty

FileManager.delete_if_empty("empty_file.txt")

Normalize and Move All Files in a Directory

Moves all files to the root of the folder, renames to avoid conflicts, and removes empty subfolders.

FileManager.move_files("path/to/root_folder")

Text Extraction Utilities

Verify if a PDF file needs OCR treatment

if extractor.pdf_needs_ocr("scanned_document.pdf")
    print("Needs OCR!")

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

goblintools-0.1.9.tar.gz (14.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

goblintools-0.1.9-py3-none-any.whl (13.6 kB view details)

Uploaded Python 3

File details

Details for the file goblintools-0.1.9.tar.gz.

File metadata

  • Download URL: goblintools-0.1.9.tar.gz
  • Upload date:
  • Size: 14.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for goblintools-0.1.9.tar.gz
Algorithm Hash digest
SHA256 ee336f81789a3709c1a5713bb66e7f57df4f22890ed6edc04cda9b7ee2e6d6be
MD5 d2a7cde926a36843f41eae19243dbbeb
BLAKE2b-256 dc6a33c68d06319fe40665da7c6f1f7d07fc1f9db71a4dcc0b15babd3e04cfa5

See more details on using hashes here.

File details

Details for the file goblintools-0.1.9-py3-none-any.whl.

File metadata

  • Download URL: goblintools-0.1.9-py3-none-any.whl
  • Upload date:
  • Size: 13.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for goblintools-0.1.9-py3-none-any.whl
Algorithm Hash digest
SHA256 d67bc8492e8fbd305c1382219275c790b71a5e8c7f0609e4cde0b75e332411ab
MD5 27e7f687252f36d992cd2b364110cc6d
BLAKE2b-256 7d31638986f339746c1090217fbbeaadec6bc56dac0d72beedff3361ab4ed2e0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page