Data extraction and text-extraction tools

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Mr. Black

A comprehensive text extraction and PII detection toolkit for Python.

Overview

Mr. Black is a powerful, versatile library for extracting text from virtually any source and detecting PII (Personally Identifiable Information) in extracted content. It provides a unified interface for text extraction from:

Files (PDFs, DOCx, Excel, images, audio, and more)
URLs and web pages (with JavaScript rendering)
Screenshots
Raw text

The library also includes robust PII detection capabilities with customizable regex patterns for various types of sensitive information.

Features

Text Extraction

Universal Text Extraction: Extract text from almost any document format
Web Content: Scrape and extract text from websites (with JS rendering)
OCR Capabilities: Extract text from images and screenshots
Audio Transcription: Convert audio to text
Password-Protected Files: Support for extracting from encrypted documents
Metadata Extraction: Get comprehensive file metadata
Text Analysis: Summarization, language detection, and basic analytics

PII Detection

Comprehensive Pattern Library: Detect a wide range of PII types
Customizable Patterns: Extend with your own regex patterns
Multiple Input Sources: Scan files, URLs, text, or screen content
Batch Processing: Process entire directories efficiently
Rich Output Options: Formatted display or JSON output

Installation

pip install mrblack

Command Line Interface

Mr. Black provides comprehensive command-line utilities for both text extraction and PII detection.

textextract CLI

textextract --help

usage: textextract [-h] [--metadata] [--summarize] [--sentences SENTENCES] [--analyze]
                   [--translate [LANG]] [--output OUTPUT] [--password PASSWORD] [--scrape]
                   [--max-pages MAX_PAGES] [--verbose] [--screenshot] [--chunked] [--no-js]
                   [--list-languages]
                   [source ...]

Extract and analyze text from any file, URL, directory or wildcard pattern

positional arguments:
  source                Path(s) to file(s), URL, directory, or wildcard pattern

options:
  -h, --help            show this help message and exit
  --metadata            Extract metadata instead of text
  --summarize           Summarize the extracted text
  --sentences SENTENCES
                        Number of sentences in summary (default: 5)
  --analyze             Perform text analysis
  --translate [LANG]    Translate text to specified language code (e.g., 'es'), or list available
                        languages if no code provided
  --output OUTPUT       Output file path (default: stdout)
  --password PASSWORD   Password for protected documents
  --scrape              Scrape multiple pages from a website (for URLs only)
  --max-pages MAX_PAGES
                        Maximum pages to scrape when using --scrape (default: 5)
  --verbose, -v         Increase verbosity (can be used multiple times)
  --screenshot          Capture and extract text from screen
  --chunked             Process large files in chunks to reduce memory usage
  --no-js               Disable JavaScript rendering for web pages
  --list-languages      List available translation languages

textextract Examples

# Basic text extraction
textextract document.pdf

# Extract from a URL
textextract https://example.com

# Capture and extract from screen
textextract screenshot

# Extract and summarize
textextract document.pdf --summarize --sentences 3

# Extract and translate
textextract document.pdf --translate es

# Extract metadata only
textextract document.pdf --metadata

# List available translation languages
textextract --list-languages

# Scrape multiple pages from a website
textextract https://example.com --scrape --max-pages 10

# Process files in chunks (for large files)
textextract large_document.pdf --chunked

# Process all files in a directory
textextract /path/to/documents/

# Process files matching a pattern
textextract "*.pdf"

# Save output to a file
textextract document.pdf --output results.txt

pii CLI

pii --help

usage: pii [-h] [--labels [LABEL ...]] [--json] [--serial] [--save SAVE] [path]

Extract PII or patterns from files, dirs, URLs, or screenshots.

positional arguments:
  path                  File, directory, URL, or 'screenshot'

options:
  -h, --help            show this help message and exit
  --labels [LABEL ...]  Labels to extract; no args lists all labels
  --json                Output results as JSON
  --serial              Per-file results for directories
  --save SAVE           Save JSON output to specified file

pii Examples

# Detect PII in a file
pii resume.pdf

# Detect PII from a URL
pii https://example.com

# Detect PII from screen capture
pii screenshot

# List all available PII labels
pii --labels

# Filter for specific PII types
pii document.pdf --labels email phone_number credit_card

# Output results as JSON
pii document.pdf --json

# Save results to a file
pii document.pdf --json --save results.json

# Process an entire directory
pii /path/to/documents/

# Get per-file results for a directory
pii /path/to/documents/ --serial

Mr. Black Library Usage

Basic Text Extraction

from mrblack import extract_text, text_from_url, text_from_screenshot

# Extract text from a file
content = extract_text('document.pdf')
print(content)

# Extract text from a URL
web_text = text_from_url('https://example.com')
print(web_text)

# Capture and extract text from the screen
screen_text = text_from_screenshot()
print(screen_text)

Advanced Text Processing

from mrblack import (
    summarize_text,
    analyze_text,
    translate_text,
    detect_language,
    extract_metadata
)

# Extract and summarize text
text = extract_text('article.pdf')
summary = summarize_text(text, sentences=5)
print(summary)

# Analyze text content
analysis = analyze_text(text)
print(f"Word count: {analysis['word_count']}")
print(f"Language detected: {analysis['language']}")
print(f"Most common words: {analysis['most_common_words'][:5]}")

# Translate text
translated = translate_text(text, target_lang='es')
print(translated)

# Extract metadata
metadata = extract_metadata('document.docx')
print(metadata)

PII Detection

from mrblack import (
    extract_pii_text,
    extract_pii_file,
    extract_pii_url,
    extract_pii_screenshot
)

# Detect PII in raw text
text = "Contact John Doe at john.doe@example.com or (123) 456-7890"
pii = extract_pii_text(text)
print(pii)
# Output: {'email': ['john.doe@example.com'], 'phone_number': ['(123) 456-7890']}

# Detect PII in a file
file_pii = extract_pii_file('resume.pdf')
print(file_pii)

# Detect PII on a website
url_pii = extract_pii_url('https://example.com/contact')
print(url_pii)

# Capture screen and detect PII
screen_pii = extract_pii_screenshot()
print(screen_pii)

Filter By PII Types

from mrblack import extract_pii_text

text = "My SSN is 123-45-6789 and my credit card is 4111-1111-1111-1111"
# Extract only specific PII types
pii = extract_pii_text(text, labels=["social_security", "credit_card"])
print(pii)

Supported PII Types

Mr. Black can detect numerous types of PII and sensitive information:

Category	PII Types
Personal Identifiers	Email, Phone, Social Security Numbers, Passport Numbers
Financial	Credit Card Numbers, Bank Account Numbers, Routing Numbers, SWIFT Codes
Geographic	Postal/ZIP Codes, Addresses
Technical	IP Addresses (v4/v6), MAC Addresses, UUIDs
Temporal	Dates, Times, Datetime formats
Files/Paths	Windows Paths, Unix Paths
Technology	Protocol names, Programming Languages, File Formats, OS names
Miscellaneous	VIN Numbers, Hex Numbers, Environment Variables

Supported File Formats

Mr. Black supports text extraction from a wide range of file formats:

Category	Supported Formats
Documents	PDF, DOC, DOCX, ODT, RTF, TXT
Spreadsheets	XLS, XLSX, CSV
Presentations	PPT, PPTX
Web	HTML, XML, JSON, YAML
Images	PNG, JPG, JPEG, GIF, TIFF, BMP, WebP
Audio	MP3, WAV, FLAC, AAC, OGG
E-Books	EPUB
Archives	ZIP (with extraction)

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.2.7

Jun 12, 2025

0.2.6

May 31, 2025

0.2.5

May 18, 2025

0.2.4

May 18, 2025

0.2.3

May 18, 2025

0.2.2

May 18, 2025

0.2.1

May 16, 2025

0.2.0

May 16, 2025

This version

0.1.9

May 16, 2025

0.1.8

May 15, 2025

0.1.7

May 15, 2025

0.1.6

May 14, 2025

0.1.5

May 14, 2025

0.1.4

May 13, 2025

0.1.3

May 13, 2025

0.1.2

May 12, 2025

0.1.1

May 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mrblack-0.1.9.tar.gz (93.3 kB view details)

Uploaded May 16, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

mrblack-0.1.9-py3-none-any.whl (37.8 kB view details)

Uploaded May 16, 2025 Python 3

File details

Details for the file mrblack-0.1.9.tar.gz.

File metadata

Download URL: mrblack-0.1.9.tar.gz
Upload date: May 16, 2025
Size: 93.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for mrblack-0.1.9.tar.gz
Algorithm	Hash digest
SHA256	`dc989decd0200680ad388ae487bac80d16601606155bd383394b7853c31af06c`
MD5	`a2373134d57a92919d5b78063cf347c0`
BLAKE2b-256	`b8d32139f38901b4e54bab06e9cf132ff35d207d4cc9b1396f37d6542e0e7b0b`

See more details on using hashes here.

File details

Details for the file mrblack-0.1.9-py3-none-any.whl.

File metadata

Download URL: mrblack-0.1.9-py3-none-any.whl
Upload date: May 16, 2025
Size: 37.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for mrblack-0.1.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e6f1bb0100b3617d48a5e6504f698bed28daaeb7197c01b4107ddc7544f566b5`
MD5	`ac43d44da580310b05c04b4a888fb2b2`
BLAKE2b-256	`50f4c689c36a5ebecd1f5329b5d81a89f2ff5ee3547ba21e30e4b14140e7ef94`

See more details on using hashes here.

mrblack 0.1.9

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Mr. Black

Overview

Features

Text Extraction

PII Detection

Installation

Command Line Interface

textextract CLI

textextract Examples

pii CLI

pii Examples

Mr. Black Library Usage

Basic Text Extraction

Advanced Text Processing

PII Detection

Filter By PII Types

Supported PII Types

Supported File Formats

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes