Indox Data Extraction

These details have not been verified by PyPI

Project links

Homepage

Project description

IndoxMiner

IndoxMiner is a powerful Python library that leverages Large Language Models (LLMs) to extract structured information from unstructured data sources including text, PDFs, and images. Using a flexible schema-based approach, it enables precise data extraction, validation, and transformation, making it ideal for automating document processing workflows.

🚀 Key Features

Multi-Format Support: Extract data from text, PDFs, images, and scanned documents
Schema-Based Extraction: Define custom schemas to specify exactly what data to extract
LLM Integration: Seamless integration with OpenAI models for intelligent extraction
Validation & Type Safety: Built-in validation rules and type-safe field definitions
Flexible Output: Export to JSON, pandas DataFrames, or custom formats
Async Support: Built for scalability with asynchronous processing capabilities
OCR Integration: Multiple OCR engine options for image-based text extraction
High-Resolution Support: Enhanced processing for high-quality PDFs
Error Handling: Comprehensive error handling and validation reporting

📦 Installation

pip install indoxminer

🎯 Quick Start

Basic Text Extraction

from indoxminer import ExtractorSchema, Field, FieldType, ValidationRule, Extractor, OpenAi

# Initialize OpenAI extractor
llm_extractor = OpenAi(
    api_key="your-api-key",
    model="gpt-4-mini"
)

# Define extraction schema
schema = ExtractorSchema(
    fields=[
        Field(
            name="product_name",
            description="Product name",
            field_type=FieldType.STRING,
            rules=ValidationRule(min_length=2)
        ),
        Field(
            name="price",
            description="Price in USD",
            field_type=FieldType.FLOAT,
            rules=ValidationRule(min_value=0)
        )
    ]
)

# Create extractor and process text
extractor = Extractor(llm=llm_extractor, schema=schema)
text = """
MacBook Pro 16-inch with M2 chip
Price: $2,399.99
In stock: Yes
"""

# Extract and convert to DataFrame
result = await extractor.extract(text)
df = extractor.to_dataframe(result)

PDF Processing

from indoxminer import DocumentProcessor, ProcessingConfig

# Initialize processor with custom config
processor = DocumentProcessor(
    files=["invoice.pdf"],
    config=ProcessingConfig(
        hi_res_pdf=True,
        chunk_size=1000
    )
)

# Process document
documents = processor.process()

# Extract structured data
schema = ExtractorSchema(
    fields=[
        Field(
            name="bill_to",
            description="Billing address",
            field_type=FieldType.STRING
        ),
        Field(
            name="invoice_date",
            description="Invoice date",
            field_type=FieldType.DATE
        ),
        Field(
            name="total_amount",
            description="Total amount in USD",
            field_type=FieldType.FLOAT
        )
    ]
)

results = await extractor.extract(documents)

Image Processing with OCR

# Configure OCR-enabled processor
config = ProcessingConfig(
    ocr_enabled=True,
    ocr_engine="easyocr",  # or "tesseract", "paddle"
    language="en"
)

processor = DocumentProcessor(
    files=["receipt.jpg"],
    config=config
)

# Process image and extract text
documents = processor.process()

🔧 Core Components

ExtractorSchema

Defines the structure of data to be extracted:

Field definitions
Validation rules
Output format specifications

schema = ExtractorSchema(
    fields=[...],
    output_format="json"
)

Field Types

Supported field types:

STRING: Text data
INTEGER: Whole numbers
FLOAT: Decimal numbers
DATE: Date values
BOOLEAN: True/False values
LIST: Arrays of values
DICT: Nested objects

Validation Rules

Available validation options:

min_length/max_length: String length constraints
min_value/max_value: Numeric bounds
pattern: Regex patterns
required: Required fields
custom: Custom validation functions

⚙️ Configuration Options

ProcessingConfig

config = ProcessingConfig(
    hi_res_pdf=True,          # High-resolution PDF processing
    ocr_enabled=True,         # Enable OCR
    ocr_engine="tesseract",   # OCR engine selection
    chunk_size=1000,          # Text chunk size
    language="en",            # Processing language
    max_threads=4             # Parallel processing threads
)

🔍 Error Handling

IndoxMiner provides detailed error reporting:

results = await extractor.extract(documents)

if not results.is_valid:
    for chunk_idx, errors in results.validation_errors.items():
        print(f"Errors in chunk {chunk_idx}:")
        for error in errors:
            print(f"- {error.field}: {error.message}")

# Access valid results
valid_data = results.get_valid_results()

🤝 Contributing

We welcome contributions! To contribute:

Fork the repository
Create a feature branch
Commit your changes
Push to the branch
Open a Pull Request

Please read our Contributing Guidelines for more details.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🆘 Support

Documentation: Full documentation
Issues: GitHub Issues
Discussions: GitHub Discussions

🌟 Star History

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.1.5

Feb 7, 2025

0.1.4

Dec 29, 2024

0.1.3

Dec 28, 2024

0.1.2

Dec 26, 2024

This version

0.1.1

Dec 25, 2024

0.1.0

Dec 19, 2024

0.0.12

Nov 16, 2024

0.0.11

Nov 16, 2024

0.0.10

Nov 11, 2024

0.0.9

Nov 11, 2024

0.0.8

Nov 9, 2024

0.0.7

Nov 5, 2024

0.0.6

Nov 4, 2024

0.0.5

Nov 4, 2024

0.0.4

Nov 3, 2024

0.0.3

Nov 3, 2024

0.0.2

Nov 3, 2024

0.0.1

Nov 2, 2024

0.0.0

Nov 2, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

indoxminer-0.1.1.tar.gz (65.1 kB view details)

Uploaded Dec 25, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

indoxMiner-0.1.1-py3-none-any.whl (80.1 kB view details)

Uploaded Dec 25, 2024 Python 3

File details

Details for the file indoxminer-0.1.1.tar.gz.

File metadata

Download URL: indoxminer-0.1.1.tar.gz
Upload date: Dec 25, 2024
Size: 65.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for indoxminer-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`b6432b2208820f262030207a34adedfa02927538d289918efbcb60dfa0131f2e`
MD5	`0e1d968ea72ef22caecfbfbbd0eac37b`
BLAKE2b-256	`36ccade2c3b1d72255c48acefe28c1fdd372155282c9d7696c7d168fdb46f94e`

See more details on using hashes here.

File details

Details for the file indoxMiner-0.1.1-py3-none-any.whl.

File metadata

Download URL: indoxMiner-0.1.1-py3-none-any.whl
Upload date: Dec 25, 2024
Size: 80.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for indoxMiner-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8cc719072406f55593d5f6e19911ed6b66139b32cb94eb49fc46d177a2d5ee87`
MD5	`33a1a25916ec087c173cf105c3e47b91`
BLAKE2b-256	`50f0764df6fbd162b4f091ef8ac295d382c11e16f1b056559a10d5647d84f7aa`

See more details on using hashes here.

indoxMiner 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

IndoxMiner

🚀 Key Features

📦 Installation

🎯 Quick Start

Basic Text Extraction

PDF Processing

Image Processing with OCR

🔧 Core Components

ExtractorSchema

Field Types

Validation Rules

⚙️ Configuration Options

ProcessingConfig

🔍 Error Handling

🤝 Contributing

📄 License

🆘 Support

🌟 Star History

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes