Skip to main content

Indox Data Extraction

Project description

IndoxMiner

PyPI version License: MIT

IndoxMiner is a powerful Python library that leverages Large Language Models (LLMs) to extract structured information from unstructured data sources including text, PDFs, and images. Using a flexible schema-based approach, it enables precise data extraction, validation, and transformation, making it ideal for automating document processing workflows.

🚀 Key Features

  • Multi-Format Support: Extract data from text, PDFs, images, and scanned documents
  • Schema-Based Extraction: Define custom schemas to specify exactly what data to extract
  • LLM Integration: Seamless integration with OpenAI models for intelligent extraction
  • Validation & Type Safety: Built-in validation rules and type-safe field definitions
  • Flexible Output: Export to JSON, pandas DataFrames, or custom formats
  • Async Support: Built for scalability with asynchronous processing capabilities
  • OCR Integration: Multiple OCR engine options for image-based text extraction
  • High-Resolution Support: Enhanced processing for high-quality PDFs
  • Error Handling: Comprehensive error handling and validation reporting

📦 Installation

pip install indoxminer

🎯 Quick Start

Basic Text Extraction

from indoxminer import ExtractorSchema, Field, FieldType, ValidationRule, Extractor, OpenAi

# Initialize OpenAI extractor
llm_extractor = OpenAi(
    api_key="your-api-key",
    model="gpt-4-mini"
)

# Define extraction schema
schema = ExtractorSchema(
    fields=[
        Field(
            name="product_name",
            description="Product name",
            field_type=FieldType.STRING,
            rules=ValidationRule(min_length=2)
        ),
        Field(
            name="price",
            description="Price in USD",
            field_type=FieldType.FLOAT,
            rules=ValidationRule(min_value=0)
        )
    ]
)

# Create extractor and process text
extractor = Extractor(llm=llm_extractor, schema=schema)
text = """
MacBook Pro 16-inch with M2 chip
Price: $2,399.99
In stock: Yes
"""

# Extract and convert to DataFrame
result = await extractor.extract(text)
df = extractor.to_dataframe(result)

PDF Processing

from indoxminer import DocumentProcessor, ProcessingConfig

# Initialize processor with custom config
processor = DocumentProcessor(
    files=["invoice.pdf"],
    config=ProcessingConfig(
        hi_res_pdf=True,
        chunk_size=1000
    )
)

# Process document
documents = processor.process()

# Extract structured data
schema = ExtractorSchema(
    fields=[
        Field(
            name="bill_to",
            description="Billing address",
            field_type=FieldType.STRING
        ),
        Field(
            name="invoice_date",
            description="Invoice date",
            field_type=FieldType.DATE
        ),
        Field(
            name="total_amount",
            description="Total amount in USD",
            field_type=FieldType.FLOAT
        )
    ]
)

results = await extractor.extract(documents)

Image Processing with OCR

# Configure OCR-enabled processor
config = ProcessingConfig(
    ocr_enabled=True,
    ocr_engine="easyocr",  # or "tesseract", "paddle"
    language="en"
)

processor = DocumentProcessor(
    files=["receipt.jpg"],
    config=config
)

# Process image and extract text
documents = processor.process()

🔧 Core Components

ExtractorSchema

Defines the structure of data to be extracted:

  • Field definitions
  • Validation rules
  • Output format specifications
schema = ExtractorSchema(
    fields=[...],
    output_format="json"
)

Field Types

Supported field types:

  • STRING: Text data
  • INTEGER: Whole numbers
  • FLOAT: Decimal numbers
  • DATE: Date values
  • BOOLEAN: True/False values
  • LIST: Arrays of values
  • DICT: Nested objects

Validation Rules

Available validation options:

  • min_length/max_length: String length constraints
  • min_value/max_value: Numeric bounds
  • pattern: Regex patterns
  • required: Required fields
  • custom: Custom validation functions

⚙️ Configuration Options

ProcessingConfig

config = ProcessingConfig(
    hi_res_pdf=True,          # High-resolution PDF processing
    ocr_enabled=True,         # Enable OCR
    ocr_engine="tesseract",   # OCR engine selection
    chunk_size=1000,          # Text chunk size
    language="en",            # Processing language
    max_threads=4             # Parallel processing threads
)

🔍 Error Handling

IndoxMiner provides detailed error reporting:

results = await extractor.extract(documents)

if not results.is_valid:
    for chunk_idx, errors in results.validation_errors.items():
        print(f"Errors in chunk {chunk_idx}:")
        for error in errors:
            print(f"- {error.field}: {error.message}")

# Access valid results
valid_data = results.get_valid_results()

🤝 Contributing

We welcome contributions! To contribute:

  1. Fork the repository
  2. Create a feature branch
  3. Commit your changes
  4. Push to the branch
  5. Open a Pull Request

Please read our Contributing Guidelines for more details.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🆘 Support

🌟 Star History

Star History Chart

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

indoxminer-0.1.1.tar.gz (65.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

indoxMiner-0.1.1-py3-none-any.whl (80.1 kB view details)

Uploaded Python 3

File details

Details for the file indoxminer-0.1.1.tar.gz.

File metadata

  • Download URL: indoxminer-0.1.1.tar.gz
  • Upload date:
  • Size: 65.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for indoxminer-0.1.1.tar.gz
Algorithm Hash digest
SHA256 b6432b2208820f262030207a34adedfa02927538d289918efbcb60dfa0131f2e
MD5 0e1d968ea72ef22caecfbfbbd0eac37b
BLAKE2b-256 36ccade2c3b1d72255c48acefe28c1fdd372155282c9d7696c7d168fdb46f94e

See more details on using hashes here.

File details

Details for the file indoxMiner-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: indoxMiner-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 80.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for indoxMiner-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 8cc719072406f55593d5f6e19911ed6b66139b32cb94eb49fc46d177a2d5ee87
MD5 33a1a25916ec087c173cf105c3e47b91
BLAKE2b-256 50f0764df6fbd162b4f091ef8ac295d382c11e16f1b056559a10d5647d84f7aa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page