Indox Data Extraction

These details have not been verified by PyPI

Project links

Homepage

Project description

IndoxMiner

IndoxMiner is a powerful Python library for extracting structured information from unstructured text and documents using Large Language Models (LLM). It provides a flexible schema-based approach to define and validate the information you want to extract.

Features

🔍 Extract structured data from text and PDFs
📄 Support for multiple document formats
🔧 Customizable extraction schemas
✅ Built-in validation rules
📊 Easy conversion to pandas DataFrames
🤖 Integration with OpenAI models
🎯 Type-safe field definitions
🔄 Async support for better performance

Installation

pip install indoxminer

Quick Start

Basic Text Extraction

from indoxminer import (
    ExtractorSchema,
    Field,
    FieldType,
    ValidationRule,
    OutputFormat,
    Extractor,
    OpenAi
)

# Initialize OpenAI
llm_extractor = OpenAi(api_key="your-api-key", model="gpt-4-mini")

# Define extraction schema
schema = ExtractorSchema(
    fields=[
        Field(
            name="product_name",
            description="Product name",
            field_type=FieldType.STRING,
            rules=ValidationRule(min_length=2)
        ),
        Field(
            name="price",
            description="Price in USD",
            field_type=FieldType.FLOAT,
            rules=ValidationRule(min_value=0)
        )
    ]
)

# Create extractor
extractor = Extractor(llm=llm_extractor, schema=schema)

# Extract information
text = """
MacBook Pro 16-inch with M2 chip
Price: $2,399.99
In stock: Yes
"""
result = await extractor.extract(text)

# Convert to DataFrame
df = extractor.to_dataframe(result)

PDF Document Processing

from indoxminer import DocumentProcessor, ProcessingConfig

# Initialize document processor
processor = DocumentProcessor(["invoice.pdf"])

# Process documents with configuration
documents = processor.process(
    config=ProcessingConfig(
        hi_res_pdf=True
    )
)

# Define complex schema
schema = ExtractorSchema(
    fields=[
        Field(
            name="bill_to",
            description="Bill To",
            field_type=FieldType.STRING,
            rules=ValidationRule(min_length=2)
        ),
        Field(
            name="date",
            description="date",
            field_type=FieldType.DATE,
        ),
        Field(
            name="amount",
            description="price in usd",
            field_type=FieldType.FLOAT,
        ),
    ],
    output_format=OutputFormat.JSON
)

# Extract information
results = await extractor.extract(documents)

# Handle results and validation
valid_data = results.get_valid_results()

if not results.is_valid:
    for chunk_idx, errors in results.validation_errors.items():
        print(f"Chunk {chunk_idx} has errors: {errors}")

Core Components

ExtractorSchema

Defines the structure of data to be extracted:

fields: List of Field objects defining what to extract
output_format: Desired output format (JSON, etc.)

Field

Defines individual data fields to extract:

name: Field identifier
description: Field description for the LLM
field_type: Data type (STRING, FLOAT, INTEGER, DATE)
rules: Validation rules for the field

ValidationRule

Sets constraints for extracted data:

min_length: Minimum string length
max_length: Maximum string length
min_value: Minimum numeric value
max_value: Maximum numeric value
pattern: Regex pattern for validation

DocumentProcessor

Handles document processing:

Supports multiple file formats
Configurable processing options
High-resolution PDF support

Configuration

ProcessingConfig

config = ProcessingConfig(
    hi_res_pdf=True,  # Enable high-resolution PDF processing
    # Add other configuration options as needed
)

Error Handling

The library provides comprehensive validation and error handling:

Validation errors are collected per chunk
Easy access to valid and invalid results
Detailed error messages for debugging

Output Formats

Supported output formats:

JSON
Pandas DataFrame
Raw dictionary

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Support

For issues and feature requests, please use the GitHub issue tracker.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.0.7

Nov 5, 2024

0.0.6

Nov 4, 2024

0.0.5

Nov 4, 2024

0.0.4

Nov 3, 2024

0.0.3

Nov 3, 2024

0.0.2

Nov 3, 2024

0.0.1

Nov 2, 2024

This version

0.0.0

Nov 2, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

indoxminer-0.0.0.tar.gz (32.0 kB view hashes)

Uploaded Nov 2, 2024 Source

Built Distribution

indoxMiner-0.0.0-py3-none-any.whl (32.6 kB view hashes)

Uploaded Nov 2, 2024 Python 3

Hashes for indoxminer-0.0.0.tar.gz

Hashes for indoxminer-0.0.0.tar.gz
Algorithm	Hash digest
SHA256	`a0e61e285370da9f2aeab2e5fb730f1a0e6e0a1fe2f0f993aaeda9f2f94536b8`
MD5	`cdd34e39bc48b32ef2ed900de8c339a4`
BLAKE2b-256	`9331efb5b5dc4cf82dde4ea3c7b3e27c92ad53d8672268f286c1d2a93b287048`

Hashes for indoxMiner-0.0.0-py3-none-any.whl

Hashes for indoxMiner-0.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`efb675e5d1a32aa19b5642fa0588555657db6507b4fb0fe2f9ac86610ccf4c62`
MD5	`a06083fcdce226f77b48ff75ccb97c04`
BLAKE2b-256	`895319249f193a076790c4f77e7702cdd26f04b9b444e02ce956ac0bd872eb78`