Indox Data Extraction
Project description
IndoxMiner
IndoxMiner is a powerful Python library that leverages Large Language Models (LLMs) to extract structured information from unstructured data sources including text, PDFs, and images. Using a flexible schema-based approach, it enables precise data extraction, validation, and transformation, making it ideal for automating document processing workflows.
🚀 Key Features
- Multi-Format Support: Extract data from text, PDFs, images, and scanned documents
- Schema-Based Extraction: Define custom schemas to specify exactly what data to extract
- LLM Integration: Seamless integration with OpenAI models for intelligent extraction
- Validation & Type Safety: Built-in validation rules and type-safe field definitions
- Flexible Output: Export to JSON, pandas DataFrames, or custom formats
- Async Support: Built for scalability with asynchronous processing capabilities
- OCR Integration: Multiple OCR engine options for image-based text extraction
- High-Resolution Support: Enhanced processing for high-quality PDFs
- Error Handling: Comprehensive error handling and validation reporting
📦 Installation
pip install indoxminer
🎯 Quick Start
Basic Text Extraction
from indoxminer import ExtractorSchema, Field, FieldType, ValidationRule, Extractor, OpenAi
# Initialize OpenAI extractor
llm_extractor = OpenAi(
api_key="your-api-key",
model="gpt-4-mini"
)
# Define extraction schema
schema = ExtractorSchema(
fields=[
Field(
name="product_name",
description="Product name",
field_type=FieldType.STRING,
rules=ValidationRule(min_length=2)
),
Field(
name="price",
description="Price in USD",
field_type=FieldType.FLOAT,
rules=ValidationRule(min_value=0)
)
]
)
# Create extractor and process text
extractor = Extractor(llm=llm_extractor, schema=schema)
text = """
MacBook Pro 16-inch with M2 chip
Price: $2,399.99
In stock: Yes
"""
# Extract and convert to DataFrame
result = await extractor.extract(text)
df = extractor.to_dataframe(result)
PDF Processing
from indoxminer import DocumentProcessor, ProcessingConfig
# Initialize processor with custom config
processor = DocumentProcessor(
files=["invoice.pdf"],
config=ProcessingConfig(
hi_res_pdf=True,
chunk_size=1000
)
)
# Process document
documents = processor.process()
# Extract structured data
schema = ExtractorSchema(
fields=[
Field(
name="bill_to",
description="Billing address",
field_type=FieldType.STRING
),
Field(
name="invoice_date",
description="Invoice date",
field_type=FieldType.DATE
),
Field(
name="total_amount",
description="Total amount in USD",
field_type=FieldType.FLOAT
)
]
)
results = await extractor.extract(documents)
Image Processing with OCR
# Configure OCR-enabled processor
config = ProcessingConfig(
ocr_enabled=True,
ocr_engine="easyocr", # or "tesseract", "paddle"
language="en"
)
processor = DocumentProcessor(
files=["receipt.jpg"],
config=config
)
# Process image and extract text
documents = processor.process()
🔧 Core Components
ExtractorSchema
Defines the structure of data to be extracted:
- Field definitions
- Validation rules
- Output format specifications
schema = ExtractorSchema(
fields=[...],
output_format="json"
)
Field Types
Supported field types:
STRING
: Text dataINTEGER
: Whole numbersFLOAT
: Decimal numbersDATE
: Date valuesBOOLEAN
: True/False valuesLIST
: Arrays of valuesDICT
: Nested objects
Validation Rules
Available validation options:
min_length
/max_length
: String length constraintsmin_value
/max_value
: Numeric boundspattern
: Regex patternsrequired
: Required fieldscustom
: Custom validation functions
⚙️ Configuration Options
ProcessingConfig
config = ProcessingConfig(
hi_res_pdf=True, # High-resolution PDF processing
ocr_enabled=True, # Enable OCR
ocr_engine="tesseract", # OCR engine selection
chunk_size=1000, # Text chunk size
language="en", # Processing language
max_threads=4 # Parallel processing threads
)
🔍 Error Handling
IndoxMiner provides detailed error reporting:
results = await extractor.extract(documents)
if not results.is_valid:
for chunk_idx, errors in results.validation_errors.items():
print(f"Errors in chunk {chunk_idx}:")
for error in errors:
print(f"- {error.field}: {error.message}")
# Access valid results
valid_data = results.get_valid_results()
🤝 Contributing
We welcome contributions! To contribute:
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Open a Pull Request
Please read our Contributing Guidelines for more details.
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🆘 Support
- Documentation: Full documentation
- Issues: GitHub Issues
- Discussions: GitHub Discussions
🌟 Star History
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for indoxMiner-0.0.6-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5476a3c6fd58654839a7c28eabecf124df4e6a9aaf0898d4503b8358360546dd |
|
MD5 | 4e3c086d04d57c7f373589f23baad629 |
|
BLAKE2b-256 | 1fd42e5c0ab8280918e8ac85bb93e0d89c069de9d1edd70c44d618173a5e1bb7 |