Skip to main content

Indox Data Extraction

Project description

IndoxMiner

PyPI version License: MIT

<<<<<<< HEAD IndoxMiner is a powerful Python library that leverages Large Language Models (LLMs) to extract structured information from unstructured data sources including text, PDFs, and images. Using a flexible schema-based approach, it enables precise data extraction, validation, and transformation, making it ideal for automating document processing workflows.

IndoxMiner is a powerful Python library that leverages Large Language Models (LLMs) for data extraction and advanced object detection. It combines schema-based data extraction from unstructured data sources such as text, PDFs, and images, with state-of-the-art object detection models. IndoxMiner enables seamless automation for document processing and visual recognition tasks.

18e56cc1e583b9efff4efa6ba05f11624780a538

🚀 Key Features

  • Multi-Format Support: Extract data from text, PDFs, images, and scanned documents
  • Schema-Based Extraction: Define custom schemas to specify exactly what data to extract
  • LLM Integration: Seamless integration with OpenAI models for intelligent extraction
  • Validation & Type Safety: Built-in validation rules and type-safe field definitions
  • Flexible Output: Export to JSON, pandas DataFrames, or custom formats
  • Async Support: Built for scalability with asynchronous processing capabilities
  • OCR Integration: Multiple OCR engine options for image-based text extraction
  • High-Resolution Support: Enhanced processing for high-quality PDFs
  • Error Handling: Comprehensive error handling and validation reporting

📦 Installation

pip install indoxminer

<<<<<<< HEAD

🎯 Quick Start

======= You may also install required object detection dependencies like Detectron2 or YOLOv8 using:

pip install 'git+https://github.com/facebookresearch/detectron2.git'
pip install ultralytics

📝 Data Extraction

IndoxMiner allows you to extract structured data from various formats like text, PDFs, and images using schema-based extraction and integration with powerful language models (LLMs).

18e56cc1e583b9efff4efa6ba05f11624780a538

Basic Text Extraction

from indoxminer import ExtractorSchema, Field, FieldType, ValidationRule, Extractor, OpenAi

# Initialize OpenAI extractor
llm_extractor = OpenAi(
    api_key="your-api-key",
    model="gpt-4-mini"
)

# Define extraction schema
schema = ExtractorSchema(
    fields=[
        Field(
            name="product_name",
            description="Product name",
            field_type=FieldType.STRING,
            rules=ValidationRule(min_length=2)
        ),
        Field(
            name="price",
            description="Price in USD",
            field_type=FieldType.FLOAT,
            rules=ValidationRule(min_value=0)
        )
    ]
)

# Create extractor and process text
extractor = Extractor(llm=llm_extractor, schema=schema)
text = """
MacBook Pro 16-inch with M2 chip
Price: $2,399.99
In stock: Yes
"""

# Extract and convert to DataFrame
result = await extractor.extract(text)
df = extractor.to_dataframe(result)

PDF Processing

from indoxminer import DocumentProcessor, ProcessingConfig

# Initialize processor with custom config
processor = DocumentProcessor(
    files=["invoice.pdf"],
    config=ProcessingConfig(
        hi_res_pdf=True,
        chunk_size=1000
    )
)

# Process document
documents = processor.process()

# Extract structured data
schema = ExtractorSchema(
    fields=[
        Field(
            name="bill_to",
            description="Billing address",
            field_type=FieldType.STRING
        ),
        Field(
            name="invoice_date",
            description="Invoice date",
            field_type=FieldType.DATE
        ),
        Field(
            name="total_amount",
            description="Total amount in USD",
            field_type=FieldType.FLOAT
        )
    ]
)

results = await extractor.extract(documents)

Image Processing with OCR

# Configure OCR-enabled processor
config = ProcessingConfig(
    ocr_enabled=True,
    ocr_engine="easyocr",  # or "tesseract", "paddle"
    language="en"
)

processor = DocumentProcessor(
    files=["receipt.jpg"],
    config=config
)

# Process image and extract text
documents = processor.process()

<<<<<<< HEAD


📷 Object Detection

IndoxMiner includes powerful object detection capabilities using pre-trained models. The library supports a wide range of models suitable for real-time and high-accuracy object detection tasks.

Supported Detection Models

  • Detectron2 (from Facebook AI Research)
  • DETR (DEtection TRansformers)
  • DETR-CLIP (Combining DETR with CLIP for improved performance)
  • GroundingDINO (Grounding vision-language models for better contextual understanding)
  • Kosmos2 (Cross-modal vision-language pre-training)
  • OWL-ViT (Open-Vocabulary Vision Transformer for universal object detection)
  • RT-DETR (Real-Time DEtection TRansformers)
  • SAM2 (Segment Anything Model for interactive image segmentation)
  • YOLOv5 (You Only Look Once model for real-time detection)
  • YOLOv6, YOLOv7, YOLOv8, YOLOv10, YOLOv11 (Updated versions of the popular YOLO object detection models)
  • YOLOX (A robust, scalable version of the YOLO family)

These models are optimized for speed and accuracy, providing precise bounding boxes, class labels, and confidence scores for various objects in the image.

🚀 Quick Start - Object Detection

Here's a guide to using the IndoxMiner object detection models. This example demonstrates using the YOLOv5 model for object detection:

Using YOLOv5 for Object Detection

from indoxminer import ObjectDetection

# Initialize YOLOv5 detector
detector = ObjectDetection(model="yolov5")

# Run object detection on an image
image_path = "image.jpg"
detections = await detector.detect_objects(image_path)

# Visualize results
detector.visualize_results(detections)

# Optionally, save the detected image
detector.save_results(detections, "output.jpg")

Supported Detection Models

The ObjectDetection class in IndoxMiner can use any of the following models:

detector = ObjectDetection(model="detectron2")  # for Detectron2
detector = ObjectDetection(model="detr")        # for DETR
detector = ObjectDetection(model="yolov8")      # for YOLOv8

You can also set additional configuration options such as confidence threshold, device (CPU/GPU), etc.

📊 Visualizing Detection Results

IndoxMiner provides simple methods to visualize detection results, such as bounding boxes, class labels, and confidence scores. The visualize_results() method displays the image with the bounding boxes drawn around the detected objects.

detector.visualize_results(detections)  # Display bounding boxes and labels

⚙️ Configuration Options

You can configure various parameters of the object detection models for improved accuracy and performance:

  • model: The model name ("yolov5", "detectron2", "detr", etc.)
  • confidence_threshold: Confidence threshold for object detection (default: 0.5)
  • device: The device to run the model on ("cpu" or "cuda" for GPU acceleration)

Example:

detector = ObjectDetection(
    model="yolov5", 
    confidence_threshold=0.6, 
    device="cuda"
)

💡 Supported Formats and Output

The detected objects are returned as a list of dictionaries with the following information:

  • bbox: The bounding box coordinates (x1, y1, x2, y2)
  • class_id: The predicted class ID (e.g., for COCO dataset)
  • confidence: The confidence score of the prediction

For example:

{
    "bbox": [x1, y1, x2, y2],
    "class_id": 63,
    "confidence": 0.87
}

🔄 Models and Pre-trained Weights

The available models and their pre-trained weights are downloaded automatically when you initialize the detector. This ensures that you always have access to the latest model versions without additional setup.


18e56cc1e583b9efff4efa6ba05f11624780a538

🔧 Core Components

ExtractorSchema

Defines the structure of data to be extracted:

  • Field definitions
  • Validation rules
  • Output format specifications
schema = ExtractorSchema(
    fields=[...],
    output_format="json"
)

Field Types

Supported field types:

  • STRING: Text data
  • INTEGER: Whole numbers
  • FLOAT: Decimal numbers
  • DATE: Date values
  • BOOLEAN: True/False values
  • LIST: Arrays of values
  • DICT: Nested objects

Validation Rules

Available validation options:

  • min_length/max_length: String length constraints
  • min_value/max_value: Numeric bounds
  • pattern: Regex patterns
  • required: Required fields
  • custom: Custom validation functions

<<<<<<< HEAD ##⚙️ Configuration Options

⚙️ Configuration Options

18e56cc1e583b9efff4efa6ba05f11624780a538

ProcessingConfig

config = ProcessingConfig(
    hi_res_pdf=True,          # High-resolution PDF processing
    ocr_enabled=True,         # Enable OCR
    ocr_engine="tesseract",   # OCR engine selection
    chunk_size=1000,          # Text chunk size
    language="en",            # Processing language
    max_threads=4             # Parallel processing threads
)

🔍 Error Handling

IndoxMiner provides detailed error reporting:

results = await extractor.extract(documents)

if not results.is_valid:
    for chunk_idx, errors in results.validation_errors.items():
        print(f"Errors in chunk {chunk_idx}:")
        for error in errors:
            print(f"- {error.field}: {error.message}")

# Access valid results
valid_data = results.get_valid_results()

🤝 Contributing

We welcome contributions! To contribute:

  1. Fork the repository
  2. Create a feature branch
  3. Commit your changes
  4. Push to the branch
  5. Open a Pull Request

Please read our Contributing Guidelines for more details.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🆘 Support

🌟 Star History

<<<<<<< HEAD Star History Chart

Star History Chart

>>>>>>> 18e56cc1e583b9efff4efa6ba05f11624780a538

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

indoxminer-0.1.3.tar.gz (67.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

indoxMiner-0.1.3-py3-none-any.whl (81.8 kB view details)

Uploaded Python 3

File details

Details for the file indoxminer-0.1.3.tar.gz.

File metadata

  • Download URL: indoxminer-0.1.3.tar.gz
  • Upload date:
  • Size: 67.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for indoxminer-0.1.3.tar.gz
Algorithm Hash digest
SHA256 12be65c51dc97c47b2c603db820ce160f5ba737ee13378267fee76fbe8aaf79e
MD5 de13ede8ae20020b022d0043cd305424
BLAKE2b-256 d5a8b4326c558fa07555308d5f1b2d366b3a30afab1b3d40a7a3845cc3856fde

See more details on using hashes here.

File details

Details for the file indoxMiner-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: indoxMiner-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 81.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for indoxMiner-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 05716d49cea31beba54beefd8781a20cd12e1fb2b709004f7a8cada15590e4ff
MD5 041c5f7b5ac8821a75bfd9f65f57e853
BLAKE2b-256 26fe72ca438e1301d26e82190ee6c90a73750aac287bfa9f0aa35ea802fb212a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page