Skip to main content

Indox Data Extraction

Project description

IndoxMiner

PyPI version
License: MIT

IndoxMiner is a powerful Python library that leverages Large Language Models (LLMs) for data extraction, advanced object detection, and image classification. It combines schema-based data extraction from unstructured data sources such as text, PDFs, and images, with state-of-the-art object detection and image classification models. IndoxMiner enables seamless automation for document processing, visual recognition, and classification tasks.

🚀 Key Features

  • Data Extraction: Extract structured data from text, PDFs, and images using schema-based extraction and LLMs.
  • Object Detection: Leverage pre-trained object detection models for high-accuracy real-time image recognition.
  • Image Classification: Utilize advanced classification models for identifying objects or features in images.
  • OCR Integration: Extract text from scanned documents or images with integrated OCR engines.
  • Schema-Based Extraction: Define custom schemas for data extraction with validation and type-safety.
  • Multi-Model Support: Supports a wide range of models for detection and classification.
  • Async Support: Built for scalability with asynchronous processing capabilities.
  • Flexible Outputs: Export results to JSON, pandas DataFrames, or custom formats.

📦 Installation

You may also install required object detection dependencies like Detectron2 or YOLOv8 using: Install IndoxMiner with:

pip install indoxminer

You may also install additional dependencies for object detection and classification, such as YOLOv8 or Detectron2.


📝 Quick Start

1. Data Extraction

IndoxMiner integrates seamlessly with OpenAI models for schema-based extraction from text, PDFs, and images. Here's how you can extract structured data from a document:

Basic Text Extraction

from indoxminer import ExtractorSchema, Field, FieldType, ValidationRule, Extractor, OpenAi

# Initialize OpenAI extractor
llm_extractor = OpenAi(api_key="your-api-key", model="gpt-4-mini")

# Define extraction schema
schema = ExtractorSchema(
    fields=[
        Field(name="product_name", field_type=FieldType.STRING, rules=ValidationRule(min_length=2)),
        Field(name="price", field_type=FieldType.FLOAT, rules=ValidationRule(min_value=0))
    ]
)

# Create extractor and process text
extractor = Extractor(llm=llm_extractor, schema=schema)
text = """
MacBook Pro 16-inch with M2 chip
Price: $2,399.99
In stock: Yes
"""

result = await extractor.extract(text)
df = extractor.to_dataframe(result)

2. Object Detection

IndoxMiner provides powerful object detection capabilities with support for a variety of models, such as YOLO, Detectron2, and DETR.

Supported Models for Object Detection

Model Supported ✅
Detectron2
DETR
DETR-CLIP
GroundingDINO
Grounded-SAM2
Grounded-SAM2-FLorence2
Kosmos2
OWL-ViT
OWL-V2
RT-DETR
SAM2
YOLOv5
YOLOv6
YOLOv7
YOLOv8
YOLOv10
YOLOv11
YOLOX

Example: Object Detection with YOLOv5

from indoxminer.detection import YOLOv5

# Initialize YOLOv5 model
detector = YOLOv5()

# Detect objects in an image
image_path = "dog-cat-under-sheet.jpg"
outputs = await detector.detect_objects(image_path)


You can also switch to other models by specifying the model name, e.g., `detectron2`, `detr`, `yolov8`, etc.

```python
detector = YOLOv8()  # For YOLOv8

3. Image Classification

IndoxMiner now supports advanced image classification with models like SigCLIP, ViT, MetaCLIP, MobileCLIP, BioCLIP, AltCLIP, and RemoteCLIP.

Supported Models for Classification

Model Description
SigCLIP Semantic image classification model.
ViT Vision Transformer for image classification.
MetaCLIP Meta AI’s advanced CLIP model.
MobileCLIP Mobile-optimized CLIP.
BioCLIP Specialized for biological images.
AltCLIP Alternative CLIP from BAAI.
RemoteCLIP Remote sensing-specific CLIP model.

Example: Classification with RemoteCLIP

from indoxminer.classification import RemoteCLIPClassifier
from PIL import Image

# Initialize RemoteCLIP
classifier = RemoteCLIPClassifier(
    model_name="ViT-L-14",
    checkpoint_path="/path/to/RemoteCLIP-ViT-L-14.pt"
)

# Classify an image
image = Image.open("/path/to/airport.jpg")
labels = ["An airport", "A university", "A stadium"]
classifier.classify(image, labels, top=3)

Example: Classification with SigCLIP

from indoxminer.classification import SigCLIPClassifier
from PIL import Image

# Initialize SigCLIP
classifier = SigCLIPClassifier()

# Classify an image with default labels
image = Image.open("/path/to/image.jpg")
classifier.classify(image)

🔧 Core Components

Classification Models

  • Flexible Input: Supports single or batch image classification.
  • Custom Labels: Define your own labels for classification tasks.
  • Visualization: Generates bar plots of predicted probabilities.

🔍 Error Handling

IndoxMiner provides detailed error reporting for both data extraction, object detection, and classification tasks.

try:
    results = await extractor.extract(documents)
except Exception as e:
    print(f"An error occurred: {e}")

🤝 Contributing

We welcome contributions! Please follow the standard workflow:

  1. Fork the repository.
  2. Create a feature branch.
  3. Commit your changes.
  4. Push to the branch.
  5. Open a Pull Request.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


🆘 Support


🌟 Star History

Star History Chart

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

indoxminer-0.1.5.tar.gz (76.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

indoxMiner-0.1.5-py3-none-any.whl (105.1 kB view details)

Uploaded Python 3

File details

Details for the file indoxminer-0.1.5.tar.gz.

File metadata

  • Download URL: indoxminer-0.1.5.tar.gz
  • Upload date:
  • Size: 76.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for indoxminer-0.1.5.tar.gz
Algorithm Hash digest
SHA256 6e02ed266d77e739211886724afc14db71e7d35fad76d9603e7099ade25b726d
MD5 4f48ea4d6e893e81e5cfd72fd646dfa0
BLAKE2b-256 66fb4ce2de6ad986d4d2e7d50092f14323fbd9b9853738babdf96ca828549cfd

See more details on using hashes here.

File details

Details for the file indoxMiner-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: indoxMiner-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 105.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.4

File hashes

Hashes for indoxMiner-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 a7ffdfd2f79cc919cf8335fda104f6a28cb6b3338d238f789492dabe736944b0
MD5 6c36b2a8129599852c4355d98f929ed2
BLAKE2b-256 e98f0766f8d888eb4106e35b3d7f3ba4013770b35e6a56aab415d07e5080bbd6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page