Skip to main content

Intelliparse is your all-in-one solution to extract text, images, tables, and metadata from **various file formats** - from common documents to complex CAD drawings. Powered by AI for intelligent content understanding.

Project description

🧠 Intelliparse

Smart File Parsing & Content Extraction Made Simple

Intelliparse is your all-in-one solution to extract text, images, tables, and metadata from various file formats - from common documents to complex CAD drawings. Powered by AI for intelligent content understanding. 🚀

from intelliparse.parsers import FileParser
from intelliparse.types import RawFile

# Parse any file with AI-powered insights
file = RawFile.from_path("contract.pdf")
parser = FileParser(strategy="high")
parsed = await parser.parse_async(file)

print(f"🔍 Found {len(parsed.sections)} sections!")
print(f"📄 Text: {parsed.sections[0].text[:200]}...")

🌟 Features

  • ** Common File Formats** supported (PDF, DOCX, PPT, Images, Audio, Video, CAD, and more)
  • AI-Powered Insights - Automatic image descriptions, audio transcriptions, and content analysis
  • Military-Grade Extraction (WIP) - Text, tables, images, metadata, and document structure
  • Easy Extension - Add custom parsers in <10 lines of code

📦 Installation

# Install core library
pip install intelliparse

# Install system dependencies (choose your OS)
# Ubuntu/Debian
sudo apt-get install libmagic1
# macOS
brew install libmagic
# Windows (via Chocolatey)
choco install magic

🚀 Basic Usage

Parse Any File

file = RawFile.from_bytes(b"file content", "secret_data.xlsx")
parsed = await FileParser().parse_async(file) # ParsedFile

for section in parsed.sections:
    print(f"Section {section.number}:")
    print(f"- Text: {section.text[:100]}...")
    print(f"- Found {len(section.images)} images!")

Extract Tables

table_data = parsed.sections[0].items[0]
if isinstance(table_data, TablePageItem):
    print("📊 Perfect Table Found!")
    print("\n".join(table_data.csv.split("\n")[:3]))

🔍 Advanced Usage

AI-Powered Parsing

from intellibricks.agents import Agent
from intellibricks.llms import TextTranscriptionSynapse, Synapse
from intellibricks.llms.types import (
    GenerationConfig,
    ChainOfThought,
    VisualMediaDescription,
    AudioDescription
)

# Use AI to describe images and transcribe audio
parser = FileParser(
    strategy="high",
    visual_description_agent=Agent(
        task="Detailed description of visual elements.",
        instructions=[
            "Describe the provided visual elements in a"
            "detailed manner, following the instructions."
            "Descriptions must be in Portuguese.",
        ],
        metadata={
            "name": "Visual Elements Descriptor",
            "description": "Description of visual elements in Portuguese.",
        },
        synapse=Synapse.of("google/genai/gemini-1.5-flash"),
        response_model=ChainOfThought[VisualMediaDescription],
        output_language="en",
        generation_config=GenerationConfig(timeout=60, max_retries=1),
    ),
    audio_description_agent=Agent(
        task="Audio transcription",
        instructions=[
            "Transcribe the provided audio in a"
            "clear and precise manner, following the instructions."
            "Transcriptions must be in Portuguese.",
        ],
        metadata={
            "name": "Audio Transcriber",
            "description": "Audio transcription in Portuguese.",
        },
        synapse=Synapse.of("google/genai/gemini-1.5-flash"),
        audio_transcriptions_synapse=TextTranscriptionSynapse.of(
            "groq/api/whisper-large-v3-turbo"
        ),
        response_model=ChainOfThought[AudioDescription],
    ),
)

parsed = await parser.parse_async(RawFile.from_path("presentation.mp4"))
print(f"📽 Video Description: {parsed.md}")

📚 Supported Formats

Category Formats
Documents PDF, DOCX, PPTX, XLSX, TXT, XML
Images PNG, JPG, TIFF, BMP, GIF, SVG, WEBP,
Audio/Video MP3, WAV, FLAC, AAC, MP4, AVI, MOV,
CAD/Design DWG
Archives ZIP, RAR, 7Z, TAR, GZ
Specialized PKT (Cisco - TODO),

🤝 Contributing

We welcome contributors! To get started:

git clone https://github.com/arthurbrenno/intelliparse.git
cd intelliparse
uv sync

Run tests (TODO. Will work like this):

pytest tests/ --verbose

📜 License

Apache 2.0 - Made with ❤️ by Arthur Brenno


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

intelliparse-0.0.1.tar.gz (67.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

intelliparse-0.0.1-py3-none-any.whl (32.8 kB view details)

Uploaded Python 3

File details

Details for the file intelliparse-0.0.1.tar.gz.

File metadata

  • Download URL: intelliparse-0.0.1.tar.gz
  • Upload date:
  • Size: 67.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.26

File hashes

Hashes for intelliparse-0.0.1.tar.gz
Algorithm Hash digest
SHA256 1000e198de42d7fc3394fa835e21fcb9eaa2761970f9b95c39c3711ebb0a26cd
MD5 6e62d277dac865b1c3bfeb73165fe3e4
BLAKE2b-256 359e7168535723b9a452e71bd914037332579d932db199937f9c2a68aaf48496

See more details on using hashes here.

File details

Details for the file intelliparse-0.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for intelliparse-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0117b9ff0190943b3b91aaffc340b8e49ad0c5723dd4224785064eee7dc7e1e5
MD5 7a8bb93f05eaa8993f66ff02aedfeb34
BLAKE2b-256 747724cc5e5a1156109a87099b2eb483475a5f82a93733fd507f7d19f5ed6ab2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page