Skip to main content

Intelliparse is your all-in-one solution to extract text, images, tables, and metadata from **various file formats** - from common documents to complex CAD drawings. Powered by AI for intelligent content understanding.

Project description

🧠 Intelliparse

Smart File Parsing & Content Extraction Made Simple

Intelliparse is your all-in-one solution to extract text, images, tables, and metadata from various file formats - from common documents to complex CAD drawings. Powered by AI for intelligent content understanding. 🚀

from intelliparse.parsers import FileParser
from intelliparse.types import RawFile

# Parse any file with AI-powered insights
file = RawFile.from_path("contract.pdf")
parser = FileParser()
parsed = await parser.parse_async(file)

print(f"🔍 Found {len(parsed.sections)} sections!")
print(f"📄 Text: {parsed.sections[0].text[:200]}...")

🌟 Features

  • Common File Formats supported (PDF, DOCX, PPT, Images, Audio, Video, CAD, and more)
  • AI-Powered Insights - Automatic image descriptions, audio transcriptions, and content analysis
  • Military-Grade Extraction (WIP) - Text, tables, images, metadata, and document structure
  • Easy Extension - Add custom parsers in <10 lines of code

📦 Installation

# Install core library
pip install intelliparse

# Install system dependencies (choose your OS)
# Ubuntu/Debian
sudo apt-get install libmagic1
# macOS
brew install libmagic
# Windows (via Chocolatey)
choco install magic

🚀 Basic Usage

Parse Any File

file = RawFile.from_bytes(b"file content", "secret_data.xlsx")
parsed = await FileParser().parse_async(file) # ParsedFile

for section in parsed.sections:
    print(f"Section {section.number}:")
    print(f"- Text: {section.text[:100]}...")
    print(f"- Found {len(section.images)} images!")

Extract Tables

table_data = parsed.sections[0].items[0]
if isinstance(table_data, TablePageItem):
    print("📊 Perfect Table Found!")
    print("\n".join(table_data.csv.split("\n")[:3]))

🔍 Advanced Usage

AI-Powered Parsing

from intellibricks.agents import Agent
from intellibricks.llms import TextTranscriptionSynapse, Synapse
from intellibricks.llms.types import (
    GenerationConfig,
    ChainOfThought,
    VisualMediaDescription,
    AudioDescription
)

# Use AI to describe images and transcribe audio
parser = FileParser(
    strategy="high",
    visual_description_agent=Agent(
        task="Detailed description of visual elements.",
        instructions=[
            "Describe the provided visual elements in a"
            "detailed manner, following the instructions."
            "Descriptions must be in Portuguese.",
        ],
        metadata={
            "name": "Visual Elements Descriptor",
            "description": "Description of visual elements in Portuguese.",
        },
        synapse=Synapse.of("google/genai/gemini-1.5-flash"),
        response_model=ChainOfThought[VisualMediaDescription],
        output_language="en",
        generation_config=GenerationConfig(timeout=60, max_retries=1),
    ),
    audio_description_agent=Agent(
        task="Audio transcription",
        instructions=[
            "Transcribe the provided audio in a"
            "clear and precise manner, following the instructions."
            "Transcriptions must be in Portuguese.",
        ],
        metadata={
            "name": "Audio Transcriber",
            "description": "Audio transcription in Portuguese.",
        },
        synapse=Synapse.of("google/genai/gemini-1.5-flash"),
        audio_transcriptions_synapse=TextTranscriptionSynapse.of(
            "groq/api/whisper-large-v3-turbo"
        ),
        response_model=ChainOfThought[AudioDescription],
    ),
)

parsed = await parser.parse_async(RawFile.from_path("presentation.mp4"))
print(f"📽 Video Description: {parsed.md}")

📚 Supported Formats

Category Formats
Documents PDF, DOCX, PPTX, XLSX, TXT, XML
Images PNG, JPG, TIFF, BMP, GIF, SVG, WEBP,
Audio/Video MP3, WAV, FLAC, AAC, MP4, AVI, MOV,
CAD/Design DWG
Archives ZIP, RAR, 7Z, TAR, GZ
Specialized PKT (Cisco - TODO),

🤝 Contributing

We welcome contributors! To get started:

git clone https://github.com/arthurbrenno/intelliparse.git
cd intelliparse
uv sync

Run tests (TODO. Will work like this):

pytest tests/ --verbose

📜 License

Apache 2.0 - Made with ❤️ by Arthur Brenno


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

intelliparse-0.0.3.tar.gz (71.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

intelliparse-0.0.3-py3-none-any.whl (37.1 kB view details)

Uploaded Python 3

File details

Details for the file intelliparse-0.0.3.tar.gz.

File metadata

  • Download URL: intelliparse-0.0.3.tar.gz
  • Upload date:
  • Size: 71.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.26

File hashes

Hashes for intelliparse-0.0.3.tar.gz
Algorithm Hash digest
SHA256 ab4d2c37636a6c08058f33ea62ee74323699d8e121ae386c93b7bd395cb98162
MD5 da29b1d7a8dfd324bf5017eb13315fdf
BLAKE2b-256 c54ba091e58806122261fb4fa3df5425708bff9e1fdb30603cfafffd1b188b26

See more details on using hashes here.

File details

Details for the file intelliparse-0.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for intelliparse-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 272c6fe979500aa4d9c513120cc2c545b64601fa7ba1d4ef2bcf028b2113c120
MD5 88a3b1e9fe9415ba1aac40b070aad4ed
BLAKE2b-256 1a79b199da91e0a57328b3f180dc35334b26792b3c0bcd888d4b413d9933d428

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page