Skip to main content

Intelliparse is your all-in-one solution to extract text, images, tables, and metadata from **various file formats** - from common documents to complex CAD drawings. Powered by AI for intelligent content understanding.

Project description

🧠 Intelliparse

Smart File Parsing & Content Extraction Made Simple

Intelliparse is your all-in-one solution to extract text, images, tables, and metadata from various file formats - from common documents to complex CAD drawings. Powered by AI for intelligent content understanding. 🚀

from intelliparse.parsers import FileParser
from intelliparse.types import RawFile

# Parse any file with AI-powered insights
file = RawFile.from_path("contract.pdf")
parser = FileParser()
parsed = await parser.parse_async(file)

print(f"🔍 Found {len(parsed.sections)} sections!")
print(f"📄 Text: {parsed.sections[0].text[:200]}...")

🌟 Features

  • ** Common File Formats** supported (PDF, DOCX, PPT, Images, Audio, Video, CAD, and more)
  • AI-Powered Insights - Automatic image descriptions, audio transcriptions, and content analysis
  • Military-Grade Extraction (WIP) - Text, tables, images, metadata, and document structure
  • Easy Extension - Add custom parsers in <10 lines of code

📦 Installation

# Install core library
pip install intelliparse

# Install system dependencies (choose your OS)
# Ubuntu/Debian
sudo apt-get install libmagic1
# macOS
brew install libmagic
# Windows (via Chocolatey)
choco install magic

🚀 Basic Usage

Parse Any File

file = RawFile.from_bytes(b"file content", "secret_data.xlsx")
parsed = await FileParser().parse_async(file) # ParsedFile

for section in parsed.sections:
    print(f"Section {section.number}:")
    print(f"- Text: {section.text[:100]}...")
    print(f"- Found {len(section.images)} images!")

Extract Tables

table_data = parsed.sections[0].items[0]
if isinstance(table_data, TablePageItem):
    print("📊 Perfect Table Found!")
    print("\n".join(table_data.csv.split("\n")[:3]))

🔍 Advanced Usage

AI-Powered Parsing

from intellibricks.agents import Agent
from intellibricks.llms import TextTranscriptionSynapse, Synapse
from intellibricks.llms.types import (
    GenerationConfig,
    ChainOfThought,
    VisualMediaDescription,
    AudioDescription
)

# Use AI to describe images and transcribe audio
parser = FileParser(
    strategy="high",
    visual_description_agent=Agent(
        task="Detailed description of visual elements.",
        instructions=[
            "Describe the provided visual elements in a"
            "detailed manner, following the instructions."
            "Descriptions must be in Portuguese.",
        ],
        metadata={
            "name": "Visual Elements Descriptor",
            "description": "Description of visual elements in Portuguese.",
        },
        synapse=Synapse.of("google/genai/gemini-1.5-flash"),
        response_model=ChainOfThought[VisualMediaDescription],
        output_language="en",
        generation_config=GenerationConfig(timeout=60, max_retries=1),
    ),
    audio_description_agent=Agent(
        task="Audio transcription",
        instructions=[
            "Transcribe the provided audio in a"
            "clear and precise manner, following the instructions."
            "Transcriptions must be in Portuguese.",
        ],
        metadata={
            "name": "Audio Transcriber",
            "description": "Audio transcription in Portuguese.",
        },
        synapse=Synapse.of("google/genai/gemini-1.5-flash"),
        audio_transcriptions_synapse=TextTranscriptionSynapse.of(
            "groq/api/whisper-large-v3-turbo"
        ),
        response_model=ChainOfThought[AudioDescription],
    ),
)

parsed = await parser.parse_async(RawFile.from_path("presentation.mp4"))
print(f"📽 Video Description: {parsed.md}")

📚 Supported Formats

Category Formats
Documents PDF, DOCX, PPTX, XLSX, TXT, XML
Images PNG, JPG, TIFF, BMP, GIF, SVG, WEBP,
Audio/Video MP3, WAV, FLAC, AAC, MP4, AVI, MOV,
CAD/Design DWG
Archives ZIP, RAR, 7Z, TAR, GZ
Specialized PKT (Cisco - TODO),

🤝 Contributing

We welcome contributors! To get started:

git clone https://github.com/arthurbrenno/intelliparse.git
cd intelliparse
uv sync

Run tests (TODO. Will work like this):

pytest tests/ --verbose

📜 License

Apache 2.0 - Made with ❤️ by Arthur Brenno


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

intelliparse-0.0.2.tar.gz (71.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

intelliparse-0.0.2-py3-none-any.whl (37.1 kB view details)

Uploaded Python 3

File details

Details for the file intelliparse-0.0.2.tar.gz.

File metadata

  • Download URL: intelliparse-0.0.2.tar.gz
  • Upload date:
  • Size: 71.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.26

File hashes

Hashes for intelliparse-0.0.2.tar.gz
Algorithm Hash digest
SHA256 a897354a3566c81ebdbf8e18f2060e0a8e673144247c2249bbe68d2225d3dce1
MD5 301b52b08c94032a4110924e5d3530a1
BLAKE2b-256 3fc8be90d5f1d1398c56a77cd588d234031cd51336130ec681648b6fb624a0e0

See more details on using hashes here.

File details

Details for the file intelliparse-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for intelliparse-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 0b93a931496ae3715ffc0fd7d24655e9701b095688d573e4cb63934c9a802b9d
MD5 a27f9df4ea652cc20d1ffae3890aabfd
BLAKE2b-256 b73f4f53a2f6606179a7bd9dc903e4c3ff0ba3a0936b3f50724c8501ad2dc5d5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page