Skip to main content

Intelliparse is your all-in-one solution to extract text, images, tables, and metadata from **various file formats** - from common documents to complex CAD drawings. Powered by AI for intelligent content understanding.

Project description

🧠 Intelliparse

Smart File Parsing & Content Extraction Made Simple

Intelliparse is your all-in-one solution to extract text, images, tables, and metadata from various file formats - from common documents to complex CAD drawings. Powered by AI for intelligent content understanding. 🚀

from intelliparse.parsers import FileParser
from intelliparse.types import RawFile

# Parse any file with AI-powered insights
file = RawFile.from_path("contract.pdf")
parser = FileParser()
parsed_file = parser.parse(file)

print(f"🔍 Found {len(parsed_file.sections)} sections!")
print(f"📄 Text: {parsed_file.sections[0].text[:200]}...")

🌟 Features

  • Common File Formats supported (PDF, DOCX, PPT, Images, Audio, Video, CAD, and more)
  • AI-Powered Insights - Automatic image descriptions, audio transcriptions, and content analysis
  • Military-Grade Extraction (WIP) - Text, tables, images, metadata, and document structure
  • Easy Extension - Add custom parsers in <10 lines of code

📦 Installation

# Install core library
pip install intelliparse

# Install system dependencies (choose your OS)
# Ubuntu/Debian
sudo apt-get install libmagic1
# macOS
brew install libmagic
# Windows (via Chocolatey)
choco install magic

🚀 Basic Usage

Parse Any File

file = RawFile.from_bytes(b"file content", "secret_data.xlsx")
parsed = FileParser().parse(file) # ParsedFile

for section in parsed.sections:
    print(f"Section {section.number}:")
    print(f"- Text: {section.text[:100]}...")
    print(f"- Found {len(section.images)} images!")

Extract Tables

table_data = parsed.sections[0].items[0]
if isinstance(table_data, TablePageItem):
    print("📊 Perfect Table Found!")
    print("\n".join(table_data.csv.split("\n")[:3]))

🔍 Advanced Usage

AI-Powered Parsing

from intellibricks.agents import Agent
from intellibricks.llms import TextTranscriptionSynapse, Synapse
from intellibricks.llms.types import (
    GenerationConfig,
    ChainOfThought,
    VisualMediaDescription,
    AudioDescription
)

# Use AI to describe images and transcribe audio
parser = FileParser(
    strategy="high",
    visual_description_agent=Agent(
        task="Detailed description of visual elements.",
        instructions=[
            "Describe the provided visual elements in a"
            "detailed manner, following the instructions."
            "Descriptions must be in Portuguese.",
        ],
        metadata={
            "name": "Visual Elements Descriptor",
            "description": "Description of visual elements in Portuguese.",
        },
        synapse=Synapse.of("google/genai/gemini-1.5-flash"),
        response_model=ChainOfThought[VisualMediaDescription],
        output_language="en",
        generation_config=GenerationConfig(timeout=60, max_retries=1),
    ),
    audio_description_agent=Agent(
        task="Audio transcription",
        instructions=[
            "Transcribe the provided audio in a"
            "clear and precise manner, following the instructions."
            "Transcriptions must be in Portuguese.",
        ],
        metadata={
            "name": "Audio Transcriber",
            "description": "Audio transcription in Portuguese.",
        },
        synapse=Synapse.of("google/genai/gemini-1.5-flash"),
        audio_transcriptions_synapse=TextTranscriptionSynapse.of(
            "groq/api/whisper-large-v3-turbo"
        ),
        response_model=ChainOfThought[AudioDescription],
    ),
)

parsed = parser.parse(RawFile.from_path("presentation.mp4"))
print(f"📽 Video Description: {parsed.md}")

📚 Supported Formats

Category Formats
Documents PDF, DOCX, PPTX, XLSX, TXT, XML
Images PNG, JPG, TIFF, BMP, GIF, SVG, WEBP,
Audio/Video MP3, WAV, FLAC, AAC, MP4, AVI, MOV,
CAD/Design DWG
Archives ZIP, RAR, 7Z, TAR, GZ
Specialized PKT (Cisco - TODO),

🤝 Contributing

We welcome contributors! To get started:

git clone https://github.com/arthurbrenno/intelliparse.git
cd intelliparse
uv sync

Run tests (TODO. Will work like this):

pytest tests/ --verbose

📜 License

Apache 2.0 - Made with ❤️ by Arthur Brenno


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

intelliparse-0.0.5.tar.gz (69.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

intelliparse-0.0.5-py3-none-any.whl (35.7 kB view details)

Uploaded Python 3

File details

Details for the file intelliparse-0.0.5.tar.gz.

File metadata

  • Download URL: intelliparse-0.0.5.tar.gz
  • Upload date:
  • Size: 69.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.3

File hashes

Hashes for intelliparse-0.0.5.tar.gz
Algorithm Hash digest
SHA256 55d1de3fc3c47e4263f25d07d497ece1090d7fea596b5bad66fb8d47d96a3cde
MD5 c198c08ace241ea1280cc4c34335741a
BLAKE2b-256 797578b57c1abf1010e97de143dfb89f7745f5640cd991aae4300f87a525a423

See more details on using hashes here.

File details

Details for the file intelliparse-0.0.5-py3-none-any.whl.

File metadata

File hashes

Hashes for intelliparse-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 6f3cfc15fdbfab05b00d76a1bf6ae55aef7fcc91b4f41f7559809751c2aa500b
MD5 029cee94c62158a7318c7aaa1b30be43
BLAKE2b-256 b66ea18aef6e44cec8d2e084970e402d80a978430c21972bec85bd210a3f0eda

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page