Skip to main content

Parse PDF files using different parsers.

Project description

ParserStudio

Parsestudio is a powerful Python library for extracting and parsing content from PDF documents. It provides an intuitive interface for handling diverse tasks such as extracting text, tables, and images using different parsing backends.


Key Features

  • Modular Design: Choose between multiple parser backends (DoclingParser, PymuPDFParser, LlmapParser) to suit your needs.
  • Multimodal Parsing: Extract text, tables, and images seamlessly.
  • Extensible: Easily integrate custom parsers or adjust parsing behavior with additional parameters.

Installation

Get started with Parsestudio by installing it via pip:

pip install parsestudio

Install the library from source by cloning the repository and running:

git clone https://github.com/chatclimate-ai/ParseStudio.git
cd ParseStudio
pip install .

Quick Start

1. Import and Initialize the Parser

from parsestudio.parse import PDFParser

# Initialize with the desired parser backend
parser = PDFParser(parser="docling")  # Options: "docling", "pymupdf", "llama"

2. Parse a PDF File

outputs = parser.run("path/to/file.pdf", modalities=["text", "tables", "images"])

# Access text content
print(outputs[0].text)

# Access tables
for table in outputs[0].tables:
    print(table.markdown)

# Access images
for image in outputs[0].images:
    image.image.show()

3. Supported Parsers

Choose from the following parsers based on your requirements:

  • Docling: Advanced parser for extracting rich content.
  • PyMuPDF: Lightweight and efficient.
  • LlamaParse: AI-enhanced parser with advanced capabilities.

Each parser has its own strengths. Choose the one that best fits your use case.

LlamaPDFParser Setup

If you choose to use the LlmapParser, you need to set up an API key. Follow these steps:

  1. Create a .env File: In the root directory of your project, create a file named .env.
  2. Add Your API Key: Add the following line to the .env file, replacing your-api-key with your Llmap API key:
    LLAMA_API_KEY=your-api-key
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parsestudio-1.0.2.tar.gz (13.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parsestudio-1.0.2-py3-none-any.whl (12.4 kB view details)

Uploaded Python 3

File details

Details for the file parsestudio-1.0.2.tar.gz.

File metadata

  • Download URL: parsestudio-1.0.2.tar.gz
  • Upload date:
  • Size: 13.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for parsestudio-1.0.2.tar.gz
Algorithm Hash digest
SHA256 240bb27ff5af4c96d04fb9b8f49bac1e1561e1064db9732bdd5aa8e4bd5f4198
MD5 8d601a50b3431752fbe0b6dc39b13d1a
BLAKE2b-256 82dd40933d39b10252c8908fc1a3d2cf08a5e566ab5fc9bae845f418ad73b9ff

See more details on using hashes here.

File details

Details for the file parsestudio-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: parsestudio-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 12.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for parsestudio-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 3d6a9f6c22c7722634f8fde2fe5795b14428babba0a3e1dd5ebba94eef904272
MD5 9e656f5eec19ddcf8dfecb757491b6f0
BLAKE2b-256 677a330b9e7a852d7844d464b8d7b9a2bba5976c993e991d6221a7846349ecd2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page