Skip to main content

A Python library for extracting and analyzing content from any documents, supporting batch and selective extraction, custom configuration, and multiple output formats.

Project description

[!Caution] This project is in active development. The API is subject to change and breaking changes may occur. Package may not work until first stable release (1.0.0).

docviz

python version License Ruff uv

Overview

Extract content from documents easily with Python.

  • Extract from PDFs (other formats are coming soon)
  • Streaming extraction for large documents and real-time results
  • Process one or many files using batch extraction
  • Choose what to extract (tables, text, images, etc.)
  • Export results to JSON, CSV, Excel and others
  • Simple and flexible API with high configurability

📦 Installation

  • Using uv:

    uv add docviz-python
    

    Upgrading from previous version:

    uv pip install docviz-python --upgrade
    
  • Using pip:

    pip install docviz-python --upgrade
    
  • Directly from source:

    git clone https://github.com/privateai-com/docviz.git
    cd docviz
    pip install -e .
    

Quick Start

Basic Usage

import asyncio
import docviz

async def main():
    # Create a document instance (can be a local file or a URL)
    document = docviz.Document("path/to/your/document.pdf")
    
    # Extract all content asynchronously
    extractions = await document.extract_content()
    
    # Save results (file name without extension, it will be inherited from chosen format)
    extractions.save("results", save_format=docviz.SaveFormat.JSON)

asyncio.run(main())

Synchronous Usage

import docviz

document = docviz.Document("path/to/your/document.pdf")
extractions = document.extract_content_sync()
extractions.save("results", save_format=docviz.SaveFormat.JSON)

Code Examples

Batch Processing

import docviz
from pathlib import Path

# Process all PDF files in a directory
pdf_directory = Path("data/papers/")
output_dir = Path("output/")
output_dir.mkdir(exist_ok=True)

pdfs = pdf_directory.glob("*.pdf")
documents = [docviz.Document(str(pdf)) for pdf in pdfs]
extractions = docviz.batch_extract(documents)

for ext in extractions:
    ext.save(output_dir, save_format=[docviz.SaveFormat.JSON, docviz.SaveFormat.CSV])

Selective Extraction

import docviz

document = docviz.Document("path/to/document.pdf")

# Extract only specific types of content
extractions = document.extract_content(
    includes=[
        docviz.ExtractionType.TABLE,
        docviz.ExtractionType.TEXT,
        docviz.ExtractionType.FIGURE,
        docviz.ExtractionType.EQUATION,
    ]
)

extractions.save("selective_results", save_format=docviz.SaveFormat.JSON)

Custom Configuration

import docviz

document = docviz.Document("path/to/document.pdf")
extractions = document.extract_content(
    extraction_config=docviz.ExtractionConfig(page_limit=30),
    llm_config=LLMConfig(
        model="gpt-4o-mini",
        api_key=os.getenv("OPENAI_API_KEY"),
        base_url="https://api.openai.com/v1",
    )
)
extractions.save("configured_results", save_format=docviz.SaveFormat.JSON)

Streaming Processing

import docviz

document = docviz.Document("path/to/large_document.pdf")

# Process document in pages to save memory
for page_result in document.extract_streaming():
    # Process each page
    page_result.save(f"page_{page_result.page_number}", save_format=docviz.SaveFormat.JSON)

Progress Tracking

import docviz
from tqdm import tqdm

document = docviz.Document("path/to/document.pdf")

# Extract with progress bar
with tqdm(total=document.page_count, desc="Extracting content") as pbar:
    extractions = document.extract_content(progress_callback=pbar.update)

extractions.save("progress_results", save_format=docviz.SaveFormat.JSON)

Docs

Project has a static site with docs and examples on almost all of its functionality. You can find it at GitHub Pages or build it on your own using sphinx locally. All the dependencies are included in pyproject.toml under the docs group.

Examples

  • Basic Usage with 3 different approaches: simple, passing url to document, streaming example and custom configuration using OpenAI key.
  • Streaming Processing with progress tracking and generator API.
  • OpenAI API Example with custom configuration using OpenAI key.

Pipeline Visualization

Original Chart
Original page with chart
Extracted Chart
Chart region extracted by Page Parser
Extracted Chart
Gemma3 output

Contributing

Refer to CONTRIBUTING.md for more information.

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docviz_python-0.11.0.tar.gz (54.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docviz_python-0.11.0-py3-none-any.whl (66.2 kB view details)

Uploaded Python 3

File details

Details for the file docviz_python-0.11.0.tar.gz.

File metadata

  • Download URL: docviz_python-0.11.0.tar.gz
  • Upload date:
  • Size: 54.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.18

File hashes

Hashes for docviz_python-0.11.0.tar.gz
Algorithm Hash digest
SHA256 985371d2f263baf6081ea15a8d14ce73f3915027a7d4099069748e7a94905f94
MD5 7e4d75d912128e0a487f02975d7eb0d7
BLAKE2b-256 ba55f3e3cf4289577f5f882ad9a9ae4bd7e16d2e7bb94c19854e92166fcea881

See more details on using hashes here.

File details

Details for the file docviz_python-0.11.0-py3-none-any.whl.

File metadata

File hashes

Hashes for docviz_python-0.11.0-py3-none-any.whl
Algorithm Hash digest
SHA256 94b8701d4b7127efe3f4da8b1d0a3471a04375f95079dfeb16e7c5633daeddbc
MD5 e402a98cdc737256ba7be9711613362b
BLAKE2b-256 c1a66f9698c50fa5aa769ec3da0a9750c135782240d84bc2e0e5e7a4c5a37c18

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page