docviz-python

A Python library for extracting and analyzing content from any documents, supporting batch and selective extraction, custom configuration, and multiple output formats.

These details have not been verified by PyPI

Project description

[!Caution] This project is in active development. The API is subject to change and breaking changes may occur. Package may not work until first stable release (1.0.0).

Overview

Extract content from documents easily with Python.

Extract from PDFs (other formats are coming soon)
Streaming extraction for large documents and real-time results
Process one or many files using batch extraction
Choose what to extract (tables, text, images, etc.)
Export results to JSON, CSV, Excel and others
Simple and flexible API with high configurability

📦 Installation

Using uv:

uv add docviz-python

Upgrading from previous version:

uv pip install docviz-python --upgrade

Using pip:
```
pip install docviz-python --upgrade
```

Directly from source:

git clone https://github.com/privateai-com/docviz.git
cd docviz
pip install -e .

Quick Start

Basic Usage

import asyncio
import docviz

async def main():
    # Create a document instance (can be a local file or a URL)
    document = docviz.Document("path/to/your/document.pdf")
    
    # Extract all content asynchronously
    extractions = await document.extract_content()
    
    # Save results (file name without extension, it will be inherited from chosen format)
    extractions.save("results", save_format=docviz.SaveFormat.JSON)

asyncio.run(main())

Synchronous Usage

import docviz

document = docviz.Document("path/to/your/document.pdf")
extractions = document.extract_content_sync()
extractions.save("results", save_format=docviz.SaveFormat.JSON)

Code Examples

Batch Processing

import docviz
from pathlib import Path

# Process all PDF files in a directory
pdf_directory = Path("data/papers/")
output_dir = Path("output/")
output_dir.mkdir(exist_ok=True)

pdfs = pdf_directory.glob("*.pdf")
documents = [docviz.Document(str(pdf)) for pdf in pdfs]
extractions = docviz.batch_extract(documents)

for ext in extractions:
    ext.save(output_dir, save_format=[docviz.SaveFormat.JSON, docviz.SaveFormat.CSV])

Selective Extraction

import docviz

document = docviz.Document("path/to/document.pdf")

# Extract only specific types of content
extractions = document.extract_content(
    includes=[
        docviz.ExtractionType.TABLE,
        docviz.ExtractionType.TEXT,
        docviz.ExtractionType.FIGURE,
        docviz.ExtractionType.EQUATION,
    ]
)

extractions.save("selective_results", save_format=docviz.SaveFormat.JSON)

Custom Configuration

import docviz

document = docviz.Document("path/to/document.pdf")
extractions = document.extract_content(
    extraction_config=docviz.ExtractionConfig(page_limit=30),
    llm_config=LLMConfig(
        model="gpt-4o-mini",
        api_key=os.getenv("OPENAI_API_KEY"),
        base_url="https://api.openai.com/v1",
    )
)
extractions.save("configured_results", save_format=docviz.SaveFormat.JSON)

Streaming Processing

import docviz

document = docviz.Document("path/to/large_document.pdf")

# Process document in pages to save memory
for page_result in document.extract_streaming():
    # Process each page
    page_result.save(f"page_{page_result.page_number}", save_format=docviz.SaveFormat.JSON)

Progress Tracking

import docviz
from tqdm import tqdm

document = docviz.Document("path/to/document.pdf")

# Extract with progress bar
with tqdm(total=document.page_count, desc="Extracting content") as pbar:
    extractions = document.extract_content(progress_callback=pbar.update)

extractions.save("progress_results", save_format=docviz.SaveFormat.JSON)

Docs

Project has a static site with docs and examples on almost all of its functionality. You can find it at GitHub Pages or build it on your own using sphinx locally. All the dependencies are included in pyproject.toml under the docs group.

Examples

Basic Usage with 3 different approaches: simple, passing url to document, streaming example and custom configuration using OpenAI key.
Streaming Processing with progress tracking and generator API.
OpenAI API Example with custom configuration using OpenAI key.

Pipeline Visualization

Original page with chart

Chart region extracted by Page Parser

Gemma3 output

Contributing

Refer to CONTRIBUTING.md for more information.

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.11.0

Sep 19, 2025

0.10.3

Sep 12, 2025

0.10.2

Sep 6, 2025

0.10.1

Sep 6, 2025

0.10.0

Sep 5, 2025

0.9.0

Aug 25, 2025

0.8.1

Aug 22, 2025

0.7.0

Aug 21, 2025

0.6.0

Aug 21, 2025

0.5.0

Aug 20, 2025

0.4.0

Aug 16, 2025

0.3.0

Aug 15, 2025

0.2.0

Aug 15, 2025

0.1.0

Aug 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docviz_python-0.11.0.tar.gz (54.6 kB view details)

Uploaded Sep 19, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docviz_python-0.11.0-py3-none-any.whl (66.2 kB view details)

Uploaded Sep 19, 2025 Python 3

File details

Details for the file docviz_python-0.11.0.tar.gz.

File metadata

Download URL: docviz_python-0.11.0.tar.gz
Upload date: Sep 19, 2025
Size: 54.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.18

File hashes

Hashes for docviz_python-0.11.0.tar.gz
Algorithm	Hash digest
SHA256	`985371d2f263baf6081ea15a8d14ce73f3915027a7d4099069748e7a94905f94`
MD5	`7e4d75d912128e0a487f02975d7eb0d7`
BLAKE2b-256	`ba55f3e3cf4289577f5f882ad9a9ae4bd7e16d2e7bb94c19854e92166fcea881`

See more details on using hashes here.

File details

Details for the file docviz_python-0.11.0-py3-none-any.whl.

File metadata

Download URL: docviz_python-0.11.0-py3-none-any.whl
Upload date: Sep 19, 2025
Size: 66.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.8.18

File hashes

Hashes for docviz_python-0.11.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`94b8701d4b7127efe3f4da8b1d0a3471a04375f95079dfeb16e7c5633daeddbc`
MD5	`e402a98cdc737256ba7be9711613362b`
BLAKE2b-256	`c1a66f9698c50fa5aa769ec3da0a9750c135782240d84bc2e0e5e7a4c5a37c18`

See more details on using hashes here.

docviz-python 0.11.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Overview

📦 Installation

Quick Start

Basic Usage

Synchronous Usage

Code Examples

Batch Processing

Selective Extraction

Custom Configuration

Streaming Processing

Progress Tracking

Docs

Examples

Pipeline Visualization

Contributing

📄 License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes