A Python library for extracting and analyzing content from any documents, supporting batch and selective extraction, custom configuration, and multiple output formats.
Project description
[!Caution] This project is in active development. The API is subject to change and breaking changes may occur. Package may not work until first stable release (1.0.0).
Overview
Extract content from documents easily with Python.
- Extract from PDFs (other formats are coming soon)
- Streaming extraction for large documents and real-time results
- Process one or many files using batch extraction
- Choose what to extract (tables, text, images, etc.)
- Export results to JSON, CSV, Excel and others
- Simple and flexible API with high configurability
📦 Installation
-
Using uv:
uv add docviz-python
Upgrading from previous version:
uv pip install docviz-python --upgrade
-
Using pip:
pip install docviz-python --upgrade
-
Directly from source:
git clone https://github.com/privateai-com/docviz.git cd docviz pip install -e .
Quick Start
Basic Usage
import asyncio
import docviz
async def main():
# Create a document instance (can be a local file or a URL)
document = docviz.Document("path/to/your/document.pdf")
# Extract all content asynchronously
extractions = await document.extract_content()
# Save results (file name without extension, it will be inherited from chosen format)
extractions.save("results", save_format=docviz.SaveFormat.JSON)
asyncio.run(main())
Synchronous Usage
import docviz
document = docviz.Document("path/to/your/document.pdf")
extractions = document.extract_content_sync()
extractions.save("results", save_format=docviz.SaveFormat.JSON)
Code Examples
Batch Processing
import docviz
from pathlib import Path
# Process all PDF files in a directory
pdf_directory = Path("data/papers/")
output_dir = Path("output/")
output_dir.mkdir(exist_ok=True)
pdfs = pdf_directory.glob("*.pdf")
documents = [docviz.Document(str(pdf)) for pdf in pdfs]
extractions = docviz.batch_extract(documents)
for ext in extractions:
ext.save(output_dir, save_format=[docviz.SaveFormat.JSON, docviz.SaveFormat.CSV])
Selective Extraction
import docviz
document = docviz.Document("path/to/document.pdf")
# Extract only specific types of content
extractions = document.extract_content(
includes=[
docviz.ExtractionType.TABLE,
docviz.ExtractionType.TEXT,
docviz.ExtractionType.FIGURE,
docviz.ExtractionType.EQUATION,
]
)
extractions.save("selective_results", save_format=docviz.SaveFormat.JSON)
Custom Configuration
import docviz
document = docviz.Document("path/to/document.pdf")
extractions = document.extract_content(
extraction_config=docviz.ExtractionConfig(page_limit=30),
llm_config=LLMConfig(
model="gpt-4o-mini",
api_key=os.getenv("OPENAI_API_KEY"),
base_url="https://api.openai.com/v1",
)
)
extractions.save("configured_results", save_format=docviz.SaveFormat.JSON)
Streaming Processing
import docviz
document = docviz.Document("path/to/large_document.pdf")
# Process document in pages to save memory
for page_result in document.extract_streaming():
# Process each page
page_result.save(f"page_{page_result.page_number}", save_format=docviz.SaveFormat.JSON)
Progress Tracking
import docviz
from tqdm import tqdm
document = docviz.Document("path/to/document.pdf")
# Extract with progress bar
with tqdm(total=document.page_count, desc="Extracting content") as pbar:
extractions = document.extract_content(progress_callback=pbar.update)
extractions.save("progress_results", save_format=docviz.SaveFormat.JSON)
Docs
Project has a static site with docs and examples on almost all of its functionality. You can find it at GitHub Pages or build it on your own using sphinx locally. All the dependencies are included in pyproject.toml under the docs group.
Examples
- Basic Usage with 3 different approaches: simple, passing url to document, streaming example and custom configuration using OpenAI key.
- Streaming Processing with progress tracking and generator API.
- OpenAI API Example with custom configuration using OpenAI key.
Pipeline Visualization
Contributing
Refer to CONTRIBUTING.md for more information.
📄 License
This project is licensed under the MIT License. See the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docviz_python-0.11.0.tar.gz.
File metadata
- Download URL: docviz_python-0.11.0.tar.gz
- Upload date:
- Size: 54.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
985371d2f263baf6081ea15a8d14ce73f3915027a7d4099069748e7a94905f94
|
|
| MD5 |
7e4d75d912128e0a487f02975d7eb0d7
|
|
| BLAKE2b-256 |
ba55f3e3cf4289577f5f882ad9a9ae4bd7e16d2e7bb94c19854e92166fcea881
|
File details
Details for the file docviz_python-0.11.0-py3-none-any.whl.
File metadata
- Download URL: docviz_python-0.11.0-py3-none-any.whl
- Upload date:
- Size: 66.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.8.18
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
94b8701d4b7127efe3f4da8b1d0a3471a04375f95079dfeb16e7c5633daeddbc
|
|
| MD5 |
e402a98cdc737256ba7be9711613362b
|
|
| BLAKE2b-256 |
c1a66f9698c50fa5aa769ec3da0a9750c135782240d84bc2e0e5e7a4c5a37c18
|