Skip to main content

Extract and infer metadata from PDF documents using AI-powered analysis

Project description

docinfer

A Python package for extracting and inferring metadata from PDF documents using AI-powered analysis.

Features

  • Extract metadata from PDF files
  • AI-powered document analysis using LLMs
  • CLI tool for easy batch processing
  • Flexible configuration and output formatting
  • Structured metadata models using Pydantic

Requirements

  • Python 3.12 or higher
  • Ollama - Required for AI-powered analysis
  • See pyproject.toml for full Python dependency list

Installation

From GitHub Repository

pip install git+https://github.com/tidyeval/docinfer.git

From Local Development

Clone the repository and install in editable mode:

git clone https://github.com/tidyeval/docinfer.git
cd docinfer
pip install -e .

Quick Start

Using uvx (Recommended)

Run directly without installation using uvx:

uvx --from git+https://github.com/tidyeval/docinfer.git docinfer <path-to-pdf>

Note: Once published to PyPI, you'll be able to run uvx docinfer <path-to-pdf> directly.

CLI Usage

If you've installed the package locally, run directly:

docinfer <path-to-pdf>

Options

  • --model MODEL - Specify the Ollama model (default: gemma3:4b)
    • Example: docinfer document.pdf --model gemma2
  • --json - Output as JSON instead of formatted text
  • --no-ai - Skip AI analysis and show embedded metadata only
  • --export FILE - Export results to JSON file
  • --quiet - Suppress progress output

Python API

from docinfer.services.pdf_extractor import PDFExtractor
from docinfer.services.ai_analyzer import AIAnalyzer

# Extract PDF content
extractor = PDFExtractor()
content = extractor.extract("document.pdf")

# Analyze with AI
analyzer = AIAnalyzer()
metadata = analyzer.analyze(content)

Project Structure

docinfer/
├── src/
│   ├── cli.py              # Command-line interface
│   ├── models/             # Pydantic data models
│   ├── services/           # Core services (PDF extraction, AI analysis)
│   └── prompts/            # AI prompt templates
├── tests/                  # Unit and integration tests
├── specs/                  # Project specifications
├── pyproject.toml          # Project configuration
└── README.md               # This file

Development

Setting up Development Environment

  1. Clone the repository:

    git clone https://github.com/tidyeval/docinfer.git
    cd docinfer
    
  2. Create and activate virtual environment:

    python -m venv .venv
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
    
  3. Install in development mode:

    pip install -e ".[dev]"
    

Running Tests

pytest

Code Quality

The project uses:

  • black for code formatting
  • ruff for linting
  • pytest for testing

Contributing

Contributions are welcome! Please ensure:

  • Code passes linting and formatting checks
  • Tests pass with good coverage
  • Commit messages are descriptive

License

See LICENSE file for details.

Author

Tino Kanngiesser (tinokanngiesser@gmail.com)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docinfer-0.1.0.tar.gz (6.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docinfer-0.1.0-py3-none-any.whl (6.5 kB view details)

Uploaded Python 3

File details

Details for the file docinfer-0.1.0.tar.gz.

File metadata

  • Download URL: docinfer-0.1.0.tar.gz
  • Upload date:
  • Size: 6.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for docinfer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2cacf52177fcddbf3b271cef96d8b54bf8f721dec50f898250aeec211c25ac28
MD5 fccbe185941b43afd4c403e5b87ff784
BLAKE2b-256 ee6addefa4ca5804c2faede15b03e3c7372b2e0e84bd9b9a25de044d98713772

See more details on using hashes here.

File details

Details for the file docinfer-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: docinfer-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 6.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for docinfer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3e04d6ad47dd5a06da4852ad182c2dfe7bd8b09b7533ffa859a1557002844e77
MD5 bf10e5a8c7d50f0edb999c4ed37fe3af
BLAKE2b-256 c69d02c55099ee41d27c625885cf6a147da7566abb685fce08f27b60c0b7e8f5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page