Skip to main content

Extract and infer metadata from PDF documents using AI-powered analysis

Project description

docinfer

A Python package for extracting and inferring metadata from PDF documents using AI-powered analysis.

Features

  • Extract metadata from PDF files
  • AI-powered document analysis using LLMs
  • CLI tool for easy batch processing
  • Flexible configuration and output formatting
  • Structured metadata models using Pydantic

Requirements

  • Python 3.12 or higher
  • Ollama - Required for AI-powered analysis
  • See pyproject.toml for full Python dependency list

Installation

From GitHub Repository

pip install git+https://github.com/tidyeval/docinfer.git

From Local Development

Clone the repository and install in editable mode:

git clone https://github.com/tidyeval/docinfer.git
cd docinfer
pip install -e .

Quick Start

Using uvx (Recommended)

Run directly without installation using uvx:

uvx --from git+https://github.com/tidyeval/docinfer.git docinfer <path-to-pdf>

Note: Once published to PyPI, you'll be able to run uvx docinfer <path-to-pdf> directly.

CLI Usage

If you've installed the package locally, run directly:

docinfer <path-to-pdf>

Options

  • --model MODEL - Specify the Ollama model (default: gemma3:4b)
    • Example: docinfer document.pdf --model gemma2
  • --json - Output as JSON instead of formatted text
  • --no-ai - Skip AI analysis and show embedded metadata only
  • --export FILE - Export results to JSON file
  • --quiet - Suppress progress output

Python API

from docinfer.services.pdf_extractor import PDFExtractor
from docinfer.services.ai_analyzer import AIAnalyzer

# Extract PDF content
extractor = PDFExtractor()
content = extractor.extract("document.pdf")

# Analyze with AI
analyzer = AIAnalyzer()
metadata = analyzer.analyze(content)

Project Structure

docinfer/
├── src/
│   ├── cli.py              # Command-line interface
│   ├── models/             # Pydantic data models
│   ├── services/           # Core services (PDF extraction, AI analysis)
│   └── prompts/            # AI prompt templates
├── tests/                  # Unit and integration tests
├── specs/                  # Project specifications
├── pyproject.toml          # Project configuration
└── README.md               # This file

Development

Setting up Development Environment

  1. Clone the repository:

    git clone https://github.com/tidyeval/docinfer.git
    cd docinfer
    
  2. Create and activate virtual environment:

    python -m venv .venv
    source .venv/bin/activate  # On Windows: .venv\Scripts\activate
    
  3. Install in development mode:

    pip install -e ".[dev]"
    

Running Tests

pytest

Code Quality

The project uses:

  • black for code formatting
  • ruff for linting
  • pytest for testing

Contributing

Contributions are welcome! Please ensure:

  • Code passes linting and formatting checks
  • Tests pass with good coverage
  • Commit messages are descriptive

License

See LICENSE file for details.

Author

Tino Kanngiesser (tinokanngiesser@gmail.com)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docinfer-0.1.1.tar.gz (15.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docinfer-0.1.1-py3-none-any.whl (17.2 kB view details)

Uploaded Python 3

File details

Details for the file docinfer-0.1.1.tar.gz.

File metadata

  • Download URL: docinfer-0.1.1.tar.gz
  • Upload date:
  • Size: 15.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for docinfer-0.1.1.tar.gz
Algorithm Hash digest
SHA256 d91ecc4270fcad376eb1a2a1e93075ac48da544ba358982eed281bbb4d523aa5
MD5 d32359068903d414252b208c93a1aa52
BLAKE2b-256 d7f89b730219b8eccb33e9164d3b20f761cbd3b273bace3e82e9033fa10c6805

See more details on using hashes here.

File details

Details for the file docinfer-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: docinfer-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 17.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for docinfer-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c78ea535683cde49412e14e725659d832050a7d970db2ef4f0042bcb62de2274
MD5 01d46455a1a336c691dd2118bd284f40
BLAKE2b-256 4672cbccc435c599cd739b478902e0113d84ab1242f01fff8db9d740a0fa9cbf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page