Extract and infer metadata from PDF documents using AI-powered analysis
Project description
docinfer
A Python package for extracting and inferring metadata from PDF documents using AI-powered analysis.
Features
- Extract metadata from PDF files
- AI-powered document analysis using LLMs
- CLI tool for easy batch processing
- Flexible configuration and output formatting
- Structured metadata models using Pydantic
Requirements
- Python 3.12 or higher
- Ollama - Required for AI-powered analysis
- Install Ollama
- Pull a model:
ollama pull gemma3:4b
- See
pyproject.tomlfor full Python dependency list
Installation
From GitHub Repository
pip install git+https://github.com/tidyeval/docinfer.git
From Local Development
Clone the repository and install in editable mode:
git clone https://github.com/tidyeval/docinfer.git
cd docinfer
pip install -e .
Quick Start
Using uvx (Recommended)
Run directly without installation using uvx:
uvx --from git+https://github.com/tidyeval/docinfer.git docinfer <path-to-pdf>
Note: Once published to PyPI, you'll be able to run
uvx docinfer <path-to-pdf>directly.
CLI Usage
If you've installed the package locally, run directly:
docinfer <path-to-pdf>
Options
--model MODEL- Specify the Ollama model (default:gemma3:4b)- Example:
docinfer document.pdf --model gemma2
- Example:
--json- Output as JSON instead of formatted text--no-ai- Skip AI analysis and show embedded metadata only--export FILE- Export results to JSON file--quiet- Suppress progress output
Python API
from docinfer.services.pdf_extractor import PDFExtractor
from docinfer.services.ai_analyzer import AIAnalyzer
# Extract PDF content
extractor = PDFExtractor()
content = extractor.extract("document.pdf")
# Analyze with AI
analyzer = AIAnalyzer()
metadata = analyzer.analyze(content)
Project Structure
docinfer/
├── src/
│ ├── cli.py # Command-line interface
│ ├── models/ # Pydantic data models
│ ├── services/ # Core services (PDF extraction, AI analysis)
│ └── prompts/ # AI prompt templates
├── tests/ # Unit and integration tests
├── specs/ # Project specifications
├── pyproject.toml # Project configuration
└── README.md # This file
Development
Setting up Development Environment
-
Clone the repository:
git clone https://github.com/tidyeval/docinfer.git cd docinfer
-
Create and activate virtual environment:
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install in development mode:
pip install -e ".[dev]"
Running Tests
pytest
Code Quality
The project uses:
- black for code formatting
- ruff for linting
- pytest for testing
Contributing
Contributions are welcome! Please ensure:
- Code passes linting and formatting checks
- Tests pass with good coverage
- Commit messages are descriptive
License
See LICENSE file for details.
Author
Tino Kanngiesser (tinokanngiesser@gmail.com)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docinfer-0.1.1.tar.gz.
File metadata
- Download URL: docinfer-0.1.1.tar.gz
- Upload date:
- Size: 15.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d91ecc4270fcad376eb1a2a1e93075ac48da544ba358982eed281bbb4d523aa5
|
|
| MD5 |
d32359068903d414252b208c93a1aa52
|
|
| BLAKE2b-256 |
d7f89b730219b8eccb33e9164d3b20f761cbd3b273bace3e82e9033fa10c6805
|
File details
Details for the file docinfer-0.1.1-py3-none-any.whl.
File metadata
- Download URL: docinfer-0.1.1-py3-none-any.whl
- Upload date:
- Size: 17.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c78ea535683cde49412e14e725659d832050a7d970db2ef4f0042bcb62de2274
|
|
| MD5 |
01d46455a1a336c691dd2118bd284f40
|
|
| BLAKE2b-256 |
4672cbccc435c599cd739b478902e0113d84ab1242f01fff8db9d740a0fa9cbf
|