Skip to main content

PDF metadata extraction CLI using PyExifTool

Project description

fileKor

Local metadata engine that extracts, summarizes, classifies, and tags files using taxonomy-based labeling.

Quick Start

# Install uv
winget install astral-sh.uv


git clone filekor 

cd filekor

# Setup
uv venv

# Windows
.venv\Scripts\activate
# MacOS/Linux
source venv/bin/activate

uv pip install -e .

# CLI Usage
filekor extract documento.pdf
filekor sidecar documento.pdf
filekor sidecar ./documentos --dir           # Process directory (generates merged.kor by default)
filekor sidecar ./documentos --dir --no-merge # Generate individual .kor files
filekor sidecar ./documentos --dir --db     # Use database to regenerate when available
filekor labels documento.pdf
filekor sync documento.kor          # Sync existing .kor to database
filekor merge ./directorio         # Merge multiple .kor files
filekor delete --path ./doc.pdf    # Delete by path
filekor delete --sha <hash>        # Delete by SHA256

Library Usage

filekor can be used as a Python library for database-backed queries and search:

from filekor.db import get_db, sync_file, search_files

# Get database instance (lazy singleton)
db = get_db()

# Sync a .kor file to the database
sync_file("./documento.kor")

# Search files by labels and content with scoring
results = search_files(
    labels=["finance", "2026"],
    query="budget report"
)
# Returns ranked results with relevance scores

Enable auto-sync in config.yaml to automatically update the database when using CLI commands.

Features

Core Features

  • Metadata Extraction - Extract metadata from PDF, TXT, MD files using PyExifTool
  • Text Extraction - Extract and summarize text content from supported files
  • Sidecar Generation - Generate YAML sidecar files (.kor) with full metadata
  • Taxonomy Labels - LLM-based classification with custom taxonomy support

LLM Providers

  • Google Gemini - Native Gemini API support
  • OpenAI - GPT-4o, GPT-4o-mini support
  • Groq - Fast inference with Llama models
  • OpenRouter - Access to 200+ free models
  • Mock Provider - Testing without API calls

Database & Search

  • SQLite Database - Index all .kor metadata
  • Full-Text Search - FTS5 for fast filename/metadata search
  • Multi-Label Search - OR logic for filtering by multiple labels
  • Relevance Scoring - Configurable weights for search ranking
  • Auto-Sync - Automatic database updates from CLI

Interfaces

  • CLI - Complete command-line interface
  • Library API - Python API for integration
  • 100 Tests - Comprehensive test coverage

Documentation

Guide Description
Installation Setup and installation
Usage CLI commands reference
Library Python Library API with code examples
Taxonomy Labels and taxonomy configuration
LLM LLM provider setup (Gemini, OpenAI, Groq, OpenRouter)
Development Development and testing guide

Project Structure

fileKor/
├── src/filekor/       # Source code
│   ├── cli.py        # CLI interface
│   ├── db.py         # Database module (SQLite)
│   ├── models.py     # Database models
│   ├── sidecar.py    # Sidecar model
│   ├── labels.py     # Labels module
│   └── llm.py       # LLM providers
├── docs/             # Documentation
├── test-files/        # Test files
├── tests/           # Test suite
└── README.md

License

Apache License 2.0 - See LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filekor-0.1.0.tar.gz (50.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

filekor-0.1.0-py3-none-any.whl (50.9 kB view details)

Uploaded Python 3

File details

Details for the file filekor-0.1.0.tar.gz.

File metadata

  • Download URL: filekor-0.1.0.tar.gz
  • Upload date:
  • Size: 50.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for filekor-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f5562b6678141c6d75199c81e9e66bf63a34736581a9c2eb576c70960fc7d85a
MD5 43b860b27202a96428921931d4f05594
BLAKE2b-256 23f976fda6fac24d6bba103b6a31b765d4601de402881d6b6ebe5558e9eb7faf

See more details on using hashes here.

File details

Details for the file filekor-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: filekor-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 50.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for filekor-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 aae759f0736cb31b574010b52b1994f505154c5d50278b1dfad1f0c204c3a3ef
MD5 93117a20102abd8b356857b7695ad57c
BLAKE2b-256 ef8d049692bdfc092ca3ed4680747a01cdcb0263ef74606de7cf4d021a46b823

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page