Skip to main content

PDF metadata extraction CLI using PyExifTool

Project description

fileKor

Local metadata engine that extracts, summarizes, classifies, and tags files using taxonomy-based labeling.

Quick Start

# Install uv
winget install astral-sh.uv


git clone filekor 

cd filekor

# Setup
uv venv

# Windows
.venv\Scripts\activate
# MacOS/Linux
source venv/bin/activate

uv pip install -e .

# CLI Usage
filekor extract documento.pdf
filekor sidecar documento.pdf
filekor sidecar ./documentos --dir           # Process directory (generates merged.kor by default)
filekor sidecar ./documentos --dir --no-merge # Generate individual .kor files
filekor sidecar ./documentos --dir --db     # Use database to regenerate when available
filekor labels documento.pdf
filekor sync documento.kor          # Sync existing .kor to database
filekor merge ./directorio         # Merge multiple .kor files
filekor delete --path ./doc.pdf    # Delete by path
filekor delete --sha <hash>        # Delete by SHA256

Library Usage

filekor can be used as a Python library for database-backed queries and search:

from filekor.db import get_db, sync_file, search_files

# Get database instance (lazy singleton)
db = get_db()

# Sync a .kor file to the database
sync_file("./documento.kor")

# Search files by labels and content with scoring
results = search_files(
    labels=["finance", "2026"],
    query="budget report"
)
# Returns ranked results with relevance scores

Enable auto-sync in config.yaml to automatically update the database when using CLI commands.

Features

Core Features

  • Metadata Extraction - Extract metadata from PDF, TXT, MD files using PyExifTool
  • Text Extraction - Extract and summarize text content from supported files
  • Sidecar Generation - Generate YAML sidecar files (.kor) with full metadata
  • Taxonomy Labels - LLM-based classification with custom taxonomy support

LLM Providers

  • Google Gemini - Native Gemini API support
  • OpenAI - GPT-4o, GPT-4o-mini support
  • Groq - Fast inference with Llama models
  • OpenRouter - Access to 200+ free models
  • Mock Provider - Testing without API calls

Database & Search

  • SQLite Database - Index all .kor metadata
  • Full-Text Search - FTS5 for fast filename/metadata search
  • Multi-Label Search - OR logic for filtering by multiple labels
  • Relevance Scoring - Configurable weights for search ranking
  • Auto-Sync - Automatic database updates from CLI

Interfaces

  • CLI - Complete command-line interface
  • Library API - Python API for integration
  • 100 Tests - Comprehensive test coverage

Documentation

Guide Description
Installation Setup and installation
Usage CLI commands reference
Library Python Library API with code examples
Taxonomy Labels and taxonomy configuration
LLM LLM provider setup (Gemini, OpenAI, Groq, OpenRouter)
Development Development and testing guide

Project Structure

fileKor/
├── src/filekor/       # Source code
│   ├── cli.py        # CLI interface
│   ├── db.py         # Database module (SQLite)
│   ├── models.py     # Database models
│   ├── sidecar.py    # Sidecar model
│   ├── labels.py     # Labels module
│   └── llm.py       # LLM providers
├── docs/             # Documentation
├── test-files/        # Test files
├── tests/           # Test suite
└── README.md

License

Apache License 2.0 - See LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filekor-0.1.1.tar.gz (50.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

filekor-0.1.1-py3-none-any.whl (50.9 kB view details)

Uploaded Python 3

File details

Details for the file filekor-0.1.1.tar.gz.

File metadata

  • Download URL: filekor-0.1.1.tar.gz
  • Upload date:
  • Size: 50.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for filekor-0.1.1.tar.gz
Algorithm Hash digest
SHA256 bdbddc265963d9c9e253586f1953de87b9e88f3aa2a0050d6c862c908f8b6ff8
MD5 e92588ad61ca110183642ecc5bb650b4
BLAKE2b-256 2e569f45cf20e59e7a1b65998f065b23c35ee5348ac086d40f8856048648126f

See more details on using hashes here.

File details

Details for the file filekor-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: filekor-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 50.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for filekor-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9151d9e77debf78c327eeaed4d3fab9a37bf3becdbd878702944a6afd4e61582
MD5 a2254397c73c24a2e4a9908f2bf48b3a
BLAKE2b-256 618c9453bd8ae73855727debb6b34a9a14262e1aae2f69842a94c0d4817c4c90

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page