Skip to main content

Automatically organize local documents into semantic folders using embeddings and clustering.

Project description

semantic-organizer

Python License Offline Version

Automatically organize your local documents into meaningful folders — fully offline, no LLMs, no API keys, no cloud.

pip install semantic-organizer
from semantic_organizer import organize

organize("/path/to/my/documents")

What it does

Point it at any folder. It reads your .txt, .pdf, and .docx files, understands what they mean, groups them by topic, and moves them into labeled subfolders — automatically.

Before

documents/
├── q3_report.pdf
├── lecture_notes.txt
├── sprint_retro.docx
├── research_paper.pdf
├── roadmap.txt
└── meeting_notes.docx

After

documents/
├── Financial Reports/
│   └── q3_report.pdf
├── Machine Learning/
│   ├── lecture_notes.txt
│   └── research_paper.pdf
└── Project Management/
    ├── sprint_retro.docx
    ├── roadmap.txt
    └── meeting_notes.docx

How it works

Documents → Extract text → Embed with all-MiniLM-L6-v2 → KMeans clustering → KeyBERT labels → Organize
  1. Extracts text from .txt, .pdf, and .docx files
  2. Embeds each document using all-MiniLM-L6-v2 — a 80MB model that runs fully offline after first download
  3. Clusters documents by semantic similarity using KMeans + Silhouette Analysis to auto-detect the optimal number of groups
  4. Labels each cluster with descriptive keywords using KeyBERT
  5. Organizes files into labeled subfolders

Embeddings are cached locally after the first run — subsequent runs on the same folder are instant.


Installation

pip install semantic-organizer

Requires Python 3.10+. The embedding model (~80MB) is downloaded on first use and cached locally.


Usage

Basic

from semantic_organizer import organize

organize("/path/to/my/documents")

Dry run — preview before committing

Always a good idea on first use:

clusters = organize("/path/to/my/documents", dry_run=True)

# Inspect the proposed structure
for cluster in clusters:
    print(cluster["label"])
    for doc in cluster["documents"]:
        print(f"  {doc['filename']}")

Copy mode — originals untouched

organize("/path/to/my/documents", mode="copy")

Undo last operation

from semantic_organizer import undo

undo(store_dir="/path/to/my/documents/.semantic_store")

All options

organize(
    folder="/path/to/my/documents",

    # "move" (default) — moves original files into labeled subfolders
    # "copy"           — copies files, originals stay untouched
    mode="copy",

    # Where to persist embeddings between runs
    # Default: <folder>/.semantic_store
    store_dir="/custom/store/path",

    # Re-embed all documents even if a store already exists
    # Use this after adding or removing files from the folder
    force=False,

    # Preview without moving or copying anything
    dry_run=True,
)

Return value

organize() always returns the cluster list — whether or not dry_run is set:

[
    {
        "cluster_id": 0,
        "label": "Machine Learning",
        "documents": [
            {"filename": "lecture_notes.txt", "path": "/abs/path/lecture_notes.txt"},
            {"filename": "research_paper.pdf", "path": "/abs/path/research_paper.pdf"},
        ]
    },
    ...
]

Privacy

Everything runs locally on your machine.

  • No files, text, or embeddings are ever sent to a server
  • The embedding model (all-MiniLM-L6-v2) is downloaded once from HuggingFace and cached locally
  • All subsequent runs are completely offline

Architecture

semantic_organizer/
├── extractor.py    — text extraction from .txt, .pdf, .docx
├── embedder.py     — sentence embeddings via all-MiniLM-L6-v2
├── store.py        — persist embeddings as .npy + metadata as .json
├── clusterer.py    — KMeans + Silhouette Analysis
├── labeler.py      — KeyBERT keyword extraction
└── controller.py   — move/copy files + manifest-based undo

Known limitations

  • Short documents (under 50 chars) are skipped — too little text to embed reliably
  • Non-English documents may cluster poorly — the default model is optimized for English
  • Large folders (500+ files) will be slow on the first run — embedding is the bottleneck. Subsequent runs use the cached store
  • Highly uniform folders (all documents on the same topic) may produce less meaningful clusters

Development

git clone https://github.com/yourusername/semantic-organizer
cd semantic-organizer

python3 -m venv venv
source venv/bin/activate

pip install -e ".[dev]"

pytest tests/ -v

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic_organizer-0.1.0.tar.gz (15.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

semantic_organizer-0.1.0-py3-none-any.whl (15.7 kB view details)

Uploaded Python 3

File details

Details for the file semantic_organizer-0.1.0.tar.gz.

File metadata

  • Download URL: semantic_organizer-0.1.0.tar.gz
  • Upload date:
  • Size: 15.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for semantic_organizer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 83725e9b2f74660bd377f11c0bb8b6be1a6afd3e246ea66eae1ba4751264a31c
MD5 ca37f5e87a977d1afd8b2dc74d9d5a94
BLAKE2b-256 1ed7e4964fa0505e5c6f6a2fe9cec1a1f52b805e7e304a062446045e58892591

See more details on using hashes here.

File details

Details for the file semantic_organizer-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for semantic_organizer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3d6a80f707c91fec822260d15b4a18daa0584676fd3ea555de157559a8739a9a
MD5 ec55009f00672ad537c795a58877f3f2
BLAKE2b-256 93949f6d3c15b6f2accccaa76909dd6a65a4e8860a7bdd66c032c979e9ea22e1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page