Automatically organize local documents into semantic folders using embeddings and clustering.
Project description
semantic-organizer
Automatically organize your local documents into meaningful folders — fully offline, no LLMs, no API keys, no cloud.
pip install semantic-organizer
from semantic_organizer import organize
organize("/path/to/my/documents")
What it does
Point it at any folder. It reads your .txt, .pdf, and .docx files, understands what they mean, groups them by topic, and moves them into labeled subfolders — automatically.
Before
documents/
├── q3_report.pdf
├── lecture_notes.txt
├── sprint_retro.docx
├── research_paper.pdf
├── roadmap.txt
└── meeting_notes.docx
After
documents/
├── Financial Reports/
│ └── q3_report.pdf
├── Machine Learning/
│ ├── lecture_notes.txt
│ └── research_paper.pdf
└── Project Management/
├── sprint_retro.docx
├── roadmap.txt
└── meeting_notes.docx
How it works
Documents → Extract text → Embed with all-MiniLM-L6-v2 → KMeans clustering → KeyBERT labels → Organize
- Extracts text from
.txt,.pdf, and.docxfiles - Embeds each document using
all-MiniLM-L6-v2— a 80MB model that runs fully offline after first download - Clusters documents by semantic similarity using KMeans + Silhouette Analysis to auto-detect the optimal number of groups
- Labels each cluster with descriptive keywords using KeyBERT
- Organizes files into labeled subfolders
Embeddings are cached locally after the first run — subsequent runs on the same folder are instant.
Installation
pip install semantic-organizer
Requires Python 3.10+. The embedding model (~80MB) is downloaded on first use and cached locally.
Usage
Basic
from semantic_organizer import organize
organize("/path/to/my/documents")
Dry run — preview before committing
Always a good idea on first use:
clusters = organize("/path/to/my/documents", dry_run=True)
# Inspect the proposed structure
for cluster in clusters:
print(cluster["label"])
for doc in cluster["documents"]:
print(f" {doc['filename']}")
Copy mode — originals untouched
organize("/path/to/my/documents", mode="copy")
Undo last operation
from semantic_organizer import undo
undo(store_dir="/path/to/my/documents/.semantic_store")
All options
organize(
folder="/path/to/my/documents",
# "move" (default) — moves original files into labeled subfolders
# "copy" — copies files, originals stay untouched
mode="copy",
# Where to persist embeddings between runs
# Default: <folder>/.semantic_store
store_dir="/custom/store/path",
# Re-embed all documents even if a store already exists
# Use this after adding or removing files from the folder
force=False,
# Preview without moving or copying anything
dry_run=True,
)
Return value
organize() always returns the cluster list — whether or not dry_run is set:
[
{
"cluster_id": 0,
"label": "Machine Learning",
"documents": [
{"filename": "lecture_notes.txt", "path": "/abs/path/lecture_notes.txt"},
{"filename": "research_paper.pdf", "path": "/abs/path/research_paper.pdf"},
]
},
...
]
Privacy
Everything runs locally on your machine.
- No files, text, or embeddings are ever sent to a server
- The embedding model (
all-MiniLM-L6-v2) is downloaded once from HuggingFace and cached locally - All subsequent runs are completely offline
Architecture
semantic_organizer/
├── extractor.py — text extraction from .txt, .pdf, .docx
├── embedder.py — sentence embeddings via all-MiniLM-L6-v2
├── store.py — persist embeddings as .npy + metadata as .json
├── clusterer.py — KMeans + Silhouette Analysis
├── labeler.py — KeyBERT keyword extraction
└── controller.py — move/copy files + manifest-based undo
Known limitations
- Short documents (under 50 chars) are skipped — too little text to embed reliably
- Non-English documents may cluster poorly — the default model is optimized for English
- Large folders (500+ files) will be slow on the first run — embedding is the bottleneck. Subsequent runs use the cached store
- Highly uniform folders (all documents on the same topic) may produce less meaningful clusters
Development
git clone https://github.com/yourusername/semantic-organizer
cd semantic-organizer
python3 -m venv venv
source venv/bin/activate
pip install -e ".[dev]"
pytest tests/ -v
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file semantic_organizer-0.1.0.tar.gz.
File metadata
- Download URL: semantic_organizer-0.1.0.tar.gz
- Upload date:
- Size: 15.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
83725e9b2f74660bd377f11c0bb8b6be1a6afd3e246ea66eae1ba4751264a31c
|
|
| MD5 |
ca37f5e87a977d1afd8b2dc74d9d5a94
|
|
| BLAKE2b-256 |
1ed7e4964fa0505e5c6f6a2fe9cec1a1f52b805e7e304a062446045e58892591
|
File details
Details for the file semantic_organizer-0.1.0-py3-none-any.whl.
File metadata
- Download URL: semantic_organizer-0.1.0-py3-none-any.whl
- Upload date:
- Size: 15.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3d6a80f707c91fec822260d15b4a18daa0584676fd3ea555de157559a8739a9a
|
|
| MD5 |
ec55009f00672ad537c795a58877f3f2
|
|
| BLAKE2b-256 |
93949f6d3c15b6f2accccaa76909dd6a65a4e8860a7bdd66c032c979e9ea22e1
|