A lightning-fast, zero-cloud CLI utility that cleans your Windows directories by parsing file contents and visual data — automatically destroying clones and generating intelligently named folder structures.

Project description

🌀 Vortex

Python scikit-learn Qdrant Windows Version License

Overview
Features
Architecture
Project Structure
Tech Stack
Supported File Formats
Prerequisites
Installation
Usage
How It Works
Configuration
Roadmap
Contributing
License

Overview

Vortex uses state-of-the-art embedding models to intelligently group your files. Whether you have text documents, code, PDFs, or images — it semantically analyzes their content entirely locally. Your data never leaves your machine.

The Problem: Over time, directories accumulate duplicate files and an unstructured mess of documents, images, and code. Manually organizing them is tedious and error-prone.

The Solution: Vortex automates the entire process in a single command — deduplicating exact clones via SHA-256, then semantically clustering the remaining files using vector embeddings and DBSCAN, and finally reorganizing them into auto-named folders derived from their content.

Features

🔍 Intelligent Deduplication

Recursive Directory Traversal — Walks the entire target directory tree, resolving all nested files.
SHA-256 Chunked Hashing — Identifies exact binary duplicates efficiently, even for very large files, using 4 MB chunked reads.
Storage Statistics — Gathers file size metadata before acting.

🧠 Semantic Clustering

Multi-Format Content Extraction — Pulls text from plain text files, code, PDFs (via PyMuPDF), and images (via Tesseract OCR).
Local Embedding Generation — Generates text embeddings with BAAI/bge-small-en-v1.5 and image embeddings with Qdrant/clip-ViT-B-32-vision, all via fastembed.
Vector Database — Manages semantic indexes in a local Qdrant instance (on-disk, no server needed).
DBSCAN Clustering — Groups files by semantic proximity using cosine-distance DBSCAN with tuned eps and min_samples per modality.
TF-IDF Directory Naming — Automatically generates human-readable folder names by extracting the most relevant terms from each cluster's combined content.

🛡️ Human-in-the-Loop (HITL)

Implementation in progress.

An interactive review phase that will let you verify the proposed folder structure and directory names before Vortex commits changes to your filesystem.

Architecture

flowchart TD
    A["🗂️ Target Directory"] --> B["Phase 1: Deduplication"]
    
    subgraph DEDUP ["src/dedup"]
        B --> B1["dirTraversal — Recursive file discovery"]
        B1 --> B2["fileHash — SHA-256 chunked hashing"]
        B2 --> B3["fileStats — File size metadata"]
        B3 --> B4{"Duplicates found?"}
        B4 -- Yes --> B5["Remove exact clones"]
        B4 -- No --> C
        B5 --> C
    end
    
    C["Unique Files"] --> D["Phase 2: Clustering"]
    
    subgraph CLUSTER ["src/clustering"]
        D --> D1["fileContent — Multi-format text extraction"]
        D1 --> D2["embeddings — Vector generation"]
        D2 --> D2a["Text: BAAI/bge-small-en-v1.5 → 384d"]
        D2 --> D2b["Image: Qdrant/clip-ViT-B-32-vision → 512d"]
        D2a --> D3["Qdrant Vector DB — Local on-disk storage"]
        D2b --> D3
        D3 --> D4["dbscanModel — Cosine DBSCAN clustering"]
        D4 --> D5["dirNaming — TF-IDF folder name generation"]
    end
    
    D5 --> E["Phase 3: HITL Review"]
    E --> F["📁 Organized Directory"]

Project Structure

Vortex/
├── main.py                          # CLI entry point (Typer + Rich)
├── pyproject.toml                   # Project metadata & dependencies
├── uv.lock                         # Locked dependency versions
├── .python-version                  # Python 3.12
│
├── src/
│   ├── dedup/                       # Phase 1 — Deduplication
│   │   ├── dirTraversal.py          # Recursive directory walker
│   │   ├── fileHash.py              # SHA-256 chunked file hashing
│   │   └── fileStats.py             # File size statistics
│   │
│   ├── clustering/                  # Phase 2 — Semantic Clustering
│   │   ├── fileContent.py           # Multi-format content extraction
│   │   ├── embeddings.py            # Text & image embedding generation + Qdrant storage
│   │   ├── dbscanModel.py           # DBSCAN clustering with data retrieval from Qdrant
│   │   └── dirNaming.py             # TF-IDF based directory name generation
│   │
│   └── hitl/                        # Phase 3 — Human-in-the-Loop (WIP)
│
└── docs/                            # Documentation (WIP)

Tech Stack

Category	Technology	Purpose
Language	Python ≥ 3.12	Core runtime
CLI Framework	Typer + Rich	Command-line interface with styled output
Text Embeddings	FastEmbed (`BAAI/bge-small-en-v1.5`)	384-dim text vectors
Image Embeddings	FastEmbed (`Qdrant/clip-ViT-B-32-vision`)	512-dim image vectors
Vector Database	Qdrant (local on-disk mode)	Semantic index storage
Clustering	scikit-learn (DBSCAN)	Density-based grouping
PDF Parsing	PyMuPDF	Text extraction from PDFs
OCR	Tesseract + pytesseract	Image-to-text extraction
Image Processing	Pillow	Image loading for OCR
TF-IDF	scikit-learn (TfidfVectorizer)	Directory name generation
Package Manager	uv	Fast dependency management

Supported File Formats

Category	Extensions	Extraction Method
Plain Text / Code	`.txt`, `.md`, `.csv`, `.json`, `.py`, `.js`, `.html`	Direct UTF-8 read
Documents	`.pdf`	PyMuPDF text extraction
Images	`.png`, `.jpg`, `.jpeg`	CLIP embeddings (visual) + Tesseract OCR (textual)

Note: Images are embedded using CLIP for visual similarity clustering. OCR is used separately when text content is needed (e.g., for directory naming).

Prerequisites

Python 3.12+ — Download
uv (recommended) — Install
Tesseract OCR — Required for image text extraction.
- Download the Windows installer from UB Mannheim.
- Ensure tesseract.exe is accessible via your system PATH.

Installation

# Clone the repository
git clone https://github.com/44ompatil/Vortex.git
cd Vortex

# Install dependencies (uv recommended)
uv sync

# Or, using pip
pip install -e .

Usage

Vortex exposes a sort command that orchestrates the full pipeline:

# Sort a directory
uv run python main.py sort <target-directory>

# Example
uv run python main.py sort "C:\Users\you\Downloads"

Available Commands

Command	Description
`sort <directory>`	Run the full dedup → cluster → organize pipeline on the target directory
`help`	Display available commands and usage information

What Happens When You Run `sort`

Scan — Recursively discovers all files in the target directory.
Dedup — Identifies and removes exact binary duplicates (SHA-256).
Extract — Pulls text/visual content from each unique file.
Embed — Generates vector embeddings (text: 384d, image: 512d).
Cluster — Groups semantically similar files using DBSCAN.
Name — Generates descriptive folder names via TF-IDF.
Organize — Moves files into their newly created, named folders.

How It Works

Phase 1: Deduplication (`src/dedup/`)

Files are recursively discovered via os.walk. Each file is hashed using SHA-256 with 4 MB chunked reads to handle large files efficiently. Files sharing the same hash are identified as exact duplicates — only one copy is kept, the rest are deleted.

Phase 2: Semantic Clustering (`src/clustering/`)

Remaining unique files have their content extracted based on file type. Text content is embedded into 384-dimensional vectors using BAAI/bge-small-en-v1.5, while images are embedded into 512-dimensional vectors using Qdrant/clip-ViT-B-32-vision. All vectors are stored in a local Qdrant database.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) then groups files by semantic similarity using cosine distance. The algorithm parameters are tuned separately for each modality:

Modality	`eps`	`min_samples`
Text	0.20	2
Image	0.45	2

Each cluster is assigned a human-readable name generated by running TF-IDF on the combined text content of the cluster's files, extracting the top-2 most distinctive terms.

Phase 3: Human-in-the-Loop (`src/hitl/`)

Coming soon. This phase will present the proposed folder structure for interactive review before files are moved.

Configuration

Currently, model parameters are configured in-code:

Parameter	Location	Default	Description
`chunkSize`	`fileHash.py`	`4194304` (4 MB)	Byte chunk size for SHA-256 hashing
`txtEps`	`dbscanModel.py`	`0.20`	DBSCAN epsilon for text clusters
`txtMinPts`	`dbscanModel.py`	`2`	DBSCAN minimum points for text clusters
`imgEps`	`dbscanModel.py`	`0.45`	DBSCAN epsilon for image clusters
`imgMinPts`	`dbscanModel.py`	`2`	DBSCAN minimum points for image clusters
`top_n_words`	`dirNaming.py`	`2`	Number of TF-IDF terms used in folder names
`PageSize`	`dbscanModel.py`	`30`	Qdrant scroll page size

Roadmap

Recursive directory traversal
SHA-256 chunked deduplication
Multi-format content extraction (text, PDF, images)
Local text & image embedding generation
Qdrant vector storage
DBSCAN semantic clustering
TF-IDF auto-naming for directories
Typer CLI with Rich output
Human-in-the-Loop interactive review
Config file support (YAML/TOML)
Cross-modal clustering (text + image in unified space)
Undo / dry-run mode
Progress bars and summary statistics
Additional file format support (.docx, .xlsx, .pptx)

Contributing

Contributions are welcome! Here's how to get started:

Fork the repository.
Create a feature branch — git checkout -b feature/your-feature
Commit your changes — git commit -m "Add your feature"
Push to the branch — git push origin feature/your-feature
Open a Pull Request.

Please ensure your code follows the existing style and includes appropriate documentation.

License

This project is licensed under the MIT License — see the LICENSE file for details.

_{Built with ❤️ for anyone drowning in unorganized files.}

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Jun 5, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vortiq-0.1.0.tar.gz (13.4 kB view details)

Uploaded Jun 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vortiq-0.1.0-py3-none-any.whl (13.5 kB view details)

Uploaded Jun 5, 2026 Python 3

File details

Details for the file vortiq-0.1.0.tar.gz.

File metadata

Download URL: vortiq-0.1.0.tar.gz
Upload date: Jun 5, 2026
Size: 13.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for vortiq-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`faa3610942c2624ca04c659d4ae7e3b0982dc7cac4b956c0aef12b93deabd7c2`
MD5	`77beaa22eae20a0a72fe83ef2e5af383`
BLAKE2b-256	`48f4a9466deecce8a41394adaf12eeb2da332301c9f43c6b76cc8b505d525713`

See more details on using hashes here.

File details

Details for the file vortiq-0.1.0-py3-none-any.whl.

File metadata

Download URL: vortiq-0.1.0-py3-none-any.whl
Upload date: Jun 5, 2026
Size: 13.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for vortiq-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`033086b54920483b7ed15043c2b14a531c36e3fa2fd61409a0c4dcadef7f99cd`
MD5	`b7697d3b7193422d5bb823389fb22488`
BLAKE2b-256	`862c798104b311cdc8e6f6a4891bdb1736ffe379830b1718b21646a978ca5841`

See more details on using hashes here.

Vortiq 0.1.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Project description

🌀 Vortex

Table of Contents

Overview

Features

🔍 Intelligent Deduplication

🧠 Semantic Clustering

🛡️ Human-in-the-Loop (HITL)

Architecture

Project Structure

Tech Stack

Supported File Formats

Prerequisites

Installation

Usage

Available Commands

What Happens When You Run sort

How It Works

Phase 1: Deduplication (src/dedup/)

Phase 2: Semantic Clustering (src/clustering/)

Phase 3: Human-in-the-Loop (src/hitl/)

Configuration

Roadmap

Contributing

License

Project details

Verified details

Maintainers

Meta

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

What Happens When You Run `sort`

Phase 1: Deduplication (`src/dedup/`)

Phase 2: Semantic Clustering (`src/clustering/`)

Phase 3: Human-in-the-Loop (`src/hitl/`)