A package to automatically sort documents into folders based on content similarity
Project description
Document Sorter
Document Sorter is a Python package that automatically organizes documents into folders based on their content similarity. It supports various file types including PDF, DOCX, XLSX, CSV, TXT, MD, and TEX.
Features
- Searches for documents in a specified directory
- Extracts text from various file types
- Clusters documents based on content similarity using the elbow method
- Allows users to specify the number of clusters
- Supports custom keywords for folder names
- Sorts documents into folders named after the most relevant keyword for each cluster or user-specified keywords
- Supports dry run mode for testing
- Provides verbose output option for detailed information during execution
Installation
You can install Document Sorter using pip:
pip install document-sorter
Usage
After installation, you can use the Document Sorter from the command line:
document-sorter [directory_path] [OPTIONS]
Options
directory_path
: The directory to search for documents (default: current directory)--dry-run
: Perform a dry run without moving files--verbose
: Print detailed information during execution--clusters N
: Specify the number of clusters to use--keywords KEYWORD1 KEYWORD2 ...
: Specify custom keywords for folder names
Examples
-
Basic usage (current directory):
document-sorter
-
Specify a directory with verbose output:
document-sorter /path/to/documents --verbose
-
Perform a dry run:
document-sorter /path/to/documents --dry-run
-
Specify the number of clusters:
document-sorter /path/to/documents --clusters 5
-
Use custom keywords:
document-sorter /path/to/documents --keywords work personal projects research
-
Combine options:
document-sorter /path/to/documents --verbose --clusters 4 --keywords work personal projects research
Behavior
- If neither
--clusters
nor--keywords
are specified, the script will automatically determine the optimal number of clusters and generate keywords based on document content. - If
--clusters
is specified without--keywords
, the script will use the specified number of clusters and generate keywords based on document content. - If
--keywords
is specified without--clusters
, the script will automatically determine the optimal number of clusters and use the provided keywords for folder names. - If both
--clusters
and--keywords
are specified, the script will use the specified number of clusters and the provided keywords for folder names.
Requirements
Document Sorter requires Python 3.6 or later. For a full list of dependencies, see the requirements.txt
file.
License
This project is licensed under the MIT License.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file document-sorter-0.1.0.tar.gz
.
File metadata
- Download URL: document-sorter-0.1.0.tar.gz
- Upload date:
- Size: 5.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.8.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7c2cf9d1f529bc9bd9e1311787aa53881a697ca93d3338245b60c5e95e16b48b |
|
MD5 | c805c968dfb285ce76c75e47d1e85795 |
|
BLAKE2b-256 | b08b03358e088cf7b8febdb1d843b02f8e7742daa09ad53b029b361f1f555c94 |
File details
Details for the file document_sorter-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: document_sorter-0.1.0-py3-none-any.whl
- Upload date:
- Size: 6.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.8.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9ef94bfa29ec56c3a5b872f2ad1e181070cb6860b780c41c2042f75b58c254b6 |
|
MD5 | 8af70a20a0da6530cb7ae2a177674fc6 |
|
BLAKE2b-256 | 58a9672e0885ab17bf786e4df2b919dd159e06e61df67622cff1b4b1d9334076 |