Skip to main content

A package to automatically sort documents into folders based on content similarity

Project description

Document Sorter

Document Sorter is a Python package that automatically organizes documents into folders based on their content similarity. It supports various file types including PDF, DOCX, XLSX, CSV, TXT, MD, and TEX.

Features

  • Searches for documents in a specified directory
  • Extracts text from various file types
  • Clusters documents based on content similarity using the elbow method
  • Allows users to specify the number of clusters
  • Supports custom keywords for folder names
  • Sorts documents into folders named after the most relevant keyword for each cluster or user-specified keywords
  • Supports dry run mode for testing
  • Provides verbose output option for detailed information during execution

Installation

You can install Document Sorter using pip:

pip install document-sorter

Usage

After installation, you can use the Document Sorter from the command line:

document-sorter [directory_path] [OPTIONS]

Options

  • directory_path: The directory to search for documents (default: current directory)
  • --dry-run: Perform a dry run without moving files
  • --verbose: Print detailed information during execution
  • --clusters N: Specify the number of clusters to use
  • --keywords KEYWORD1 KEYWORD2 ...: Specify custom keywords for folder names

Examples

  1. Basic usage (current directory):

    document-sorter
    
  2. Specify a directory with verbose output:

    document-sorter /path/to/documents --verbose
    
  3. Perform a dry run:

    document-sorter /path/to/documents --dry-run
    
  4. Specify the number of clusters:

    document-sorter /path/to/documents --clusters 5
    
  5. Use custom keywords:

    document-sorter /path/to/documents --keywords work personal projects research
    
  6. Combine options:

    document-sorter /path/to/documents --verbose --clusters 4 --keywords work personal projects research
    

Behavior

  • If neither --clusters nor --keywords are specified, the script will automatically determine the optimal number of clusters and generate keywords based on document content.
  • If --clusters is specified without --keywords, the script will use the specified number of clusters and generate keywords based on document content.
  • If --keywords is specified without --clusters, the script will automatically determine the optimal number of clusters and use the provided keywords for folder names.
  • If both --clusters and --keywords are specified, the script will use the specified number of clusters and the provided keywords for folder names.

Requirements

Document Sorter requires Python 3.6 or later. For a full list of dependencies, see the requirements.txt file.

License

This project is licensed under the MIT License.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

document-sorter-0.1.0.tar.gz (5.5 kB view details)

Uploaded Source

Built Distribution

document_sorter-0.1.0-py3-none-any.whl (6.0 kB view details)

Uploaded Python 3

File details

Details for the file document-sorter-0.1.0.tar.gz.

File metadata

  • Download URL: document-sorter-0.1.0.tar.gz
  • Upload date:
  • Size: 5.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.8.8

File hashes

Hashes for document-sorter-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7c2cf9d1f529bc9bd9e1311787aa53881a697ca93d3338245b60c5e95e16b48b
MD5 c805c968dfb285ce76c75e47d1e85795
BLAKE2b-256 b08b03358e088cf7b8febdb1d843b02f8e7742daa09ad53b029b361f1f555c94

See more details on using hashes here.

File details

Details for the file document_sorter-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for document_sorter-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9ef94bfa29ec56c3a5b872f2ad1e181070cb6860b780c41c2042f75b58c254b6
MD5 8af70a20a0da6530cb7ae2a177674fc6
BLAKE2b-256 58a9672e0885ab17bf786e4df2b919dd159e06e61df67622cff1b4b1d9334076

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page