Skip to main content

A simple package to hunt down file duplicates.

Project description

DupliCat

A simple utility for finding duplicated files within a specified path. It is intended to be a library but can also be used as a commandline tool, it doesn't delete the duplicate files found but returns a list of junk files so that you can choose the ones to delete.

Usage As A Library

  • Import the dupliCat class and create an object by passing the following arguments,
    • path where the search will be made, defaults to current directory.
    • recurse boolean, set to true if you want it to recurse down to all files in the path including sub-dirs defaults to False

Methods

  • the generate_secure_hash method takes a file as first argument and generates a secure-hash for it. Hashing algorithm is blake2b, key is the size of the file, it returns the file with secure_hash attribute set. File must be of type dupliFile.

  • read_chunk this method reads a default 400 bytes of data from file. It takes the file as first positional argument and size as second argument which defaults to 400. File must be of type dupliFile

  • human_size this is a static method that converts bytes into human-readable format.

      >>> human_size(nbytes=123456)
      >>> 120.56 KB
    
  • hash_chunk static method, takes two positional arguments, text: str and key: int hashes text with key using blake2b.

  • call the search_duplicate method to begin the 🔍 search, search results will be stored in the duplicates property of the class. This method is somewhat the main api of the class, it does everything for you, calling other methods instead of this might remove the functionality of using files from size_index as input for generating a hash index.

  • the search_duplicate method used to initiate a search for duplicates. Does not take any additional arguments. Junk files set by this method contains all duplicates with one file left over for each.

  • use the analyse method to analyse search result, this returns a named tuple of type Analysis. It contains the total number of duplicate files accessed through analysis.total_file_num, their total size on the disk accessed through analysis.total_size and the most occurred file, accessed through analysis.most_occurrence.

  • the generate_size_index method is used to generate the size index from files. It sets the result or the generated size_index to self.size_index takes the parameter

    • files: files from which size index should be generated.
  • the generate_hash_index method is used to generate the hash index from files in the size index. It sets the result or the generated_hash_index to self.hash_index takes the argument

    • files: files from which hash index should be generated.

Properties

  • size_index You can also access the size index using the property. it is a dictionary containing files grouped by their sizes.
  • hash_index You can also access the hash index using this property. It is a dictionary containing files grouped by their secure hashes.
  • fetched_files access all fetched files from the search path
  • path where the search will be made, defaults to current directory.
  • recurse boolean, set to true if you want it to recurse down to all files in the path including sub-dirs defaults to False
  • junk_files a list containing all duplicate files leaving an original copy each. Meaning you can go ahead and delete all files in this list

Updates - version 3.0.5

  • fixed the total size value
  • added junk_files property
  • new method set_secure_hash for setting the secure hash of a file if provided else generates one for the file.
  • updated generate_secure_hash to only generate and return a secure hash for the file
  • fetch_files now implements a recursive use of os.scandir instead of os.walk for faster file fetching.
  • increased overall speed.

Usage From Commandline

You can now use dupliCat from the command line.

$ dupliCat --help

the above command will help you to use it.

Contact

teddbug-S

Kwieeciol

😄 Happy Coding!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dupliCat-3.7.2.tar.gz (10.6 kB view details)

Uploaded Source

Built Distribution

dupliCat-3.7.2-py3-none-any.whl (10.8 kB view details)

Uploaded Python 3

File details

Details for the file dupliCat-3.7.2.tar.gz.

File metadata

  • Download URL: dupliCat-3.7.2.tar.gz
  • Upload date:
  • Size: 10.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.10.4

File hashes

Hashes for dupliCat-3.7.2.tar.gz
Algorithm Hash digest
SHA256 5c76fd97123c043acf9335d45a30348b8c6a554b40d82bef447687bfeb7c5c2e
MD5 01d3bf17bf7be4bec96ab8c28e8bf820
BLAKE2b-256 930710bf11ce7b3b3090038b51aebc0cb87d17c6e88ab6a7bde30d3e788de0a3

See more details on using hashes here.

File details

Details for the file dupliCat-3.7.2-py3-none-any.whl.

File metadata

  • Download URL: dupliCat-3.7.2-py3-none-any.whl
  • Upload date:
  • Size: 10.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.10.4

File hashes

Hashes for dupliCat-3.7.2-py3-none-any.whl
Algorithm Hash digest
SHA256 8b7545438128f3cd50cf89e9b8736a015c07f4afc3838e7611296638384331e3
MD5 5a9d58a0886da389dc96e41f06e26853
BLAKE2b-256 1f2786411a768bc48ddeb33a5e3390487667664238f1c719bbeb11c616d93cc5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page