A simple package to hunt down file duplicates
Project description
DupliCat
A simple utility for finding duplicated files within a specified path. It is intended to be a library but can also be used as a commandline tool, it doesn't delete the duplicate files found but returns a list of junk files so that you can choose the ones to delete.
Usage As A Library:
- Import the dupliCat class and create an object by passing the following arguments,
path
where the search will be made, defaults to current directory.recurse
boolean, set to true if you want it to recurse down to all files in the path including sub-dirs defaults toFalse
Methods
-
the
generate_secure_hash
method takes a file as first argument and generates a secure-hash for it. Hashing algorithm is blake2b, key is the size of the file, it returns the file with secure_hash attribute set. File must be of typedupliFile
. -
read_chunk
this method reads a default 400 bytes of data from file. It takes the file as first positional argument and size as second argument which defaults to 400. File must be of typedupliFile
-
human_size
this is a static method that converts bytes into human-readable format.
>>> human_size(nbytes=123456)
>>> 120.56 KB
-
hash_chunk
static method, takes two positional arguments,text: str
andkey: int
hashes text with key using blake2b. -
call the
search_duplicate
method to begin the 🔍 search, search results will be stored in theduplicates
property of the class. This method is somewhat the main api of the class, it does everything for you, calling other methods instead of this might remove the functionality of using files fromsize_index
as input for generating a hash index.the
search_duplicate
method has the following optional argumentsuse_hash
find duplicates using hash table if set to True otherwise uses size_table, using the hash_table is more accurate in conditions where different files have same sizes and quite fast since table is generated using the size table, parameter defaults toTrue
. Using the size table is faster but does not guarantee accuracy.from_size
it defaults to True, this enables thesearch_duplicate
method to generate a hash index using files from the size index. Defaults toTrue
.
Note:
Both parameters are set to True for more accurate search since search will be done using the
`size_index` and also the `hash_index`.
Junk files set by this method contains all duplicates with one file left over for each.
- use the
analyse
method to analyse search result, this returns a named tuple of typeAnalysis
. It contains the total number of duplicate files accessed throughanalysis.total_file_num
, their total size on the disk accessed throughanalysis.total_size
and the most occurred file, accessed throughanalysis.most_occurrence
.
Properties
size_index
You can also access the size index using the property. it is a dictionary containing files grouped by their sizes.hash_index
You can also access the hash index using this property. It is a dictionary containing files grouped by their secure hashes.fetched_file
access all fetched files from the search pathpath
where the search will be made, defaults to current directory.recurse
boolean, set to true if you want it to recurse down to all files in the path including sub-dirs defaults toFalse
Usage From Commandline
Functionality removed for now.
Contact
😄 Happy Coding!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.