dupfinder

Find and manage duplication files on the file system

These details have not been verified by PyPI

Project links

Homepage

Project description

This package designed to find and manage duplications, and contains two utilities:

dupfind - to find duplications

dupmanage - to manage found duplications

DUPFIND UTILITY:

dupfind utility allows you to find duplicated files and directories in your file system.

Show how utility find duplicated files:

By default utility identifies duplication files by file content.

First of all - create several different files in the current directory.

>>> createFile('tfile1.txt', "A"*10)
>>> createFile('tfile2.txt', "A"*1025)
>>> createFile('tfile3.txt', "A"*2048)

Then create other files in another directory, one of them to be the same as already created ones.

>>> mkd("dir1")
>>> createFile('tfile1.txt', "A"*20, "dir1")
>>> createFile('tfile2.txt', "A"*1025, "dir1")
>>> createFile('tfile13.txt', "A"*48, "dir1")

Look into the directories contents:

>>> ls()
=== list directory ===
D :: dir1 :: ...
F :: tfile1.txt :: 10
F :: tfile2.txt :: 1025
F :: tfile3.txt :: 2048

>>> ls("dir1")
=== list dir1 directory ===
F :: tfile1.txt :: 20
F :: tfile13.txt :: 48
F :: tfile2.txt :: 1025

We see, that “tfile2.txt” is same in both directories, while “tfile1.txt” - has the same name, but differs in size. So utility must identify only “tfile2.txt” as a duplication file.

We force output results with “-o <output file name>” argument to outputf file, and pass testdir as directory that is looking for duplications.

>>> dupfind("-o %(o)s %(dir)s" % {'o':outputf, 'dir': testdir})

Now check the results file for duplications.

>>> cat(outputf)
hash,size,type,ext,name,directory,modification,operation,operation_data
...,1025,F,txt,tfile2.txt,.../tmp.../dir1,...
...,1025,F,txt,tfile2.txt,.../tmp...,...

Show quick/slow utility mode:

As mentioned above - utility identifies duplication files by file contents. This mode slows down the system and consumes a lot of system resources.

However, in most cases the file name and size is enough to identify the duplication. So in that case you can use quick mode –quick (-q) option.

So test the previous files in the quick mode:

>>> dupfind("-q -o %(o)s %(dir)s" % {'o':outputf, 'dir': testdir})

Now check the result file for duplications.

>>> cat(outputf)
hash,size,type,ext,name,directory,modification,operation,operation_data
...,1025,F,txt,tfile2.txt,.../tmp.../dir1,...
...,1025,F,txt,tfile2.txt,.../tmp...,...

As we can see the quick mode identifies duplications correctly.

Let’s show that there are cases when this mode can lead to mistakes. To do that let’s add a file with the same name and size but different content and apply utility in both modes:

>>> createFile('tfile000.txt', "First  "*20,)
>>> createFile('tfile000.txt', "Second "*20, "dir1")

Now check the duplication results using default (not quick mode) …

>>> dupfind(" -o %(o)s %(dir)s" % {'o':outputf, 'dir': testdir})
>>> cat(outputf)
hash,size,type,ext,name,directory,modification,operation,operation_data
...,1025,F,txt,tfile2.txt,.../tmp.../dir1,...
...,1025,F,txt,tfile2.txt,.../tmp...,...

As we can see not-quick mode identifies duplications correctly.

Let’s check duplications using the quick mode…

>>> dupfind(" -q -o %(o)s %(dir)s" % {'o':outputf, 'dir': testdir})
>>> cat(outputf)
hash,size,type,ext,name,directory,modification,operation,operation_data
...,140,F,txt,tfile000.txt,.../tmp.../dir1,...
...,140,F,txt,tfile000.txt,.../tmp...,...
...,1025,F,txt,tfile2.txt,.../tmp.../dir1,...
...,1025,F,txt,tfile2.txt,.../tmp...,...

As we can see wrong duplications are found using the quick-mode.

Cleanup the test

>>> cleanTestDir()

Show how utility finds duplicated directories:

Utility identifies duplicated directories as directories, all files of which are duplicated and all inner directories are also duplicated directories.

First compare 2 directories with the same files.

Create directories with the same content.

>>> def mkDir(dpath):
...     mkd(dpath)
...     createFile('tfile1.txt', "A"*10, dpath)
...     createFile('tfile2.txt', "A"*1025, dpath)
...     createFile('tfile3.txt', "A"*2048, dpath)
...
>>> mkDir("dir1")
>>> mkDir("dir2")

Confirm that the directories’ contents are really identical

>>> ls("dir1")
=== list dir1 directory ===
F :: tfile1.txt :: 10
F :: tfile2.txt :: 1025
F :: tfile3.txt :: 2048

>>> ls("dir2")
=== list dir2 directory ===
F :: tfile1.txt :: 10
F :: tfile2.txt :: 1025
F :: tfile3.txt :: 2048

Now run the utility and check the result file:

>>> dupfind("-o %(o)s %(dir)s" % {'o':outputf, 'dir': testdir})
>>> cat(outputf)
hash,size,type,ext,name,directory,modification,operation,operation_data
...,D,,dir1,...
...,D,,dir2,...

Compare 2 directories with the same files and dirs.

Create new directories with the same content, but different names in previously created directories.

So for directories to be interpreted as duplications - they don’t need to have the same name, but the identical content.

Add 2 identical directories to the previous ones.

>>> def mkDir1(dpath):
...     mkd(dpath)
...     createFile('tfile11.txt', "B"*4000, dpath)
...     createFile('tfile12.txt', "B"*222, dpath)
...
>>> mkDir1("dir1/dir11")
>>> mkDir1("dir2/dir21")

Note that we added two directories with same contents, but different names. This should not break duplications.

>>> def mkDir2(dpath):
...     mkd(dpath)
...     createFile('tfile21.txt', "C"*4096, dpath)
...     createFile('tfile22.txt', "C"*123, dpath)
...     createFile('tfile23.txt', "C"*444, dpath)
...     createFile('tfile24.txt', "C"*555, dpath)
...
>>> mkDir2("dir1/dir22")
>>> mkDir2("dir2/dir22")

Confirm that directories’ contents are really identical

>>> ls("dir1")
=== list dir1 directory ===
D :: dir11 :: -1
D :: dir22 :: -1
F :: tfile1.txt :: 10
F :: tfile2.txt :: 1025
F :: tfile3.txt :: 2048
>>> ls("dir2")
=== list dir2 directory ===
D :: dir21 :: -1
D :: dir22 :: -1
F :: tfile1.txt :: 10
F :: tfile2.txt :: 1025
F :: tfile3.txt :: 2048

And contents for inner directories

First subdirectory:

>>> ls("dir1/dir11")
=== list dir1/dir11 directory ===
F :: tfile11.txt :: 4000
F :: tfile12.txt :: 222
>>> ls("dir2/dir21")
=== list dir2/dir21 directory ===
F :: tfile11.txt :: 4000
F :: tfile12.txt :: 222

Second subdirectory:

>>> ls("dir1/dir22")
=== list dir1/dir22 directory ===
F :: tfile21.txt :: 4096
F :: tfile22.txt :: 123
F :: tfile23.txt :: 444
F :: tfile24.txt :: 555
>>> ls("dir2/dir22")
=== list dir2/dir22 directory ===
F :: tfile21.txt :: 4096
F :: tfile22.txt :: 123
F :: tfile23.txt :: 444
F :: tfile24.txt :: 555

Now test the utility.

>>> dupfind("-o %(o)s %(dir)s" % {'o':outputf, 'dir': testdir})

Checks the results file for duplications.

>>> cat(outputf)
hash,size,type,ext,name,directory,modification,operation,operation_data
...,D,,dir1,...
...,D,,dir2,...

NOTE:

Inner duplication directories are excluded from the results:

>>> outputres = file(outputf).read()
>>> "dir1/dir11" in outputres
False
>>> "dir1/dir22" in outputres
False
>>> "dir2/dir21" in outputres
False
>>> "dir2/dir22" in outputres
False

Utility accepts more than one argument as directories list:

Use previous directory structure to prove this:

Now pass to utility “dir1/dir11” and “dir2” directories:

>>> dupfind("-o %(o)s %(dir1-11)s %(dir2)s" % {
...     'o':outputf,
...     'dir1-11': os.path.join(testdir,"dir1/dir11"),
...     'dir2': os.path.join(testdir,"dir2"),})

Now check the result file for duplications.

>>> cat(outputf)
hash,size,type,ext,name,directory,modification,operation,operation_data
...,D,,dir11,.../tmp.../dir1,...
...,D,,dir21,.../tmp.../dir2,...

DUPMANAGE UTILITY:

dupmanage utility allows you to manage duplication files and directories of your file system with csv data file.

Utility use csv-formatted data-file to process duplication items. Data file must contain the following columns:

type

name

directory

operation

operation_data

Utility supports 2 types of operations with duplication items:

deleting (“D”)

symlinking (“L”) only for UNIX-like systems

operation_data is only used for symlinking operation and must contain the path to symlinking sorce item.

Show how utility manages duplications:

To show - use previous directory structure and also add several duplications:

Create a file in the root directory and the same file in another catalog.

>>> createFile('tfile03.txt', "D"*100)
>>> mkd("dir3")
>>> createFile('tfile03.txt', "D"*100, "dir3")

Look into directories contents:

>>> ls()
=== list directory ===
D :: dir1 :: ...
D :: dir2 :: ...
D :: dir3 :: ...
F :: tfile03.txt :: 100

>>> ls("dir3")
=== list dir3 directory ===
F :: tfile03.txt :: 100

We already know the previous duplications, so now we create csv-formatted data file to manage duplications.

>>> manage_data = """type,name,directory,operation,operation_data
... F,tfile03.txt,%(testdir)s/dir3,L,%(testdir)s/tfile03.txt
... D,dir2,%(testdir)s,D,
... """ % {'testdir': testdir}
>>> createFile('manage.csv', manage_data)

Now call the utility and check result directory content:

>>> manage_path = os.path.join(testdir, 'manage.csv')
>>> dupmanage("%s -v" % manage_path)
[...
[...]: Symlink .../tfile03.txt item to .../dir3/tfile03.txt
[...]: Remove .../dir2 directory
[...]: Processed 2 items

Review directory content:

>>> ls()
=== list directory ===
D :: dir1 :: ...
D :: dir3 :: ...
F :: tfile03.txt :: 100

>>> ls("dir3")
=== list dir3 directory ===
L :: tfile03.txt :: ...

HISTORY:

1.4.3

Comment useless for now output_format option for dupfinder utility.

1.4.2

Refactoring content comparison to use zlib.crc32 function to calculate file content diges - speedup algorythm.

Fixed some bugs

1.4

Updated file duplication finding: added file comparison by content oportunity. Made this variant - default one.

Added -q (–quick) option to use quick file comparison (by name and size)

Added tests for quick/not-quick duplication finding

1.2

Added dupmanage utility for manage duplications

Added tests for dupmanage utility

1.0

Tests for dupfinder utility added

0.8

Refactoring classes: remove DupFilter, move filtering into DupOut class.

Force implicitly hiding inner content of a duplication directories.

0.7

Refactoring utility into classes

Fix bugs with bad files processing

Fix bug with size calculation

0.5

Refactoring inner finding algorithm

Implemented opportunity to remove from the result report inner content from duplication directories

0.3

Files finder implemented

Output in csv format

added filters by size

0.1

Initial release

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.4.3

Jan 20, 2010

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dupfinder-1.4.3.tar.gz (10.5 kB view details)

Uploaded Jan 20, 2010 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dupfinder-1.4.3-py2.4.egg (23.6 kB view details)

Uploaded Jan 20, 2010 Egg

File details

Details for the file dupfinder-1.4.3.tar.gz.

File metadata

Download URL: dupfinder-1.4.3.tar.gz
Upload date: Jan 20, 2010
Size: 10.5 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for dupfinder-1.4.3.tar.gz
Algorithm	Hash digest
SHA256	`3cab159008fa5a8b88da2bc8b1e0ed2761cecdcb9f983c0f3f8b7c771ab3f9af`
MD5	`25b2b06a31e74ce1e59a23de682af1ff`
BLAKE2b-256	`54c17c4fa491298a83f98318ef6e429c372797667a89f798fd7cf06748034c85`

See more details on using hashes here.

File details

Details for the file dupfinder-1.4.3-py2.4.egg.

File metadata

Download URL: dupfinder-1.4.3-py2.4.egg
Upload date: Jan 20, 2010
Size: 23.6 kB
Tags: Egg
Uploaded using Trusted Publishing? No

File hashes

Hashes for dupfinder-1.4.3-py2.4.egg
Algorithm	Hash digest
SHA256	`507cdb501eb7161a91f4b6fdb4ffe72f0878016e71dfa10f975af6a9519e894f`
MD5	`5b0697388a8a9913a0ffc56a5e38926a`
BLAKE2b-256	`5b3aa1c1f17d42048bd5a44555f158bd59a03da596b59248fb9e71f3a36b970b`

See more details on using hashes here.

dupfinder 1.4.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

DUPFIND UTILITY:

Show how utility find duplicated files:

Show quick/slow utility mode:

Show how utility finds duplicated directories:

First compare 2 directories with the same files.

Compare 2 directories with the same files and dirs.

NOTE:

Utility accepts more than one argument as directories list:

DUPMANAGE UTILITY:

Show how utility manages duplications:

HISTORY:

1.4.3

1.4.2

1.4

1.2

1.0

0.8

0.7

0.5

0.3

0.1

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes