Skip to main content

Find and manage duplication files on the file system

Project description

This package designed to find and manage duplications, and contains two utilities:

  • dupfind - to find duplications

  • dupmanage - to manage found duplications

DUPFIND UTILITY:

dupfind utility allows you to find duplicated files and directories in your file system.

Show how utility find duplicated files:

By default utility identifies duplication files by file content.

First of all - create several different files in the current directory.

>>> createFile('tfile1.txt', "A"*10)
>>> createFile('tfile2.txt', "A"*1025)
>>> createFile('tfile3.txt', "A"*2048)

Then create other files in another directory, one of them to be the same as already created ones.

>>> mkd("dir1")
>>> createFile('tfile1.txt', "A"*20, "dir1")
>>> createFile('tfile2.txt', "A"*1025, "dir1")
>>> createFile('tfile13.txt', "A"*48, "dir1")

Look into the directories contents:

>>> ls()
=== list directory ===
D :: dir1 :: ...
F :: tfile1.txt :: 10
F :: tfile2.txt :: 1025
F :: tfile3.txt :: 2048
>>> ls("dir1")
=== list dir1 directory ===
F :: tfile1.txt :: 20
F :: tfile13.txt :: 48
F :: tfile2.txt :: 1025

We see, that “tfile2.txt” is same in both directories, while “tfile1.txt” - has the same name, but differs in size. So utility must identify only “tfile2.txt” as a duplication file.

We force output results with “-o <output file name>” argument to outputf file, and pass testdir as directory that is looking for duplications.

>>> dupfind("-o %(o)s %(dir)s" % {'o':outputf, 'dir': testdir})

Now check the results file for duplications.

>>> cat(outputf)
hash,size,type,ext,name,directory,modification,operation,operation_data
...,1025,F,txt,tfile2.txt,.../tmp.../dir1,...
...,1025,F,txt,tfile2.txt,.../tmp...,...

Show quick/slow utility mode:

As mentioned above - utility identifies duplication files by file contents. This mode slows down the system and consumes a lot of system resources.

However, in most cases the file name and size is enough to identify the duplication. So in that case you can use quick mode –quick (-q) option.

So test the previous files in the quick mode:

>>> dupfind("-q -o %(o)s %(dir)s" % {'o':outputf, 'dir': testdir})

Now check the result file for duplications.

>>> cat(outputf)
hash,size,type,ext,name,directory,modification,operation,operation_data
...,1025,F,txt,tfile2.txt,.../tmp.../dir1,...
...,1025,F,txt,tfile2.txt,.../tmp...,...

As we can see the quick mode identifies duplications correctly.

Let’s show that there are cases when this mode can lead to mistakes. To do that let’s add a file with the same name and size but different content and apply utility in both modes:

>>> createFile('tfile000.txt', "First  "*20,)
>>> createFile('tfile000.txt', "Second "*20, "dir1")

Now check the duplication results using default (not quick mode) …

>>> dupfind(" -o %(o)s %(dir)s" % {'o':outputf, 'dir': testdir})
>>> cat(outputf)
hash,size,type,ext,name,directory,modification,operation,operation_data
...,1025,F,txt,tfile2.txt,.../tmp.../dir1,...
...,1025,F,txt,tfile2.txt,.../tmp...,...

As we can see not-quick mode identifies duplications correctly.

Let’s check duplications using the quick mode…

>>> dupfind(" -q -o %(o)s %(dir)s" % {'o':outputf, 'dir': testdir})
>>> cat(outputf)
hash,size,type,ext,name,directory,modification,operation,operation_data
...,140,F,txt,tfile000.txt,.../tmp.../dir1,...
...,140,F,txt,tfile000.txt,.../tmp...,...
...,1025,F,txt,tfile2.txt,.../tmp.../dir1,...
...,1025,F,txt,tfile2.txt,.../tmp...,...

As we can see wrong duplications are found using the quick-mode.

Cleanup the test

>>> cleanTestDir()

Show how utility finds duplicated directories:

Utility identifies duplicated directories as directories, all files of which are duplicated and all inner directories are also duplicated directories.

First compare 2 directories with the same files.

Create directories with the same content.

>>> def mkDir(dpath):
...     mkd(dpath)
...     createFile('tfile1.txt', "A"*10, dpath)
...     createFile('tfile2.txt', "A"*1025, dpath)
...     createFile('tfile3.txt', "A"*2048, dpath)
...
>>> mkDir("dir1")
>>> mkDir("dir2")

Confirm that the directories’ contents are really identical

>>> ls("dir1")
=== list dir1 directory ===
F :: tfile1.txt :: 10
F :: tfile2.txt :: 1025
F :: tfile3.txt :: 2048
>>> ls("dir2")
=== list dir2 directory ===
F :: tfile1.txt :: 10
F :: tfile2.txt :: 1025
F :: tfile3.txt :: 2048

Now run the utility and check the result file:

>>> dupfind("-o %(o)s %(dir)s" % {'o':outputf, 'dir': testdir})
>>> cat(outputf)
hash,size,type,ext,name,directory,modification,operation,operation_data
...,D,,dir1,...
...,D,,dir2,...

Compare 2 directories with the same files and dirs.

Create new directories with the same content, but different names in previously created directories.

So for directories to be interpreted as duplications - they don’t need to have the same name, but the identical content.

Add 2 identical directories to the previous ones.

>>> def mkDir1(dpath):
...     mkd(dpath)
...     createFile('tfile11.txt', "B"*4000, dpath)
...     createFile('tfile12.txt', "B"*222, dpath)
...
>>> mkDir1("dir1/dir11")
>>> mkDir1("dir2/dir21")

Note that we added two directories with same contents, but different names. This should not break duplications.

>>> def mkDir2(dpath):
...     mkd(dpath)
...     createFile('tfile21.txt', "C"*4096, dpath)
...     createFile('tfile22.txt', "C"*123, dpath)
...     createFile('tfile23.txt', "C"*444, dpath)
...     createFile('tfile24.txt', "C"*555, dpath)
...
>>> mkDir2("dir1/dir22")
>>> mkDir2("dir2/dir22")

Confirm that directories’ contents are really identical

>>> ls("dir1")
=== list dir1 directory ===
D :: dir11 :: -1
D :: dir22 :: -1
F :: tfile1.txt :: 10
F :: tfile2.txt :: 1025
F :: tfile3.txt :: 2048
>>> ls("dir2")
=== list dir2 directory ===
D :: dir21 :: -1
D :: dir22 :: -1
F :: tfile1.txt :: 10
F :: tfile2.txt :: 1025
F :: tfile3.txt :: 2048

And contents for inner directories

First subdirectory:

>>> ls("dir1/dir11")
=== list dir1/dir11 directory ===
F :: tfile11.txt :: 4000
F :: tfile12.txt :: 222
>>> ls("dir2/dir21")
=== list dir2/dir21 directory ===
F :: tfile11.txt :: 4000
F :: tfile12.txt :: 222

Second subdirectory:

>>> ls("dir1/dir22")
=== list dir1/dir22 directory ===
F :: tfile21.txt :: 4096
F :: tfile22.txt :: 123
F :: tfile23.txt :: 444
F :: tfile24.txt :: 555
>>> ls("dir2/dir22")
=== list dir2/dir22 directory ===
F :: tfile21.txt :: 4096
F :: tfile22.txt :: 123
F :: tfile23.txt :: 444
F :: tfile24.txt :: 555

Now test the utility.

>>> dupfind("-o %(o)s %(dir)s" % {'o':outputf, 'dir': testdir})

Checks the results file for duplications.

>>> cat(outputf)
hash,size,type,ext,name,directory,modification,operation,operation_data
...,D,,dir1,...
...,D,,dir2,...

NOTE:

Inner duplication directories are excluded from the results:

>>> outputres = file(outputf).read()
>>> "dir1/dir11" in outputres
False
>>> "dir1/dir22" in outputres
False
>>> "dir2/dir21" in outputres
False
>>> "dir2/dir22" in outputres
False

Utility accepts more than one argument as directories list:

Use previous directory structure to prove this:

Now pass to utility “dir1/dir11” and “dir2” directories:

>>> dupfind("-o %(o)s %(dir1-11)s %(dir2)s" % {
...     'o':outputf,
...     'dir1-11': os.path.join(testdir,"dir1/dir11"),
...     'dir2': os.path.join(testdir,"dir2"),})

Now check the result file for duplications.

>>> cat(outputf)
hash,size,type,ext,name,directory,modification,operation,operation_data
...,D,,dir11,.../tmp.../dir1,...
...,D,,dir21,.../tmp.../dir2,...

DUPMANAGE UTILITY:

dupmanage utility allows you to manage duplication files and directories of your file system with csv data file.

Utility use csv-formatted data-file to process duplication items. Data file must contain the following columns:

  • type

  • name

  • directory

  • operation

  • operation_data

Utility supports 2 types of operations with duplication items:

  • deleting (“D”)

  • symlinking (“L”) only for UNIX-like systems

operation_data is only used for symlinking operation and must contain the path to symlinking sorce item.

Show how utility manages duplications:

To show - use previous directory structure and also add several duplications:

Create a file in the root directory and the same file in another catalog.

>>> createFile('tfile03.txt', "D"*100)
>>> mkd("dir3")
>>> createFile('tfile03.txt', "D"*100, "dir3")

Look into directories contents:

>>> ls()
=== list directory ===
D :: dir1 :: ...
D :: dir2 :: ...
D :: dir3 :: ...
F :: tfile03.txt :: 100
>>> ls("dir3")
=== list dir3 directory ===
F :: tfile03.txt :: 100

We already know the previous duplications, so now we create csv-formatted data file to manage duplications.

>>> manage_data = """type,name,directory,operation,operation_data
... F,tfile03.txt,%(testdir)s/dir3,L,%(testdir)s/tfile03.txt
... D,dir2,%(testdir)s,D,
... """ % {'testdir': testdir}
>>> createFile('manage.csv', manage_data)

Now call the utility and check result directory content:

>>> manage_path = os.path.join(testdir, 'manage.csv')
>>> dupmanage("%s -v" % manage_path)
[...
[...]: Symlink .../tfile03.txt item to .../dir3/tfile03.txt
[...]: Remove .../dir2 directory
[...]: Processed 2 items

Review directory content:

>>> ls()
=== list directory ===
D :: dir1 :: ...
D :: dir3 :: ...
F :: tfile03.txt :: 100
>>> ls("dir3")
=== list dir3 directory ===
L :: tfile03.txt :: ...

HISTORY:

1.4.3

  • Comment useless for now output_format option for dupfinder utility.

1.4.2

  • Refactoring content comparison to use zlib.crc32 function to calculate file content diges - speedup algorythm.

  • Fixed some bugs

1.4

  • Updated file duplication finding: added file comparison by content oportunity. Made this variant - default one.

  • Added -q (–quick) option to use quick file comparison (by name and size)

  • Added tests for quick/not-quick duplication finding

1.2

  • Added dupmanage utility for manage duplications

  • Added tests for dupmanage utility

1.0

  • Tests for dupfinder utility added

0.8

  • Refactoring classes: remove DupFilter, move filtering into DupOut class.

  • Force implicitly hiding inner content of a duplication directories.

0.7

  • Refactoring utility into classes

  • Fix bugs with bad files processing

  • Fix bug with size calculation

0.5

  • Refactoring inner finding algorithm

  • Implemented opportunity to remove from the result report inner content from duplication directories

0.3

  • Files finder implemented

  • Output in csv format

  • added filters by size

0.1

  • Initial release

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dupfinder-1.4.3.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dupfinder-1.4.3-py2.4.egg (23.6 kB view details)

Uploaded Egg

File details

Details for the file dupfinder-1.4.3.tar.gz.

File metadata

  • Download URL: dupfinder-1.4.3.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for dupfinder-1.4.3.tar.gz
Algorithm Hash digest
SHA256 3cab159008fa5a8b88da2bc8b1e0ed2761cecdcb9f983c0f3f8b7c771ab3f9af
MD5 25b2b06a31e74ce1e59a23de682af1ff
BLAKE2b-256 54c17c4fa491298a83f98318ef6e429c372797667a89f798fd7cf06748034c85

See more details on using hashes here.

File details

Details for the file dupfinder-1.4.3-py2.4.egg.

File metadata

  • Download URL: dupfinder-1.4.3-py2.4.egg
  • Upload date:
  • Size: 23.6 kB
  • Tags: Egg
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for dupfinder-1.4.3-py2.4.egg
Algorithm Hash digest
SHA256 507cdb501eb7161a91f4b6fdb4ffe72f0878016e71dfa10f975af6a9519e894f
MD5 5b0697388a8a9913a0ffc56a5e38926a
BLAKE2b-256 5b3aa1c1f17d42048bd5a44555f158bd59a03da596b59248fb9e71f3a36b970b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page