Skip to main content

dedupe files and send them to the cloud

Project description

dedoop

Build Status

In digital preservation work you sometimes may find yourself accepting a disk or random assortment of files, and want to examine all of them looking for duplicates and copy them to a new location in a uniform way, while preserving the original paths as metadata to help you process the data. Ok, maybe this is a bit of niche use case, but this is what dedoop was created for.

dedoop will recursively read a source directory of files and write them out to a new target directory or bucket in the cloud using the files's SHA256 checksum as the filename. If a given file occurs more than once in the source directory it will only be written once to the target location. File metadata such as the media type and original file name will be persisted in a JSON file that is output at the end of the process. In the case of writing to the cloud, object metadata will be used to store this information.

Install

Install Python 3 and:

% pip3 install dedoop

Usage

Add to Storage

To add a directory of data to the storage location you can:

% dedoop add path/to/source path/to/target

So for example if the source directory looks like this:

source
├── a.jpg
├── a.png
├── b.jpg
└── c
    ├── a.jpg
    └── b.jpg

The resulting target could look like this (assuming the files of the same name had the same contents that hashed to these values):

target
├── 1e89b90b5973baad2e6c3294ffe648ff53ab0b9d75188e9fbb8b38deb9ba3341.png
├── 45d257c93e59ec35187c6a34c8e62e72c3e9cfbb548984d6f6e8deb84bac41f4.jpg
└── b6df8058fa818acfd91759edffa27e473f2308d5a6fca1e07a79189b95879953.jpg

Add to the Cloud

You can also write files to any cloud storage provider that is supported by libcloud, such as Amazon S3, Google Cloud Storage, etc.

Limit by File Extension

If you like you can limit the types of files that are added by using the --extensions command line option and giving it a comma separated list of file extensions to include. All non-matching files (case insensitive) will be ignored.

% dedoop add --extensions jpg,png path/to/source path/to/target

List Cloud Files

Its easy to list files on the file system. But its more difficult to see what's in the cloud--especially with the metadata dedoop has attached to each object. The list command will do that for you.

% dedoop ls s3://my-storage-location/

Logging

If you use --verbose you will see log messages on the console about what is happening. You can optionally send these messages to a log file of your choosing using the --log option.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dedoop-0.0.4.tar.gz (4.5 kB view details)

Uploaded Source

File details

Details for the file dedoop-0.0.4.tar.gz.

File metadata

  • Download URL: dedoop-0.0.4.tar.gz
  • Upload date:
  • Size: 4.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.8.2

File hashes

Hashes for dedoop-0.0.4.tar.gz
Algorithm Hash digest
SHA256 525e0b1413446ce25ec7abe1bf99bff9937cd7e6c1b27fd3dec0d6080815dd88
MD5 89127923295d46ddded5aa672690d687
BLAKE2b-256 e9abc83be9e7362b02ca835e4a0126e854e9b302b6a83d32bdb7f6a41e29ab1f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page