Skip to main content
Join the official 2019 Python Developers SurveyStart the survey!

Convert file systems to archiveable ISO files

Project description

odarchive

Optical disc archiving - converting file systems to sets of ISO files and back. It is meant for archiving information to optical disc for long term data storage.

This readme is broken into sections:

  • Command line usage
  • Technical description
  • Software description
  • Future Enhancements
  • License (MIT)

Command line usage

Summary of commands, see below for commandline options.

Commands Comment
init Creates a new catalogue
write_iso write out an iso
archive
segment split archive into segments

odarchive create_db drive_path

Creates a database catalogue.json in current working directory from files in drive_path

odarchive plan_iso size

Enriches the database to plan building ISO with a maximum size of n bytes or using

  • 'cd' = 700MB
  • 'dvd' = 4.7GB
  • 'bd' = 25GB
  • 'bluray' = 25GB
  • 'bd-dl' = 50GB
  • 'bd-xl' = 100GB
  • 'bd-xx' = 128GB

odarchive num_isos

Planned Return number of isos required

odarchive create_iso n

Default for n is 0 (numbering from Zero)

Technical Description

Segmentation

When a catalogue is created for the first time it is not segmented. Segmenting refers to writing out a single disc which has smaller capacity than the total archive.

File name

This services is planned to work from supplied USB drives. The internal file name is a UDF Abolute Path. It is an absolute path so that if the name of the Data directory changes this won’t make a difference when combining discs with different naming conventions.

User case history: I want to backup some critical files first eg my own Photgraphs and video and then other files The deeper the file system on the USB the better the result will be

disc directory layout

/Data + All data files catalogue.json #
README.MD

Json file format

JsonFile = HeadingSection *FileDefinition HeadingSection = Version FileDefinition = Hash FileNameList Mtime Size

Max. filename length 255 bytes (path 1023 bytes)

Example::

    {
        "version": 1,
        "files" :  {
            "Hash1": {
                filenames : {
                    "data/first.html" : null,
                    "data/first_copy.html" : null,
                }
                size : 21,
                mtime : ??
            },
            "Hash2": {
                filenames : {
                    "data/second.txt" : null,
                }
            size : 22,
            mtime : ?? },
        }
    }

Software Structure

Object Structure

The main objects is an Archive object.

This has two main sub objects:

  • a File_db
  • a hash_file_db

The file_db is a temporary structure which is derived from walking the file system. The hash_file_db is then derived from that by taking the hashes of all those files and then building a database of each unique file. This then shows the multiple location of where the file is stored.

File Structure

This is one module: odarchive.

This has three parts:

This has three top level modules:

  • archive.py which handles the archive
  • odarchive_cli.py which puts a command line wrapper around the archive code.
  • a _version.py which holds the version number constants for both pypy and the code

The archive code is the main code and 4 sub files: - consts.py General constants - file_db.py handles a file database - hash_db extends the file database to include a database of hashentries - hash_file_entry.py creating hashes for single files

There is a distinct order in which things must happen:

-Scan the file and build a file database. -Create a hash tables

Unique ID's

There is a job_id which is created at the start of a job. This should be unique and last the life time of the job. There is also a guid with the catalogue, this changes every time it is saved/changed and is like the version of the archive.

Limitations

Saving an archive

When an archive is saved it is put into a catalogue.json file. This is only possible once all the hashes for the files have been created as the format is indexed the file hash.

Temporary work can be saved via the save_as_dill method.

Inspiration

Thanks to M Ruffalo of https://github.com/mruffalo/hash-db/blob/master/hash_db.py for a lot of inspiration.

Future enhancements

convert a json catalogue into a pickle

Currently when working on a backup plan the data is pickled from command to command. In order to work with existing cataloges this is what you need.

Make a .exe

Create a single exe that will do this job or make an exe that can be installed via pip.

Segemented isos into specific size

Convert catalogue from dictionary to database

At the moment the catalogue database is an in memory dictionary. This will limit it to about 2 million files per GB of available memory.

Make a service with feedback on status

Eg archiving Z drive has taken at least an hour and I don’t if working or crashed.

Make calculating Hash multi threaded

May work faster

Add size of catalogue to length of reserved space

Incremental

  • With an old catalogue and drive path create a additional catalog. Can be current point in time, eg so not reference files that were deleted.
  • This also allows rescanning without making any changes. This should also cover the case of multiple single backups being coalesced into a bigger backup. You might in large file format decide to backup a number of individual files first.

Split large files

Make a service as a Glacier replacement.

Eg rerun and post changes via web

Make read only cached drive

robot plus discs 600 discs = 15TB

Calculate directory size when segmenting

At the moment a fixed directory size is added to each entry

get_info

Add JsON alternative to returned data

Make Platinum discs

GBP16 a g SG 19 1u 12cm area = 0.012 a 1u coat costs about GBP4 a disc However if these had a glass substrate they might have a very long life. Potentially you could write them naked and then cover them with another glass slip. Could even write with an electron beam although might be expensive

UDF Bridge format

To make the discs more compatible add a ISO 9660 directory structure as well as the UDF

In order to make the discs more compatible both a UDF and a ISO 9660 directory structure are output. There are different levels of ISO 9660 structure but I have chosen level 3 as the standard.

The ISO 9660 leads to complications as it is restricted to shorter directory names and a maximum directory depth of 7. This means the the conversion from UDF directories to ISO 9660 directories is lossy. From this it means that you have to implement a lossless scheme (store the directory mapping as a dictionary) so that you can do multiple backups on multiple discs and still have a coherent directory structure.

Adding error correction

The main aim of this is to measure the degradation of the storage media and to know when the data needs restoring. In a DRAM this is done all the time - it should also be done on raid drives to scrub the errors.

Licensing

Using an MIT license see LICENSE.

V0.0.0 2018-10-04

First release

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for odarchive, version 0.1.0
Filename, size File type Python version Upload date Hashes
Filename, size odarchive-0.1.0-py3-none-any.whl (36.9 kB) File type Wheel Python version py3 Upload date Hashes View hashes
Filename, size odarchive-0.1.0.tar.gz (31.1 kB) File type Source Python version None Upload date Hashes View hashes

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page