Skip to main content

A tool for archivist's to automate the generation of references for digital files

Project description

Auto Reference Generator

A small python programme to generate hierarchical archival reference for files and directories and export the results to a spreadsheet.

Supported Versions CodeQL

Table of Contents

Quick Start

Option 1: Using pip (Recommended for Python users / long-term usage)

pip install -U auto_reference_generator
auto_ref /path/to/root -p PREFIX -o /path/to/output

Option 2: Using Portable Executable (No Python Required)

Download the latest portable executable for your platform from Releases

Extract and run:

# Windows
cd auto_ref\bin
.\auto_ref.cmd .\path\to\root -p PREFIX -o .\path\to\output

# Linux/macOS
./auto_ref /path/to/root -p PREFIX -o /path/to/output

On Windows you can also use the install.cmd with admin privileges to install and run the command without navigating to the .cmd directory (see Option 1 for use)

Output

Generates a meta folder with output root_AutoRef.xlsx and a list of the generated reference hierarchy, alongside some metadata.

Version & Package info

Python Version: Python Version 3.10+ is recommended. Earlier versions may work but are not tested.

Additional Packages:

  • pandas (required)
  • openpyxl (required)
  • pyodf (optional - ods export)
  • lxml (optional - xml export)
  • tqdm (required)

To install using Python:

pip install pandas openpyxl pyodf lxml tqdm

If using Python ensure it is added to Environment.

Why use this tool?

This tool is designed for archivists cataloguing large amounts of Digital Records at a time.

Automated Generation of References saves time and effort compared to manually filling in vs.

Additional options expand upon this and allow insertion into existing hierarchies and reference systems.

Additional Features:

  • Prefixes - allowing merging into existing hierarchies
  • Suffixes
  • Level identification and limiting
  • Keyword assignment - replacing Numericals with specified keywords (initials, first letter, JSON map)
  • Logged removal of empty directories
  • Accession / Running Number mode
  • Fixity Generation
  • Export options include: xslx (Default), csv, ods, json or xml.
  • Integration with Opex Manifest Generator *Shameless Self promotion*.

Basic Usage / Examples

  • Basic: auto_ref /path/to/root
  • Prefix: auto_ref /path/to/root -p PREFIX
  • Suffix: auto_ref /path/to/root -s SUFFIX
  • Delimiter: auto_ref /path/to/root -dlm "-"
  • Accession: auto_ref /path/to/root -acc file
  • Fixity: auto_ref /path/to/root -fx MD5
  • Format: auto_ref /path/to/root -fmt csv
  • Remove Empty auto_ref /path/to/root --remove-empty
  • Output: auto_ref /path/to/root -o /path/to/output
  • Include Hidden: auto_ref /path/to/root --hidden

These options can be combined in a number of combinations.

Expected Spreadsheet

The spreadsheet should output like so:

SpreadPreview

This includes a preset of metadata: Including: FullName, RelativeName, BaseName, Size, Modified, Ref_Section Level, Parent, Archive_Reference,

The reference will by default be generated to the Archive_Reference column:

ReferencePreview

Structure of References

# Usage with Prefix `ARC`
auto_ref /path/to/root -p ARC

Folder                 Reference
>Root                  ARC
--->Folder 1           ARC/1
------>Sub Folder 1    ARC/1/1
--------->File 1       ARC/1/1/1
--------->File 2       ARC/1/1/2
------>Sub Folder 2    ARC/1/2
--------->File 3       ARC/1/2/1
--------->File 4       ARC/1/2/2
--->Folder 2           ARC/2
------>Sub Folder 3    ARC/2/1
--------->File 5       ARC/2/2
--->File 6             ARC/3
...

# Files and Folders can coexist at the same level. Without a prefix the root reference defaults to 0:
auto_ref /path/to/root

>Root                  0
--->Folder             1
------>Sub Folder      1/1
--------->File         1/1/1
--------->File2        1/1/2
------>File3           1/2
...

# Prefixes can also be set to integrate the folder into the existing hierarchy at any point.
auto_ref /path/to/root -p "ARC/1/2/3"

>Root                   ARC/1/2/3
--->Folder              ARC/1/2/3/1
------>File             ARC/1/2/3/1/1
------>File2            ARC/1/2/3/1/2
...

# Start Ref option will also set the starting number for first subfolder.
auto_ref /path/to/root -p "ARC/1/2/3" -s 5

>Root                   ARC/1/2/3
--->Folder              ARC/1/2/3/5
------>File             ARC/1/2/3/5/1
...

Advanced Options

Important notes

  • The term meta is hard coded to always be ignored for folders.
  • A meta folder will always be generated unless using --disable-meta-dir option.
  • Both relative and absolute paths will work

Clear Empty Directories

# Will remove empty directories and generate a plain text log to the 'meta folder'. This is to prevent misleading references to nothing.
auto_ref /path/to/root --remove-empty

Hash/Fixity Generation

# Will generate a SHA-1 fixity list alongside reference, in columns Hash and Algorithm
auto_ref /path/to/root -fx SHA-1

# MD5, SHA-1, SHA-256, SHA-512 supported.

HashPreview

Level Limit

# Sets a level-depth to stop generating referencing at. Example will stop generating 5 levels down from root.
auto_ref /path/to/root -l 5

Skip

# Will skip reference generation if you just want a listing of files
auto_ref /path/to/root --skip

Keywords

Keywords replace the numerical reference with a keyword that matches folder name.

# Replaces keywords "Department of Justice" & "Department of Finance" with intials of words IE DOJ, DOF.
auto_ref /path/to/root -key "Department of Justice" "Department of Finance"

# The keywords will replace the reference number to all matches of the keyword. The way the replacement is made is determined by the `-keym / --keyword-mode`.

Keyword Modes:

# intialise
# Uses the intials of the keywords in this example Department of Justice becomes DOJ. Singular words will use firstletters mode. Is the default mode.
auto_ref -key "Department of Justice" -keym initialise

# firstletters
# Use the first x letters of word. IE `Department of Justices` becomes `DEP`.
auto_ref -key "Department of Justice" -keym firstletters

# from_json
# Uses a Python Dictionary stored as a JSON file to set custom abbreviations.
auto_ref -key /path/to/keyword.json -keym from_json

# JSON formatted like:
{'keyword to replace':'value to replace with', 'keyword2':'value2'}

Additional Keyword Options:

--keywords-case-sensitivity # Sets make lookup case sensitive. Default is insensitive.
--keywords-abbreviation-number # Sets the number of letters to abbreviate firstletters mode to. Default is 3.
--keywords-retain-order # Sets whether reference generation will count replacements in its ordering.
                        # By default it will not count replacements.
                        # If a keyword replacement is made after reference number 1, the next reference number after the replacement will be: 2
                        # If this option is used the number will instead be 3.

Options File

# Set a custom options file to customise default headers and some program defaults
auto_ref /path/to/root --options-file /path/to/options.properties

Default Options are:

[options]

INDEX_FIELD = FullName # Sets name to run indexing from

PATH_FIELD = FullName
RELATIVE_FIELD = RelativeName
PARENT_FIELD = Parent
PARENT_REF = Parent_Ref
REFERENCE_FIELD = Archive_Reference
REF_SECTION = Ref_Section
ACCESSION_FIELD = Accession_Reference
LEVEL_FIELD = Level
BASENAME_FIELD = BaseName
EXTENSION_FIELD = Extension
ATTRIBUTE_FIELD = Attributes
SIZE_FIELD = Size
CREATEDATE_FIELD = Create_Date
MODDATE_FIELD = Modified_Date
ACCESSDATE_FIELD = Access_Date

ALGORITHM_FIELD = Algorithm
HASH_FIELD = Hash

ACCDELIMTER = -
ACCFILE_KEYWORD = File
ACCDIR_KEYWORD = Dir
METAFOLDER = meta
OUTPUTSUFFIX = _AutoRef
EMPTYSUFFIX = _EmptyDirsRemoved

Accession mode

An alternative method of code generation is based on an accession number / running number pattern. Each file or folder will be given a running number regardless of depth.

Example output running Accession in "file" Mode:

>Root                 ACC-Dir
---> Folder 1          ACC-Dir
------> File 1         ACC-1
------> File 2         ACC-2
---> File 3            ACC-3
---> Folder 2          ACC-Dir
------> Sub-Folder     ACC-Dir
---------> File 4      ACC-4

Examples:

# Run acc generation for files with Prefix "ACC" - numbers files
auto_ref /path/to/root -acc file -accp "ACC"`

# Run Accession generation for directories - numbers directories
auto_ref /path/to/root -acc dir -accp "ACC"`

# Run Accession generation for both - numbers both
auto_ref /path/to/root -acc both -accp "ACC"`

The output will be to an additional Accession_Reference column

AccessionPReview

Full Options:

The below covers the full range of options. Use the -h option to show this dialog:

Usage:

Auto_Reference_Generator [-h] [-v] [-p [PREFIX]] [-s [SUFFIX]]
                                    [--suffix-option [{file,dir,both}]] [-acc [{file,dir,both}]]
                                    [-accp [ACC_PREFIX]] [-l [LEVEL_LIMIT]] [-str [START_REF]]
                                    [-dlm [DELIMITER]] [--remove-empty] [--disable-empty-export]
                                    [-hid] [--sort-by [{folders_first,alphabetical}]]
                                    [-fx [{MD5,SHA-1,SHA1,SHA-256} ...]]
                                    [--max-workers [MAX_WORKERS]] [-o [OUTPUT]] [--disable-meta-dir]
                                    [-skp] [-fmt {xlsx,csv,json,ods,xml,dict}]
                                    [--options-file [OPTIONS_FILE]]
                                    [--log-level [{DEBUG,INFO,WARNING,ERROR}]]
                                    [--log-file [LOG_FILE]] [-key [KEYWORDS ...]]
                                    [--keywords-case-sensitivity]
                                    [-keym [{initialise,firstletters,from_json}]]
                                    [--keywords-retain-order]
                                    [--keywords-abbreviation-number [KEYWORDS_ABBREVIATION_NUMBER]]
                                    [--physical-mode-input [PHYSICAL_MODE_INPUT]]
                                    [--spreadsheet-to-sort [SPREADSHEET_TO_SORT]]
                                    [root]

Auto Reference Generator for Digital Cataloguing

Positional arguments:

  • root: The root directory to create references for

Optional arguments:

  • -v, --version: See version information, then exit

Reference Options: Options for reference generation

  • -p [PREFIX], --prefix [PREFIX]: Set a prefix to append onto generated references
  • -s [SUFFIX], --suffix [SUFFIX]: Set a suffix to append onto generated references
  • --suffix-option [{file, dir, both}]: Set whether to apply the suffix to files, folders or both when generating references
  • -acc [{file, dir, both}], --accession [{file, dir, both}]: Sets the program to create an accession listing - IE a running number of the files.
  • -accp [ACC_PREFIX], --acc-prefix [ACC_PREFIX]: Sets the Prefix for Accession Mode
  • -l [LEVEL_LIMIT], --level-limit [LEVEL_LIMIT]: Set a level limit to generate references to
  • -str [START_REF], --start-ref [START_REF]: Set the starting reference number. Won't affect sub-folders/files
  • -dlm [DELIMITER], --delimiter [DELIMITER]: Set the delimiter to use between levels
  • --remove-empty: Sets the Program to remove any Empty Directory and Log removals to a text file
  • --disable-empty-export: Sets the program to not export a log of removed empty directories, by default will export, this flag disables that
  • -hid, --hidden: Set to include hidden files/folders in the listing
  • --sort-by [{folders_first, alphabetical}]: Set the sorting method, 'folders_first' sorts folders first then files alphabetically; 'alphabetically' sorts alphabetically (ignoring folder distinction)
  • -fx [{MD5, SHA-1, SHA1, SHA-256} ...], --fixity [{MD5, SHA-1, SHA1, SHA-256} ...]: Set to generate fixities, specify Algorithm to use (default SHA-1)
  • --max-workers [MAX_WORKERS]: Set the maximum number of worker threads to use for hash generation when using --fixity (default: 1)

Output Options: Options for outputting the generated references

  • -o [OUTPUT], --output [OUTPUT]: Set the output directory for the created spreadsheet
  • --disable-meta-dir: Set to disable creating a 'meta' file for spreadsheet; can be used in combination with output
  • -skp, --skip: Set to skip creating references, will generate a spreadsheet listing
  • -fmt {xlsx, csv, json, ods, xml, dict}, --output-format {xlsx, csv, json, ods, xml, dict}: Set to set output format. Note ods requires odfpy; xml requires lxml; dict requires pandas, please install as needed
  • --options-file [OPTIONS_FILE]: Set the options file to use, to override output column headers and other options
  • --log-level [{DEBUG, INFO, WARNING, ERROR}]: Set the logging level (default: WARNING)
  • --log-file [LOG_FILE]: Optional path to write logs to a file (default: stdout)

Keyword Options: Options for using keywords in reference generation

  • -key [KEYWORDS ...], --keywords [KEYWORDS ...]: Set to replace reference numbers with given Keywords for folders (only Folders atm). Can be a list of keywords or a JSON file mapping folder names to keywords.
  • --keywords-case-sensitivity: Set to change case keyword matching sensitivity. By default keyword matching is insensitive
  • -keym [{initialise, firstletters, from_json}], --keywords-mode [{initialise, firstletters, from_json}]: Set to alternate keyword mode: 'initialise' will use initials of words; 'firstletters' will use the first letters of the string; 'from_json' will use a JSON file mapping names to keywords
  • --keywords-retain-order: Set when using keywords to continue reference numbering. If not used keywords don't 'count' to reference numbering, e.g. if using initials 'Project Alpha' -> 'PA' then the next folder/file will be '1' not '2'
  • --keywords-abbreviation-number [KEYWORDS_ABBREVIATION_NUMBER]: Set to set the number of letters to abbreviate for 'firstletters' mode, does not impact 'initialise' mode.

Physical Mode Options: Options for using physical mode functionality

  • --physical-mode-input [PHYSICAL_MODE_INPUT]: Set to conduct an Auto Generation of a Specify a path to a Spreadsheet
  • --spreadsheet-to-sort [SPREADSHEET_TO_SORT]: Set to a path to a Spreadsheet containing an 'Archive_Reference' Column to sort the spreadsheet according to hierarchy

Troubleshooting

  • On Windows ensure that when you enter the root folder it does not end in a \. This is slightly annoying as it adds it by default when tabbing.
  • In the examples above I've used linux paths. If you're on Windows don't forget to change these to backslashes \

Future Developments

  • Level Limitations to allow for "group references" - Added!
  • Generating references which use alphabetic characters - Added!
  • A mode for Physical Cataloguing...

Contributing

I welcome further contributions and feedback. If there any issues please raise them here.

Developers

The program can be used as a python module like so.

from auto_reference_generator import ReferenceGenerator

rg = ReferenceGenerator ("/path/to/root", prefix = "ARC", output_path = "/path/to/output")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

auto_reference_generator-1.3.9.tar.gz (177.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

auto_reference_generator-1.3.9-py3-none-any.whl (28.1 kB view details)

Uploaded Python 3

File details

Details for the file auto_reference_generator-1.3.9.tar.gz.

File metadata

  • Download URL: auto_reference_generator-1.3.9.tar.gz
  • Upload date:
  • Size: 177.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for auto_reference_generator-1.3.9.tar.gz
Algorithm Hash digest
SHA256 2700d40e987862c088262dcdd6c82ea7c1cc07adfc49a13d998e6622937b53f3
MD5 a6af35252c2557cfc25304cb7b0debf4
BLAKE2b-256 b1bb1cc63000e45efd13cc8efe94ddf1dc4bd524c30d40287c979a2c55fafa12

See more details on using hashes here.

Provenance

The following attestation bundles were made for auto_reference_generator-1.3.9.tar.gz:

Publisher: pypi-publish.yml on CPJPRINCE/auto_reference_generator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file auto_reference_generator-1.3.9-py3-none-any.whl.

File metadata

File hashes

Hashes for auto_reference_generator-1.3.9-py3-none-any.whl
Algorithm Hash digest
SHA256 01f72626c265f75812ea96a88207edaf22ed8cdcac97a6acf258eac4b2691e66
MD5 e705e2265d97e6403cb26f67d3cfd129
BLAKE2b-256 bdc6a471207197a5bccf258b11877b1b536b2240829bb53daf2fa1465267a2ca

See more details on using hashes here.

Provenance

The following attestation bundles were made for auto_reference_generator-1.3.9-py3-none-any.whl:

Publisher: pypi-publish.yml on CPJPRINCE/auto_reference_generator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page