A tool for archivist's to automate the generation of references for digital files
Project description
Auto Reference Generator
A small python programme to generate hierarchical archival reference for files and directories and export the results to a spreadsheet.
Table of Contents
- Quick Start
- Version & Package info
- Why use this tool?
- Additional Features:
- Basic Usage / Examples
- Expected Spreadsheet
- Structure of References
- Advanced Options
- Full Options:
- Troubleshooting
- Future Developments
- Contributing
- Developers
Quick Start
Option 1: Using pip (Recommended for Python users / long-term usage)
pip install -U auto_reference_generator
auto_ref /path/to/root -p PREFIX -o /path/to/output
Option 2: Using Portable Executable (No Python Required)
Download the latest portable executable for your platform from Releases
Extract and run:
# Windows
cd auto_ref\bin
.\auto_ref.cmd .\path\to\root -p PREFIX -o .\path\to\output
# Linux/macOS
./auto_ref /path/to/root -p PREFIX -o /path/to/output
On Windows you can also use the install.cmd with admin privileges to install and run the command without navigating to the .cmd directory (see Option 1 for use)
Output
Generates a meta folder with output root_AutoRef.xlsx and a list of the generated reference hierarchy, alongside some metadata.
Version & Package info
Python Version: Python Version 3.10+ is recommended. Earlier versions may work but are not tested.
Additional Packages:
- pandas (required)
- openpyxl (required)
- pyodf (optional - ods export)
- lxml (optional - xml export)
- tqdm (required)
To install using Python:
pip install pandas openpyxl pyodf lxml tqdm
If using Python ensure it is added to Environment.
Why use this tool?
This tool is designed for archivists cataloguing large amounts of Digital Records at a time.
Automated Generation of References saves time and effort compared to manually filling in vs.
Additional options expand upon this and allow insertion into existing hierarchies and reference systems.
Additional Features:
- Prefixes - allowing merging into existing hierarchies
- Suffixes
- Level identification and limiting
- Keyword assignment - replacing Numericals with specified keywords (initials, first letter, JSON map)
- Logged removal of empty directories
- Accession / Running Number mode
- Fixity Generation
- Export options include: xslx (Default), csv, ods, json or xml.
- Integration with Opex Manifest Generator *Shameless Self promotion*.
Basic Usage / Examples
- Basic:
auto_ref /path/to/root - Prefix:
auto_ref /path/to/root -p PREFIX - Suffix:
auto_ref /path/to/root -s SUFFIX - Delimiter:
auto_ref /path/to/root -dlm "-" - Accession:
auto_ref /path/to/root -acc file - Fixity:
auto_ref /path/to/root -fx MD5 - Format:
auto_ref /path/to/root -fmt csv - Remove Empty
auto_ref /path/to/root --remove-empty - Output:
auto_ref /path/to/root -o /path/to/output - Include Hidden:
auto_ref /path/to/root --hidden
These options can be combined in a number of combinations.
Expected Spreadsheet
The spreadsheet should output like so:
This includes a preset of metadata: Including: FullName, RelativeName, BaseName, Size, Modified, Ref_Section Level, Parent, Archive_Reference,
The reference will by default be generated to the Archive_Reference column:
Structure of References
# Usage with Prefix `ARC`
auto_ref /path/to/root -p ARC
Folder Reference
>Root ARC
--->Folder 1 ARC/1
------>Sub Folder 1 ARC/1/1
--------->File 1 ARC/1/1/1
--------->File 2 ARC/1/1/2
------>Sub Folder 2 ARC/1/2
--------->File 3 ARC/1/2/1
--------->File 4 ARC/1/2/2
--->Folder 2 ARC/2
------>Sub Folder 3 ARC/2/1
--------->File 5 ARC/2/2
--->File 6 ARC/3
...
# Files and Folders can coexist at the same level. Without a prefix the root reference defaults to 0:
auto_ref /path/to/root
>Root 0
--->Folder 1
------>Sub Folder 1/1
--------->File 1/1/1
--------->File2 1/1/2
------>File3 1/2
...
# Prefixes can also be set to integrate the folder into the existing hierarchy at any point.
auto_ref /path/to/root -p "ARC/1/2/3"
>Root ARC/1/2/3
--->Folder ARC/1/2/3/1
------>File ARC/1/2/3/1/1
------>File2 ARC/1/2/3/1/2
...
# Start Ref option will also set the starting number for first subfolder.
auto_ref /path/to/root -p "ARC/1/2/3" -s 5
>Root ARC/1/2/3
--->Folder ARC/1/2/3/5
------>File ARC/1/2/3/5/1
...
Advanced Options
Important notes
- The term
metais hard coded to always be ignored for folders. - A meta folder will always be generated unless using
--disable-meta-diroption. - Both relative and absolute paths will work
Clear Empty Directories
# Will remove empty directories and generate a plain text log to the 'meta folder'. This is to prevent misleading references to nothing.
auto_ref /path/to/root --remove-empty
Hash/Fixity Generation
# Will generate a SHA-1 fixity list alongside reference, in columns Hash and Algorithm
auto_ref /path/to/root -fx SHA-1
# MD5, SHA-1, SHA-256, SHA-512 supported.
Level Limit
# Sets a level-depth to stop generating referencing at. Example will stop generating 5 levels down from root.
auto_ref /path/to/root -l 5
Skip
# Will skip reference generation if you just want a listing of files
auto_ref /path/to/root --skip
Keywords
Keywords replace the numerical reference with a keyword that matches folder name.
# Replaces keywords "Department of Justice" & "Department of Finance" with intials of words IE DOJ, DOF.
auto_ref /path/to/root -key "Department of Justice" "Department of Finance"
# The keywords will replace the reference number to all matches of the keyword. The way the replacement is made is determined by the `-keym / --keyword-mode`.
Keyword Modes:
# intialise
# Uses the intials of the keywords in this example Department of Justice becomes DOJ. Singular words will use firstletters mode. Is the default mode.
auto_ref -key "Department of Justice" -keym initialise
# firstletters
# Use the first x letters of word. IE `Department of Justices` becomes `DEP`.
auto_ref -key "Department of Justice" -keym firstletters
# from_json
# Uses a Python Dictionary stored as a JSON file to set custom abbreviations.
auto_ref -key /path/to/keyword.json -keym from_json
# JSON formatted like:
{'keyword to replace':'value to replace with', 'keyword2':'value2'}
Additional Keyword Options:
--keywords-case-sensitivity # Sets make lookup case sensitive. Default is insensitive.
--keywords-abbreviation-number # Sets the number of letters to abbreviate firstletters mode to. Default is 3.
--keywords-retain-order # Sets whether reference generation will count replacements in its ordering.
# By default it will not count replacements.
# If a keyword replacement is made after reference number 1, the next reference number after the replacement will be: 2
# If this option is used the number will instead be 3.
Options File
# Set a custom options file to customise default headers and some program defaults
auto_ref /path/to/root --options-file /path/to/options.properties
Default Options are:
[options]
INDEX_FIELD = FullName # Sets name to run indexing from
PATH_FIELD = FullName
RELATIVE_FIELD = RelativeName
PARENT_FIELD = Parent
PARENT_REF = Parent_Ref
REFERENCE_FIELD = Archive_Reference
REF_SECTION = Ref_Section
ACCESSION_FIELD = Accession_Reference
LEVEL_FIELD = Level
BASENAME_FIELD = BaseName
EXTENSION_FIELD = Extension
ATTRIBUTE_FIELD = Attributes
SIZE_FIELD = Size
CREATEDATE_FIELD = Create_Date
MODDATE_FIELD = Modified_Date
ACCESSDATE_FIELD = Access_Date
ALGORITHM_FIELD = Algorithm
HASH_FIELD = Hash
ACCDELIMTER = -
ACCFILE_KEYWORD = File
ACCDIR_KEYWORD = Dir
METAFOLDER = meta
OUTPUTSUFFIX = _AutoRef
EMPTYSUFFIX = _EmptyDirsRemoved
Accession mode
An alternative method of code generation is based on an accession number / running number pattern. Each file or folder will be given a running number regardless of depth.
Example output running Accession in "file" Mode:
>Root ACC-Dir
---> Folder 1 ACC-Dir
------> File 1 ACC-1
------> File 2 ACC-2
---> File 3 ACC-3
---> Folder 2 ACC-Dir
------> Sub-Folder ACC-Dir
---------> File 4 ACC-4
Examples:
# Run acc generation for files with Prefix "ACC" - numbers files
auto_ref /path/to/root -acc file -accp "ACC"`
# Run Accession generation for directories - numbers directories
auto_ref /path/to/root -acc dir -accp "ACC"`
# Run Accession generation for both - numbers both
auto_ref /path/to/root -acc both -accp "ACC"`
The output will be to an additional Accession_Reference column
Full Options:
The below covers the full range of options. Use the -h option to show this dialog:
Usage:
Auto_Reference_Generator [-h] [-v] [-p [PREFIX]] [-s [SUFFIX]]
[--suffix-option [{file,dir,both}]] [-acc [{file,dir,both}]]
[-accp [ACC_PREFIX]] [-l [LEVEL_LIMIT]] [-str [START_REF]]
[-dlm [DELIMITER]] [--remove-empty] [--disable-empty-export]
[-hid] [--sort-by [{folders_first,alphabetical}]]
[-fx [{MD5,SHA-1,SHA1,SHA-256} ...]]
[--max-workers [MAX_WORKERS]] [-o [OUTPUT]] [--disable-meta-dir]
[-skp] [-fmt {xlsx,csv,json,ods,xml,dict}]
[--options-file [OPTIONS_FILE]]
[--log-level [{DEBUG,INFO,WARNING,ERROR}]]
[--log-file [LOG_FILE]] [-key [KEYWORDS ...]]
[--keywords-case-sensitivity]
[-keym [{initialise,firstletters,from_json}]]
[--keywords-retain-order]
[--keywords-abbreviation-number [KEYWORDS_ABBREVIATION_NUMBER]]
[--physical-mode-input [PHYSICAL_MODE_INPUT]]
[--spreadsheet-to-sort [SPREADSHEET_TO_SORT]]
[root]
Auto Reference Generator for Digital Cataloguing
Positional arguments:
root: The root directory to create references for
Optional arguments:
-v,--version: See version information, then exit
Reference Options: Options for reference generation
-p [PREFIX],--prefix [PREFIX]: Set a prefix to append onto generated references-s [SUFFIX],--suffix [SUFFIX]: Set a suffix to append onto generated references--suffix-option [{file,dir,both}]: Set whether to apply the suffix to files, folders or both when generating references-acc [{file,dir,both}],--accession [{file,dir,both}]: Sets the program to create an accession listing - IE a running number of the files.-accp [ACC_PREFIX],--acc-prefix [ACC_PREFIX]: Sets the Prefix for Accession Mode-l [LEVEL_LIMIT],--level-limit [LEVEL_LIMIT]: Set a level limit to generate references to-str [START_REF],--start-ref [START_REF]: Set the starting reference number. Won't affect sub-folders/files-dlm [DELIMITER],--delimiter [DELIMITER]: Set the delimiter to use between levels--remove-empty: Sets the Program to remove any Empty Directory and Log removals to a text file--disable-empty-export: Sets the program to not export a log of removed empty directories, by default will export, this flag disables that-hid,--hidden: Set to include hidden files/folders in the listing--sort-by [{folders_first,alphabetical}]: Set the sorting method, 'folders_first' sorts folders first then files alphabetically; 'alphabetically' sorts alphabetically (ignoring folder distinction)-fx [{MD5,SHA-1,SHA1,SHA-256} ...],--fixity [{MD5,SHA-1,SHA1,SHA-256} ...]: Set to generate fixities, specify Algorithm to use (default SHA-1)--max-workers [MAX_WORKERS]: Set the maximum number of worker threads to use for hash generation when using --fixity (default: 1)
Output Options: Options for outputting the generated references
-o [OUTPUT],--output [OUTPUT]: Set the output directory for the created spreadsheet--disable-meta-dir: Set to disable creating a 'meta' file for spreadsheet; can be used in combination with output-skp,--skip: Set to skip creating references, will generate a spreadsheet listing-fmt {xlsx,csv,json,ods,xml,dict},--output-format {xlsx,csv,json,ods,xml,dict}: Set to set output format. Note ods requires odfpy; xml requires lxml; dict requires pandas, please install as needed--options-file [OPTIONS_FILE]: Set the options file to use, to override output column headers and other options--log-level [{DEBUG,INFO,WARNING,ERROR}]: Set the logging level (default: WARNING)--log-file [LOG_FILE]: Optional path to write logs to a file (default: stdout)
Keyword Options: Options for using keywords in reference generation
-key [KEYWORDS ...],--keywords [KEYWORDS ...]: Set to replace reference numbers with given Keywords for folders (only Folders atm). Can be a list of keywords or a JSON file mapping folder names to keywords.--keywords-case-sensitivity: Set to change case keyword matching sensitivity. By default keyword matching is insensitive-keym [{initialise,firstletters,from_json}],--keywords-mode [{initialise,firstletters,from_json}]: Set to alternate keyword mode: 'initialise' will use initials of words; 'firstletters' will use the first letters of the string; 'from_json' will use a JSON file mapping names to keywords--keywords-retain-order: Set when using keywords to continue reference numbering. If not used keywords don't 'count' to reference numbering, e.g. if using initials 'Project Alpha' -> 'PA' then the next folder/file will be '1' not '2'--keywords-abbreviation-number [KEYWORDS_ABBREVIATION_NUMBER]: Set to set the number of letters to abbreviate for 'firstletters' mode, does not impact 'initialise' mode.
Physical Mode Options: Options for using physical mode functionality
--physical-mode-input [PHYSICAL_MODE_INPUT]: Set to conduct an Auto Generation of a Specify a path to a Spreadsheet--spreadsheet-to-sort [SPREADSHEET_TO_SORT]: Set to a path to a Spreadsheet containing an 'Archive_Reference' Column to sort the spreadsheet according to hierarchy
Troubleshooting
- On Windows ensure that when you enter the root folder it does not end in a
\. This is slightly annoying as it adds it by default when tabbing. - In the examples above I've used linux paths. If you're on Windows don't forget to change these to backslashes
\
Future Developments
Level Limitations to allow for "group references"- Added!Generating references which use alphabetic characters- Added!- A mode for Physical Cataloguing...
Contributing
I welcome further contributions and feedback. If there any issues please raise them here.
Developers
The program can be used as a python module like so.
from auto_reference_generator import ReferenceGenerator
rg = ReferenceGenerator ("/path/to/root", prefix = "ARC", output_path = "/path/to/output")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file auto_reference_generator-1.3.8.tar.gz.
File metadata
- Download URL: auto_reference_generator-1.3.8.tar.gz
- Upload date:
- Size: 177.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
72c9aaa1cb5d12855694263f704d14d2f435fe7d64a95a3da941eea976b2ad00
|
|
| MD5 |
e51cfb4a737fb96c24051f4729bd4b28
|
|
| BLAKE2b-256 |
c222f4010969a1cbd68986cbc84146cbf7776d8eef18be593ecb09211fe02fd0
|
Provenance
The following attestation bundles were made for auto_reference_generator-1.3.8.tar.gz:
Publisher:
pypi-publish.yml on CPJPRINCE/auto_reference_generator
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
auto_reference_generator-1.3.8.tar.gz -
Subject digest:
72c9aaa1cb5d12855694263f704d14d2f435fe7d64a95a3da941eea976b2ad00 - Sigstore transparency entry: 1074925448
- Sigstore integration time:
-
Permalink:
CPJPRINCE/auto_reference_generator@a0c6b5b18b6867129464ab4b4712ca74e467ba48 -
Branch / Tag:
refs/tags/v1.3.8 - Owner: https://github.com/CPJPRINCE
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@a0c6b5b18b6867129464ab4b4712ca74e467ba48 -
Trigger Event:
push
-
Statement type:
File details
Details for the file auto_reference_generator-1.3.8-py3-none-any.whl.
File metadata
- Download URL: auto_reference_generator-1.3.8-py3-none-any.whl
- Upload date:
- Size: 28.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5ff89f43dcab0b0d766791d3490c6e4758ca1cd852acb444811301f84acf40f3
|
|
| MD5 |
14b129f4cfc362a75d078a6aca513cb3
|
|
| BLAKE2b-256 |
8c6a98d16851100cbd957ca9926f0a6a462eccedcf1e324021fca5eec3d3de9e
|
Provenance
The following attestation bundles were made for auto_reference_generator-1.3.8-py3-none-any.whl:
Publisher:
pypi-publish.yml on CPJPRINCE/auto_reference_generator
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
auto_reference_generator-1.3.8-py3-none-any.whl -
Subject digest:
5ff89f43dcab0b0d766791d3490c6e4758ca1cd852acb444811301f84acf40f3 - Sigstore transparency entry: 1074925466
- Sigstore integration time:
-
Permalink:
CPJPRINCE/auto_reference_generator@a0c6b5b18b6867129464ab4b4712ca74e467ba48 -
Branch / Tag:
refs/tags/v1.3.8 - Owner: https://github.com/CPJPRINCE
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@a0c6b5b18b6867129464ab4b4712ca74e467ba48 -
Trigger Event:
push
-
Statement type: