Computes SHA-1 checksum for all the specified files and writes to checksum.txt file.
Project description
PRIDE Checksum Generator
A command-line tool that computes SHA-1 checksums for files and generates integrity verification reports. This tool is essential for data validation, file integrity checking, and ensuring data hasn't been corrupted during transfer or storage.
Why do you need this tool?
Data Integrity Verification: When working with important files (research data, archives, backups), you need to ensure files haven't been corrupted, modified, or tampered with during:
- File transfers between systems
- Long-term storage
- Data migrations
- Backup and restore operations
Compliance and Auditing: Many organizations require checksum verification for:
- Data governance and compliance
- Scientific research reproducibility
- Archive validation
- Quality assurance processes
Batch Processing: Instead of manually computing checksums for individual files, this tool efficiently processes entire directories or file lists, making it ideal for:
- Large datasets
- Automated workflows
- Data pipelines
- Archive management
Installation
Prerequisites
- Python 3.11 or higher
- pip package manager
Install from PyPI (recommended)
pip install pride-checksum
Install from source
git clone https://github.com/PRIDE-Archive/pride-checksum.git
cd pride-checksum
pip install -e .
Install for development
pip install -e ".[dev]"
Usage
The tool provides two modes of operation:
Mode 1: Process all files in a directory
pride_checksum --files_dir /path/to/your/files/ --out_path /path/to/save/checksum/
Mode 2: Process files from a list
pride_checksum --files_list_path /path/to/files_list.txt --out_path /path/to/save/checksum/
Command-line options
--files_dir: Directory containing files to checksum (processes all files in the directory)--files_list_path: Path to a text file containing list of files to process--out_path: Directory where thechecksum.txtfile will be saved (required)
Note: You must specify either --files_dir OR --files_list_path, but not both.
Examples
Example 1: Processing a directory
# Create some test files
mkdir my_data
echo "Sample content" > my_data/file1.txt
echo "More data" > my_data/file2.xml
# Generate checksums
mkdir checksums
pride_checksum --files_dir my_data --out_path checksums
# View the results
cat checksums/checksum.txt
Example 2: Processing files from a list
Create a file list (my_files.txt):
/home/user/documents/report.pdf
/home/user/data/experiment1.csv
/home/user/data/experiment2.csv
Then run:
pride_checksum --files_list_path my_files.txt --out_path /home/user/checksums/
Example 3: Incremental Update (when a file changes)
This example demonstrates the incremental update feature, which is ideal for large datasets:
# Initial run: compute checksums for all files
pride_checksum --files_dir my_data --out_path checksums
# Later, you rename/modify a file (e.g., uncompress file.txt.gz to file.txt)
mv my_data/file.txt.gz my_data/file.txt
# Run again: only the changed file is recomputed, others are reused
pride_checksum --files_dir my_data --out_path checksums
Output of the second run:
[INFO] checksum.txt already exists. Will perform incremental update.
[INFO] Found 3 existing checksum entries.
[INFO] Removing 1 files that no longer exist: ['file.txt.gz']
[ 1 / 3 ] Reusing existing checksum for: file1.txt -> aaf4c61ddcc5e8a2...
[ 2 / 3 ] Processing: /path/to/file.txt
[ 2 / 3 ] Generated checksum for: file.txt -> 356a192b7913b04c54...
[ 3 / 3 ] Reusing existing checksum for: file2.xml -> da39a3ee5e6b4b0d...
[INFO] Incremental update summary: 2 reused, 1 new, 1 removed
Performance benefit: For 1000 files where only 1 file changed, you save 999 checksum computations! ⚡
Example Output
The generated checksum.txt file contains tab-separated values with filename and SHA-1 hash:
# SHA-1 Checksum
file1.txt aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d
file2.xml 356a192b7913b04c54574d18c28d46e6395428ab
report.pdf da39a3ee5e6b4b0d3255bfef95601890afd80709
File Requirements and Limitations
Supported Files
- ✅ Regular files only (no directories)
- ✅ Any file type and size
- ✅ Files with alphanumeric names
- ✅ Files with underscores (
_) and hyphens (-)
Restrictions
- ❌ No spaces in filenames - files with spaces will be rejected
- ❌ No special characters except underscore and hyphen
- ❌ No hidden files (files starting with
.) - ❌ No directories in the file list
- ❌ No duplicate filenames (even if in different paths)
Valid filename examples:
✅ data_file.txt
✅ experiment-01.csv
✅ report_2024.pdf
✅ analysis123.xml
Invalid filename examples:
❌ file with spaces.txt
❌ file@symbol.txt
❌ .hidden_file
❌ data%file.csv
Important Notes
⚡ Incremental Updates: If checksum.txt already exists in the output directory, the tool will perform an incremental update:
- Reuses existing checksums for files that haven't changed (same filename)
- Computes checksums only for new files or renamed files
- Removes entries for files that no longer exist
- Significantly faster for large datasets when only a few files have changed
This is especially useful when you have a large submission (e.g., 1000 files) and only need to update one or a few files—you don't have to recompute checksums for everything!
Example incremental update output:
[INFO] checksum.txt already exists. Will perform incremental update.
[INFO] Found 3 existing checksum entries.
[INFO] Removing 1 files that no longer exist: ['file.txt.gz']
[ 1 / 3 ] Reusing existing checksum for: file1.txt -> aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d
[ 2 / 3 ] Processing: /path/to/file.txt
[ 2 / 3 ] Generated checksum for: file.txt -> 356a192b7913b04c54574d18c28d46e6395428ab
[ 3 / 3 ] Reusing existing checksum for: file3.xml -> da39a3ee5e6b4b0d3255bfef95601890afd80709
[INFO] Incremental update summary: 2 reused, 1 new, 1 removed
📝 Progress Tracking: The tool displays progress as it processes files:
[ 1 / 3 ] Processing: /path/to/file1.txt
[ 1 / 3 ] Generated checksum for: file1.txt -> aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d
[ 2 / 3 ] Processing: /path/to/file2.xml
...
🔍 Validation: The tool performs extensive validation:
- Checks if files exist and are accessible
- Validates filename format
- Detects duplicate filenames
- Ensures output directory exists and is writable
Troubleshooting
Common Issues
Error: "Directory doesn't exist"
- Ensure the
--files_dirpath exists and is accessible - Check that
--out_pathdirectory exists (create it if needed)
Error: "Invalid filename"
- Rename files to use only alphanumeric characters, underscores, and hyphens
- Remove spaces and special characters from filenames
Error: "Hidden files are not allowed"
- Remove hidden files (starting with
.) from your directory or file list
Error: "Following files have duplicate entries"
- Ensure all files have unique names, even if they're in different directories
- Rename duplicate files before processing
Error: "No permissions to write"
- Check write permissions on the output directory
- Ensure you have sufficient disk space
Development
Running Tests
python -m pytest tests/ -v
Building Package
pip install build
python -m build
Project Structure
pride-checksum/
├── src/pride_checksum/ # Main source code
├── tests/ # Test files
├── README.md # This file
└── pyproject.toml # Project configuration
Publishing to PyPI
This package is automatically published to PyPI when a new release is created on GitHub. The publishing workflow:
- Create a new release on GitHub with a version tag (e.g.,
v1.2.0) - The GitHub Actions workflow automatically builds and publishes the package to PyPI
- The package becomes available at https://pypi.org/project/pride-checksum/
The publishing process uses PyPI's trusted publishing feature for secure authentication.
Use Cases
- Research Data Management: Verify integrity of research datasets
- Archive Validation: Ensure archived files haven't been corrupted
- Data Transfer Verification: Confirm files transferred correctly
- Backup Validation: Verify backup integrity
- Compliance Auditing: Generate checksums for audit trails
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
License
This project is licensed under the MIT License - see the project repository for details.
Support
For questions, issues, or contributions, please visit the GitHub repository.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pride_checksum-1.2.0.tar.gz.
File metadata
- Download URL: pride_checksum-1.2.0.tar.gz
- Upload date:
- Size: 12.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3598fa3b003faeaa705c438295821c73300afa0dd7c9aa38cccf57c4b787f19e
|
|
| MD5 |
18d3585d1e866ce857c3885e468bfaf1
|
|
| BLAKE2b-256 |
06e10299e6a7a08461193946ae08a79ee3b0ea3202d6b3df2df513dfdcdbdb94
|
Provenance
The following attestation bundles were made for pride_checksum-1.2.0.tar.gz:
Publisher:
publish.yml on PRIDE-Archive/pride-checksum
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pride_checksum-1.2.0.tar.gz -
Subject digest:
3598fa3b003faeaa705c438295821c73300afa0dd7c9aa38cccf57c4b787f19e - Sigstore transparency entry: 760314327
- Sigstore integration time:
-
Permalink:
PRIDE-Archive/pride-checksum@e7d4c5d2d7c1ad219f69dbdbc0cac15939817930 -
Branch / Tag:
refs/tags/v1.2.0 - Owner: https://github.com/PRIDE-Archive
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e7d4c5d2d7c1ad219f69dbdbc0cac15939817930 -
Trigger Event:
release
-
Statement type:
File details
Details for the file pride_checksum-1.2.0-py3-none-any.whl.
File metadata
- Download URL: pride_checksum-1.2.0-py3-none-any.whl
- Upload date:
- Size: 7.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5f49841de91f73d4a36bc90c215f9c644bda8b4fd626b962d027e3b030c19046
|
|
| MD5 |
33dc5129fb519083cdde6a235693454e
|
|
| BLAKE2b-256 |
e28d0f11dca811254dbcd23a43ffeb80536253173f52da1ceb55e7aa16179b81
|
Provenance
The following attestation bundles were made for pride_checksum-1.2.0-py3-none-any.whl:
Publisher:
publish.yml on PRIDE-Archive/pride-checksum
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pride_checksum-1.2.0-py3-none-any.whl -
Subject digest:
5f49841de91f73d4a36bc90c215f9c644bda8b4fd626b962d027e3b030c19046 - Sigstore transparency entry: 760314329
- Sigstore integration time:
-
Permalink:
PRIDE-Archive/pride-checksum@e7d4c5d2d7c1ad219f69dbdbc0cac15939817930 -
Branch / Tag:
refs/tags/v1.2.0 - Owner: https://github.com/PRIDE-Archive
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e7d4c5d2d7c1ad219f69dbdbc0cac15939817930 -
Trigger Event:
release
-
Statement type: