Python scripts for deduplicating folders and unarchiving files.
Project description
PyStou
Welcome to PyStou – your ultimate toolkit for keeping your filesystem tidy and organized! Whether you're a developer drowning in duplicate folders or someone who loves archiving files but hates the clutter, PyStou is here to rescue you from chaos with style and efficiency.
PyStou is proudly developed by the International Consortium of Investigative Journalists (ICIJ), aiming to empower users with tools to manage and maintain large amounts of files.
Table of Contents
Features
- Automatically identify and manage duplicate directories, ensuring you only keep what you need.
- Effortlessly extract a wide range of archive formats, including
.zip,.tar.gz,.zst, and.pst. - Support for split ZIP archives (
.z01,.z02, etc.) with automatic detection. - Nested archive extraction for archives containing other archives.
- Parallel archive extraction for faster processing of multiple archives.
- Remove junk files (
.DS_Store,Thumbs.db,__MACOSX, etc.) with a single command. - Detect file type mismatches and encrypted archives.
- Get comprehensive directory statistics including file counts, sizes, and types.
- Find and remove empty directories safely.
- Choose to interact with each file/archive or set default actions for seamless automation.
- Keep track of all actions with detailed JSON-formatted logs for easy troubleshooting.
- Pure native Python scripts ready to run out-of-the-box (except for necessary command-line tools).
Installation
Getting started with PyStou is a breeze! Follow the steps below to install and set up the project on your machine.
Prerequisites
- Python 3.7 or higher is required.
- Command-Line Tools:
p7zip-full: Required for extracting split ZIP archives (.z01,.z02, etc.).pst-utils: Required for extracting.pstfiles.zstd: Required for handling.zstfiles.
Clone the Repository
git clone https://github.com/ICIJ/pystou.git
cd pystou
Install the Package
PyStou can be installed using pip. It includes all necessary components without additional dependencies.
pip install .
Note: You might need to use
pip3and/orsudodepending on your system configuration.
Usage
PyStou provides a unified command-line interface with several subcommands.
pystou --help
pystou dedup --help
pystou extract --help
pystou cleanup --help
pystou identify --help
pystou stats --help
pystou empty --help
Deduplicate Folders
Purpose: Identify and manage duplicate directories to keep your filesystem clean.
Command:
pystou dedup [directory] [options]
Parameters:
directory: (Optional) The root directory to start scanning from. Defaults to the current directory if not specified.
Options:
-r,--recursive: Recursively process subdirectories.-l LEVEL,--level LEVEL: Maximum depth level for recursion (default: unlimited).-c CHOICE,--default-choice CHOICE: Default action to apply to all duplicate groups.1: Delete duplicates.2: Merge contents and delete duplicates.3: Skip (do nothing).
-n,--dry-run: Perform a dry run without making any changes.--log-dir LOG_DIR: Directory to store log files (default: current directory).--db-dir DB_DIR: Directory to store index database (default: current directory).
Examples:
-
Interactive Mode:
pystou dedup /path/to/your/folders -r
The script will prompt you for each duplicate group found.
-
Automated Mode with Default Choice (Delete Duplicates):
pystou dedup /path/to/your/folders -r -c 1
-
Dry Run Mode:
pystou dedup /path/to/your/folders -r -n
Extract Archives
Purpose: Extract various archive formats efficiently and manage them post-extraction.
Supported Formats:
- Standard:
.zip,.tar,.tar.gz,.tgz,.tar.bz2,.tbz,.gz,.bz2 - Zstandard:
.zst,.tar.zst,.tzst - Outlook:
.pst - Split ZIP:
.z01,.z02, ... (automatically detected with main.zipfile)
Command:
pystou extract [directory] [options]
Parameters:
directory: (Optional) The root directory to start searching for archives. Defaults to the current directory if not specified.
Options:
-r,--recursive: Recursively search subdirectories for archives.-c CHOICE,--default-choice CHOICE: Default action to apply to all archives.1: Extract archives.2: Skip (do nothing).
-dc DELETE_CHOICE,--default-delete-choice DELETE_CHOICE: Default action when prompted to delete archives after extraction.1: Delete the archive after extraction.2: Keep the archive after extraction.
-p N,--parallel N: Number of parallel extraction workers (default: 1). Requires-cflag.-N,--nested: Recursively extract archives found inside extracted content.--max-depth N: Maximum nesting depth for--nested(default: 10).-n,--dry-run: Perform a dry run without making any changes.--log-dir LOG_DIR: Directory to store log files (default: current directory).--db-dir DB_DIR: Directory to store index database (default: current directory).
Examples:
-
Interactive Mode:
pystou extract /path/to/archives -r
The script will prompt you for each archive found, asking whether to extract or skip.
-
Automated Mode with Default Choices (Extract and Delete Archives):
pystou extract /path/to/archives -r -c 1 -dc 1
-
Parallel Extraction (4 workers):
pystou extract /path/to/archives -r -c 1 -dc 2 -p 4
-
Nested Extraction (archives inside archives):
pystou extract /path/to/archives -r -c 1 -dc 1 --nested
-
Dry Run Mode:
pystou extract /path/to/archives -r -n
Cleanup Junk Files
Purpose: Remove common junk files created by operating systems and applications.
Removed by default:
- macOS:
.DS_Store,._.DS_Store,._*files,__MACOSX,.AppleDouble,.Spotlight-V100,.Trashes,.fseventsd,.TemporaryItems,.LSOverride - Windows:
Thumbs.db,ehthumbs.db,ehthumbs_vista.db,desktop.ini
Command:
pystou cleanup [directory] [options]
Options:
-r,--recursive: Recursively process subdirectories.--include PATTERN: Additional file/directory names to remove (can be used multiple times).--list-only: Only list junk files without removing them.-n,--dry-run: Perform a dry run without making any changes.
Examples:
-
List junk files:
pystou cleanup /path/to/folder -r --list-only
-
Remove junk files:
pystou cleanup /path/to/folder -r
-
Remove additional patterns:
pystou cleanup /path/to/folder -r --include ".gitkeep" --include "*.bak"
Identify File Types
Purpose: Detect file types and find potential issues like mismatched extensions or encrypted archives.
Command:
pystou identify [directory] [options]
Options:
-r,--recursive: Recursively process subdirectories.--check-mismatch: Check for files with mismatched extensions.--check-encrypted: Check for encrypted ZIP archives.--check-all: Run all checks.--extensions EXT: Comma-separated list of extensions to check (e.g.,.zip,.pdf).
Examples:
-
Find mismatched extensions:
pystou identify /path/to/folder -r --check-mismatch
-
Find encrypted archives:
pystou identify /path/to/folder -r --check-encrypted
-
Run all checks on specific extensions:
pystou identify /path/to/folder -r --check-all --extensions ".zip,.pdf,.docx"
Directory Statistics
Purpose: Display comprehensive statistics about files and directories.
Command:
pystou stats [directory] [options]
Options:
-r,--recursive: Recursively process subdirectories.--top N: Number of top items to show (default: 10).--by-extension: Show breakdown by file extension.--by-size: Show largest files.--json: Output statistics in JSON format.
Examples:
-
Show directory statistics:
pystou stats /path/to/folder -r
-
Show largest files:
pystou stats /path/to/folder -r --by-size --top 20
-
Output as JSON:
pystou stats /path/to/folder -r --json
Empty Directories
Purpose: Find and remove empty directories.
Command:
pystou empty [directory] [options]
Options:
-r,--recursive: Recursively process subdirectories.--list-only: Only list empty directories without removing them.--include-hidden: Include hidden directories (starting with.).-n,--dry-run: Perform a dry run without making any changes.
Examples:
-
List empty directories:
pystou empty /path/to/folder -r --list-only
-
Remove empty directories:
pystou empty /path/to/folder -r
-
Include hidden directories:
pystou empty /path/to/folder -r --include-hidden
Running Tests
PyStou includes a suite of unit tests to ensure everything works smoothly. Here's how to run them:
make test
Or manually:
python3 -m unittest discover tests
Note: Ensure you have all necessary command-line tools installed (
readpst,zstd,7z) before running tests that involve archive extraction.
License
Distributed under the MIT License. See LICENSE for more information.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pystou-0.1.1.tar.gz.
File metadata
- Download URL: pystou-0.1.1.tar.gz
- Upload date:
- Size: 103.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
006f9b3a4d53a29ecea1d3d737571fc2ecc85c9cb1af4e35a7a8f2268206efd3
|
|
| MD5 |
4a9443615b805560db1b684341367e22
|
|
| BLAKE2b-256 |
1b466f86106e488bf1f1e2570643fc94bd0f2eed4736e21d22ea1d806d362429
|
Provenance
The following attestation bundles were made for pystou-0.1.1.tar.gz:
Publisher:
release.yml on ICIJ/pystou
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pystou-0.1.1.tar.gz -
Subject digest:
006f9b3a4d53a29ecea1d3d737571fc2ecc85c9cb1af4e35a7a8f2268206efd3 - Sigstore transparency entry: 1779761139
- Sigstore integration time:
-
Permalink:
ICIJ/pystou@4b24fcd570e4ae33bdddc7cb8924fa1dcee00e6b -
Branch / Tag:
refs/heads/main - Owner: https://github.com/ICIJ
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@4b24fcd570e4ae33bdddc7cb8924fa1dcee00e6b -
Trigger Event:
push
-
Statement type:
File details
Details for the file pystou-0.1.1-py3-none-any.whl.
File metadata
- Download URL: pystou-0.1.1-py3-none-any.whl
- Upload date:
- Size: 40.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
94a2aa2f398a61e46edae137422267de070bf281edc5567c5845c16dad7cdd46
|
|
| MD5 |
242bd6722aecd4b8797923bc029a30d5
|
|
| BLAKE2b-256 |
7e0e6facd0b8b3612c6508c1712d56f7d42f15e5ceaeff878c1d0366494f8e0b
|
Provenance
The following attestation bundles were made for pystou-0.1.1-py3-none-any.whl:
Publisher:
release.yml on ICIJ/pystou
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
pystou-0.1.1-py3-none-any.whl -
Subject digest:
94a2aa2f398a61e46edae137422267de070bf281edc5567c5845c16dad7cdd46 - Sigstore transparency entry: 1779762831
- Sigstore integration time:
-
Permalink:
ICIJ/pystou@4b24fcd570e4ae33bdddc7cb8924fa1dcee00e6b -
Branch / Tag:
refs/heads/main - Owner: https://github.com/ICIJ
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@4b24fcd570e4ae33bdddc7cb8924fa1dcee00e6b -
Trigger Event:
push
-
Statement type: