Clean up cached data from ML/data science libraries
Project description
ds-cache-cleaner
Clean up cached data from ML/data science libraries.
Supported Caches
- HuggingFace Models -
~/.cache/huggingface/hub(models) - HuggingFace Datasets (Hub) -
~/.cache/huggingface/hub(datasets) - Transformers -
~/.cache/huggingface/transformers - HF Datasets -
~/.cache/huggingface/datasets - ir_datasets -
~/.ir_datasets - datamaestro (cache) -
~/datamaestro/cache(partial downloads, processing) - datamaestro (data) -
~/datamaestro/data(downloaded datasets)
Installation
pip install ds-cache-cleaner
Or with uv:
uv pip install ds-cache-cleaner
Usage
List caches
ds-cache-cleaner list
Show cache entries
ds-cache-cleaner show
ds-cache-cleaner show -c "HuggingFace Hub"
Clean caches
# Interactive mode
ds-cache-cleaner clean
# Clean specific cache
ds-cache-cleaner clean -c "HuggingFace Hub"
# Clean all without prompting
ds-cache-cleaner clean --all
# Dry run
ds-cache-cleaner clean --dry-run
Interactive TUI
ds-cache-cleaner tui
Library Integration
ML libraries can integrate with ds-cache-cleaner to provide rich metadata about their cached data. This enables better descriptions, accurate last-access times, and custom library-specific metadata.
Metadata Format
The metadata is stored in a ds-cache-cleaner/ folder inside each cache directory:
~/.cache/mylib/
├── ds-cache-cleaner/
│ ├── lock # Lock file for concurrent access
│ ├── information.json # Cache info and parts list
│ └── part_models.json # Entries for "models" part
└── ... (actual cache data)
Using the CacheRegistry API
from ds_cache_cleaner import CacheRegistry
# Initialize once for your library
registry = CacheRegistry(
cache_path="~/.cache/mylib",
library="mylib",
description="My ML Library cache",
)
# Register a part (e.g., models, datasets)
registry.register_part("models", "Downloaded model weights")
# When downloading a new model
registry.register_entry(
part="models",
path="bert-base", # relative path within cache
description="BERT base model",
size=438_000_000,
)
# When accessing an existing entry (updates last_access time)
registry.touch("models", "bert-base")
# When deleting an entry (removes from metadata)
registry.remove("models", "bert-base")
Custom Metadata
You can store library-specific metadata using the metadata parameter:
# Store custom metadata when registering an entry
registry.register_entry(
part="models",
path="bert-base-uncased",
description="BERT base uncased model",
size=438_000_000,
metadata={
"model_type": "bert",
"revision": "main",
"tags": ["encoder", "uncased", "english"],
"framework": "pytorch",
},
)
# Retrieve entry with its metadata
entry = registry.get_entry("models", "bert-base-uncased")
if entry:
print(entry.metadata) # {"model_type": "bert", ...}
Full API Reference
CacheRegistry
| Method | Description |
|---|---|
register_part(name, description="") |
Register a new part (e.g., "models", "datasets") |
register_entry(part, path, description="", size=None, metadata=None) |
Register or update a cache entry |
touch(part, path) |
Update last access time for an entry |
remove(part, path) -> bool |
Remove entry from metadata (returns True if found) |
update_size(part, path, size) |
Update the size of an entry |
get_entry(part, path) -> EntryMetadata | None |
Get metadata for a specific entry |
list_entries(part) -> list[EntryMetadata] |
List all entries in a part |
list_parts() -> list[PartInfo] |
List all parts in the cache |
parts.<name> |
Access a part via attribute notation (see below) |
Parts Accessor
The parts property provides a convenient attribute-style API:
# Instead of:
registry.register_entry("models", entry)
registry.get_entry("models", "bert")
registry.list_entries("models")
# You can use:
registry.parts.models.register(entry)
registry.parts.models.get("bert")
registry.parts.models.list()
PartAccessor methods:
| Method | Description |
|---|---|
register(path, ...) |
Register an entry (same args as register_entry) |
get(path) |
Get an entry by path |
get_entry(path) |
Alias for get() |
list() |
List all entries |
touch(path) |
Update last access time |
remove(path) |
Remove an entry |
update_size(path, size) |
Update entry size |
Data Classes
The metadata system uses standard Python dataclasses with pydantic validation:
from ds_cache_cleaner import EntryMetadata, PartInfo
# EntryMetadata fields
@dataclass
class EntryMetadata:
path: str # Relative path within cache (required)
description: str = "" # Human-readable description
created: datetime | None = None # When the entry was created
last_access: datetime | None = None # Last access time
size: int | None = None # Size in bytes
metadata: dict[str, Any] = {} # Library-specific metadata
# PartInfo fields
@dataclass
class PartInfo:
name: str # Part name (e.g., "models")
description: str = "" # Human-readable description
Custom Entry Classes
For type-safe entries, subclass EntryMetadata with your own fields. Extra fields are automatically serialized into the metadata dict and reconstructed on read:
from dataclasses import dataclass, field
from ds_cache_cleaner import CacheRegistry, EntryMetadata
@dataclass
class ModelEntry(EntryMetadata):
"""Custom entry for ML models."""
model_type: str = ""
revision: str = ""
tags: list[str] = field(default_factory=list)
# Create registry with custom entry type
registry = CacheRegistry(
cache_path="~/.cache/mylib",
library="mylib",
entry_types={"models": ModelEntry}, # Map part name to entry type
)
registry.register_part("models", "Model weights")
# Register using custom entry instance
entry = ModelEntry(
path="bert-base",
description="BERT model",
model_type="bert",
revision="v1.0",
tags=["encoder", "english"],
)
registry.register_entry("models", entry)
# Get entry returns the correct type
model = registry.get_entry("models", "bert-base")
assert isinstance(model, ModelEntry)
print(model.model_type) # "bert"
print(model.tags) # ["encoder", "english"]
# List entries also returns custom types
for entry in registry.list_entries("models"):
print(f"{entry.path}: {entry.model_type}")
The JSON format remains backward compatible - extra fields are stored in the metadata dict:
{
"path": "bert-base",
"description": "BERT model",
"metadata": {
"model_type": "bert",
"revision": "v1.0",
"tags": ["encoder", "english"]
}
}
Development
# Install with dev dependencies
pip install -e ".[dev]"
# Run tests
hatch run test
# Lint
hatch run lint:check
# Format
hatch run lint:fix
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ds_cache_cleaner-0.4.0.tar.gz.
File metadata
- Download URL: ds_cache_cleaner-0.4.0.tar.gz
- Upload date:
- Size: 22.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
04bea121ff7b5223ad5662053e5871953e0d54de1daa701cba1aa1d7e875d4b8
|
|
| MD5 |
d981caaa7f80e745b71b2f9ce3049bc5
|
|
| BLAKE2b-256 |
77945018a247868469cae7ed721e3b051c4804e568b1f96070d363a0740e3427
|
Provenance
The following attestation bundles were made for ds_cache_cleaner-0.4.0.tar.gz:
Publisher:
upload-to-pypi.yaml on bpiwowar/ds-cache-cleaner
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ds_cache_cleaner-0.4.0.tar.gz -
Subject digest:
04bea121ff7b5223ad5662053e5871953e0d54de1daa701cba1aa1d7e875d4b8 - Sigstore transparency entry: 816975698
- Sigstore integration time:
-
Permalink:
bpiwowar/ds-cache-cleaner@1b530f5c1efa3bac1dadbf42f6ff4073491c9422 -
Branch / Tag:
refs/tags/0.4.0 - Owner: https://github.com/bpiwowar
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
upload-to-pypi.yaml@1b530f5c1efa3bac1dadbf42f6ff4073491c9422 -
Trigger Event:
release
-
Statement type:
File details
Details for the file ds_cache_cleaner-0.4.0-py3-none-any.whl.
File metadata
- Download URL: ds_cache_cleaner-0.4.0-py3-none-any.whl
- Upload date:
- Size: 28.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
971263ee313e41e8e4c86cf0034853bcaf3f239e283e6aead02c6a7a399223de
|
|
| MD5 |
0eec0353348574837b05dbc3d1129ad3
|
|
| BLAKE2b-256 |
a92795ff403f599ee1d94e2b6b3b2b1e3505ac99494f27c66be964879370e108
|
Provenance
The following attestation bundles were made for ds_cache_cleaner-0.4.0-py3-none-any.whl:
Publisher:
upload-to-pypi.yaml on bpiwowar/ds-cache-cleaner
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ds_cache_cleaner-0.4.0-py3-none-any.whl -
Subject digest:
971263ee313e41e8e4c86cf0034853bcaf3f239e283e6aead02c6a7a399223de - Sigstore transparency entry: 816975761
- Sigstore integration time:
-
Permalink:
bpiwowar/ds-cache-cleaner@1b530f5c1efa3bac1dadbf42f6ff4073491c9422 -
Branch / Tag:
refs/tags/0.4.0 - Owner: https://github.com/bpiwowar
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
upload-to-pypi.yaml@1b530f5c1efa3bac1dadbf42f6ff4073491c9422 -
Trigger Event:
release
-
Statement type: