Skip to main content

Clean up cached data from ML/data science libraries

Project description

ds-cache-cleaner

Clean up cached data from ML/data science libraries.

Supported Caches

  • HuggingFace Models - ~/.cache/huggingface/hub (models)
  • HuggingFace Datasets (Hub) - ~/.cache/huggingface/hub (datasets)
  • Transformers - ~/.cache/huggingface/transformers
  • HF Datasets - ~/.cache/huggingface/datasets
  • ir_datasets - ~/.ir_datasets
  • datamaestro (cache) - ~/datamaestro/cache (partial downloads, processing)
  • datamaestro (data) - ~/datamaestro/data (downloaded datasets)

Installation

pip install ds-cache-cleaner

Or with uv:

uv pip install ds-cache-cleaner

Usage

List caches

ds-cache-cleaner list

Show cache entries

ds-cache-cleaner show
ds-cache-cleaner show -c "HuggingFace Hub"

Clean caches

# Interactive mode
ds-cache-cleaner clean

# Clean specific cache
ds-cache-cleaner clean -c "HuggingFace Hub"

# Clean all without prompting
ds-cache-cleaner clean --all

# Dry run
ds-cache-cleaner clean --dry-run

Interactive TUI

ds-cache-cleaner tui

Library Integration

ML libraries can integrate with ds-cache-cleaner to provide rich metadata about their cached data. This enables better descriptions, accurate last-access times, and custom library-specific metadata.

Metadata Format

The metadata is stored in a ds-cache-cleaner/ folder inside each cache directory:

~/.cache/mylib/
├── ds-cache-cleaner/
│   ├── lock                    # Lock file for concurrent access
│   ├── information.json        # Cache info and parts list
│   └── part_models.json        # Entries for "models" part
└── ... (actual cache data)

Using the CacheRegistry API

from ds_cache_cleaner import CacheRegistry

# Initialize once for your library
registry = CacheRegistry(
    cache_path="~/.cache/mylib",
    library="mylib",
    description="My ML Library cache",
)

# Register a part (e.g., models, datasets)
registry.register_part("models", "Downloaded model weights")

# When downloading a new model
registry.register_entry(
    part="models",
    path="bert-base",  # relative path within cache
    description="BERT base model",
    size=438_000_000,
)

# When accessing an existing entry (updates last_access time)
registry.touch("models", "bert-base")

# When deleting an entry (removes from metadata)
registry.remove("models", "bert-base")

Custom Metadata

You can store library-specific metadata using the metadata parameter:

# Store custom metadata when registering an entry
registry.register_entry(
    part="models",
    path="bert-base-uncased",
    description="BERT base uncased model",
    size=438_000_000,
    metadata={
        "model_type": "bert",
        "revision": "main",
        "tags": ["encoder", "uncased", "english"],
        "framework": "pytorch",
    },
)

# Retrieve entry with its metadata
entry = registry.get_entry("models", "bert-base-uncased")
if entry:
    print(entry.metadata)  # {"model_type": "bert", ...}

Full API Reference

CacheRegistry

Method Description
register_part(name, description="") Register a new part (e.g., "models", "datasets")
register_entry(part, path, description="", size=None, metadata=None) Register or update a cache entry
touch(part, path) Update last access time for an entry
remove(part, path) -> bool Remove entry from metadata (returns True if found)
update_size(part, path, size) Update the size of an entry
get_entry(part, path) -> EntryMetadata | None Get metadata for a specific entry
list_entries(part) -> list[EntryMetadata] List all entries in a part
list_parts() -> list[PartInfo] List all parts in the cache
parts.<name> Access a part via attribute notation (see below)

Parts Accessor

The parts property provides a convenient attribute-style API:

# Instead of:
registry.register_entry("models", entry)
registry.get_entry("models", "bert")
registry.list_entries("models")

# You can use:
registry.parts.models.register(entry)
registry.parts.models.get("bert")
registry.parts.models.list()

PartAccessor methods:

Method Description
register(path, ...) Register an entry (same args as register_entry)
get(path) Get an entry by path
get_entry(path) Alias for get()
list() List all entries
touch(path) Update last access time
remove(path) Remove an entry
update_size(path, size) Update entry size

Data Classes

The metadata system uses standard Python dataclasses with pydantic validation:

from ds_cache_cleaner import EntryMetadata, PartInfo

# EntryMetadata fields
@dataclass
class EntryMetadata:
    path: str                           # Relative path within cache (required)
    description: str = ""               # Human-readable description
    created: datetime | None = None     # When the entry was created
    last_access: datetime | None = None # Last access time
    size: int | None = None             # Size in bytes
    metadata: dict[str, Any] = {}       # Library-specific metadata

# PartInfo fields
@dataclass
class PartInfo:
    name: str          # Part name (e.g., "models")
    description: str = ""  # Human-readable description

Custom Entry Classes

For type-safe entries, subclass EntryMetadata with your own fields. Extra fields are automatically serialized into the metadata dict and reconstructed on read:

from dataclasses import dataclass, field
from ds_cache_cleaner import CacheRegistry, EntryMetadata

@dataclass
class ModelEntry(EntryMetadata):
    """Custom entry for ML models."""
    model_type: str = ""
    revision: str = ""
    tags: list[str] = field(default_factory=list)

# Create registry with custom entry type
registry = CacheRegistry(
    cache_path="~/.cache/mylib",
    library="mylib",
    entry_types={"models": ModelEntry},  # Map part name to entry type
)

registry.register_part("models", "Model weights")

# Register using custom entry instance
entry = ModelEntry(
    path="bert-base",
    description="BERT model",
    model_type="bert",
    revision="v1.0",
    tags=["encoder", "english"],
)
registry.register_entry("models", entry)

# Get entry returns the correct type
model = registry.get_entry("models", "bert-base")
assert isinstance(model, ModelEntry)
print(model.model_type)  # "bert"
print(model.tags)        # ["encoder", "english"]

# List entries also returns custom types
for entry in registry.list_entries("models"):
    print(f"{entry.path}: {entry.model_type}")

The JSON format remains backward compatible - extra fields are stored in the metadata dict:

{
  "path": "bert-base",
  "description": "BERT model",
  "metadata": {
    "model_type": "bert",
    "revision": "v1.0",
    "tags": ["encoder", "english"]
  }
}

Development

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
hatch run test

# Lint
hatch run lint:check

# Format
hatch run lint:fix

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ds_cache_cleaner-0.4.0.tar.gz (22.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ds_cache_cleaner-0.4.0-py3-none-any.whl (28.4 kB view details)

Uploaded Python 3

File details

Details for the file ds_cache_cleaner-0.4.0.tar.gz.

File metadata

  • Download URL: ds_cache_cleaner-0.4.0.tar.gz
  • Upload date:
  • Size: 22.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ds_cache_cleaner-0.4.0.tar.gz
Algorithm Hash digest
SHA256 04bea121ff7b5223ad5662053e5871953e0d54de1daa701cba1aa1d7e875d4b8
MD5 d981caaa7f80e745b71b2f9ce3049bc5
BLAKE2b-256 77945018a247868469cae7ed721e3b051c4804e568b1f96070d363a0740e3427

See more details on using hashes here.

Provenance

The following attestation bundles were made for ds_cache_cleaner-0.4.0.tar.gz:

Publisher: upload-to-pypi.yaml on bpiwowar/ds-cache-cleaner

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ds_cache_cleaner-0.4.0-py3-none-any.whl.

File metadata

File hashes

Hashes for ds_cache_cleaner-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 971263ee313e41e8e4c86cf0034853bcaf3f239e283e6aead02c6a7a399223de
MD5 0eec0353348574837b05dbc3d1129ad3
BLAKE2b-256 a92795ff403f599ee1d94e2b6b3b2b1e3505ac99494f27c66be964879370e108

See more details on using hashes here.

Provenance

The following attestation bundles were made for ds_cache_cleaner-0.4.0-py3-none-any.whl:

Publisher: upload-to-pypi.yaml on bpiwowar/ds-cache-cleaner

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page