Skip to main content

HuggingFace utilities for repository management, dataset operations, and model analysis

Project description

dr-hf

HuggingFace utilities for repository management, dataset operations, and model analysis.

Installation

uv add dr-hf

For model weight analysis (requires PyTorch):

uv add dr-hf[weights]

For DuckDB query support:

uv add dr-hf[duckdb]

Quick Start

from dr_hf import (
    get_checkpoint_branches,
    parse_branch_name,
    HFLocation,
    download_dataset,
)

# Parse checkpoint branches from a repo
branches = get_checkpoint_branches("org/model-checkpoints")
for branch in branches:
    info = parse_branch_name(branch)
    print(f"Step {info.step}, Seed: {info.seed}")

# Create a location reference for HF datasets
loc = HFLocation(org="allenai", repo_name="my-dataset")
print(loc.repo_uri)  # hf://datasets/allenai/my-dataset

# Download a dataset to local parquet
from pathlib import Path
download_dataset(Path("./data/squad_train.parquet"), repo_id="squad", split="train")

Module Overview

Module Purpose Key Exports
branches Branch discovery & parsing get_checkpoint_branches, parse_branch_name, create_branch_metadata
configs Model config analysis download_config_file, analyze_model_config, estimate_parameter_count
weights Model weight analysis analyze_model_weights, calculate_weight_statistics
checkpoints Checkpoint orchestration analyze_complete_checkpoint, process_all_checkpoints
datasets Dataset loading & caching load_or_download_dataset, download_dataset
io HfApi upload/download upload_file_to_hf, cached_download_tables_from_hf
location HF resource URIs HFLocation, HFRepoID, HFResource
paths Environment paths get_data_dir, get_repo_dir
models Pydantic data models BranchInfo, ConfigAnalysis, WeightsAnalysis, ...

⚡ = Requires [weights] optional dependency

Documentation

Auto-generated API Docs

# Serve interactive docs locally
uv run pdoc dr_hf

# Generate static HTML
uv run pdoc dr_hf -o docs/api_html

Quick Reference

Branch Operations

from dr_hf import (
    get_all_repo_branches,    # list all branches in repo
    get_checkpoint_branches,  # filter to stepN-seed-* branches
    is_checkpoint_branch,     # check if branch matches pattern
    parse_branch_name,        # extract step/seed -> BranchInfo
    extract_step_from_branch, # get step number
    extract_seed_from_branch, # get seed string
    sort_branches_by_step,    # sort branches by step
    group_branches_by_seed,   # group branches by seed
    create_branch_metadata,   # full repo metadata -> BranchMetadata
)

Config Analysis

from dr_hf import (
    download_config_file,           # download config.json
    analyze_model_config,           # parse config -> ConfigAnalysis
    extract_model_architecture_info,# extract architecture -> ArchitectureInfo
    estimate_parameter_count,       # estimate params -> ParameterEstimate
)

Weight Analysis (requires [weights])

from dr_hf import (
    discover_model_weight_files,  # find weight files in repo
    download_model_weights,       # download specific weights
    calculate_weight_statistics,  # analyze weights -> WeightFileStatistics
    calculate_tensor_stats,       # per-tensor stats -> TensorStats
    analyze_layer_structure,      # categorize layers -> LayerAnalysis
    calculate_global_weight_stats,# global stats -> GlobalWeightStats
    analyze_model_weights,        # full workflow -> WeightsAnalysis
)

Checkpoint Analysis (requires [weights])

from dr_hf import (
    download_optimizer_checkpoint, # download optim.pt
    analyze_optimizer_checkpoint,  # parse optimizer -> OptimizerAnalysis
    analyze_complete_checkpoint,   # full analysis -> CheckpointAnalysis
    process_single_checkpoint,     # single branch analysis
    process_all_checkpoints,       # parallel multi-branch
    create_comprehensive_summary,  # DataFrame summary
    create_learning_rate_summary,  # LR-focused summary
    save_checkpoint_analysis,      # save to JSON
    save_all_analyses_outputs,     # save CSVs + JSON
)

Dataset Operations

from dr_hf import (
    load_or_download_dataset, # load from cache or download
    download_dataset,         # download HF dataset to parquet
    sanitize_repo_name,       # convert repo ID to safe filename
)

HfApi I/O

from dr_hf import (
    upload_file_to_hf,            # upload file to HF repo
    cached_download_tables_from_hf,# download parquet with caching
    get_tables_from_cache,        # read cached parquet files
    read_local_parquet_paths,     # list local parquet files
    query_hf_with_duckdb,         # query HF with DuckDB (requires [duckdb])
)

Location Management

from dr_hf import (
    HFLocation,   # Pydantic model for HF dataset locations
    HFRepoID,     # Type alias: "org/repo-name"
    HFResource,   # Type alias: "hf://datasets/org/repo"
)

loc = HFLocation(org="allenai", repo_name="c4")
loc.repo_id        # "allenai/c4"
loc.repo_uri       # "hf://datasets/allenai/c4"
loc.repo_link      # HttpUrl to HF page

# Parse from URI
loc = HFLocation.from_uri("hf://datasets/squad/squad")

Environment Paths

from dr_hf import (
    get_data_dir,  # get DATA_DIR from env
    get_repo_dir,  # get REPO_DIR from env
)

Pydantic Models

from dr_hf import (
    # Branch models
    BranchInfo,           # parsed branch (step, seed, valid)
    SeedBranchInfo,       # step + branch name
    SeedConfiguration,    # seed group metadata
    BranchMetadata,       # full repo branch info

    # Config models
    ConfigAnalysis,       # config.json analysis result
    ArchitectureInfo,     # model architecture details
    ParameterEstimate,    # estimated parameter counts

    # Weight models
    WeightsAnalysis,      # full weight analysis result
    WeightsSummary,       # aggregated weight stats
    WeightFileStatistics, # per-file statistics
    TensorInfo,           # per-tensor metadata
    TensorStats,          # tensor statistics
    LayerAnalysis,        # layer categorization
    LayerCategorization,  # layers by type
    LayerCounts,          # layer count summary
    GlobalWeightStats,    # global weight statistics
    ParameterStats,       # parameter counts

    # Checkpoint models
    CheckpointAnalysis,   # full checkpoint analysis
    CheckpointComponents, # optimizer + config + weights
    CheckpointSummaryRow, # DataFrame row model
    OptimizerAnalysis,    # optimizer state analysis
    OptimizerComponentInfo,# optimizer component details
    LearningRateInfo,     # learning rate extraction
    ParamGroupInfo,       # param group details
)

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dr_hf-0.1.0.tar.gz (137.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dr_hf-0.1.0-py3-none-any.whl (23.3 kB view details)

Uploaded Python 3

File details

Details for the file dr_hf-0.1.0.tar.gz.

File metadata

  • Download URL: dr_hf-0.1.0.tar.gz
  • Upload date:
  • Size: 137.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.7

File hashes

Hashes for dr_hf-0.1.0.tar.gz
Algorithm Hash digest
SHA256 dbfe176b7fda1bc585cfb3966d6dc250129bf21051a5235871edd3d2e27a5463
MD5 7a5f48623c5c79bf2f644c095c8669d2
BLAKE2b-256 cf042be3f6ccfa2a748b18dd45dffc7d6aa045ba0d9c9dc04ffefacd9cc5a951

See more details on using hashes here.

File details

Details for the file dr_hf-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: dr_hf-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 23.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.7

File hashes

Hashes for dr_hf-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4cc6e3b4a1515879576b67bc17e0f1e37728f387804de2819758d6151fccd58c
MD5 5f499e705286884075cdfaa84dc7bf46
BLAKE2b-256 e248a495b4b0147d175f48bf5cf37ef845ce509d26f511382d9b88ab4aa0a2a4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page