Skip to main content

Filesystem-based structured storage for data with metadata - the middle ground between scattered files and databases

Project description

๐Ÿ“š Shelfie

Simple filesystem-based structured storage for data with metadata

Shelfie helps you organize your data files in a structured, hierarchical way while automatically managing metadata. Think of it as a filing system that creates organized directories based on your data's characteristics and keeps track of important information about each dataset.

๐ŸŽฏ Why Shelfie?

  • Organized: Automatically creates directory structures based on your data's fields
  • Metadata-aware: Stores attributes alongside your data files
  • Flexible: Works with any data that can be saved as CSV, JSON, or pickle
  • Simple: Intuitive API for creating and reading structured datasets
  • Discoverable: Easy to browse and understand your data organization in the filesystem

Shelfie is meant to be an in between a full database and having to create a wrapper for a filesystem based storage for each project.

๐Ÿ—๏ธ How It Works

Conceptual Model: Database Relations โ†’ Directory Structure

Shelfie translates database-style relationships into filesystem organization:

Database Thinking          โ†’    Filesystem Result
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Tables: [experiments]      โ†’    Directory Level 1
Tables: [models]           โ†’    Directory Level 2  
Tables: [dates]            โ†’    Directory Level 3
Columns: epochs, lr        โ†’    metadata.json
Data: results.csv          โ†’    Attached files

Visual Concept

Root Directory
โ”œโ”€โ”€ .shelfie.pkl                    # Shelf configuration
โ”œโ”€โ”€ experiment_1/                   # Field 1 value
โ”‚   โ”œโ”€โ”€ random_forest/              # Field 2 value  
โ”‚   โ”‚   โ”œโ”€โ”€ 2025-06-12/             # Field 3 value (auto-generated date)
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ metadata.json       # Stored attributes
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ results.csv         # Your data files
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ model.pkl          # More data files
โ”‚   โ”‚   โ””โ”€โ”€ gradient_boost/
โ”‚   โ”‚       โ””โ”€โ”€ 2025-06-12/
โ”‚   โ”‚           โ”œโ”€โ”€ metadata.json
โ”‚   โ”‚           โ””โ”€โ”€ results.csv
โ”‚   โ””โ”€โ”€ neural_network/
โ”‚       โ””โ”€โ”€ 2025-06-12/
โ”‚           โ”œโ”€โ”€ metadata.json
โ”‚           โ””โ”€โ”€ predictions.csv
โ””โ”€โ”€ experiment_2/
    โ””โ”€โ”€ ...

The Pattern

Shelfie = Filesystem-Based Relational Design

  1. Fields โ†’ Directory hierarchy (what you'd normalize into separate tables)
  2. Attributes โ†’ Stored metadata (what you'd store as columns in those tables)
  3. Data โ†’ Files attached to each record (the actual data your database would reference)
  4. File Paths โ†’ Automatically tracked as filename_path__ in metadata

Traditional Database:

SELECT r.accuracy, e.name, m.type, r.epochs 
FROM results r
JOIN experiments e ON r.experiment_id = e.id  
JOIN models m ON r.model_id = m.id
WHERE e.date = '2025-06-12'

Shelfie Equivalent:

data = load_from_shelf("./experiments")
results_df = data['results']  # Already has experiment, model, date columns!
filtered = results_df[results_df['date'] == '2025-06-12']

๐Ÿš€ Quick Start

Installation

pip install shelfie

Basic Example

import pandas as pd
from shelfie import Shelf, DateField

# Create a shelf for ML experiments
ml_shelf = Shelf(
    root="./experiments",
    fields=["experiment", "model", DateField("date")],  # Directory structure
    attributes=["epochs", "learning_rate"]              # Required metadata
)

# Create a new experiment record
experiment = ml_shelf.create(
    experiment="baseline",
    model="mlp", 
    epochs=100,
    learning_rate=0.001  # Typical learning rate for neural networks
)

# Attach your results
results_df = pd.DataFrame({
    "accuracy": [0.85, 0.87, 0.89], 
    "loss": [0.45, 0.32, 0.28],
    "epoch": [1, 2, 3]
})

experiment.attach(results_df, "results.csv")

This creates:

experiments/
โ””โ”€โ”€ baseline/
    โ””โ”€โ”€ mlp/
        โ””โ”€โ”€ 2025-06-12/
            โ”œโ”€โ”€ metadata.json  # {"epochs": 100, "learning_rate": 0.001, "results_path__": "/path/to/results.csv"}
            โ””โ”€โ”€ results.csv    # Your data

๐Ÿ“– Detailed Examples

1. ML Experiment Tracking

Think of this as three related database tables:

  • experiments table โ†’ project field
  • models table โ†’ model_type field
  • runs table โ†’ date field
  • Attributes: dataset, hyperparams, notes
from shelfie import Shelf, DateField, TimestampField
import pandas as pd

# Set up experiment tracking (defines your "table" relationships)
experiments = Shelf(
    root="./ml_experiments",
    fields=["project", "model_type", DateField("date")],  # Your table hierarchy
    attributes=["dataset", "hyperparams", "notes"]        # Your table columns
)

# Log different experiments
mlp_experiment = experiments.create(
    project="customer_churn",
    model_type="mlp",
    dataset="v2_cleaned",
    hyperparams={"hidden_layers": [128, 64, 32], "dropout": 0.3, "activation": "relu"},
    notes="Multi-layer perceptron with dropout regularization"
)

# Attach multiple files
mlp_experiment.attach(train_results, "training_metrics.csv")
mlp_experiment.attach(test_results, "test_results.csv")
mlp_experiment.attach(feature_importance, "feature_importance.csv")

# Try a different model
cnn_experiment = experiments.create(
    project="customer_churn",
    model_type="cnn",
    dataset="v2_cleaned", 
    hyperparams={"filters": [32, 64, 128], "kernel_size": 3, "learning_rate": 0.0001},
    notes="Convolutional neural network approach"
)

2. Sales Data by Region and Time

Database equivalent:

  • regions table โ†’ region field
  • time_periods table โ†’ year, quarter fields
  • Attributes: analyst, report_type, data_source
# Organize sales data by geography and time (multi-table relationship)
sales_shelf = Shelf(
    root="./sales_data",
    fields=["region", "year", "quarter"],                    # Geographic + temporal tables
    attributes=["analyst", "report_type", "data_source"]     # Report metadata columns
)

# Store Q1 data for North America
na_q1 = sales_shelf.create(
    region="north_america",
    year="2025", 
    quarter="Q1",
    analyst="john_doe",
    report_type="quarterly_summary",
    data_source="salesforce"
)

sales_data = pd.DataFrame({
    "product": ["A", "B", "C"],
    "revenue": [150000, 200000, 180000],
    "units_sold": [1500, 2000, 1800]
})

na_q1.attach(sales_data, "quarterly_sales.csv")

3. Survey Data Organization

Database tables: survey_types โ†’ demographics โ†’ timestamps

# Organize survey responses by type and demographics
surveys = Shelf(
    root="./survey_data",
    fields=["survey_type", "demographic", TimestampField("timestamp")],  # Survey taxonomy
    attributes=["sample_size", "methodology", "response_rate"]            # Survey metadata
)

# Store customer satisfaction survey
survey = surveys.create(
    survey_type="customer_satisfaction",
    demographic="millennials",
    sample_size=1000,
    methodology="online_panel", 
    response_rate=0.23
)

responses = pd.DataFrame({
    "question_id": [1, 2, 3, 4, 5],
    "avg_score": [4.2, 3.8, 4.1, 3.9, 4.0],
    "response_count": [920, 915, 898, 901, 911]
})

survey.attach(responses, "responses.csv")

๐Ÿ“Š Reading Your Data Back

The Magic: Automatic JOIN Operations

Unlike databases where you need explicit JOINs, Shelfie automatically combines your "table" relationships:

from shelfie import load_from_shelf

# Load all data from experiments shelf
data = load_from_shelf("./ml_experiments")

# Returns a dictionary of DataFrames - like running multiple JOINed queries:
# {
#   'metadata': All experiment metadata with project+model+date info,
#   'training_metrics': Training data with experiment context automatically joined,
#   'test_results': Test data with experiment context automatically joined,
#   ...
# }

# Analyze all your experiments - no JOINs needed!
print(data['metadata'])  # Overview of all experiments
print(data['training_metrics'])  # All training metrics with full context

# Note: File paths are stored as filename_path__ columns (e.g., 'training_metrics_path__')

What you get automatically:

  • Denormalized DataFrames: Each CSV gets experiment+model+date columns added
  • Full Context: Every row knows its complete "relational" context
  • No JOIN complexity: Relationships are already materialized
  • Pandas-ready: Immediate analysis without SQL knowledge

Each DataFrame automatically includes:

  • Original data columns: Your actual data
  • Attribute columns: Metadata from your "table columns" (hyperparams, notes, etc.)
  • Field columns: Directory structure as relational context (project, model_type, date)
  • File path columns: References as filename_path__ columns

๐Ÿ› ๏ธ Advanced Features

Custom Fields with Defaults

from shelfie import Field, DateField, TimestampField

# Field with a default value
shelf = Shelf(
    root="./data",
    fields=[
        "experiment",
        Field("environment", default="production"),  # Always "production" unless specified
        DateField("date"),                          # Auto-generates today's date
        TimestampField("timestamp")                 # Auto-generates current timestamp
    ],
    attributes=["version"]
)

# Only need to specify non-default fields
record = shelf.create(
    experiment="test_1",
    version="1.0"
)
# Creates: ./data/test_1/production/2025-06-12/2025-06-12_14-30-45/
# Metadata includes: version_path__ for any attached files

Multiple File Types

# Attach different file types
record.attach(results_df, "results.csv")           # CSV
record.attach(model_config, "config.json")         # JSON  
record.attach(trained_model, "model.pkl")          # Pickle
record.attach(report_text, "summary.txt")          # Text

Loading Existing Shelves

# Load a shelf that was created elsewhere
existing_shelf = Shelf.load_from_root("./experiments")

# Continue adding to it
new_experiment = existing_shelf.create(
    experiment="advanced",
    model="transformer",
    epochs=50,
    learning_rate=0.0001  # Lower learning rate for transformer models
)

๐Ÿ—‚๏ธ Directory Structure Examples

Before Shelfie

my_project/
โ”œโ”€โ”€ experiment1_mlp_results.csv
โ”œโ”€โ”€ experiment1_mlp_model.pkl  
โ”œโ”€โ”€ experiment2_cnn_results.csv
โ”œโ”€โ”€ experiment2_cnn_model.pkl
โ”œโ”€โ”€ baseline_test_data.csv
โ”œโ”€โ”€ advanced_test_data.csv
โ””โ”€โ”€ notes.txt  # Which file belongs to what?

After Shelfie

my_project/
โ”œโ”€โ”€ baseline/
โ”‚   โ”œโ”€โ”€ mlp/
โ”‚   โ”‚   โ””โ”€โ”€ 2025-06-12/
โ”‚   โ”‚       โ”œโ”€โ”€ metadata.json      # {"epochs": 100, "lr": 0.001, "results_path__": "/path/results.csv"}
โ”‚   โ”‚       โ”œโ”€โ”€ results.csv
โ”‚   โ”‚       โ””โ”€โ”€ model.pkl
โ”‚   โ””โ”€โ”€ cnn/
โ”‚       โ””โ”€โ”€ 2025-06-12/
โ”‚           โ”œโ”€โ”€ metadata.json      # {"epochs": 200, "lr": 0.0001, "results_path__": "/path/results.csv"}
โ”‚           โ”œโ”€โ”€ results.csv
โ”‚           โ””โ”€โ”€ model.pkl
โ””โ”€โ”€ advanced/
    โ””โ”€โ”€ transformer/
        โ””โ”€โ”€ 2025-06-12/
            โ”œโ”€โ”€ metadata.json      # {"epochs": 50, "lr": 0.0001, "results_path__": "/path/results.csv"}
            โ”œโ”€โ”€ results.csv
            โ””โ”€โ”€ model.pkl

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

๐Ÿ“„ License

This project is licensed under the MIT License.


Happy organizing! ๐Ÿ“šโœจ

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shelfie-0.2.0.tar.gz (34.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

shelfie-0.2.0-py3-none-any.whl (10.7 kB view details)

Uploaded Python 3

File details

Details for the file shelfie-0.2.0.tar.gz.

File metadata

  • Download URL: shelfie-0.2.0.tar.gz
  • Upload date:
  • Size: 34.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.31

File hashes

Hashes for shelfie-0.2.0.tar.gz
Algorithm Hash digest
SHA256 f1be82c8b42db7c3de62f1445c2cd74df4d274964b44b29229b94d1213aca649
MD5 cb0aef6c2a1cf4175d2ca0a10f3ceae8
BLAKE2b-256 23032c62d647292ee4b01114a3a183d1b7ace226de28ffe75a3b639a82aa3ea7

See more details on using hashes here.

File details

Details for the file shelfie-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: shelfie-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 10.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.5.31

File hashes

Hashes for shelfie-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8850e11aa0226358ee3e97b66694b8ba163f448e02bc4efce626c1638c1d248c
MD5 c2a7f178ca2dc506ee48690a3cf1eba0
BLAKE2b-256 373bba0b4fc326302ff76563311b7df21b24a9a5234a31c817fe809735586dda

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page