Filesystem-based structured storage for data with metadata - the middle ground between scattered files and databases
Project description
๐ Shelfie
Simple filesystem-based structured storage for data with metadata
Shelfie helps you organize your data files in a structured, hierarchical way while automatically managing metadata. Think of it as a filing system that creates organized directories based on your data's characteristics and keeps track of important information about each dataset.
๐ฏ Why Shelfie?
- Organized: Automatically creates directory structures based on your data's fields
- Metadata-aware: Stores attributes alongside your data files
- Flexible: Works with any data that can be saved as CSV, JSON, or pickle
- Simple: Intuitive API for creating and reading structured datasets
- Discoverable: Easy to browse and understand your data organization in the filesystem
Shelfie is meant to be an in between a full database and having to create a wrapper for a filesystem based storage for each project.
๐๏ธ How It Works
Conceptual Model: Database Relations โ Directory Structure
Shelfie translates database-style relationships into filesystem organization:
Database Thinking โ Filesystem Result
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Tables: [experiments] โ Directory Level 1
Tables: [models] โ Directory Level 2
Tables: [dates] โ Directory Level 3
Columns: epochs, lr โ metadata.json
Data: results.csv โ Attached files
Visual Concept
Root Directory
โโโ .shelfie.pkl # Shelf configuration
โโโ experiment_1/ # Field 1 value
โ โโโ random_forest/ # Field 2 value
โ โ โโโ 2025-06-12/ # Field 3 value (auto-generated date)
โ โ โ โโโ metadata.json # Stored attributes
โ โ โ โโโ results.csv # Your data files
โ โ โ โโโ model.pkl # More data files
โ โ โโโ gradient_boost/
โ โ โโโ 2025-06-12/
โ โ โโโ metadata.json
โ โ โโโ results.csv
โ โโโ neural_network/
โ โโโ 2025-06-12/
โ โโโ metadata.json
โ โโโ predictions.csv
โโโ experiment_2/
โโโ ...
The Pattern
Shelfie = Filesystem-Based Relational Design
- Fields โ Directory hierarchy (what you'd normalize into separate tables)
- Attributes โ Stored metadata (what you'd store as columns in those tables)
- Data โ Files attached to each record (the actual data your database would reference)
- File Paths โ Automatically tracked as
filename_path__in metadata
Traditional Database:
SELECT r.accuracy, e.name, m.type, r.epochs
FROM results r
JOIN experiments e ON r.experiment_id = e.id
JOIN models m ON r.model_id = m.id
WHERE e.date = '2025-06-12'
Shelfie Equivalent:
data = load_from_shelf("./experiments")
results_df = data['results'] # Already has experiment, model, date columns!
filtered = results_df[results_df['date'] == '2025-06-12']
๐ Quick Start
Installation
pip install shelfie
Basic Example
import pandas as pd
from shelfie import Shelf, DateField
# Create a shelf for ML experiments
ml_shelf = Shelf(
root="./experiments",
fields=["experiment", "model", DateField("date")], # Directory structure
attributes=["epochs", "learning_rate"] # Required metadata
)
# Create a new experiment record
experiment = ml_shelf.create(
experiment="baseline",
model="mlp",
epochs=100,
learning_rate=0.001 # Typical learning rate for neural networks
)
# Attach your results
results_df = pd.DataFrame({
"accuracy": [0.85, 0.87, 0.89],
"loss": [0.45, 0.32, 0.28],
"epoch": [1, 2, 3]
})
experiment.attach(results_df, "results.csv")
This creates:
experiments/
โโโ baseline/
โโโ mlp/
โโโ 2025-06-12/
โโโ metadata.json # {"epochs": 100, "learning_rate": 0.001, "results_path__": "/path/to/results.csv"}
โโโ results.csv # Your data
๐ Detailed Examples
1. ML Experiment Tracking
Think of this as three related database tables:
experimentstable โprojectfieldmodelstable โmodel_typefieldrunstable โdatefield- Attributes:
dataset,hyperparams,notes
from shelfie import Shelf, DateField, TimestampField
import pandas as pd
# Set up experiment tracking (defines your "table" relationships)
experiments = Shelf(
root="./ml_experiments",
fields=["project", "model_type", DateField("date")], # Your table hierarchy
attributes=["dataset", "hyperparams", "notes"] # Your table columns
)
# Log different experiments
mlp_experiment = experiments.create(
project="customer_churn",
model_type="mlp",
dataset="v2_cleaned",
hyperparams={"hidden_layers": [128, 64, 32], "dropout": 0.3, "activation": "relu"},
notes="Multi-layer perceptron with dropout regularization"
)
# Attach multiple files
mlp_experiment.attach(train_results, "training_metrics.csv")
mlp_experiment.attach(test_results, "test_results.csv")
mlp_experiment.attach(feature_importance, "feature_importance.csv")
# Try a different model
cnn_experiment = experiments.create(
project="customer_churn",
model_type="cnn",
dataset="v2_cleaned",
hyperparams={"filters": [32, 64, 128], "kernel_size": 3, "learning_rate": 0.0001},
notes="Convolutional neural network approach"
)
2. Sales Data by Region and Time
Database equivalent:
regionstable โregionfieldtime_periodstable โyear,quarterfields- Attributes:
analyst,report_type,data_source
# Organize sales data by geography and time (multi-table relationship)
sales_shelf = Shelf(
root="./sales_data",
fields=["region", "year", "quarter"], # Geographic + temporal tables
attributes=["analyst", "report_type", "data_source"] # Report metadata columns
)
# Store Q1 data for North America
na_q1 = sales_shelf.create(
region="north_america",
year="2025",
quarter="Q1",
analyst="john_doe",
report_type="quarterly_summary",
data_source="salesforce"
)
sales_data = pd.DataFrame({
"product": ["A", "B", "C"],
"revenue": [150000, 200000, 180000],
"units_sold": [1500, 2000, 1800]
})
na_q1.attach(sales_data, "quarterly_sales.csv")
3. Survey Data Organization
Database tables: survey_types โ demographics โ timestamps
# Organize survey responses by type and demographics
surveys = Shelf(
root="./survey_data",
fields=["survey_type", "demographic", TimestampField("timestamp")], # Survey taxonomy
attributes=["sample_size", "methodology", "response_rate"] # Survey metadata
)
# Store customer satisfaction survey
survey = surveys.create(
survey_type="customer_satisfaction",
demographic="millennials",
sample_size=1000,
methodology="online_panel",
response_rate=0.23
)
responses = pd.DataFrame({
"question_id": [1, 2, 3, 4, 5],
"avg_score": [4.2, 3.8, 4.1, 3.9, 4.0],
"response_count": [920, 915, 898, 901, 911]
})
survey.attach(responses, "responses.csv")
๐ Reading Your Data Back
The Magic: Automatic JOIN Operations
Unlike databases where you need explicit JOINs, Shelfie automatically combines your "table" relationships:
from shelfie import load_from_shelf
# Load all data from experiments shelf
data = load_from_shelf("./ml_experiments")
# Returns a dictionary of DataFrames - like running multiple JOINed queries:
# {
# 'metadata': All experiment metadata with project+model+date info,
# 'training_metrics': Training data with experiment context automatically joined,
# 'test_results': Test data with experiment context automatically joined,
# ...
# }
# Analyze all your experiments - no JOINs needed!
print(data['metadata']) # Overview of all experiments
print(data['training_metrics']) # All training metrics with full context
# Note: File paths are stored as filename_path__ columns (e.g., 'training_metrics_path__')
What you get automatically:
- Denormalized DataFrames: Each CSV gets experiment+model+date columns added
- Full Context: Every row knows its complete "relational" context
- No JOIN complexity: Relationships are already materialized
- Pandas-ready: Immediate analysis without SQL knowledge
Each DataFrame automatically includes:
- Original data columns: Your actual data
- Attribute columns: Metadata from your "table columns" (hyperparams, notes, etc.)
- Field columns: Directory structure as relational context (project, model_type, date)
- File path columns: References as
filename_path__columns
๐ ๏ธ Advanced Features
Custom Fields with Defaults
from shelfie import Field, DateField, TimestampField
# Field with a default value
shelf = Shelf(
root="./data",
fields=[
"experiment",
Field("environment", default="production"), # Always "production" unless specified
DateField("date"), # Auto-generates today's date
TimestampField("timestamp") # Auto-generates current timestamp
],
attributes=["version"]
)
# Only need to specify non-default fields
record = shelf.create(
experiment="test_1",
version="1.0"
)
# Creates: ./data/test_1/production/2025-06-12/2025-06-12_14-30-45/
# Metadata includes: version_path__ for any attached files
Multiple File Types
# Attach different file types
record.attach(results_df, "results.csv") # CSV
record.attach(model_config, "config.json") # JSON
record.attach(trained_model, "model.pkl") # Pickle
record.attach(report_text, "summary.txt") # Text
Loading Existing Shelves
# Load a shelf that was created elsewhere
existing_shelf = Shelf.load_from_root("./experiments")
# Continue adding to it
new_experiment = existing_shelf.create(
experiment="advanced",
model="transformer",
epochs=50,
learning_rate=0.0001 # Lower learning rate for transformer models
)
๐๏ธ Directory Structure Examples
Before Shelfie
my_project/
โโโ experiment1_mlp_results.csv
โโโ experiment1_mlp_model.pkl
โโโ experiment2_cnn_results.csv
โโโ experiment2_cnn_model.pkl
โโโ baseline_test_data.csv
โโโ advanced_test_data.csv
โโโ notes.txt # Which file belongs to what?
After Shelfie
my_project/
โโโ baseline/
โ โโโ mlp/
โ โ โโโ 2025-06-12/
โ โ โโโ metadata.json # {"epochs": 100, "lr": 0.001, "results_path__": "/path/results.csv"}
โ โ โโโ results.csv
โ โ โโโ model.pkl
โ โโโ cnn/
โ โโโ 2025-06-12/
โ โโโ metadata.json # {"epochs": 200, "lr": 0.0001, "results_path__": "/path/results.csv"}
โ โโโ results.csv
โ โโโ model.pkl
โโโ advanced/
โโโ transformer/
โโโ 2025-06-12/
โโโ metadata.json # {"epochs": 50, "lr": 0.0001, "results_path__": "/path/results.csv"}
โโโ results.csv
โโโ model.pkl
๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
๐ License
This project is licensed under the MIT License.
Happy organizing! ๐โจ
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file shelfie-0.2.0.tar.gz.
File metadata
- Download URL: shelfie-0.2.0.tar.gz
- Upload date:
- Size: 34.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.31
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f1be82c8b42db7c3de62f1445c2cd74df4d274964b44b29229b94d1213aca649
|
|
| MD5 |
cb0aef6c2a1cf4175d2ca0a10f3ceae8
|
|
| BLAKE2b-256 |
23032c62d647292ee4b01114a3a183d1b7ace226de28ffe75a3b639a82aa3ea7
|
File details
Details for the file shelfie-0.2.0-py3-none-any.whl.
File metadata
- Download URL: shelfie-0.2.0-py3-none-any.whl
- Upload date:
- Size: 10.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.31
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8850e11aa0226358ee3e97b66694b8ba163f448e02bc4efce626c1638c1d248c
|
|
| MD5 |
c2a7f178ca2dc506ee48690a3cf1eba0
|
|
| BLAKE2b-256 |
373bba0b4fc326302ff76563311b7df21b24a9a5234a31c817fe809735586dda
|