Skip to main content

A collection of tools for building structured Python projects

Project description

Cocina

Cocina is a collection of tools for building structured Python projects. It provides sophisticated configuration management, job execution capabilities, and a professional CLI interface.

Core Components

  1. ConfigHandler - Unified configuration management, constants, and environment variables
  2. ConfigArgs - Job-specific configuration loading with structured argument access
  3. CLI - Command-line interface for project initialization and job execution

Table of Contents


Getting Started


Install

FROM PYPI

pip install cocina

FROM CONDA

 conda install -c conda-forge cocina

Initialize

pixi run cocina init --log_dir logs --package your_package_name

See cocina Configuration for detailed initialization options.


Overview

Cocina separates configuration (values that can change) from constants (values that never change) and job arguments (run-specific parameters).

Key Concepts

  • ConfigHandler (ch) - Manages constants and project configuration

    • Constants: your_module/constants.py (protected from modification)
    • General Config: config/config.yaml
    • Env Config: config/<environment-name>.yaml
    • Usage: ch.DATABASE_URL, ch.get(MAX_SCALE, 1000)
  • ConfigArgs (ca) - Manages job-specific run configurations

    • Job configs: config/args/job_name.yaml
    • Usage: To run method method_name: method_name(*ca.method_name.args, **ca.method_name.kwargs)

Note: names of configuration and job directories and files can be customized in .cocina.

Before and After

Traditional approach:

SOURCE = "path/to/src.parquet"
OUTPUT_DEST = "path/to/output"

def main():
    data = load_data(SOURCE, limit=1000, debug=True)
    data = process_data(data, scale=100, validate=False)
    save_data(data, OUTPUT_DEST, format="json")

if __name__ == "__main__":
    main()

With Cocina:

def run(config_args):
    data = load_data(*config_args.load_data.args, **config_args.load_data.kwargs)
    data = process_data(data, *config_args.process_data.args, **config_args.process_data.kwargs)
    save_data(data, *config_args.save_data.args, **config_args.save_data.kwargs)

All parameters are now externalized to YAML configuration files, making scripts reusable and maintainable. CLI mangagement/arg-parsing is handled through the cocina CLI

Example

Project Structure:

my_project/
├── my_package/                 # Python package
│   ├── constants.py            # Project Constants (protected from modification)
│   ├── ...                     # Modules
│   └── data_manager.py         # Named example python module
├── config/
│   ├── config.yaml             # Main configuration
│   ├── prod.yaml               # Production configuration overrides
│   └── args/
│       └── data_pipeline.yaml  # Job configuration
└── jobs/
    └── data_pipeline.py        # Job implementation

Configuration (config/args/data_pipeline.yaml):

extract_data:
  args: ["source_table"]
  kwargs:
    limit: 1000
    debug: false

transform_data:
  scale: 100
  validate: true

save_data:
  - "output_table"

Job Implementation (jobs/data_pipeline.py):

def run(config_args, printer=None):
    data = extract_data(*config_args.extract_data.args, **config_args.extract_data.kwargs)
    data = transform_data(data, *config_args.transform_data.args, **config_args.transform_data.kwargs)
    save_data(*config_args.save_data.args, **config_args.save_data.kwargs)

Running Jobs:

# Default environment
pixi run cocina job data_pipeline

# Production environment
pixi run cocina job data_pipeline --env prod

RUN AND MAIN METHODS

When running a job, the CLI requires either a run method that takes arguments config_args: ConfigArgs, printer: Printer, or a run method that takes only config_args: ConfigArgs, or a main method that does not have any arguments.

Priority ordering is:

  1. run(config_args, printer) | passing both a ConfigArgs and Printer instance
  2. run(config_args) | passing a ConfigArgs instance
  3. main() | for jobs without configuration (legacy scripts)

USER CODEBASE/NOTEBOOKS

Although the main focus is on building and running configured "jobs", ConfigArgs can also be used in your code (a notebook for example):

# Load job-specific configuration
ca = ConfigArgs('job_group_1.job_a1')
jobs.job_group_1.job_a1.step_1(*ca.step_1.args, **ca.step_1.kwargs)

cocina Configuration

The .cocina file contains project settings and must be in your project root. It defines:

  • Configuration file locations and naming conventions
  • Project root directory location
  • Environment variable names

Required: Every project must have a .cocina file at the root.

Options:

  • --log_dir: Enable automatic log file creation
  • --package: Specify main package for constants loading
  • --force: Overwrite existing .cocina file

Configuration Files

Cocina uses YAML files in the config/ directory:

config/
├── config.yaml           # Main configuration
├── dev.yaml             # Development environment overrides
├── prod.yaml            # Production environment overrides
└── args/                # Job-specific configurations
    ├── job_name.yaml    # Individual job config
    └── group_name/      # Grouped job configs
        └── job_a.yaml

Configuration Types:

  • Main Config: config.yaml - shared across all environments
  • Environment Config: {env}.yaml - environment-specific overrides
  • Job Config: args/{job}.yaml - job-specific parameters and arguments

ConfigHandler

Manages constants and main configuration with environment support.

from cocina.config_handler import ConfigHandler

ch = ConfigHandler()
print(ch.DATABASE_URL)  # From config.yaml
print(ch.MAX_SCALE)     # From constants.py (protected)

Features:

  • Loads constants from your_package/constants.py
  • Loads configuration from config/config.yaml
  • Environment-specific overrides from config/{env}.yaml
  • Dict-style and attribute access patterns

ConfigArgs

Loads job-specific configurations with structured argument access.

from cocina.config_handler import ConfigArgs

ca = ConfigArgs('data_pipeline')
# Access method arguments
ca.extract_data.args     # ["source_table"]
ca.extract_data.kwargs   # {"limit": 1000, "debug": False}

YAML Configuration Parsing:

  • Dict with args/kwargs keys → extracts args and kwargs
  • Dict without special keys → args=[], kwargs=dict
  • List/tuple → args=value, kwargs={}
  • Single value → args=[value], kwargs={}

Features:

  • Environment-specific overrides
  • Reference resolution from main config
  • Dynamic value substitution

CLI

Initialize Project

pixi run cocina init --log_dir logs --package your_package

Run Jobs

# Run a single job
pixi run cocina job data_pipeline

# Run with alternative config filename
# - the above command loads config/args/data_pipeline.yaml
# - the command below loads config/args/data_pipeline/v2.yaml
pixi run cocina job data_pipeline:v2

# Run with specific environment
pixi run cocina job data_pipeline --env prod

# Run multiple jobs
pixi run cocina job job1 job2 job3

# Dry run (validate without executing)
pixi run cocina job data_pipeline --dry_run

Options:

  • --env: Environment configuration to use (dev, prod, etc.)
  • --verbose: Enable detailed output
  • --dry_run: Validate configuration without running

Tools

Printer

Professional output with timestamps, headers, and optional file logging. Printer is a singleton class that automatically initializes when first accessed.

from cocina.printer import Printer

printer = Printer(log_dir='logs', basename='MyApp')
printer.message('Status update', count=42, status='ok')
printer.stop('Complete')

Timer

Simple timing functionality with duration tracking.

from cocina.utils import Timer

timer = Timer()
timer.start()           # Start timing
print(timer.state())    # Current elapsed time
print(timer.now())      # Current timestamp
stop_time = timer.stop()     # Stop timing
print(timer.delta())    # Total duration string

See complete documentation for all utility functions and helpers.


Development

Requirements: Managed with Pixi - no manual environment setup needed.

# All commands use pixi
pixi run jupyter lab

Style: Follows PEP8 standards. See setup.cfg for project-specific rules.


Documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cocina-0.1.5.tar.gz (26.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cocina-0.1.5-py3-none-any.whl (24.9 kB view details)

Uploaded Python 3

File details

Details for the file cocina-0.1.5.tar.gz.

File metadata

  • Download URL: cocina-0.1.5.tar.gz
  • Upload date:
  • Size: 26.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for cocina-0.1.5.tar.gz
Algorithm Hash digest
SHA256 a97d6f7a6b22378468f0adb4554420669624b41c7e4d44838f845b79a2c1bdc4
MD5 c3782ab75414aad4411ff43e3b27c607
BLAKE2b-256 61d074f46ca492c898d688a12003bd38bec5d83bd2b28a2f3454d94be9947a4a

See more details on using hashes here.

File details

Details for the file cocina-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: cocina-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 24.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.3

File hashes

Hashes for cocina-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 d3101be6a8d86979bc279a55c8e612831ae41ca901eeef8ba01bbf4e723106ab
MD5 93fba2a6327cb6d13b38872dc6c6fa03
BLAKE2b-256 6a91247b1a1a30696aa51de6ba8523864fe4b6f37b971324fc43cf0e2e2723f9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page