Skip to main content

Static analysis pipeline for generating high-quality Rust code datasets for model fine-tuning. Phase 2 dataset generation with Phase 1 format alignment.

Project description

Sigil Pipeline v2.6.0

A static analysis pipeline for generating high-quality Rust code datasets for model fine-tuning. The pipeline analyzes Rust crates using static analysis tools and generates training datasets in JSONL format.

๐Ÿ“– Ecosystem Architecture: For a comprehensive overview of how this project integrates with SigilDERG-Finetuner and human-eval-Rust, see ARCHITECTURE.md.

Version 2.6.0 includes:

  • Checkpoint/Resume System: Automatic checkpointing allows resuming long-running pipeline executions without losing progress. Preserves temp directories and skips already-processed crates.
  • Improved Error Injection: Enhanced error-fixing task generation with fallback to simulated errors when real compilation times out, ensuring more robust task diversity.
  • Enhanced Logging: Geiger and License checks now always write logs, even when no issues are found, improving observability and debugging.
  • Tool Execution Tracking: Rejection summaries now include flags indicating which analysis tools were executed or skipped.
  • Enterprise Observability: Structured logging via structlog, Prometheus-compatible metrics, and optional OpenTelemetry tracing.
  • License pre-checking from crates.io API
  • Cargo-deny security auditing integration
  • Streaming architecture for memory-efficient processing
  • Granular filter metrics and observability
  • Enhanced quality filtering (unsafe code, outdated dependencies)
  • Platform compatibility detection
  • Shared cargo target directory for faster builds

Overview

Sigil Pipeline performs comprehensive static analysis on Rust crates to identify high-quality, idiomatic code suitable for training code generation models. It combines:

  • Curated Rust crates analyzed through static analysis tools
  • The Stack Rust Clean dataset files (from HuggingFace)
  • Format validation to ensure consistent dataset structure

The pipeline generates JSONL datasets with prompt-generation pairs that can be used directly for fine-tuning language models.

Features

Static Code Analysis

  • Clippy: Detects idiomatic code patterns and lint violations
  • Cargo Geiger: Analyzes unsafe code usage and safety metrics
  • Cargo Outdated: Assesses dependency maintenance status
  • Cargo License: Checks license compliance (with centralized verification logic)
  • Cargo Deny: Performs security and license auditing (optional, configurable)
  • License Pre-Check: Validates licenses from crates.io API before downloading

Quality Filtering

  • Rust Edition: Filters to 2021+ edition crates (modern Rust)
  • Clippy Warnings: Category-based max_bad_code_warnings threshold (default: 0, ignores style/doc lints but blocks unsafe or correctness issues). Legacy max_clippy_warnings is still available for total-count filtering.
  • Documentation: Requires documentation comments on public items
  • Test/Bench Exclusion: Automatically filters out test and benchmark files
  • Size/Sanity Filters: Applies Stack dataset filtering criteria (line length, alphabetic ratio)
  • License Filtering: Only includes permissively licensed code (MIT, Apache-2.0, BSD, etc.) with SPDX expression support
  • Unsafe Code Filtering: Optional threshold for maximum unsafe code items (from Geiger)
  • Outdated Dependencies: Optional threshold for maximum outdated dependency ratio
  • Platform Compatibility: Automatically skips OS-specific crates incompatible with current platform
  • Security Auditing: Optional cargo-deny integration for security advisories and license violations

Dataset Generation

  • Prompt Generation: Creates instruction prompts from code and documentation based on code patterns and doc comments
  • Semantic Chunking: Splits large files into snippet-sized chunks (functions, impl blocks, modules) for Phase-2
  • Task Type Diversity: Generates multiple task types for Phase-2:
    • Code generation (70% default)
    • Transformations (15% default): syncโ†’async, matchโ†’?, iterator conversions
    • Error fixing (10% default): fix compiler errors in broken code with improved fallback to simulated errors when real compilation times out
    • Explanations (5% default): explain code functionality
  • Format Validation: Ensures consistent dataset structure
  • Dataset Merging: Combines multiple datasets with shuffle and weighting options
  • Extra Shards: Append pre-generated instruct-style shards (e.g., experimental upscales) via CLI without moving files
  • Train/Val Split by Source: Splits datasets keeping whole crates/files together (tests true generalization)
  • Streaming Architecture: Generator-based pipeline for memory-efficient processing of large datasets
  • Granular Metrics: Detailed filter reason breakdown for observability

Checkpoint/Resume System

  • Automatic Checkpointing: Saves progress periodically (configurable interval, default: every 10 crates)
  • Resume from Interruptions: Automatically detects and loads checkpoints on startup
  • Temp Directory Preservation: Reuses existing temp directories when resuming, preserving downloaded crates (saves GBs of re-downloads)
  • Smart Crate Skipping: Automatically skips already-processed crates to avoid duplicates
  • Config Compatibility Checking: Verifies config hash to prevent incompatible resumes
  • Checkpoint Location: Defaults to output_dir/checkpoint.json, customizable via --checkpoint-path

Requirements

  • Python 3.12+
  • Rust toolchain (1.56+ for 2021 edition, 1.72+ for 2024 edition)
  • Cargo subcommands:
    • cargo clippy (included with rustup)
    • cargo geiger
    • cargo outdated
    • cargo license
    • cargo deny

See docs/SETUP.md for detailed setup instructions.

Installation

# Clone the repository
git clone https://github.com/Superuser666-Sigil/SigilDERG-Data_Production.git
cd SigilDERG-Data_Production

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -e ".[datasets]"  # tree-sitter for AST parsing is now included in core deps

# Install Rust analysis tools
cargo install cargo-geiger cargo-outdated cargo-license cargo-deny
rustup component add clippy

Quick Start

Command Line

# Analyze specific crates
python -m sigil_pipeline.main --crates serde tokio actix-web

# Use crate list file
python -m sigil_pipeline.main --crate-list data/crate_list.txt

# Phase-2 Instruct Mode (generates diverse task types with semantic chunking)
python -m sigil_pipeline.main \
  --prompt-mode instruct \
  --max-sft-lines 200 \
  --max-sft-chars 8000 \
  --output output/phase2_dataset.jsonl

# Custom task type distribution
python -m sigil_pipeline.main \
  --task-mix '{"code_generation": 0.7, "transformations": 0.15, "error_fixing": 0.1, "explanations": 0.05}'

# Append experimental / pre-generated shards after generation
python -m sigil_pipeline.main \
  --crate-list data/crate_list.txt \
  --extra-phase2-shard experimental/experimental_shard.jsonl \
  --output datasets/phase2_full.jsonl

# Allow longer real error injection (e.g., 3 minutes for cargo check)
python -m sigil_pipeline.main \
  --error-injection-timeout 180 \
  --output datasets/phase2_full.jsonl

# Checkpoint/Resume: Automatically saves progress and can resume from interruptions
# Checkpoint is saved to output_dir/checkpoint.json by default
python -m sigil_pipeline.main \
  --crate-list data/crate_list.txt \
  --output datasets/phase2_full.jsonl \
  --checkpoint-interval 10  # Save checkpoint every 10 crates (default)

# Resume from checkpoint (automatically detected if checkpoint.json exists)
python -m sigil_pipeline.main \
  --crate-list data/crate_list.txt \
  --output datasets/phase2_full.jsonl
# Pipeline will automatically skip already-processed crates and reuse temp directory

# Custom checkpoint path
python -m sigil_pipeline.main \
  --checkpoint-path logs/my_checkpoint.json \
  --crate-list data/crate_list.txt \
  --output datasets/phase2_full.jsonl

# Disable checkpointing
python -m sigil_pipeline.main \
  --no-checkpointing \
  --crate-list data/crate_list.txt \
  --output datasets/phase2_full.jsonl

Python API

import asyncio
from sigil_pipeline.config import PipelineConfig
from sigil_pipeline.main import run_pipeline

async def main():
    config = PipelineConfig(
        crates=["serde", "tokio"],
        output_path="output/dataset.jsonl",
    )
    
    await run_pipeline(config)

if __name__ == "__main__":
    asyncio.run(main())

Configuration

The pipeline uses a PipelineConfig dataclass for all settings. Key options:

from sigil_pipeline.config import PipelineConfig

config = PipelineConfig(
    # Crates to analyze
    crates=["serde", "tokio"],
    crate_list_path="data/crate_list.txt",  # Or specify individual crates
    
    # Quality thresholds
    allow_edition_2018=False,  # Only 2021+ edition
    max_bad_code_warnings=0,  # Strict filter for critical lints (style lints ignored)
    require_docs=True,  # Require documentation
    
    # Advanced filtering
    max_unsafe_items=None,  # Optional: max unsafe code items (None = no filter)
    max_outdated_ratio=None,  # Optional: max outdated dependency ratio
    enable_deny_scan=False,  # Optional: cargo-deny security auditing
    
    # File filtering
    max_line_length=100,
    min_alphabetic_ratio=0.3,  # Filters minified code
    
    # Error injection controls
    enable_error_injection=True,
    error_injection_method="both",
    error_injection_timeout=120,
    
    # Performance
    reuse_cargo_target=True,  # Share cargo target directory (output/cargo_target_cache by default)
    
    # Checkpoint/Resume
    enable_checkpointing=True,  # Enable automatic checkpointing (default: True)
    checkpoint_path=None,  # Custom checkpoint path (default: output_dir/checkpoint.json)
    checkpoint_interval=10,  # Save checkpoint every N crates (default: 10)
    
    # Output
    output_path="output/dataset.jsonl",
    max_threads=4,  # Parallel processing
)

Configuration can be loaded from JSON or YAML files:

python -m sigil_pipeline.main --config config.yaml

Output Format

The pipeline generates JSONL files (one JSON object per line) with the following structure:

{"prompt": "Write a Rust program that demonstrates error handling", "gen": "use anyhow::Result;\n\nfn main() -> Result<()> {\n    // ...\n}"}
{"prompt": "Write a Rust code example that uses iterators", "gen": "fn process_data(items: &[i32]) -> Vec<i32> {\n    items.iter().map(|x| x * 2).collect()\n}"}

Each line contains:

  • prompt: Instruction prompt describing what the code does
  • gen: Generated code (plain text, UTF-8 encoded)

See docs/DATASET_SCHEMA.md for detailed format specification.

Project Structure

sigil_pipeline/          # Main pipeline package
โ”œโ”€โ”€ main.py             # Pipeline orchestration and CLI entry point
โ”œโ”€โ”€ config.py           # Configuration management
โ”œโ”€โ”€ crawler.py          # Crate downloading and Stack dataset integration
โ”œโ”€โ”€ analyzer.py         # Static analysis tools execution
โ”œโ”€โ”€ filter.py           # Quality filtering heuristics
โ”œโ”€โ”€ chunker.py          # Semantic code chunking (Phase-2)
โ”œโ”€โ”€ task_generator.py   # Task type generation (Phase-2)
โ”œโ”€โ”€ dataset_builder.py  # Prompt generation and dataset assembly
โ”œโ”€โ”€ dataset_splitter.py # Train/val splitting by source
โ”œโ”€โ”€ exporter.py         # JSONL export and dataset merging
โ”œโ”€โ”€ format_validator.py # Format validation
โ”œโ”€โ”€ observability.py    # Structured logging and metrics
โ”œโ”€โ”€ telemetry.py        # OpenTelemetry tracing (optional)
โ””โ”€โ”€ utils.py            # Utilities (cargo commands, file I/O, etc.)

tools/                   # Dataset utilities
โ”œโ”€โ”€ analyze_failures.py         # Analyze pipeline rejection reasons
โ”œโ”€โ”€ convert_jsonl_to_parquet.py # Convert JSONL to Parquet
โ”œโ”€โ”€ convert_parquet_to_jsonl.py # Convert Parquet to JSONL
โ”œโ”€โ”€ split_jsonl.py              # Split large JSONL into chunks
โ”œโ”€โ”€ split_train_val.py          # Create train/val splits
โ”œโ”€โ”€ rebalance_task_mix.py       # Adjust task type distribution
โ””โ”€โ”€ verify_format_test.py       # Validate format compliance

scripts/                 # Setup and release scripts
โ”œโ”€โ”€ create_release.py           # Release automation
โ””โ”€โ”€ setup/
    โ””โ”€โ”€ setup_rust_analysis_tools.py  # Install Rust tools

tests/                   # Test suite
benches/                 # Performance benchmarks
docs/                    # Documentation

Tools

The repository includes utility scripts for dataset manipulation and analysis.

Failure Analysis

tools/analyze_failures.py

  • Parses the latest (or specified) analysis logs
  • Categorizes Clippy warnings (ignores style warnings, flags unsafe/bad code)
  • Detects license rejections from the main pipeline log
  • Automatically removes license-rejected crates from data/crate_list.txt (unless --no-cleanup)
  • Can write a full report to disk
# Auto-detect most recent analysis directory
python tools/analyze_failures.py

# Specify locations explicitly
python tools/analyze_failures.py \
  --log-dir logs/analysis_20251124_180335 \
  --log-file logs/phase2_full_run.log \
  --crate-list data/crate_list.txt \
  --output logs/failure_analysis.txt

# Skip automatic crate_list cleanup
python tools/analyze_failures.py --no-cleanup

Dataset Utilities

tools/split_train_val.py

  • Splits a dataset into train/val files while keeping whole crates/files together.
python tools/split_train_val.py \
  --input datasets/phase2_full.jsonl \
  --train output/train.jsonl \
  --val output/val.jsonl \
  --val-ratio 0.1

tools/split_jsonl.py

  • Splits large JSONL files into ~11MB chunks without breaking JSON objects.
python tools/split_jsonl.py \
  --input datasets/phase2_full.jsonl \
  --output-dir datasets/chunks \
  --prefix phase2_chunk

tools/convert_jsonl_to_parquet.py

  • Converts JSONL datasets to Parquet, supporting both training-ready (metadata stripped) and provenance variants.
python tools/convert_jsonl_to_parquet.py \
  --input datasets/phase2_full.jsonl \
  --output datasets/phase2_full.parquet \
  --variant training

tools/convert_parquet_to_jsonl.py

  • Converts Parquet datasets back to JSONL (useful for inspection or smaller workflows).
python tools/convert_parquet_to_jsonl.py \
  --input datasets/phase2_full.parquet \
  --output datasets/phase2_roundtrip.jsonl

tools/verify_format_test.py

  • Quick check to ensure a dataset matches the Phase 1 format specification.
python tools/verify_format_test.py --input datasets/phase2_full.jsonl

tools/rebalance_task_mix.py

  • Downsamples (or lightly reweights) a JSONL dataset to match a desired _task_type distribution and writes a summary report.
python tools/rebalance_task_mix.py \
  --input datasets/phase2_full.jsonl \
  --output datasets/phase2_balanced.jsonl \
  --target-mix code_generation=0.5,error_fixing=0.25,transformations=0.15,explanations=0.10

Testing

# Run all tests (672 tests)
pytest tests/

# Run with coverage report
pytest tests/ --cov=sigil_pipeline --cov-report=term-missing

# Run specific test modules
pytest tests/test_api_tracker.py -v          # API evolution tracking
pytest tests/test_ast_patterns.py -v         # AST-based extraction
pytest tests/test_task_generator.py -v       # Task type generation
pytest tests/test_telemetry.py -v            # OpenTelemetry tracing
pytest tests/test_converters.py -v           # Format conversion
pytest tests/test_dataset_splitter.py -v     # Train/val splitting

# Run tests by keyword
pytest tests/ -k "api" -v                    # API-related tests
pytest tests/ -k "ast" -v                    # AST parsing tests

# Run property-based tests
pytest tests/test_properties.py -v --hypothesis-show-statistics

# Run local CI checks
python test_ci_local.py

Test Coverage Summary

Category Modules Coverage
Core Pipeline analyzer, filter, config 81-99%
AST Processing ast_patterns, task_generator 78-80%
API Tracking api_tracker, usage_analyzer 79-89%
Data Processing dataset_splitter, converters 63-98%
Infrastructure telemetry, utils, environment 77-91%
CLI ecosystem, main 42-93%

Overall Coverage: 75% (4845 statements, 672 tests passing)

SigilDERG Ecosystem Integration

This package is part of the SigilDERG ecosystem for Rust code model training. It integrates seamlessly with:

Install Full Ecosystem

pip install sigil-pipeline[ecosystem]

This installs all three packages with proper version constraints.

Complete Workflow

  1. Generate dataset (this package):

    python -m sigil_pipeline.main --output datasets/phase2_full.jsonl
    
  2. Fine-tune model (sigilderg-finetuner):

    sigilderg-train configs/llama8b-phase2.yml  # Uses local:datasets/phase2_full.jsonl
    
  3. Evaluate model (human-eval-rust):

    sigilderg-eval samples.jsonl --use-human-eval
    

Unified CLI

Use the unified orchestrator for the complete workflow:

sigil-ecosystem \
    --crate-list data/crate_list.txt \
    --dataset-path datasets/phase2_full.jsonl \
    --config-path configs/llama8b-phase2.yml

See Ecosystem Integration Guide for detailed documentation.

Documentation

Docker

The project includes Docker support for containerized execution:

# Build image
docker build -t sigil-pipeline:2.3.0 .

# Run pipeline
docker-compose up

# Interactive shell
docker run -it sigil-pipeline:2.2.0 bash

# Run with custom arguments
docker run -v $(pwd)/output:/app/output sigil-pipeline:2.2.0 \
    --crate-list /app/data/crate_list.txt \
    --output /app/output/dataset.jsonl

See docker-compose.yml and Dockerfile for configuration details.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Rust community for excellent analysis tools (Clippy, Geiger, etc.)
  • HuggingFace for the Stack dataset and datasets library
  • The Stack dataset contributors for providing high-quality Rust code
  • Ammar Nasr for producing and distributing the Stack Rust Clean Dataset (https://huggingface.co/datasets/ammarnasr/the-stack-rust-clean)

Sigil Pipeline - Generating high-quality Rust code datasets for model fine-tuning.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sigil_pipeline-2.6.0.tar.gz (220.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sigil_pipeline-2.6.0-py3-none-any.whl (142.9 kB view details)

Uploaded Python 3

File details

Details for the file sigil_pipeline-2.6.0.tar.gz.

File metadata

  • Download URL: sigil_pipeline-2.6.0.tar.gz
  • Upload date:
  • Size: 220.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for sigil_pipeline-2.6.0.tar.gz
Algorithm Hash digest
SHA256 daa491c566a8c5a9f1aa6ce43f4ebc4a86b36a5316144759034a66fb1d88398d
MD5 c1d7d9534c70b6fd987f477867d1b261
BLAKE2b-256 10bb2acb0efb9eb4837e09eaee9c9ec59f99aa14e4e80cc04752b362533b4d89

See more details on using hashes here.

File details

Details for the file sigil_pipeline-2.6.0-py3-none-any.whl.

File metadata

  • Download URL: sigil_pipeline-2.6.0-py3-none-any.whl
  • Upload date:
  • Size: 142.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for sigil_pipeline-2.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 df41e2579120d859d1ec88369a979b392d298d428abf190a62f4bebf8f5a0dfd
MD5 9131a68b9d0b8afd9bf688b561e2aa1f
BLAKE2b-256 d08644581362d071baa8f85847892b1dd377038802e4aec4f91cb60d1963cd8c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page