Static analysis pipeline for generating high-quality Rust code datasets for model fine-tuning. Phase 2 dataset generation with Phase 1 format alignment.

These details have not been verified by PyPI

Project links

Project description

Sigil Pipeline v2.6.0

A static analysis pipeline for generating high-quality Rust code datasets for model fine-tuning. The pipeline analyzes Rust crates using static analysis tools and generates training datasets in JSONL format.

📖 Ecosystem Architecture: For a comprehensive overview of how this project integrates with SigilDERG-Finetuner and human-eval-Rust, see ARCHITECTURE.md.

Version 2.6.0 includes:

Checkpoint/Resume System: Automatic checkpointing allows resuming long-running pipeline executions without losing progress. Preserves temp directories and skips already-processed crates.
Improved Error Injection: Enhanced error-fixing task generation with fallback to simulated errors when real compilation times out, ensuring more robust task diversity.
Enhanced Logging: Geiger and License checks now always write logs, even when no issues are found, improving observability and debugging.
Tool Execution Tracking: Rejection summaries now include flags indicating which analysis tools were executed or skipped.
Enterprise Observability: Structured logging via structlog, Prometheus-compatible metrics, and optional OpenTelemetry tracing.
License pre-checking from crates.io API
Cargo-deny security auditing integration
Streaming architecture for memory-efficient processing
Granular filter metrics and observability
Enhanced quality filtering (unsafe code, outdated dependencies)
Platform compatibility detection
Shared cargo target directory for faster builds

Overview

Sigil Pipeline performs comprehensive static analysis on Rust crates to identify high-quality, idiomatic code suitable for training code generation models. It combines:

Curated Rust crates analyzed through static analysis tools
The Stack Rust Clean dataset files (from HuggingFace)
Format validation to ensure consistent dataset structure

The pipeline generates JSONL datasets with prompt-generation pairs that can be used directly for fine-tuning language models.

Features

Static Code Analysis

Clippy: Detects idiomatic code patterns and lint violations
Cargo Geiger: Analyzes unsafe code usage and safety metrics
Cargo Outdated: Assesses dependency maintenance status
Cargo License: Checks license compliance (with centralized verification logic)
Cargo Deny: Performs security and license auditing (optional, configurable)
License Pre-Check: Validates licenses from crates.io API before downloading

Quality Filtering

Rust Edition: Filters to 2021+ edition crates (modern Rust)
Clippy Warnings: Category-based max_bad_code_warnings threshold (default: 0, ignores style/doc lints but blocks unsafe or correctness issues). Legacy max_clippy_warnings is still available for total-count filtering.
Documentation: Requires documentation comments on public items
Test/Bench Exclusion: Automatically filters out test and benchmark files
Size/Sanity Filters: Applies Stack dataset filtering criteria (line length, alphabetic ratio)
License Filtering: Only includes permissively licensed code (MIT, Apache-2.0, BSD, etc.) with SPDX expression support
Unsafe Code Filtering: Optional threshold for maximum unsafe code items (from Geiger)
Outdated Dependencies: Optional threshold for maximum outdated dependency ratio
Platform Compatibility: Automatically skips OS-specific crates incompatible with current platform
Security Auditing: Optional cargo-deny integration for security advisories and license violations

Dataset Generation

Prompt Generation: Creates instruction prompts from code and documentation based on code patterns and doc comments
Semantic Chunking: Splits large files into snippet-sized chunks (functions, impl blocks, modules) for Phase-2
Task Type Diversity: Generates multiple task types for Phase-2:
- Code generation (70% default)
- Transformations (15% default): sync→async, match→?, iterator conversions
- Error fixing (10% default): fix compiler errors in broken code with improved fallback to simulated errors when real compilation times out
- Explanations (5% default): explain code functionality
Format Validation: Ensures consistent dataset structure
Dataset Merging: Combines multiple datasets with shuffle and weighting options
Extra Shards: Append pre-generated instruct-style shards (e.g., experimental upscales) via CLI without moving files
Train/Val Split by Source: Splits datasets keeping whole crates/files together (tests true generalization)
Streaming Architecture: Generator-based pipeline for memory-efficient processing of large datasets
Granular Metrics: Detailed filter reason breakdown for observability

Checkpoint/Resume System

Automatic Checkpointing: Saves progress periodically (configurable interval, default: every 10 crates)
Resume from Interruptions: Automatically detects and loads checkpoints on startup
Temp Directory Preservation: Reuses existing temp directories when resuming, preserving downloaded crates (saves GBs of re-downloads)
Smart Crate Skipping: Automatically skips already-processed crates to avoid duplicates
Config Compatibility Checking: Verifies config hash to prevent incompatible resumes
Checkpoint Location: Defaults to output_dir/checkpoint.json, customizable via --checkpoint-path

Requirements

Python 3.12+
Rust toolchain (1.56+ for 2021 edition, 1.72+ for 2024 edition)
Cargo subcommands:
- cargo clippy (included with rustup)
- cargo geiger
- cargo outdated
- cargo license
- cargo deny

See docs/SETUP.md for detailed setup instructions.

Installation

# Clone the repository
git clone https://github.com/Superuser666-Sigil/SigilDERG-Data_Production.git
cd SigilDERG-Data_Production

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -e ".[datasets]"  # tree-sitter for AST parsing is now included in core deps

# Install Rust analysis tools
cargo install cargo-geiger cargo-outdated cargo-license cargo-deny
rustup component add clippy

Quick Start

Command Line

# Analyze specific crates
python -m sigil_pipeline.main --crates serde tokio actix-web

# Use crate list file
python -m sigil_pipeline.main --crate-list data/crate_list.txt

# Phase-2 Instruct Mode (generates diverse task types with semantic chunking)
python -m sigil_pipeline.main \
  --prompt-mode instruct \
  --max-sft-lines 200 \
  --max-sft-chars 8000 \
  --output output/phase2_dataset.jsonl

# Custom task type distribution
python -m sigil_pipeline.main \
  --task-mix '{"code_generation": 0.7, "transformations": 0.15, "error_fixing": 0.1, "explanations": 0.05}'

# Append experimental / pre-generated shards after generation
python -m sigil_pipeline.main \
  --crate-list data/crate_list.txt \
  --extra-phase2-shard experimental/experimental_shard.jsonl \
  --output datasets/phase2_full.jsonl

# Allow longer real error injection (e.g., 3 minutes for cargo check)
python -m sigil_pipeline.main \
  --error-injection-timeout 180 \
  --output datasets/phase2_full.jsonl

# Checkpoint/Resume: Automatically saves progress and can resume from interruptions
# Checkpoint is saved to output_dir/checkpoint.json by default
python -m sigil_pipeline.main \
  --crate-list data/crate_list.txt \
  --output datasets/phase2_full.jsonl \
  --checkpoint-interval 10  # Save checkpoint every 10 crates (default)

# Resume from checkpoint (automatically detected if checkpoint.json exists)
python -m sigil_pipeline.main \
  --crate-list data/crate_list.txt \
  --output datasets/phase2_full.jsonl
# Pipeline will automatically skip already-processed crates and reuse temp directory

# Custom checkpoint path
python -m sigil_pipeline.main \
  --checkpoint-path logs/my_checkpoint.json \
  --crate-list data/crate_list.txt \
  --output datasets/phase2_full.jsonl

# Disable checkpointing
python -m sigil_pipeline.main \
  --no-checkpointing \
  --crate-list data/crate_list.txt \
  --output datasets/phase2_full.jsonl

Python API

import asyncio
from sigil_pipeline.config import PipelineConfig
from sigil_pipeline.main import run_pipeline

async def main():
    config = PipelineConfig(
        crates=["serde", "tokio"],
        output_path="output/dataset.jsonl",
    )
    
    await run_pipeline(config)

if __name__ == "__main__":
    asyncio.run(main())

Configuration

The pipeline uses a PipelineConfig dataclass for all settings. Key options:

from sigil_pipeline.config import PipelineConfig

config = PipelineConfig(
    # Crates to analyze
    crates=["serde", "tokio"],
    crate_list_path="data/crate_list.txt",  # Or specify individual crates
    
    # Quality thresholds
    allow_edition_2018=False,  # Only 2021+ edition
    max_bad_code_warnings=0,  # Strict filter for critical lints (style lints ignored)
    require_docs=True,  # Require documentation
    
    # Advanced filtering
    max_unsafe_items=None,  # Optional: max unsafe code items (None = no filter)
    max_outdated_ratio=None,  # Optional: max outdated dependency ratio
    enable_deny_scan=False,  # Optional: cargo-deny security auditing
    
    # File filtering
    max_line_length=100,
    min_alphabetic_ratio=0.3,  # Filters minified code
    
    # Error injection controls
    enable_error_injection=True,
    error_injection_method="both",
    error_injection_timeout=120,
    
    # Performance
    reuse_cargo_target=True,  # Share cargo target directory (output/cargo_target_cache by default)
    
    # Checkpoint/Resume
    enable_checkpointing=True,  # Enable automatic checkpointing (default: True)
    checkpoint_path=None,  # Custom checkpoint path (default: output_dir/checkpoint.json)
    checkpoint_interval=10,  # Save checkpoint every N crates (default: 10)
    
    # Output
    output_path="output/dataset.jsonl",
    max_threads=4,  # Parallel processing
)

Configuration can be loaded from JSON or YAML files:

python -m sigil_pipeline.main --config config.yaml

Output Format

The pipeline generates JSONL files (one JSON object per line) with the following structure:

{"prompt": "Write a Rust program that demonstrates error handling", "gen": "use anyhow::Result;\n\nfn main() -> Result<()> {\n    // ...\n}"}
{"prompt": "Write a Rust code example that uses iterators", "gen": "fn process_data(items: &[i32]) -> Vec<i32> {\n    items.iter().map(|x| x * 2).collect()\n}"}

Each line contains:

prompt: Instruction prompt describing what the code does
gen: Generated code (plain text, UTF-8 encoded)

See docs/DATASET_SCHEMA.md for detailed format specification.

Project Structure

sigil_pipeline/          # Main pipeline package
├── main.py             # Pipeline orchestration and CLI entry point
├── config.py           # Configuration management
├── crawler.py          # Crate downloading and Stack dataset integration
├── analyzer.py         # Static analysis tools execution
├── filter.py           # Quality filtering heuristics
├── chunker.py          # Semantic code chunking (Phase-2)
├── task_generator.py   # Task type generation (Phase-2)
├── dataset_builder.py  # Prompt generation and dataset assembly
├── dataset_splitter.py # Train/val splitting by source
├── exporter.py         # JSONL export and dataset merging
├── format_validator.py # Format validation
├── observability.py    # Structured logging and metrics
├── telemetry.py        # OpenTelemetry tracing (optional)
└── utils.py            # Utilities (cargo commands, file I/O, etc.)

tools/                   # Dataset utilities
├── analyze_failures.py         # Analyze pipeline rejection reasons
├── convert_jsonl_to_parquet.py # Convert JSONL to Parquet
├── convert_parquet_to_jsonl.py # Convert Parquet to JSONL
├── split_jsonl.py              # Split large JSONL into chunks
├── split_train_val.py          # Create train/val splits
├── rebalance_task_mix.py       # Adjust task type distribution
└── verify_format_test.py       # Validate format compliance

scripts/                 # Setup and release scripts
├── create_release.py           # Release automation
└── setup/
    └── setup_rust_analysis_tools.py  # Install Rust tools

tests/                   # Test suite
benches/                 # Performance benchmarks
docs/                    # Documentation

Tools

The repository includes utility scripts for dataset manipulation and analysis.

Failure Analysis

tools/analyze_failures.py

Parses the latest (or specified) analysis logs
Categorizes Clippy warnings (ignores style warnings, flags unsafe/bad code)
Detects license rejections from the main pipeline log
Automatically removes license-rejected crates from data/crate_list.txt (unless --no-cleanup)
Can write a full report to disk

# Auto-detect most recent analysis directory
python tools/analyze_failures.py

# Specify locations explicitly
python tools/analyze_failures.py \
  --log-dir logs/analysis_20251124_180335 \
  --log-file logs/phase2_full_run.log \
  --crate-list data/crate_list.txt \
  --output logs/failure_analysis.txt

# Skip automatic crate_list cleanup
python tools/analyze_failures.py --no-cleanup

Dataset Utilities

tools/split_train_val.py

Splits a dataset into train/val files while keeping whole crates/files together.

python tools/split_train_val.py \
  --input datasets/phase2_full.jsonl \
  --train output/train.jsonl \
  --val output/val.jsonl \
  --val-ratio 0.1

tools/split_jsonl.py

Splits large JSONL files into ~11MB chunks without breaking JSON objects.

python tools/split_jsonl.py \
  --input datasets/phase2_full.jsonl \
  --output-dir datasets/chunks \
  --prefix phase2_chunk

tools/convert_jsonl_to_parquet.py

Converts JSONL datasets to Parquet, supporting both training-ready (metadata stripped) and provenance variants.

python tools/convert_jsonl_to_parquet.py \
  --input datasets/phase2_full.jsonl \
  --output datasets/phase2_full.parquet \
  --variant training

tools/convert_parquet_to_jsonl.py

Converts Parquet datasets back to JSONL (useful for inspection or smaller workflows).

python tools/convert_parquet_to_jsonl.py \
  --input datasets/phase2_full.parquet \
  --output datasets/phase2_roundtrip.jsonl

tools/verify_format_test.py

Quick check to ensure a dataset matches the Phase 1 format specification.

python tools/verify_format_test.py --input datasets/phase2_full.jsonl

tools/rebalance_task_mix.py

Downsamples (or lightly reweights) a JSONL dataset to match a desired _task_type distribution and writes a summary report.

python tools/rebalance_task_mix.py \
  --input datasets/phase2_full.jsonl \
  --output datasets/phase2_balanced.jsonl \
  --target-mix code_generation=0.5,error_fixing=0.25,transformations=0.15,explanations=0.10

Testing

# Run all tests (672 tests)
pytest tests/

# Run with coverage report
pytest tests/ --cov=sigil_pipeline --cov-report=term-missing

# Run specific test modules
pytest tests/test_api_tracker.py -v          # API evolution tracking
pytest tests/test_ast_patterns.py -v         # AST-based extraction
pytest tests/test_task_generator.py -v       # Task type generation
pytest tests/test_telemetry.py -v            # OpenTelemetry tracing
pytest tests/test_converters.py -v           # Format conversion
pytest tests/test_dataset_splitter.py -v     # Train/val splitting

# Run tests by keyword
pytest tests/ -k "api" -v                    # API-related tests
pytest tests/ -k "ast" -v                    # AST parsing tests

# Run property-based tests
pytest tests/test_properties.py -v --hypothesis-show-statistics

# Run local CI checks
python test_ci_local.py

Test Coverage Summary

Category	Modules	Coverage
Core Pipeline	analyzer, filter, config	81-99%
AST Processing	ast_patterns, task_generator	78-80%
API Tracking	api_tracker, usage_analyzer	79-89%
Data Processing	dataset_splitter, converters	63-98%
Infrastructure	telemetry, utils, environment	77-91%
CLI	ecosystem, main	42-93%

Overall Coverage: 75% (4845 statements, 672 tests passing)

SigilDERG Ecosystem Integration

This package is part of the SigilDERG ecosystem for Rust code model training. It integrates seamlessly with:

sigilderg-finetuner: QLoRA fine-tuning for Rust code models
human-eval-rust: Evaluation harness for Rust code generation

Install Full Ecosystem

pip install sigil-pipeline[ecosystem]

This installs all three packages with proper version constraints.

Complete Workflow

Generate dataset (this package):

python -m sigil_pipeline.main --output datasets/phase2_full.jsonl

Fine-tune model (sigilderg-finetuner):

sigilderg-train configs/llama8b-phase2.yml  # Uses local:datasets/phase2_full.jsonl

Evaluate model (human-eval-rust):

sigilderg-eval samples.jsonl --use-human-eval

Unified CLI

Use the unified orchestrator for the complete workflow:

sigil-ecosystem \
    --crate-list data/crate_list.txt \
    --dataset-path datasets/phase2_full.jsonl \
    --config-path configs/llama8b-phase2.yml

See Ecosystem Integration Guide for detailed documentation.

Documentation

Architecture: Complete ecosystem architecture overview
Setup Guide: Rust toolchain and cargo subcommand installation
Dataset Schema: Detailed dataset format specification
Ecosystem Integration: Complete workflow guide for all three packages
Clippy Category Filtering: Quality filter documentation
OS-Agnostic Cargo Commands: Cross-platform cargo usage
Testing CI Locally: Local CI workflow testing
Architecture Decision Records: Design decisions and rationale

Docker

The project includes Docker support for containerized execution:

# Build image
docker build -t sigil-pipeline:2.3.0 .

# Run pipeline
docker-compose up

# Interactive shell
docker run -it sigil-pipeline:2.2.0 bash

# Run with custom arguments
docker run -v $(pwd)/output:/app/output sigil-pipeline:2.2.0 \
    --crate-list /app/data/crate_list.txt \
    --output /app/output/dataset.jsonl

See docker-compose.yml and Dockerfile for configuration details.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Rust community for excellent analysis tools (Clippy, Geiger, etc.)
HuggingFace for the Stack dataset and datasets library
The Stack dataset contributors for providing high-quality Rust code
Ammar Nasr for producing and distributing the Stack Rust Clean Dataset (https://huggingface.co/datasets/ammarnasr/the-stack-rust-clean)

Sigil Pipeline - Generating high-quality Rust code datasets for model fine-tuning.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.6.2

Jan 11, 2026

2.6.1

Jan 11, 2026

2.6.0

Jan 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sigil_pipeline-2.6.2.tar.gz (226.2 kB view details)

Uploaded Jan 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sigil_pipeline-2.6.2-py3-none-any.whl (148.2 kB view details)

Uploaded Jan 11, 2026 Python 3

File details

Details for the file sigil_pipeline-2.6.2.tar.gz.

File metadata

Download URL: sigil_pipeline-2.6.2.tar.gz
Upload date: Jan 11, 2026
Size: 226.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for sigil_pipeline-2.6.2.tar.gz
Algorithm	Hash digest
SHA256	`a8158e8b4ccc170fe8364ce2d97a855823f3698df0cfdbecccebee9c834ea198`
MD5	`583d650451fb849fdaed9c6e0dacbe05`
BLAKE2b-256	`603804b72200b017c4a8a189a5a8805fa620baf87517184d1d769ac8048e5bc7`

See more details on using hashes here.

File details

Details for the file sigil_pipeline-2.6.2-py3-none-any.whl.

File metadata

Download URL: sigil_pipeline-2.6.2-py3-none-any.whl
Upload date: Jan 11, 2026
Size: 148.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for sigil_pipeline-2.6.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f6e21900c94e5bab1a2f2f3261f32c9d9218994a211110f84dfd04e53077ab51`
MD5	`6ab08e98850520238bbd3afd1fd1acca`
BLAKE2b-256	`bc21a664d54efbb5f7d2a3f60638c39bf1d86287306a074e5c69834fdf6a5a03`

See more details on using hashes here.

sigil-pipeline 2.6.2

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

Sigil Pipeline v2.6.0

Overview

Features

Static Code Analysis

Quality Filtering

Dataset Generation

Checkpoint/Resume System

Requirements

Installation

Quick Start

Command Line

Python API

Configuration

Output Format

Project Structure

Tools

Failure Analysis

Dataset Utilities

Testing

Test Coverage Summary

SigilDERG Ecosystem Integration

Install Full Ecosystem

Complete Workflow

Unified CLI

Documentation

Docker

License

Acknowledgments

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes