Prefix-aware curation & near-dedup for NN code via MinHash/LSH and AST fingerprints.
Project description
Neural Network Deduplication Pipeline (NN Dup)
short alias ldup
A sophisticated data curation and near-deduplication pipeline for neural network code from the LEMUR dataset. This project implements prefix-aware exact/near/AST deduplication with diversity top-up capabilities, followed by conversational chat data preparation for language model training. Outputs include train/dev/test JSON files ready for supervised fine-tuning.
The original version of the NN Dup project was created by Waleed Khalid at the Computer Vision Laboratory, University of Würzburg, Germany.
Overview
This comprehensive pipeline processes neural network implementations from the LEMUR dataset through two main stages:
Stage 1: Deduplication Pipeline
- Exact deduplication with prefix-aware canonicalization
- Lexical near-deduplication using MinHash and LSH
- Structural deduplication using AST fingerprints
- Diversity top-up for underrepresented model families
- Family-aware train/dev/test splits
Stage 2: ChatPrep Pipeline
- Conversational conversion of deduplicated code into chat format
- Template-based generation of user-assistant interactions
- Quality validation following SFT standards
- JSONL export for language model training
Features
Deduplication Pipeline
- Multi-level Deduplication: Exact, lexical (MinHash+LSH), and structural (AST) deduplication
- Prefix-aware Processing: Maintains representation across different model families
- Family-aware Splits: Ensures proper train/dev/test separation by model families
- Diversity Top-up: Intelligently adds diverse samples for underrepresented prefixes
- Comprehensive Reporting: Detailed statistics and curation reports
- Code Export: Exports deduplicated code files for further use
ChatPrep Pipeline
- Conversational Format: Converts code into chat-style interactions for LLM training
- Template-based Generation: Uses customizable templates for consistent formatting
- Infill Support: Generates partial code examples for completion tasks
- Validation & Filtering: Ensures high-quality, parseable examples following SFT standards
- Family-aware Splitting: Prevents data leakage across model families
- JSONL Export: Generates training-ready chat data in standard format
Installation
Prerequisites
- Python 3.9+
- CUDA 12.6 (for PyTorch compatibility)
Setup
-
Create and activate virtual environment:
python3 -m venv .venv source .venv/bin/activate # Linux/Mac # or .venv\Scripts\activate # Windows
-
Install dependencies:
pip install -r requirements.txt
-
Install the package in development mode:
pip install -e .
Install directly from GitHub
Install the latest version directly from the GitHub repository:
pip install git+https://github.com/ABrain-One/nn-dup.git
Usage
Basic Usage
Run the deduplication pipeline with default settings:
python -m ab.dup.preprocessing --out ./curation_output
Advanced Usage
Filter for specific model families and configure deduplication:
python -m ab.dup.preprocessing \
--out ./curation_output \
--include FractalNet \
--include ResNet \
--min-per-prefix 10 \
--keep-per-family 5 \
--lex-thresh-fractal 0.97 \
--verbose
Command Line Options
--out: Output directory (default:./curation_output)--include: Prefix filters for model names (repeatable)--prefer-prefix-order: Priority order for canonicalization--min-per-prefix: Minimum records per prefix after dedup--keep-per-family: Maximum exemplars per family in clusters--lex-thresh-fractal: Jaccard threshold for FractalNet family--topup-prefix: Enable diversity top-up for specific prefixes--topup-per-prefix: Maximum top-up records per prefix--topup-lex-max: Maximum lexical similarity for top-up--topup-struct-max: Maximum structural similarity for top-up--dump-accepted-code-dir: Subdirectory for exported code files--upweight: Sampling weight rules (PREFIX:FACTOR)--verbose: Enable verbose logging
ChatPrep: Converting Code to Chat Data
The ChatPrep module converts deduplicated neural network code into structured chat data suitable for training language models. It generates conversational examples with system prompts, user requests, and model responses.
ChatPrep Usage
Python API
from ab.chatprep import ChatPrepConfig
# Basic usage with defaults
config = ChatPrepConfig()
result = config.run()
# Custom configuration
config = ChatPrepConfig(
accepted_dir="../curation_output/accepted_code",
out_dir="../curation_output/chat_data",
seed=123,
fix_fences=True,
drop_unparseable=True,
group_by_source=True
)
result = config.run()
Command Line Interface
python -m ab.chatprep.cli.main \
--accepted-dir ../curation_output/accepted_code \
--out ../curation_output/chat_data \
--seed 123 \
--fix-fences \
--drop-unparseable \
--group-by-source
ChatPrep Configuration Parameters
accepted_dir: Directory with accepted code files (default:"curation_output/accepted_code")out_dir: Output directory for chat data (default:"curation_output/chat_data")no_infill: Disable infill generation (default:False)seed: Random seed for reproducibility (default:42)fix_fences: Fix code fences in generated examples (default:True)drop_unparseable: Drop unparseable examples (default:True)require_module_subclass: Require module subclass structure (default:True)write_drop_report: Write report of dropped examples (default:True)group_by_source: Group by source for splitting (default:True)split_ratios: Train/dev/test ratios (default:(0.8, 0.1, 0.1))model_name: Optional model name for rendering (default:None)train_name: Training file name (default:"train.jsonl")dev_name: Dev file name (default:"dev.jsonl")test_name: Test file name (default:"test.jsonl")
ChatPrep Output Files
The ChatPrep pipeline generates:
train.jsonl: Training examples in JSONL formatdev.jsonl: Development/validation examplestest.jsonl: Test examplesmanifest.json: Metadata about the generated datasetdrop_report.json: Report of examples that were filtered out
Example Chat Data Format
{
"messages": [
{
"role": "system",
"content": "You are a helpful AI assistant that generates PyTorch code for computer vision tasks."
},
{
"role": "user",
"content": "Task: Design a PyTorch CV model for image classification.\nDataset: CIFAR-10 (32x32x3).\nParameters: ≤1M.\nTricks: batch normalization, dropout, data augmentation."
},
{
"role": "assistant",
"content": "```python\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\nclass CIFAR10Model(nn.Module):\n def __init__(self):\n super().__init__()\n self.conv1 = nn.Conv2d(3, 32, 3, padding=1)\n self.bn1 = nn.BatchNorm2d(32)\n self.conv2 = nn.Conv2d(32, 64, 3, padding=1)\n self.bn2 = nn.BatchNorm2d(64)\n self.dropout = nn.Dropout(0.25)\n self.fc = nn.Linear(64 * 8 * 8, 10)\n \n def forward(self, x):\n x = F.relu(self.bn1(self.conv1(x)))\n x = F.max_pool2d(x, 2)\n x = F.relu(self.bn2(self.conv2(x)))\n x = F.max_pool2d(x, 2)\n x = self.dropout(x)\n x = x.view(x.size(0), -1)\n x = self.fc(x)\n return x\n```"
}
]
}
Configuration
Key parameters in ab/dup/consts/settings.py:
SHINGLE_K = 10: Token shingle lengthNUM_PERM = 256: MinHash permutationsLSH_THRESH = 0.85: LSH retrieval thresholdJACCARD_THRESH_LEX = 0.90: Lexical similarity thresholdJACCARD_THRESH_STRUCT = 0.90: Structural similarity thresholdSPLIT_RATIOS = (0.80, 0.10, 0.10): Train/dev/test ratios
Output Files
The pipeline generates several output files:
kept_records.json: Metadata for kept recordstombstones.json: Metadata for removed recordssplits.json: Train/dev/test assignmentsdedup_report.md: Comprehensive curation reportaccepted_code/: Directory with deduplicated Python filessampling_weights.csv: Optional sampling weights
Example Report
# Curation Report (LEMUR API)
## Summary
- Total rows fetched from LEMUR: **115,127**
- Exact duplicates removed: **104,804**
- Lexical near-duplicates removed: **8,939**
- Structural duplicates removed: **320**
- **Kept for training/eval:** **1,064** records
## Parameters
- Shingle length (k): `10`, MinHash permutations: `256`
- Lexical Jaccard verify (generic): `0.9`, (Fractal): `0.97`
- Keep per family (K): `5`, Min per prefix: `1`
- Train/dev/test ratios: `(0.8, 0.1, 0.1)`
Development
Running Tests
python -m ab.dup.preprocessing --help
Code Quality
pip install -e ".[dev]"
black ab/
isort ab/
flake8 ab/
Dependencies
nn-dataset>=2.1.0: LEMUR dataset accessdatasketch: MinHash and LSH implementationspandas>=1.3,<3.0: Data manipulationscipy: Scientific computingscikit-learn: Machine learning utilities
License
MIT License - see LICENSE file for details.
Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
Citation
If you use this pipeline in your research, please cite:
@software{nn_dup_2025,
title={Neural Network Deduplication Pipeline},
author={Waleed Khalid},
year={2025},
url={https://github.com/your-org/nn-dup}
}
Acknowledgments
- Built for the LEMUR dataset and NNGPT projects
- Developed at the Computer Vision Laboratory, University of Würzburg
- Part of the ABrain One research initiative
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ldup-2.1.1.tar.gz.
File metadata
- Download URL: ldup-2.1.1.tar.gz
- Upload date:
- Size: 6.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a587b48f0e2093ea1888bdaccdb24c87b3c8746058074b47cdbc583e5bfca6e4
|
|
| MD5 |
95f0aa1e0e3cb0c4247ae0178fba4673
|
|
| BLAKE2b-256 |
d418949810bd7ce0a7387d6b5e64670f24ee979ce2a816642ed26fd4b6f22ecb
|
File details
Details for the file ldup-2.1.1-py3-none-any.whl.
File metadata
- Download URL: ldup-2.1.1-py3-none-any.whl
- Upload date:
- Size: 7.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c6f0da6f7d2ce7ac5dc14369cdb4cd3f65c2e76dcf1bce4d7fad17a6cae44e15
|
|
| MD5 |
e6ff89fe1e1e0f980aa071bc88409e42
|
|
| BLAKE2b-256 |
90a5a9edac11740f37f65123b036f4db285a504950c4d6dab6bbd3d35f2f96af
|