Skip to main content

Prefix-aware curation & near-dedup for NN code via MinHash/LSH and AST fingerprints.

Project description

Neural Network Deduplication Pipeline (NN Dup)

GitHub release
short alias ldup

A sophisticated data curation and near-deduplication pipeline for neural network code from the LEMUR dataset. This project implements prefix-aware exact/near/AST deduplication with diversity top-up capabilities.

The original version of the NN Dup project was created by Waleed Khalid at the Computer Vision Laboratory, University of Würzburg, Germany.

Overview

This pipeline processes neural network implementations from the LEMUR dataset, performing:

  • Exact deduplication with prefix-aware canonicalization
  • Lexical near-deduplication using MinHash and LSH
  • Structural deduplication using AST fingerprints
  • Diversity top-up for underrepresented model families
  • Family-aware train/dev/test splits

Features

  • Multi-level Deduplication: Exact, lexical (MinHash+LSH), and structural (AST) deduplication
  • Prefix-aware Processing: Maintains representation across different model families
  • Family-aware Splits: Ensures proper train/dev/test separation by model families
  • Diversity Top-up: Intelligently adds diverse samples for underrepresented prefixes
  • Comprehensive Reporting: Detailed statistics and curation reports
  • Code Export: Exports deduplicated code files for further use

Installation

Prerequisites

  • Python 3.9+
  • CUDA 12.6 (for PyTorch compatibility)

Setup

  1. Create and activate virtual environment:

    python3 -m venv .venv
    source .venv/bin/activate  # Linux/Mac
    # or
    .venv\Scripts\activate     # Windows
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Install the package in development mode:

    pip install -e .
    

Install directly from GitHub

Install the latest version directly from the GitHub repository:

pip install git+https://github.com/ABrain-One/nn-dup.git

Usage

Basic Usage

Run the deduplication pipeline with default settings:

python -m ab.dup.preprocessing --out ./curation_output

Advanced Usage

Filter for specific model families and configure deduplication:

python -m ab.dup.preprocessing \
    --out ./curation_output \
    --include FractalNet \
    --include ResNet \
    --min-per-prefix 10 \
    --keep-per-family 5 \
    --lex-thresh-fractal 0.97 \
    --verbose

Command Line Options

  • --out: Output directory (default: ./curation_output)
  • --include: Prefix filters for model names (repeatable)
  • --prefer-prefix-order: Priority order for canonicalization
  • --min-per-prefix: Minimum records per prefix after dedup
  • --keep-per-family: Maximum exemplars per family in clusters
  • --lex-thresh-fractal: Jaccard threshold for FractalNet family
  • --topup-prefix: Enable diversity top-up for specific prefixes
  • --topup-per-prefix: Maximum top-up records per prefix
  • --topup-lex-max: Maximum lexical similarity for top-up
  • --topup-struct-max: Maximum structural similarity for top-up
  • --dump-accepted-code-dir: Subdirectory for exported code files
  • --upweight: Sampling weight rules (PREFIX:FACTOR)
  • --verbose: Enable verbose logging

Configuration

Key parameters in ab/dup/consts/settings.py:

  • SHINGLE_K = 10: Token shingle length
  • NUM_PERM = 256: MinHash permutations
  • LSH_THRESH = 0.85: LSH retrieval threshold
  • JACCARD_THRESH_LEX = 0.90: Lexical similarity threshold
  • JACCARD_THRESH_STRUCT = 0.90: Structural similarity threshold
  • SPLIT_RATIOS = (0.80, 0.10, 0.10): Train/dev/test ratios

Output Files

The pipeline generates several output files:

  • kept_records.json: Metadata for kept records
  • tombstones.json: Metadata for removed records
  • splits.json: Train/dev/test assignments
  • dedup_report.md: Comprehensive curation report
  • accepted_code/: Directory with deduplicated Python files
  • sampling_weights.csv: Optional sampling weights

Example Report

# Curation Report (LEMUR API)

## Summary
- Total rows fetched from LEMUR: **115,127**
- Exact duplicates removed: **104,804**
- Lexical near-duplicates removed: **8,939**
- Structural duplicates removed: **320**
- **Kept for training/eval:** **1,064** records

## Parameters
- Shingle length (k): `10`, MinHash permutations: `256`
- Lexical Jaccard verify (generic): `0.9`, (Fractal): `0.97`
- Keep per family (K): `5`, Min per prefix: `1`
- Train/dev/test ratios: `(0.8, 0.1, 0.1)`

Development

Running Tests

python -m ab.dup.preprocessing --help

Code Quality

pip install -e ".[dev]"
black ab/
isort ab/
flake8 ab/

Dependencies

  • nn-dataset>=2.1.0: LEMUR dataset access
  • datasketch: MinHash and LSH implementations
  • pandas>=1.3,<3.0: Data manipulation
  • scipy: Scientific computing
  • scikit-learn: Machine learning utilities

License

MIT License - see LICENSE file for details.

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

Citation

If you use this pipeline in your research, please cite:

@software{nn_dup_2025,
  title={Neural Network Deduplication Pipeline},
  author={Waleed Khalid},
  year={2025},
  url={https://github.com/your-org/nn-dup}
}

Acknowledgments

  • Built for the LEMUR dataset and NNGPT projects
  • Developed at the Computer Vision Laboratory, University of Würzburg
  • Part of the ABrain One research initiative

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ldup-2.1.0.tar.gz (5.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ldup-2.1.0-py3-none-any.whl (5.2 kB view details)

Uploaded Python 3

File details

Details for the file ldup-2.1.0.tar.gz.

File metadata

  • Download URL: ldup-2.1.0.tar.gz
  • Upload date:
  • Size: 5.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for ldup-2.1.0.tar.gz
Algorithm Hash digest
SHA256 9df8cc73c0a6e61c259bf29ebf97ce018420d7e7a01e5d713c435c48a40aad98
MD5 6e18d78ed08287d1496b9397dfd4058f
BLAKE2b-256 b9f5cfa2a1db551ebe54658290e298ffc4a4d4e15c190bad5324c40bbc956d12

See more details on using hashes here.

File details

Details for the file ldup-2.1.0-py3-none-any.whl.

File metadata

  • Download URL: ldup-2.1.0-py3-none-any.whl
  • Upload date:
  • Size: 5.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for ldup-2.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 175f0513402b19e185cf81453e7758f06d8bc5b41a989a37577a8bd748599498
MD5 cd2e98918985c603de9695fc84e0618a
BLAKE2b-256 6c4d146a85ba673a1e50335e7e69f8c6f26fbc2bee68cf989dca6ae9c188db49

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page