RAID: Rapid Automated Interpretability Datasets tool
Project description
RAID (Rapid Automated Interpretability Datasets) tool
Designed to generate binary and multiclass datasets rapidly using regular expressions, AST labels, and semantic concepts. This tool enabled us to perform fine-grained analyses of concepts learned by the model through its latent representations.
Features
- Labels of different granularities: tokens, phrases, blocks, other semantic chunks.
- Corresponding activations for tokens, phrases, blocks.
- B-I-0 Labelling for higher-level semantic concepts (Phrase and Block level chunks)
- Activation aggregation for higher-level semantic concepts (Phrase and Block level chunks): You can generate activations once and experiment with different granularities by aggregating the activations.
- Integration with static analysis tools to create custom labels a. Tree-sitter parsers (Abstract Syntax Tree based labels) - Syntactic b. CK metrics (Object Oriented metrics/Design patterns) - Structural c. CFG, DFG, AST Nesting/Parent-child relationships, etc -Hierarchical d. SE datasets, Ontologies, Design Patterns - Semantic e. Regular Expressions to create datasets, filter datasets, edit datasets.
Installation Instructions
Install RAID and its dependencies:
pip install git+https://github.com/arushisharma17/NeuroX.git@fe7ab9c2d8eb1b4b3f93de73b8eaae57a6fc67b7
pip install raid-tool
Usage
Run RAID on a Java file:
raid path/to/your/file.java --model bert-base-uncased --device cpu --binary_filter "set:public,static" --output_prefix output --aggregation_method mean --label class_body
Required Arguments:
input_file: Path to the Java source file to analyze
Optional Arguments:
--model: Transformer model to use (default: 'bert-base-uncased')--device: Computing device to use ('cpu' or 'cuda', default: 'cpu')--binary_filter: Filter for token labeling- Format: "type:pattern"
- Types:
set: Comma-separated list (e.g., "set:public,static")re: Regular expression pattern
--output_prefix: Prefix for output files (default: 'output')--aggregation_method: Method to aggregate activations- Options: mean, max, sum, concat (default: mean)
--label: Type of AST label to analyze- Options: program, class_declaration, class_body, method_declaration, etc.
--layer: Specific transformer layer to analyze (0-12, default: all layers)
Available Labels
The following labels are supported for the --label parameter:
- program
- class_declaration
- class_body
- method_declaration
- formal_parameters
- block
- method_invocation
- leaves
Link to Colab notebook for tutorial and initial instructions
https://colab.research.google.com/drive/1MfTbOMrZnQ_FkC65CCJyUE4v21u5pJ4G?usp=sharing
Note: Please keep updating readme as you add code.
Features
- Integrate static analysis tools.
- Generate AST nodes and labels
- Extract layerwise and 'all' layers activations
- Split phrasal nodes into B-I-O tokens
- Aggregate activations
- Lexical Patterns or features
- Support for multiple languages
- Code Ontologies/HPC Ontologies
- Software engineering datasets
- Hierarchical Properties/Structural Properties
- Add support for autoencoders NeuroX
- Train, validation, test splits for probing tasks support.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file raid_tool-5.1.0.tar.gz.
File metadata
- Download URL: raid_tool-5.1.0.tar.gz
- Upload date:
- Size: 20.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1db9fe04eae929bfbd39430f3524497c1ddb65315dea608d2d056427aa4470b6
|
|
| MD5 |
f0a70632fa98327ea1e39b25e4a1d585
|
|
| BLAKE2b-256 |
ede471474296980a9e5c9261b022d630fbf945cb56a3d098e65077ab63511c12
|
File details
Details for the file raid_tool-5.1.0-py3-none-any.whl.
File metadata
- Download URL: raid_tool-5.1.0-py3-none-any.whl
- Upload date:
- Size: 18.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3562680f26e65ae44570d47f99412e3f9ddf29450fa3c3b8fa1783e3653be46e
|
|
| MD5 |
14ea24e0a0fb9aa157a3d6d07e60e981
|
|
| BLAKE2b-256 |
abcf4e809d4145893d486434499a84ec197aefae894bb16765920479d7cd5116
|