Skip to main content

HeaderGen: Automated cell header generator

Project description

HeaderGen

HeaderGen is a tool-based approach to enhance the comprehension and navigation of undocumented Python based Jupyter notebooks by automatically creating a narrative structure in the notebook.

Data scientists build an ML-based solution notebook by first preparing the data, then extracting key features, and then creating and training the model. HeaderGen leverages the implicit narrative structure of an ML notebook to add structural headers as annotations to the notebook.

Preview

Install HeaderGen

pip install headergen

Features

  • Automated Markdown Header Insertion: Through a taxonomy for machine-learning operations, HeaderGen annotates code cells with relevant markdown headers.

  • Function Call Taxonomy: Methodically classifies function calls based on a machine-learning operations taxonomy.

  • Advanced Call Graph Analysis: Enhances PyCG framework with flow-sensitivity and external library return-type resolution.

  • Precision in External Libraries: capability to accurately resolve function return types from external libraries using typestubs.

  • Syntax Pattern Matching: Employs type data for pattern matching.

CLI Usage

generate Command:

Generate the HeaderGen annotated notebook in the current directory. Note that the caches will be created the first time HeaderGen is run.

headergen generate -i /path/to/input.ipynb

Generate a JSON metadata file that includes various analysis information, use the --json_output or -j flag.

headergen generate -i /path/to/input.ipynb -o /path/to/output/ -j

types Command:

Run type inference on the file and fetch type information.

headergen types -i /path/to/input.ipynb

Generate a JSON file with type information, use the --json_output or -j flag.

headergen types -i /path/to/input.ipynb -o /path/to/output/ -j

server Command:

Starting the server is straightforward:

headergen server

This will start the Uvicorn server listening on host 0.0.0.0 and port 54068.

get_analysis_notebook Endpoint:

This endpoint returns the analysis of the specified notebook or python script as a JSON response containing analysis data like cell_callsites and block_mapping.

Example using curl:

curl "http://0.0.0.0:54068/get_analysis_notebook?file_path=/absolute/path/to/your/file.ipynb"

get_types Endpoint:

This endpoint returns type information of the specified notebook or python script as a JSON response.

Example using curl:

curl "http://0.0.0.0:54068/get_types?file_path=/absolute/path/to/your/file.ipynb"

generate_annotated_notebook Endpoint:

This endpoint returns the annotated notebook based on the analysis. The response will be a file download.

Example using curl:

curl "http://0.0.0.0:54068/generate_annotated_notebook?file_path=/absolute/path/to/your/file.ipynb" --output annotated_file.ipynb

Folder Structure

  • callsites-jupyternb-micro-benchmark: Micro benchmark
  • callsites-jupyternb-real-world-benchmark: Real-world benchmark
  • evaluation: Contains manual header annotation and user study results
  • framework_models: Function calls to ML Taxonomy mapping
  • typestub-database: Type-stbs for ML libraries
  • headergen: Source code of HeaderGen
  • pycg_extended: Source code of extended PyCG
  • headergen-extension: Jupyter notebook plugin for HG
  • headergen_output: Folder where the generated notebooks from the docker container are stored

1. Build container

  • Get source files

    git clone --recursive
    git submodule update --init --recursive
    git pull --recurse-submodules
    
  • Linux

    docker build -t headergen .
    docker run -v {$PWD}/headergen_output:/headergen_output -it headergen bash
    
  • Windows

    docker build -t headergen .
    docker run -v "%cd%"/headergen_output:/headergen_output -it headergen bash
    

2. Run HeaderGen benchmarks from inside contatiner

Output generated from the following commands, such as annotated notebooks, reports, callsites, headers, etc, are stored in the local folder headergen_output after the following commands are done executing.

  • Micro Benchmark (generates a csv file with results)

    make ROOT_PATH=/app/HeaderGen microbench
    
  • Real-world Benchmark (generates annotated notebooks and csv file that reproduce table 2)

    make ROOT_PATH=/app/HeaderGen realworldbench
    
  • Both Benchmarks

    make ROOT_PATH=/app/HeaderGen all
    
  • Clean generated output

    make clean
    

Building from Source

  • Get source files

    git clone --recursive
    git submodule update --init --recursive
    git pull --recurse-submodules
    
  • Clear cache if exists

    rm framework_models/models_cache.pickle
    rm pycg_extended/machinery/pytd_cache.pickle
    
  • Setup venv and dependencies with setup.sh script

    ./setup.sh -i
    
  • Micro Benchmark (generates a csv file with results)

    make ROOT_PATH=<path to repo root> microbench
    
  • Real-world Benchmark (generates annotated notebooks and csv file that reproduce table 2)

    make ROOT_PATH=<path to repo root> realworldbench
    
  • Both Benchmarks

    make ROOT_PATH=<path to repo root> all
    
  • Clean generated output

    make clean
    

This repo contains code for the paper "Enhancing Comprehension and Navigation in Jupyter Notebooks with Static Analysis" published at the SANER Conference 2023.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

headergen-2.0.1.tar.gz (6.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

headergen-2.0.1-py3-none-any.whl (14.2 MB view details)

Uploaded Python 3

File details

Details for the file headergen-2.0.1.tar.gz.

File metadata

  • Download URL: headergen-2.0.1.tar.gz
  • Upload date:
  • Size: 6.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for headergen-2.0.1.tar.gz
Algorithm Hash digest
SHA256 469142fcaee4f7731bc2a89cff9b205ea62f3a67c4e43beafd96b0a6bcea6a4e
MD5 d6e0bbfe9cd5ed4763f02d09e6630a9a
BLAKE2b-256 1b7e373055457b6f67734097399f7ddff1553b1b508f153593b2d6fb8e5f2485

See more details on using hashes here.

File details

Details for the file headergen-2.0.1-py3-none-any.whl.

File metadata

  • Download URL: headergen-2.0.1-py3-none-any.whl
  • Upload date:
  • Size: 14.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.12

File hashes

Hashes for headergen-2.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4ad0eae932a0e03917625a35f9425e2339c39c45c9b63ce6aa6084cfdffef0e0
MD5 c9b9cb4320bd849f48aaf11edf378bd0
BLAKE2b-256 83c5f0033cefae17f18f1d0b4c3817ad2849a664d6cfee0c8cc463ff63062b21

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page