Automatic readme generation using language models

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

LARCH: Large Language Model-based Automatic Readme Creation with Heuristics

LARCH is an automatic readme generation system using language models.

Usage

Prerequisite

Install LARCH with pip:

pip install larch-readme

Python CLI

You can then test out generation without setting up a server.

larch --local --model openai/text-davinci-003 --openai-api-key ${YOUR_OPENAI_API_KEY}

or you can rely on a server to do generation (See following for setting up a server):

larch --endpoint https://${YOUR_SERVER_ADDRESS} --model openai/text-davinci-003

Server

Start the server.

OPENAI_API_KEY=${MY_API_KEY} larch-server

You can access http://localhost:8000/docs to see the API.

You may want to specify --host ${YOUR_HOST_NAME_OR_IP_ADDRESS} if you intend to access from a remote machine.

Both environmental variables are optional. Spcify OPENAI_API_KEY if you want to allow users to use OpenAI-based models. Specify ENTRYPOINT_EXTRACTOR if you want to use entrypoint-based generation (strongly recommended; trained with script/entrypoint_extractor.py).

You can limit the models to load with LOADED_MODELS environmental variable (not setting anything loads all models). You can also load pretrained encoder-decoder model by passing json serialization mapping from their names to their paths with ENCODER_DECODER_MODEL_PATHS.

# This loads gpt2, gpt2-xl and a pretrained encoder-decoder model from ./path-to-model/
LOADED_MODELS='gpt2,gpt2-xl' ENCODER_DECODER_MODEL_PATHS='{"my-encdec": "./path-to-model/"}' larch-server

# This only loads a pretrained encoder-decoder model. Notice that empty LOADED_MODELS and unset LOADED_MODELS have different behaviors.
LOADED_MODELS='' ENCODER_DECODER_MODEL_PATHS='{"my-encdec": "./path-to-model/"}' larch-server

You can download VSCode plugin to interact with the server from: Coming soon!

Usage with Docker

Build docker image (you need to set up proxy settings appriopriately if you are behind a proxy server).

docker build -t larch .

You may need to pass --build-arg CURL_CA_BUNDLE="" if you are behind a proxy and getting a SSL error. WARNING: This disables SSL connection thus make your connection vulnerable against attacks.

Then you can start the server with the following command:

docker run \
 --rm \
  -p ${YOUR_HOST_IP}:${PORT}:80/tcp \
   \
  larch

You need to pass -e OPENAI_API_KEY=${YOUR_OPENAI_API_KEY} if you wish to use OpenAI models.

You may need to pass -e CURL_CA_BUNDLE="" if you are behind a proxy and getting a SSL error. WARNING: This disables SSL connection thus make your connection vulnerable against attacks.

Development

Alternatively, you can run CLI without using pip for better debugging and development.

pip install -r requirements.txt
export PYTHONPATH=`pwd`

# test out generation
python larch/cli.py --local --model gpt2

# start debug server
python larch/server.py --reload --log-level debug

For testing:

pip install 'pytest>=7.2.0' 'pytest-dependency>=0.5.1'
export PYTHONPATH=`pwd`
py.test -v tests

Model Training and Evaluation

Training Encoder-Decoder Models

You can train your own Encoder-Decoder Model with scripts/finetune_encdec.py.

# Make sure you have CUDA 11.6 installed
# We do custom torch installation to enble GPU
pip install torch==1.13.0+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
pip install -r <(cat requirements.txt | grep -v torch)
pip install -r requirements-dev.txt

export PYTHONPATH=`pwd`

python scripts/finetune_encdec.py \
    --model_name_or_path t5-small \
    --do_train \
    --do_eval \
    --train_file ./path-to-train.jsonl \
    --validation_file ./path-to-dev.jsonl \
    --output_dir ./tmp-summarization \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir

Supported models are BART, mBART, T5, mT5 and LED. Only T5 models t5-small, t5-base, t5-large, t5-3b and t5-11b must use an additional argument: --source_prefix "summarize: ".

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.0.10

Jun 8, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

larch_readme-0.0.10-py3-none-any.whl (48.6 kB view hashes)

Uploaded Jun 8, 2023 Python 3

Hashes for larch_readme-0.0.10-py3-none-any.whl

Hashes for larch_readme-0.0.10-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1d0e14a42af260f1177b34072dd52257beb007451f780ffcf5e8028f90ae5fa4`
MD5	`2bfe558b41cc247ad6f09180373dfd23`
BLAKE2b-256	`f746d52404a9f60d3d2ada322a7714369028b47a7afbc8b321e579d69c05ab7f`