Skip to main content

No project description provided

Project description

Ditty

A simple fine-tune.

What

A very simple library for finetuning Huggingface Pretrained AutoModelForCausalLM such as GPTNeoX, leveraging Huggingface Accelerate, Transformers, Datasets and Peft

Ditty has support for LORA, 8bit, and fp32 cpu offloading right out of the box and assumes you are running with a single GPU or distributed over multiple GPUs by default.

Checkpointing supported, currently a bug with pushing to HF model hub though so checkpoints are all local.

FP16, BFLOAT16 now supported.

QLORA 4bit supported under experimental, and requires installing development branches of accelerate, peft, transformers and the latest bitsandbytes.

What Not

  • Ditty does not support ASICs like TPU or Trainium.
  • Ditty does not handle Sagemaker
  • Ditty does not by default run with the CPU
  • Ditty does not handle evaluation sets or benchmarking, this may or may not change.

Soon

  • Ditty may handle distributed cluster finetuning
  • Ditty will support DeepSpeed

Classes

Pipeline

Pipeline is responsible for running the entire show. Simply subclass Pipeline and implement the dataset method for your custom data, this must return a torch.utils.data.DataLoader

Instantiate with your chosen config and then simply call run.

Trainer

Trainer does what it's name implies, which is to train the model. You may never need to touch this if you're not interested in customizing the training loop.

Data

Data wraps an HF Dataset and can configure length grouped sampling and random sampling, as well as handling collation, batching, seeds, removing unused columns and a few other things.

The primary way of using this class is through the prepare method which takes a list of operations to perform against the dataset. These are normal operations like map and filter.

Example:

data = Data(
    load_kwargs={"path": self.dataset_name, "name":
                 self.dataset_language},
    tokenizer=self.tokenizer,
    seed=self.seed,
    batch_size=self.batch_size,
    grad_accum=self.grad_accum,
)

....sic

dataloader = data.prepare(
        [
            ("filter", filter_longer, {}),
            ("map", do_something, dict(batched=True, remove_columns=columns)), 
            ("map", truncate, {}),
        ]
    )

This can be used to great effect when overriding the dataset method in a subclass of Pipeline.

Setup

pip install ditty

Tips

https://github.com/google/python-fire is a tool for autogenerating CLIs from Python functions, dicts and objects.

It can be combined with Pipeline to make a very quick cli for launching your process.

Attribution / Statement of Changes

Portions of this library look to Huggingface's transformers Trainer class as a reference and in some cases re-implements functions from Trainer, simplified to only account for GPU based work and overall narrower supported scope.

This statement is both to fulfill the obligations of the ApacheV2 licencse, but also because those folks do super cool work and I appreciate all they've done for the community and its just right to call this out.

License

Apache V2 see the LICENSE file for full text.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ditty-0.5.0.tar.gz (14.3 kB view details)

Uploaded Source

Built Distribution

ditty-0.5.0-py3-none-any.whl (14.0 kB view details)

Uploaded Python 3

File details

Details for the file ditty-0.5.0.tar.gz.

File metadata

  • Download URL: ditty-0.5.0.tar.gz
  • Upload date:
  • Size: 14.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.2

File hashes

Hashes for ditty-0.5.0.tar.gz
Algorithm Hash digest
SHA256 afc739c14fa2b7123ad54e6367e5eee83b0aa09f7d9dbaf24423aeadebfbe4db
MD5 51ccdf095d23d51b70b603d137ea87f1
BLAKE2b-256 da6060f8d713984af59aafe3013e554969c66a123c128aff581996b2c18f9265

See more details on using hashes here.

File details

Details for the file ditty-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: ditty-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 14.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.2

File hashes

Hashes for ditty-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e5750ae60ca17d36097f64732382f6e4dd52596345957bd2ae55c993ccf7754b
MD5 48f4c48409d5c73d592af6a2fccc6f26
BLAKE2b-256 99aafe084269cd8ffe1c306a4f018905bf755079ff9d76a13339e5219553b83b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page