Composable inference algorithms with LLMs and programmable logic
Project description
Decoding
Composable inference algorithms with LLMs and programmable logic.
Overview
decoding
is a library for scaling the inference-time capabilities of LLMs, enabling users to easily solve difficult problems. The library is built around simple sampling and reranking patterns that accept arbitrary user-defined scoring functions. At its simplest, you write a function that takes in a string and returns a float, and we handle the rest. If you'd like to step things up, we provide simple patterns to enable the efficient construction of powerful algorithms like Backtracking Monte Carlo Tree Search variants.
Why should I care?
Scaling inference thoughtfully is yielding breakthrough performance improvements in the world of LLMs. We are already seeing small models out-perform models >10x their size by leveraging basic sampling and search strategies, particularly in combination with custom verifiers, scoring functions, and process reward models. Check out this excellent recent presentation by Sasha Rush and Daniel Ritter for more background. decoding
makes it effortless to explore this design space and allows researchers to quickly iterate on their ideas.
Getting started
Install directly from PyPi
python -m pip install decoding
See Contributing for how to build the dev and testing environment.
NOTE: Decoding depends on vLLM
, which means this library can only be built on linux, and by default must be run on GPU. To run on CPU, see the instructions from vLLM.
Documentation
All modules, classes, and functions of our public interface are documented on our website. The docs should be the first stop for questions about the API. More realistic use cases and interesting patterns can be found in the tutorial.
Tutorial
Several examples are provided to give you a taste and help you get started. Check out TUTORIAL.md
for a commented walk-through of a few sample design patterns, or go straight to the code and run it yourself in the examples/
directory.
The library's philosophy
The most valuable resource is the researcher's time. The decoding
library is designed from the ground up to easily, flexibly, and quickly support experimentation over new ideas, while coordinating all the engineering glue on the backend. We make a few design decisions that support this.
-
The library is built with an emphasis on pure functions and immutability. All classes are frozen dataclasses. All functions express typed, composable building blocks that flexibly wrap and coordinate any user-generated code. This enables users to get very creative with their design patterns without having to chase down complicated bugs.
-
When there is a decision point between squeezing a drop of performance or maintaining flexibility, we maintain flexibility. This may be controversial. When one decides a piece of code is meant to do only one thing, there are a myriad of optimizations that become available. But, if your use-case doesn't fall there, those optimizations aren't particularly useful for you who can't use the library. We optimize the library to support the broadest range of ideas a researcher can come up with, such as backtracking, resampling, or otherwise modifying in-progress generations.
-
We still keep things fast. We use libraries like
vLLM
under the hood to keep text generation fast, which is often the primary bottleneck. We also expose arguments for users to specify when parts of scorers or estimators can be run concurrently, and harness CPU-based parallelism for the heavier parts like executing LLM-generated code. More detailed profiling over common use cases is coming shortly, which will be used to drive future development.
Overall, these design decisions make decoding
a great library for R&D. Researchers can flexibly explore the design space of relevant inference algoriths for a task, and when they've arrived at a solution that is optimal and they'd like to scale it, they can refactor the relevant bottlenecks and exploit the specific optimizations that are available for their individual case.
What's next
There are a number of features coming soon to decoding
.
Monte Carlo Tree Search
Currently in decoding.experimental
we have initial support for a powerful RolloutTreeSearch
algorithm. This wraps the TreeSearch
interface, enabling rollouts within each sync phase and pushing scores back up the tree for reranking. This has currently been designed to work best with process reward models and other similar scoring functions, as opposed to the full flexibility we provide for BestOfN
and TreeSearch
that can e.g., harness grammar constraints or apply sharper scoring functions. As this interface is finalized and documented examples come together, it will be promoted to the fully supported decoding.generators
.
Sequantial Monte Carlo / Particle Filtering
HFPPL is a beautiful library for probabilistic programming with large language models, based on the work by Lew et al., 2024 on Sequential Monte Carlo Steering of Large Language Models using Probabilistic Programs. It is on our roadmap to see how their underlying algorithms and infrastructure can be ported to our interface. In the meantime, check them out.
Contributing
We welcome community contributions. Open an issue or a pull request if you see any ways to make decoding
better.
To get started with development, this library supports an automated build of the dev and testing env using GNU Make.
# clone repo and install lib in current env
git clone git@github.com:benlipkin/decoding.git
cd decoding/
make env
Before opening a PR, make sure all tests are passing:
make tests
Citation
@misc{decoding2024lipkin,
author = {Lipkin, Benjamin},
title = {Decoding: Composable inference algorithms with LLMs and programmable logic.},
publisher = {GitHub},
journal = {GitHub},
howpublished = {\url{https://github.com/benlipkin/decoding}},
year = 2024,
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file decoding-0.1.3.tar.gz
.
File metadata
- Download URL: decoding-0.1.3.tar.gz
- Upload date:
- Size: 26.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e6e94642536c6f99eb8c874de6f70364a780e981c0ef7b81a64110763b2aa35d |
|
MD5 | 683c4f5e6fda7454ecd49843960ff0d0 |
|
BLAKE2b-256 | 1d1a14efa7a2a2535734f49e8a56355880a0c82a55a716c9f7dbc06fa80cac75 |
Provenance
The following attestation bundles were made for decoding-0.1.3.tar.gz
:
Publisher:
release.yml
on benlipkin/decoding
-
Statement type:
https://in-toto.io/Statement/v1
- Predicate type:
https://docs.pypi.org/attestations/publish/v1
- Subject name:
decoding-0.1.3.tar.gz
- Subject digest:
e6e94642536c6f99eb8c874de6f70364a780e981c0ef7b81a64110763b2aa35d
- Sigstore transparency entry: 149623780
- Sigstore integration time:
- Predicate type:
File details
Details for the file decoding-0.1.3-py3-none-any.whl
.
File metadata
- Download URL: decoding-0.1.3-py3-none-any.whl
- Upload date:
- Size: 27.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/5.1.1 CPython/3.12.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 20ab66bbd46cde992c4693f955336aaad4bf54c1b6c44f051e1e9bcedb461fbc |
|
MD5 | 93d4283274f1170a87f6569de6847f07 |
|
BLAKE2b-256 | a3bcc3a14c19272c405491a090568ef3a24f8b698e8ba19cb17d4a7f09b05892 |
Provenance
The following attestation bundles were made for decoding-0.1.3-py3-none-any.whl
:
Publisher:
release.yml
on benlipkin/decoding
-
Statement type:
https://in-toto.io/Statement/v1
- Predicate type:
https://docs.pypi.org/attestations/publish/v1
- Subject name:
decoding-0.1.3-py3-none-any.whl
- Subject digest:
20ab66bbd46cde992c4693f955336aaad4bf54c1b6c44f051e1e9bcedb461fbc
- Sigstore transparency entry: 149623781
- Sigstore integration time:
- Predicate type: