Exa - Pytorch

These details have not been verified by PyPI

Project links

Project description

Exa

Boost your GPU's LLM performance by 300% on everyday GPU hardware, as validated by renowned developers, in just 5 minutes of setup and with no additional hardware costs.

Principles

Radical Simplicity (Utilizing super-powerful LLMs with as minimal lines of code as possible)
Ultra-Optimizated Peformance (High Performance code that extract all the power from these LLMs)
Fludity & Shapelessness (Plug in and play and re-architecture as you please)

🤝 Schedule a 1-on-1 Session

Book a 1-on-1 Session with Kye, the Creator, to discuss any issues, provide feedback, or explore how we can improve Exa for you.

📦 Installation 📦

You can install the package using pip

pip install exxa

Usage

Inference

Generate text using pretrained models with optional quantization with minimal configuration and straightforward usage.

Load specified pre-trained models with device flexibility (CPU/CUDA).
Set a default maximum length for the generated sequences.
Choose to quantize model weights for faster inference.
Use a custom configuration for quantization as needed.
Generate text through either a direct call or the run method.
Simple usage for quick text generation based on provided prompts.

from exa import Inference

model = Inference(
    model_id="georgesung/llama2_7b_chat_uncensored",
    quantize=True
)

model.run("What is your name")

GPTQ Inference

Efficiently generate text using quantized GPT-like models built for HuggingFace's pre-trained models with optional quantization and only a few lines of code for instantiation and generation.

Load specified pre-trained models with an option for quantization.
Define custom bit depth for the quantization (default is 4 bits).
Fine-tune quantization parameters using specific datasets.
Set maximum length for generated sequences to maintain consistency.
Tokenize prompts and generate text based on them seamlessly.

# !pip install exxa
from exa import GPTQInference

model_id = "gpt2-medium"
inference = GPTQInference(
    model_id, 
    quantization_config_bits=2, 
    max_length=400, 
    quantization_config_dataset='c4'
)
output_text = inference.run("The future of AI is")
print(output_text)

Quantize

Achieve smaller model sizes and faster inference by utilizing a unified interface tailored to HuggingFace's framework and only a simple class instantiation with multiple parameters is needed.

Efficiently quantize HuggingFace's pretrained models with specified bits (default is 4 bits).
Set custom thresholds for quantization for precision management.
Ability to skip specific modules during quantization for sensitive model parts.
Offload parts of the model to CPU in FP32 format for GPU memory management.
Specify if model weights are already in FP16 format.
Choose from multiple quantization types like "fp4", "int8", and more.
Option to enable double quantization for more compression.
Verbose logging for a detailed understanding of the quantization process.
Seamlessly push to and load models from the HuggingFace model hub.
In-built logger initialization tailored for quantization logs.
Log metadata for state and settings introspection.

from exa import Quantize

#usage
quantize = Quantize(
     model_id="bigscience/bloom-1b7",
     bits=8,
     enable_fp32_cpu_offload=True,
)

quantize.load_model()
quantize.push_to_hub("my model")
quantize.load_from_hub('my model')

🎉 Features 🎉

World-Class Quantization: Get the most out of your models with top-tier performance and preserved accuracy! 🏋️‍♂️
Automated PEFT: Simplify your workflow! Let our toolkit handle the optimizations. 🛠️
LoRA Configuration: Dive into the potential of flexible LoRA configurations, a game-changer for performance! 🌌
Seamless Integration: Designed to work seamlessly with popular models like LLAMA, Falcon, and more! 🤖

💌 Feedback & Contributions 💌

We're excited about the journey ahead and would love to have you with us! For feedback, suggestions, or contributions, feel free to open an issue or a pull request. Let's shape the future of fine-tuning together! 🌱

Check out our project board for our current backlog and features we're implementing

Benchmarks

The following is what we benchmark for according to the 🤗 LLM-Perf Leaderboard 🏋️ benchmarks

Metrics

Backend 🏭
Dtype 📥
Optimizations 🛠️
Quantization 🗜️
Class 🏋️
Type 🤗
Memory (MB) ⬇️
Throughput (tokens/s) ⬆️
Energy (tokens/kWh) ⬇️
Best Score (%) ⬆️
Best Scored LLM 🏆

License

MIT

Todo

Setup utils logger classes for metric logging with useful metadata such as token inference per second, latency, memory consumption
Add cuda c++ extensions for radically optimized classes for high performance quantization + inference on the edge

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.4

Apr 5, 2024

0.6.3

Mar 29, 2024

0.6.2

Mar 29, 2024

0.6.0

Mar 24, 2024

0.5.9

Mar 24, 2024

0.5.8

Mar 21, 2024

0.5.7

Mar 21, 2024

0.5.6

Mar 21, 2024

0.5.5

Sep 23, 2023

0.5.3

Sep 22, 2023

0.5.2

Sep 22, 2023

0.5.1

Sep 22, 2023

0.4.9

Sep 22, 2023

0.4.8

Sep 22, 2023

This version

0.4.7

Sep 22, 2023

0.4.6

Sep 22, 2023

0.4.5

Sep 22, 2023

0.4.4

Sep 22, 2023

0.4.3

Sep 22, 2023

0.4.2

Sep 20, 2023

0.4.1

Sep 20, 2023

0.4.0

Sep 20, 2023

0.3.9

Sep 20, 2023

0.3.8

Sep 20, 2023

0.3.7

Sep 20, 2023

0.3.6

Sep 20, 2023

0.3.5

Sep 20, 2023

0.3.4

Sep 20, 2023

0.3.3

Sep 20, 2023

0.3.1

Sep 20, 2023

0.3.0

Sep 20, 2023

0.2.9

Sep 14, 2023

0.2.8

Sep 14, 2023

0.2.7

Sep 14, 2023

0.2.6

Sep 14, 2023

0.2.5

Sep 14, 2023

0.2.4

Sep 14, 2023

0.2.3

Sep 14, 2023

0.2.2

Sep 14, 2023

0.2.1

Sep 14, 2023

0.2.0

Sep 14, 2023

0.1.8

Sep 14, 2023

0.1.7

Sep 14, 2023

0.1.6

Sep 14, 2023

0.1.5

Sep 14, 2023

0.1.4

Sep 14, 2023

0.1.3

Sep 14, 2023

0.1.2

Sep 14, 2023

0.1.1

Sep 14, 2023

0.1.0

Sep 14, 2023

0.0.9

Sep 14, 2023

0.0.8

Sep 14, 2023

0.0.7

Sep 14, 2023

0.0.5

Sep 14, 2023

0.0.4

Sep 14, 2023

0.0.3

Sep 14, 2023

0.0.2

Sep 14, 2023

0.0.1

Sep 5, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

exxa-0.4.7.tar.gz (22.2 kB view details)

Uploaded Sep 22, 2023 Source

Built Distribution

exxa-0.4.7-py3-none-any.whl (24.8 kB view details)

Uploaded Sep 22, 2023 Python 3

File details

Details for the file exxa-0.4.7.tar.gz.

File metadata

Download URL: exxa-0.4.7.tar.gz
Upload date: Sep 22, 2023
Size: 22.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.3.2 CPython/3.11.0 Darwin/22.4.0

File hashes

Hashes for exxa-0.4.7.tar.gz
Algorithm	Hash digest
SHA256	`2942e73232605b584be087d51d168c1c89ba7083d9fb6349fca13ac1611df000`
MD5	`07e37539d5c87e99624be6a4dfc2ce5c`
BLAKE2b-256	`80dec2b4829ce207ff28d58818d35f05d83468d2da499eedd0c07dec2a1a50a0`

See more details on using hashes here.

File details

Details for the file exxa-0.4.7-py3-none-any.whl.

File metadata

Download URL: exxa-0.4.7-py3-none-any.whl
Upload date: Sep 22, 2023
Size: 24.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.3.2 CPython/3.11.0 Darwin/22.4.0

File hashes

Hashes for exxa-0.4.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2cf3faf2e015d8ca24795732a1a66b9bc1f79f76383b314e1948ccf49a478d7b`
MD5	`b3a67bdc51ea18f819e952b671f8eb48`
BLAKE2b-256	`72939374ad893a72fc12efaeb890ec08553d3ace9763317db0f0f07defae9e4e`