SciPhi: A Framework for LLM Powered Data.
Project description
SciPhi [ΨΦ]: A Framework for LLM Powered Data
SciPhi is a Python-based framework designed to facilitate the generation of high-quality synthetic data tailored for both Large Language Models (LLMs) and human users. This suite offers:
- Configurable Data Generation: Craft datasets mediated by LLMs according to your specifications.
- Retriever-Augmented Generation (RAG) Integration: Make use of an integrated RAG Provider API to ground your generated data to real-world datasets. Also, SciPhi comes bundled with an evaluation harness to optimise your RAG workflow.
- Textbook Generation Module: A module to power the generation of RAG-augmented synthetic textbooks straight from a given table of contents.
Fast Setup
pip install sciphi
Optional Dependencies
Install with specific optional support using extras:
- Anthropic:
'sciphi[anthropic_support]'
- HF (includes Torch):
'sciphi[hf_support]'
- Llama-CPP:
'sciphi[llama_cpp_support]'
- Llama-Index:
'sciphi[llama_index_support]'
- VLLM (includes Torch):
'sciphi[vllm_support]'
- All (no vLLM):
'sciphi[all]'
- All (with extras, e.g. vLLM):
'sciphi[all_with_extras]'
Recommended (All Optional Dependencies):
pip install 'sciphi[all_with_extras]'
Setup Your Environment:
Navigate to a working directory and use a text editor to adjust the .env
file with your specific configurations.
# Proprietary Providers
OPENAI_API_KEY=your_openai_key
ANTHROPIC_API_KEY=your_anthropic_key
# Open Source Providers
HF_TOKEN=your_huggingface_token
# vLLM
VLLM_API_KEY=your_vllm_token
# RAG Provider Settings
RAG_API_BASE=your_rag_server_base_url
RAG_API_KEY=your_rag_server_key
After entering your settings, ensure you save and exit the file.
Features
Community & Support
Textbook Generation
This is an effort to democratize access to top-tier textbooks. This can readily be extended to other domains, such as internal commercial documents.
Generating Textbooks
-
Dry Run:
python -m sciphi.scripts.textbook_generator dry_run --toc_dir=sciphi/data/sample/table_of_contents --rag-enabled=False
This will perform a dry-run over the default textbooks stored in
sciphi/data/sample/textbooks
.Note - this must be run from the root of the repository, or else you will need to update the
toc_dir
accordingly. Setting rag-enabled toTrue
will enable RAG augmentation during the generation process. You may customize the RAG provider through additional arguments. -
Textbook Generation:
python -m sciphi.scripts.textbook_generator run --toc_dir=sciphi/data/sample/table_of_contents --rag-enabled=False
Replace
dry_run
above withrun
to generate one textbook for each table of contents in the target directory. See a sample textbook here. -
Example With a Custom Table of Contents:
Prepare your table of contents and save it into
$PWD/toc/test.yaml
. Then, run the following command:python -m sciphi.scripts.generate_textbook run --toc_dir=toc --output_dir=books --data_dir=$PWD
For help with formatting your table of contents, see here.
-
Custom Settings & RAG Functionality:
Simply switch
rag-enabled
toTrue
. Ensure you have the right.env
variables set up, or provide CLI values forrag_api_base
andrag_api_key
.Alternatively, you may provide your own custom settings in a YAML file. See the default settings configuration here.
Important: To make the most out of grounding your data with Wikipedia, ensure your system matches our detailed specifications. An example RAG provider can be seen here. More high quality outbook books are available here.
RAG Eval Harness
To measure the efficacy of your RAG pipeline, we provide a unique RAG evaluation harness.
Running the RAG Harness
python -m sciphi.scripts.rag_harness --n-samples=100 --rag-enabled=True --evals_to_run="science_multiple_choice"
This example runs over 100 science multiple choice questions with RAG enabled and reports the final accuracy.
Local Development
Local setup
-
Clone the Repository:
Begin by cloning the repository and stepping into the project directory:
git clone https://github.com/emrgnt-cmplxty/sciphi.git cd sciphi
-
Install the Dependencies:
Start by installing the primary requirements:
pip install -r requirements.txt
If you require further functionalities, consider the following:
-
For the developer's toolkit and utilities:
pip install -r requirements_dev.txt
-
To encompass all optional dependencies:
pip install -r requirements_all.txt
Alternatively, to manage packages using Poetry:
poetry install
And for optional dependencies w/ Poetry:
poetry install -E [all, all_with_extras]
-
-
Example - Create your own LLM and RAG provider:
from sciphi.core import LLMProviderName, RAGProviderName from sciphi.interface import LLMInterfaceManager, RAGInterfaceManager llm_interface = LLMInterfaceManager.get_interface_from_args( provider_name=LLMProviderName(llm_provider_name), model_name=llm_model_name, # Additional args max_tokens_to_sample=llm_max_tokens_to_sample, temperature=llm_temperature, top_k=llm_top_k, # Used for re-routing requests to a remote vLLM server server_base=kwargs.get("llm_server_base", None), ) rag_interface = ( RAGInterfaceManager.get_interface_from_args( provider_name=RAGProviderName(rag_provider_name), base=rag_api_base or os.environ.get("RAG_API_BASE"), token=rag_api_key or os.environ.get("RAG_API_KEY"), max_context=rag_max_context, top_k=rag_top_k, ) if rag_enabled else None ) # ... Continue ...
Supported LLM providers include OpenAI, Anthropic, HuggingFace, and vLLM. For RAG database access, configure your own, or get access to the SciPhi gigaRAG API.
System Requirements
Essential Packages:
-
Python Version:
>=3.9,<3.12
-
Required Libraries:
bs4
:^0.0.1
fire
:^0.5.0
openai
:0.27.8
pandas
:^2.1.0
python-dotenv
:^1.0.0
pyyaml
:^6.0.1
retrying
:^1.3.4
sentencepiece
:^0.1.99
torch
:^2.1.0
tiktoken
:^0.5.1
tqdm
:^4.66.1
Supplementary Packages:
- Anthropic Integration:
anthropic
:^0.3.10
- Hugging Face Tools:
accelerate
:^0.23.0
datasets
:^2.14.5
transformers
:^4.33.1
- Llama-Index:
llama-index
:^0.8.29.post1
- Llama-CPP:
llama-cpp-python
:^0.2.11
- VLLM Tools:
vllm
:0.2.0
Licensing and Acknowledgment
This project is licensed under the Apache-2.0 License.
Citing Our Work
If SciPhi plays a role in your research, we kindly ask you to acknowledge us with the following citation:
@software{SciPhi,
author = {Colegrove, Owen},
doi = {Pending},
month = {09},
title = {{SciPhi: A Framework for LLM Powered Data}},
url = {https://github.com/sciphi-ai/sciphi},
year = {2023}
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.