Financial datasets for LLMs
Project description
Financial Datasets 🧪
Financial Datasets is an open-source Python library that lets you create question & answer financial datasets using Large Language Models (LLMs). With this library, you can easily generate realistic financial datasets from a 10-K, 10-Q, PDF, and other financial texts.
Usage
Example generated dataset:
[
{
"question": "What was Airbnb's revenue in 2023?",
"answer": "$9.9 billion",
"context": "In 2023, revenue increased by 18% to $9.9 billion compared to 2022, primarily due to a 14% increase in Nights and Experiences Booked of 54.5 million combined with higher average daily rates driving a 16% increase in Gross Booking Value of $10.0 billion."
},
{
"question": "By what percentage did Airbnb's net income increase in 2023 compared to the prior year?",
"answer": "153%",
"context": "Net income in 2023 increased by 153% to $4.8 billion, compared to the prior year, driven by our revenue growth, increased interest income, discipline in managing our cost structure, and the release of a portion of our valuation allowance on deferred tax assets of $2.9 billion."
}
]
Example #1 - generate from any text
Most flexible option. Generates dataset using a list of string texts
. Colab code
example here.
from financial_datasets.generator import DatasetGenerator
# Your list of texts
texts = ...
# Create dataset generator
generator = DatasetGenerator(model="gpt-4-0125-preview", api_key="your-openai-key")
# Generate dataset from texts
dataset = generator.generate_from_texts(
texts=texts,
max_questions=100,
)
Example #2 - generate from PDF
Generate a dataset using a PDF url
only. Colab code
example here.
from financial_datasets.generator import DatasetGenerator
# Create dataset generator
generator = DatasetGenerator(model="gpt-4-0125-preview", api_key="your-openai-key")
# Generate dataset from PDF url
dataset = generator.generate_from_pdf(
url="https://www.berkshirehathaway.com/letters/2023ltr.pdf",
max_questions=100,
)
Example #3 - generate from 10-K
Generate a dataset using a ticker
and year
. Colab code
example here.
from financial_datasets.generator import DatasetGenerator
# Create dataset generator
generator = DatasetGenerator(model="gpt-4-0125-preview", api_key="your-openai-key")
# Generate dataset from 10-K
dataset = generator.generate_from_10K(
ticker="AAPL",
year=2023,
max_questions=100,
item_names=["Item 1A", "Item 7"], # optional
)
Installation
Using pip
You can install the Financial Datasets library using pip:
pip install financial-datasets
Using Poetry
If you prefer to use Poetry for dependency management, you can add Financial Datasets to your project:
poetry add financial-datasets
From the Repository
If you want to install the library directly from the repository, follow these steps:
-
Clone the repository:
git clone https://github.com/yourusername/financial-datasets.git
-
Navigate to the project directory:
cd financial-datasets
-
Install the dependencies using Poetry:
poetry install
-
You can now use the library in your Python projects.
Contributing
Contributions are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request.
License
This project is licensed under the MIT License.
Contributors
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for financial_datasets-0.1.14.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 11ad7b74cfad3359b2cb33259cd72568b414b8ab241994574d926b44e48d92be |
|
MD5 | 5ef4fa3b47de1159ef5d22cc4b8cb2d7 |
|
BLAKE2b-256 | 2f212a239243ac09b382534a8e7b4ceeaa8b104615df79ba32e652548966103d |
Hashes for financial_datasets-0.1.14-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 405ba67342ba70de1fc2e9faa555c40e3d4c4491d57df879616481eb63ffb88b |
|
MD5 | bfb2f3c63fb8c0d68290e322110ab6dc |
|
BLAKE2b-256 | 6888a94956d786534ea414987f2676875328218b8ea74a46791dacf97f465536 |