Financial datasets for LLMs
Project description
Financial Datasets 🧪
Financial Datasets is an open-source Python library that lets you create question & answer financial datasets using Large Language Models (LLMs). With this library, you can easily generate realistic financial datasets from a 10-K, 10-Q, PDF, and other financial texts.
Usage
Example generated dataset:
[
{
"question": "What was Airbnb's revenue in 2023?",
"answer": "$9.9 billion",
"context": "In 2023, revenue increased by 18% to $9.9 billion compared to 2022, primarily due to a 14% increase in Nights and Experiences Booked of 54.5 million combined with higher average daily rates driving a 16% increase in Gross Booking Value of $10.0 billion."
},
{
"question": "By what percentage did Airbnb's net income increase in 2023 compared to the prior year?",
"answer": "153%",
"context": "Net income in 2023 increased by 153% to $4.8 billion, compared to the prior year, driven by our revenue growth, increased interest income, discipline in managing our cost structure, and the release of a portion of our valuation allowance on deferred tax assets of $2.9 billion."
}
]
Example #1 - generate from any text
Most flexible option. Generates dataset using a list of string texts
. Colab code
example here.
from financial_datasets.generator import DatasetGenerator
# Your list of texts
texts = ...
# Create dataset generator
generator = DatasetGenerator(model="gpt-4o", api_key="your-openai-key")
# Generate dataset from texts
dataset = generator.generate_from_texts(
texts=texts,
max_questions=100,
)
Example #2 - generate from PDF
Generate a dataset using a PDF url
only. Colab code
example here.
from financial_datasets.generator import DatasetGenerator
# Create dataset generator
generator = DatasetGenerator(model="gpt-4o", api_key="your-openai-key")
# Generate dataset from PDF url
dataset = generator.generate_from_pdf(
url="https://www.berkshirehathaway.com/letters/2023ltr.pdf",
max_questions=100,
)
Example #3 - generate from 10-K
Generate a dataset using a ticker
and year
. Colab code
example here.
from financial_datasets.generator import DatasetGenerator
# Create dataset generator
generator = DatasetGenerator(model="gpt-4o", api_key="your-openai-key")
# Generate dataset from 10-K
dataset = generator.generate_from_10K(
ticker="AAPL",
year=2023,
max_questions=100,
item_names=["Item 1", "Item 7"], # optional - specify Item names to use
)
Installation
Using pip
You can install the Financial Datasets library using pip:
pip install financial-datasets
Using Poetry
If you prefer to use Poetry for dependency management, you can add Financial Datasets to your project:
poetry add financial-datasets
From the Repository
If you want to install the library directly from the repository, follow these steps:
-
Clone the repository:
git clone https://github.com/virattt/financial-datasets.git
-
Navigate to the project directory:
cd financial-datasets
-
Install the dependencies using Poetry:
poetry install
-
You can now use the library in your Python projects.
Contributing
Contributions are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request.
License
This project is licensed under the MIT License.
Contributors
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for financial_datasets-0.1.16.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | f167799a5b8f91558da369a5a1b5a9bb6bfee80607cc066bc55f83db4014119b |
|
MD5 | 3e661acfeba55993dddb8f662cda3f76 |
|
BLAKE2b-256 | 76c3647713d7d4445e8acd0f0ad21f2d338a57a5185b619692629a70bfdeb97a |
Hashes for financial_datasets-0.1.16-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b3942a741ff25a04d2095bc2dc0cb85249c05f8829858e9f001351bd40265364 |
|
MD5 | 1f3f6c756a63665985e3d223ebd8edee |
|
BLAKE2b-256 | b34654035bd3a576fa918862a07c3a37db57c3e1787b4d572437f16f6bc48320 |