Financial datasets for LLMs
Project description
Financial Datasets 🧪
Financial Datasets is an open-source Python library that lets you create question & answer financial datasets using Large Language Models (LLMs). With this library, you can easily generate realistic financial datasets from a 10-K, 10-Q, PDF, and other financial texts.
Usage
Example generated dataset:
[
{
"question": "What was Airbnb's revenue in 2023?",
"answer": "$9.9 billion",
"context": "In 2023, revenue increased by 18% to $9.9 billion compared to 2022, primarily due to a 14% increase in Nights and Experiences Booked of 54.5 million combined with higher average daily rates driving a 16% increase in Gross Booking Value of $10.0 billion."
},
{
"question": "By what percentage did Airbnb's net income increase in 2023 compared to the prior year?",
"answer": "153%",
"context": "Net income in 2023 increased by 153% to $4.8 billion, compared to the prior year, driven by our revenue growth, increased interest income, discipline in managing our cost structure, and the release of a portion of our valuation allowance on deferred tax assets of $2.9 billion."
}
]
Example #1 - generate from any text
Most flexible option. Generates dataset using a list of string texts
. Colab code
example here.
from financial_datasets.generator import DatasetGenerator
# Your list of texts
texts = ...
# Create dataset generator
generator = DatasetGenerator(model="gpt-4-0125-preview", api_key="your-openai-key")
# Generate dataset from texts
dataset = generator.generate_from_texts(
texts=texts,
max_questions=100,
)
Example #2 - generate from PDF
Generate a dataset using a PDF url
only. Colab code
example here.
from financial_datasets.generator import DatasetGenerator
# Create dataset generator
generator = DatasetGenerator(model="gpt-4-0125-preview", api_key="your-openai-key")
# Generate dataset from PDF url
dataset = generator.generate_from_pdf(
url="https://www.berkshirehathaway.com/letters/2023ltr.pdf",
max_questions=100,
)
Example #3 - generate from 10-K
Generate a dataset using a ticker
and year
. Colab code
example here.
from financial_datasets.generator import DatasetGenerator
# Create dataset generator
generator = DatasetGenerator(model="gpt-4-0125-preview", api_key="your-openai-key")
# Generate dataset from 10-K
dataset = generator.generate_from_10K(
ticker="AAPL",
year=2023,
max_questions=100,
)
Installation
Using pip
You can install the Financial Datasets library using pip:
pip install financial-datasets
Using Poetry
If you prefer to use Poetry for dependency management, you can add Financial Datasets to your project:
poetry add financial-datasets
From the Repository
If you want to install the library directly from the repository, follow these steps:
-
Clone the repository:
git clone https://github.com/yourusername/financial-datasets.git
-
Navigate to the project directory:
cd financial-datasets
-
Install the dependencies using Poetry:
poetry install
-
You can now use the library in your Python projects.
Contributing
Contributions are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request.
License
This project is licensed under the MIT License.
Contributors
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for financial_datasets-0.1.11.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3e134d0ae2d9b80a23c996d233dbe905e7b44c0d2311fc046f2b3ae5a32de36d |
|
MD5 | bc83015c877be372280d0e1d12364bea |
|
BLAKE2b-256 | 3c4d0f5955d2cfcfc9a9431f3cacc2700d2943cc2d903eba8a98ca87f6eb4a51 |
Hashes for financial_datasets-0.1.11-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b7a2c0f3c786a19da8ffbd217972a7cbe1a170bcdfafba163db85e98e10b50d5 |
|
MD5 | fa91afd62f0ff2f0fde998084e78a899 |
|
BLAKE2b-256 | 5e4b7788c2ee22d549fb11ddfadabb50c53a63ec1505e3d1d7559f14a7a08087 |