Convert text to datasets
Project description
txt2dataset
Convert unstructured text to structured datasets using structured output for Large Language Models. Currently supports Gemini.
Example
Andrew Baglino, Senior Vice President, Powertrain and Energy Engineering of Tesla, Inc. (“Tesla,”, or the “Company”), resigned from Tesla, effective as of April 14, 2024. Mr. Baglino served in this position since October 2019, prior to which he served in various engineering positions continuously since joining Tesla in March 2006. Tesla is grateful to Mr. Baglino for his leadership and contributions to our significant innovation and growth over the course of his 18-year career.
--->
name, title, date, action
Andrew Baglino, Senior Vice President, 4/14/2024, resigns
Installation
pip install txt2dataset
Quickstart
Initialization
from txt2dataset import DatasetBuilder
builder = DatasetBuilder(input_path,output_path)
# set api key
builder.set_api_key(api_key)
# set base prompt, e.g. what the model looks for
base_prompt = """Extract officer changes and movements to JSON format.
Track when officers join, leave, or change roles.
Provide the following information:
- date (YYYYMMDD)
- name (First Middle Last)
- title
- action (one of: ["HIRED", "RESIGNED", "TERMINATED", "PROMOTED", "TITLE_CHANGE"])
Return an empty dict if info unavailable."""
# set what the model should return
response_schema = {
"type": "ARRAY",
"items": {
"type": "OBJECT",
"properties": {
"date": {"type": "STRING", "description": "Date of action in YYYYMMDD format"},
"name": {"type": "STRING", "description": "Full name (First Middle Last)"},
"title": {"type": "STRING", "description": "Official title/position"},
"action": {
"type": "STRING",
"enum": ["HIRED", "RESIGNED", "TERMINATED", "PROMOTED", "TITLE_CHANGE"],
"description": "Type of personnel action"
}
},
"required": ["date", "name", "title", "action"]
}
}
# Optional configurations
builder.set_rpm(1500)
builder.set_save_frequency(100)
builder.set_model('gemini-1.5-flash-8b')
Build the dataset
builder.build(base_prompt=base_prompt,
response_schema=response_schema,
text_column='text',
index_column='accession_number',
input_path="data/msft_8k_item_5_02.csv",
output_path='data/msft_officers.csv')
Standardize the dataset
builder.standardize(response_schema=response_schema,input_path='data/msft_officers.csv', output_path='data/msft_officers_standardized.csv',columns=['name'])
Validate the dataset
results = builder.validate(input_path='data/msft_8k_item_5_02.csv',
output_path= 'data/msft_officers_standardized.csv',
text_column='text',
index_column='accession_number',
base_prompt=base_prompt,
response_schema=response_schema,
n=5,
quiet=False)
Example Validation Output
[{
"input_text": "Item 5.02 Departure of Directors... Kevin Turner provided notice he was resigning his position as Chief Operating Officer of Microsoft.",
"process_output": [{
"date": 20160630,
"name": "Kevin Turner",
"title": "Chief Operating Officer",
"action": "RESIGNED"
}],
"is_valid": true,
"reason": "The generated JSON is valid..."
},...
]
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
txt2dataset-0.3.0.tar.gz
(8.3 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file txt2dataset-0.3.0.tar.gz.
File metadata
- Download URL: txt2dataset-0.3.0.tar.gz
- Upload date:
- Size: 8.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0c7c2c5b1d1c224497c1d268855643b5ea999c7564054537206dcbe99e7256b5
|
|
| MD5 |
57edb1ce3fa2cdbe637470ffd08a8988
|
|
| BLAKE2b-256 |
c805df2c941554afd0d7d309c30d9d393002cd913a7a29b35f0acc2260642dcd
|
File details
Details for the file txt2dataset-0.3.0-py3-none-any.whl.
File metadata
- Download URL: txt2dataset-0.3.0-py3-none-any.whl
- Upload date:
- Size: 7.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6faff7abd85348a8028f6ebdb923d8dc18d03e697be82f0618e33ce294a25903
|
|
| MD5 |
bad61cb9a48c9fb20c7663ddc44c546a
|
|
| BLAKE2b-256 |
b4aa46a30ee0d62f0d41769bae4e6be73a25ebfddae96292eaa4056517982e3c
|