Skip to main content

Convert text to datasets

Project description

txt2dataset

Convert unstructured text to structured datasets using structured output for Large Language Models. Currently supports Gemini.

Example

Andrew Baglino, Senior Vice President, Powertrain and Energy Engineering of Tesla, Inc. (“Tesla,”, or the “Company”), resigned from Tesla, effective as of April 14, 2024. Mr. Baglino served in this position since October 2019, prior to which he served in various engineering positions continuously since joining Tesla in March 2006. Tesla is grateful to Mr. Baglino for his leadership and contributions to our significant innovation and growth over the course of his 18-year career. 

--->

name, title, date, action
Andrew Baglino, Senior Vice President, 4/14/2024, resigns

Installation

pip install txt2dataset

Quickstart

Initialization

from txt2dataset import DatasetBuilder

builder = DatasetBuilder(input_path,output_path)

# set api key
builder.set_api_key(api_key)

# set base prompt, e.g. what the model looks for
base_prompt = """Extract officer changes and movements to JSON format.
    Track when officers join, leave, or change roles.
    Provide the following information:
    - date (YYYYMMDD)
    - name (First Middle Last)
    - title
    - action (one of: ["HIRED", "RESIGNED", "TERMINATED", "PROMOTED", "TITLE_CHANGE"])
    Return an empty dict if info unavailable."""

# set what the model should return
response_schema = {
    "type": "ARRAY",
    "items": {
        "type": "OBJECT",
        "properties": {
            "date": {"type": "STRING", "description": "Date of action in YYYYMMDD format"},
            "name": {"type": "STRING", "description": "Full name (First Middle Last)"},
            "title": {"type": "STRING", "description": "Official title/position"},
            "action": {
                "type": "STRING", 
                "enum": ["HIRED", "RESIGNED", "TERMINATED", "PROMOTED", "TITLE_CHANGE"],
                "description": "Type of personnel action"
            }
        },
        "required": ["date", "name", "title", "action"]
    }
}

# Optional configurations
builder.set_rpm(1500)
builder.set_save_frequency(100)
builder.set_model('gemini-1.5-flash-8b')

Build the dataset

builder.build(base_prompt=base_prompt,
               response_schema=response_schema,
               text_column='text',
               index_column='accession_number',
               input_path="data/msft_8k_item_5_02.csv",
               output_path='data/msft_officers.csv')

Standardize the dataset

builder.standardize(response_schema=response_schema,input_path='data/msft_officers.csv', output_path='data/msft_officers_standardized.csv',columns=['name'])

Validate the dataset

results = builder.validate(input_path='data/msft_8k_item_5_02.csv',
                 output_path= 'data/msft_officers_standardized.csv', 
                 text_column='text',
                 index_column='accession_number', 
                 base_prompt=base_prompt,
                 response_schema=response_schema,
                 n=5,
                 quiet=False)

Example Validation Output

[{
    "input_text": "Item 5.02 Departure of Directors... Kevin Turner provided notice he was resigning his position as Chief Operating Officer of Microsoft.",
    "process_output": [{
        "date": 20160630,
        "name": "Kevin Turner",
        "title": "Chief Operating Officer",
        "action": "RESIGNED"
    }],
    "is_valid": true,
    "reason": "The generated JSON is valid..."
},...
]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

txt2dataset-0.3.0.tar.gz (8.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

txt2dataset-0.3.0-py3-none-any.whl (7.4 kB view details)

Uploaded Python 3

File details

Details for the file txt2dataset-0.3.0.tar.gz.

File metadata

  • Download URL: txt2dataset-0.3.0.tar.gz
  • Upload date:
  • Size: 8.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.11

File hashes

Hashes for txt2dataset-0.3.0.tar.gz
Algorithm Hash digest
SHA256 0c7c2c5b1d1c224497c1d268855643b5ea999c7564054537206dcbe99e7256b5
MD5 57edb1ce3fa2cdbe637470ffd08a8988
BLAKE2b-256 c805df2c941554afd0d7d309c30d9d393002cd913a7a29b35f0acc2260642dcd

See more details on using hashes here.

File details

Details for the file txt2dataset-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: txt2dataset-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 7.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.11.11

File hashes

Hashes for txt2dataset-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6faff7abd85348a8028f6ebdb923d8dc18d03e697be82f0618e33ce294a25903
MD5 bad61cb9a48c9fb20c7663ddc44c546a
BLAKE2b-256 b4aa46a30ee0d62f0d41769bae4e6be73a25ebfddae96292eaa4056517982e3c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page