Skip to main content

Convert text to datasets

Project description

txt2dataset

A package for building, standardizing and validating datasets using language models. Supports the Structured Output project.

Models Supported

  • Gemini

Installation

pip install txt2dataset

Usage

Schema

from pydantic import BaseModel
from typing import Optional, List
from datetime import datetime

class SingleDividend(BaseModel):
    dividend_per_share: float
    payment_date: Optional[datetime] = None
    record_date: Optional[datetime] = None
    stock_type_specified: Optional[str] = None

class DividendExtraction(BaseModel):
    info_found: bool
    data: List[SingleDividend] = []

Entries

Entries consist of an identifier and the text to be structured.

entries = [
    (0,
    """First Business Financial Services, Inc. (the "Company") issued a press release today 
    announcing that the Company's Board of Directors declared a quarterly dividend of $0.18 
    per share on April 30, 2021, unchanged compared to the last quarterly dividend per share. 
    The dividend is payable on May 24, 2021 to shareholders of record on May 10, 2021. 
    Also on July 12, 2020 there was a payable dividend of $0.15 per share to shareholders 
    of record on July 1st, 2020."""),

    (1,"""XYZ Corp declared a dividend of $0.25 per share, payable June 15, 2021 
    to shareholders of record as of June 1, 2021.""")
]

Prompt

Choose a prompt such as:

prompt = "Extract ALL dividend information from this text"

Dataset Builder Initialization

Choose the requests per minute that work for your api key and model.

builder = DatasetBuilder(
    prompt=prompt,
    schema=DividendExtraction,
    model="gemini-2.5-flash-lite",
    entries=entries,
    rpm=4000
)

Build

builder.build()

Save

builder.save('test.csv')

Result:

_id dividend_per_share payment_date record_date stock_type_specified
0 0.18 2021-05-24 00:00:00+00:00 2021-05-10 00:00:00+00:00
0 0.15 2020-07-12 00:00:00+00:00 2020-07-01 00:00:00+00:00
1 0.25 2021-06-15 00:00:00+00:00 2021-06-01 00:00:00+00:00

Future Features

  • validate() - checks that data types are expected. Needed less, thanks to the development of pydantic.
  • standardize() - standardizes data.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

txt2dataset-0.4.0.tar.gz (5.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

txt2dataset-0.4.0-py3-none-any.whl (5.6 kB view details)

Uploaded Python 3

File details

Details for the file txt2dataset-0.4.0.tar.gz.

File metadata

  • Download URL: txt2dataset-0.4.0.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for txt2dataset-0.4.0.tar.gz
Algorithm Hash digest
SHA256 f305dbadc502da784407105684478e72169d33475fabe1af1619a64a8628a329
MD5 72b0e9722d7cc1db577f2d9d46de9b8c
BLAKE2b-256 7e87a02973ab3b3fb21f5e4ae0384946c3ec32d4d089413998910d650c97c488

See more details on using hashes here.

File details

Details for the file txt2dataset-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: txt2dataset-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 5.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for txt2dataset-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 569ee3e6e2614605d77dcc09ea7b6c080e8164a6b19f793e79d07e28b65df5b9
MD5 3d392477804d00866129f3b7f0037fc5
BLAKE2b-256 31edcdc80bc775633dc3bf266f3097c0f0d583d32a3560a8561e795e30cd6891

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page