Extract structured AI investment data from earnings call transcripts using OpenAI

Project description

fin-ai-stats-extract

Overview

fin-ai-stats-extract extracts structured AI and technology investment data from XML earnings-call transcripts and writes the results to CSV.

The tool is now driven by a single TOML config file, extract.toml, which defines:

the base extraction instructions sent to the model
model and endpoint settings
the structured output groups and fields

That same config drives three things at once:

the final system instructions sent to the model
the structured JSON schema used for Responses API parsing
the CSV column order used for output

CLI flags remain available for fast tuning and one-off overrides. When a CLI flag conflicts with a TOML value, the CLI value wins and the tool logs a warning.

Requirements

Python 3.14+
uv

Install

For local development in this repository:

uv sync

For one-off usage without creating a local environment:

uvx fin-ai-stats-extract

Configuration

The default config file is ./extract.toml.

If you run the tool without --config and extract.toml does not exist in the current working directory, the packaged default config is copied there automatically.

The default config includes:

commented-out optional settings such as temperature, top_p, and reasoning_effort
the full grouped output schema for AI and non-AI investment extraction

Edit extract.toml directly to change AI instructions, AI model settings, or output fields.

Environment

Create an environment file for API-backed runs:

cp .env.example .env

Example values:

OPENAI_API_KEY=sk-your-key-here

By default, extract.toml reads the API key from OPENAI_API_KEY via api_key_env = "OPENAI_API_KEY".

Basic Usage

Run using the defaults in extract.toml:

uv run fin-ai-stats-extract --input ./data/Current

Run on a single transcript:

uv run fin-ai-stats-extract --input ./data/Current/11473715_T.xml

Write to a different CSV for one run:

uv run fin-ai-stats-extract --output ./sample.csv

Validate XML parsing only, without calling a model:

uv run fin-ai-stats-extract --dry-run

Process only a sample of files:

uv run fin-ai-stats-extract --sample 25

Enable verbose logging:

uv run fin-ai-stats-extract --verbose

Tune common model settings for one run:

uv run fin-ai-stats-extract \
  --temperature 0.2 \
  --top-p 0.9 \
  --max-output-tokens 2500 \
  --reasoning-effort medium \
  --verbosity low

Resume an interrupted run from an existing CSV:

uv run fin-ai-stats-extract --resume

Use a different config file:

uv run fin-ai-stats-extract --config ./custom_extract.toml

Open the Tkinter review window instead of supplying all settings on the command line:

uv run fin-ai-stats-extract --gui

The GUI lets you edit the prompt instructions, model settings, input/output paths, runtime options, see the TOML-driven output format in a dedicated tab, and review the exact run cost automatically before confirming or cancelling.

Local OpenAI-Compatible Endpoints

You can point the tool at a local or self-hosted OpenAI-compatible server with either:

base_url in extract.toml
--base-url for a one-off override

Example with LM Studio:

uv run fin-ai-stats-extract \
  --base-url http://127.0.0.1:1234/v1 \
  --model google/gemma-3-4b

You can also pass an API key explicitly:

uv run fin-ai-stats-extract --api-key "$OPENAI_API_KEY"

If base_url is set and no API key is available, the tool uses lm-studio as a fallback key for local endpoints that require a non-empty value.

Command-Line Arguments

--config: Path to a TOML config file. Defaults to ./extract.toml.
--gui: Open a Tkinter settings window for editing values and confirming the run.
--input: Required unless --gui is used. Path to one XML file or a folder tree containing XML files.
--output: Output CSV path. Defaults to output.csv.
--model: Override the configured model.
--base-url: Override the configured OpenAI-compatible API base URL.
--api-key: Override the API key from the configured environment variable.
--temperature: Override the configured Responses API temperature from 0 to 2.
--top-p: Override the configured nucleus sampling mass from 0 to 1.
--max-output-tokens: Override the configured maximum output tokens.
--reasoning-effort: Override the configured reasoning effort: none, minimal, low, medium, high, xhigh.
--verbosity: Override the configured output verbosity: low, medium, or high.
--max-concurrency, --max-async-jobs, --concurrency: Maximum number of concurrent extraction jobs. Defaults to CONCURRENCY_LIMIT or 100.
--dry-run: Parse XML only. No API calls are made.
--resume: Resume from an existing output CSV.
--sample: Randomly process only N files from the input set.
--verbose: Enable debug logging.
--yes, -y: Skip cost confirmation.

The most common researcher-facing controls remain temperature, top_p, max_output_tokens, reasoning_effort, and verbosity. In most cases, change temperature or top_p, but not both at the same time.

Output Schema

The [output] section of extract.toml is the extraction contract.

Each group defines:

a top-level JSON object key
a group title and description
an ordered list of fields

Each field defines:

the exact field name
the field type
whether null is allowed
the description used in the rendered model instructions

The tool uses that same schema to:

build the structured output model at runtime
append an Output Contract section to the final instructions sent to the LLM
generate CSV headers in the same order

Output

The tool writes one CSV row per transcript.

The CSV always begins with transcript metadata:

event_id
company_name
quarter
date
headline
source_file

Configured extraction fields follow in the order defined in extract.toml.

List-valued fields are serialized using a semicolon-space separator.

When the input is a folder, source_file preserves the relative subfolder path, for example subdir/12345_T.xml.

Notes On Local Models

OpenAI-compatible endpoints vary in how well they support strict structured outputs.

In particular:

some local models may fail on long transcripts because of context-window limits
some local models may return non-compliant JSON even when a schema is provided

If you use a local endpoint and see parsing or validation failures, try:

a model with a larger context window
smaller inputs using --sample or a single file first
an OpenAI-hosted model for the most reliable structured-output behavior

Project details

Release history Release notifications | RSS feed

0.2.1

Apr 6, 2026

This version

0.2.0

Apr 5, 2026

0.1.2

Mar 30, 2026

0.1.1

Mar 30, 2026

0.1.0

Mar 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fin_ai_stats_extract-0.2.0.tar.gz (20.8 kB view details)

Uploaded Apr 5, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fin_ai_stats_extract-0.2.0-py3-none-any.whl (27.3 kB view details)

Uploaded Apr 5, 2026 Python 3

File details

Details for the file fin_ai_stats_extract-0.2.0.tar.gz.

File metadata

Download URL: fin_ai_stats_extract-0.2.0.tar.gz
Upload date: Apr 5, 2026
Size: 20.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fin_ai_stats_extract-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`6be7a450e68e4dd16c1ded5e611090486cef5291f8655a3f00c70f884f81757d`
MD5	`0583b175ec49e448a91dea2b50733a84`
BLAKE2b-256	`e8c728075541c79329dbbfce7a7efb36b808f6acff7314d1874a7f50bce9a7be`

See more details on using hashes here.

Provenance

The following attestation bundles were made for fin_ai_stats_extract-0.2.0.tar.gz:

Publisher: release.yml on AlexDrBanana/fin-ai-stats-extract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: fin_ai_stats_extract-0.2.0.tar.gz
- Subject digest: 6be7a450e68e4dd16c1ded5e611090486cef5291f8655a3f00c70f884f81757d
- Sigstore transparency entry: 1239249671
- Sigstore integration time: Apr 5, 2026
Source repository:
- Permalink: AlexDrBanana/fin-ai-stats-extract@8c1f0627ba72984f8d6109af99b8557a60584535
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/AlexDrBanana
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@8c1f0627ba72984f8d6109af99b8557a60584535
- Trigger Event: push

File details

Details for the file fin_ai_stats_extract-0.2.0-py3-none-any.whl.

File metadata

Download URL: fin_ai_stats_extract-0.2.0-py3-none-any.whl
Upload date: Apr 5, 2026
Size: 27.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fin_ai_stats_extract-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`91dff4ba815b30b7578e9fc5a06ac6d7556396d279254c6902f28a0246d81b3b`
MD5	`f59a76e8595826ca8d7c2ca9b823e56d`
BLAKE2b-256	`eb06269e05c13589636d36231e6f56a9be4f39ea5f2632f2839bdbe91b7f58dd`

See more details on using hashes here.

Provenance

The following attestation bundles were made for fin_ai_stats_extract-0.2.0-py3-none-any.whl:

Publisher: release.yml on AlexDrBanana/fin-ai-stats-extract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: fin_ai_stats_extract-0.2.0-py3-none-any.whl
- Subject digest: 91dff4ba815b30b7578e9fc5a06ac6d7556396d279254c6902f28a0246d81b3b
- Sigstore transparency entry: 1239249673
- Sigstore integration time: Apr 5, 2026
Source repository:
- Permalink: AlexDrBanana/fin-ai-stats-extract@8c1f0627ba72984f8d6109af99b8557a60584535
- Branch / Tag: refs/tags/v0.2.0
- Owner: https://github.com/AlexDrBanana
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@8c1f0627ba72984f8d6109af99b8557a60584535
- Trigger Event: push

fin-ai-stats-extract 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

fin-ai-stats-extract

Overview

Requirements

Install

Configuration

Environment

Basic Usage

Local OpenAI-Compatible Endpoints

Command-Line Arguments

Output Schema

Output

Notes On Local Models

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance