Skip to main content

Extract structured AI investment data from earnings call transcripts using OpenAI

Project description

fin-ai-stats-extract

Overview

fin-ai-stats-extract extracts structured AI and technology investment data from XML earnings-call transcripts and writes the results to CSV.

The tool is now driven by a single TOML config file, extract.toml, which defines:

  • the base extraction instructions sent to the model
  • model and endpoint settings
  • the structured output groups and fields

That same config drives three things at once:

  • the final system instructions sent to the model
  • the structured JSON schema used for Responses API parsing
  • the CSV column order used for output

CLI flags remain available for fast tuning and one-off overrides. When a CLI flag conflicts with a TOML value, the CLI value wins and the tool logs a warning.

Requirements

  • Python 3.14+
  • uv

Install

For local development in this repository:

uv sync

For one-off usage without creating a local environment:

uvx fin-ai-stats-extract

Configuration

The default config file is ./extract.toml.

If you run the tool without --config and extract.toml does not exist in the current working directory, the packaged default config is copied there automatically.

The default config includes:

  • commented-out optional settings such as temperature, top_p, and reasoning_effort
  • the full grouped output schema for AI and non-AI investment extraction

Edit extract.toml directly to change AI instructions, AI model settings, or output fields.

Environment

Create an environment file for API-backed runs:

cp .env.example .env

Example values:

OPENAI_API_KEY=sk-your-key-here

By default, extract.toml reads the API key from OPENAI_API_KEY via api_key_env = "OPENAI_API_KEY".

Basic Usage

Run using the defaults in extract.toml:

uv run fin-ai-stats-extract --input ./data/Current

Run on a single transcript:

uv run fin-ai-stats-extract --input ./data/Current/11473715_T.xml

Write to a different CSV for one run:

uv run fin-ai-stats-extract --output ./sample.csv

Validate XML parsing only, without calling a model:

uv run fin-ai-stats-extract --dry-run

Process only a sample of files:

uv run fin-ai-stats-extract --sample 25

Enable verbose logging:

uv run fin-ai-stats-extract --verbose

Tune common model settings for one run:

uv run fin-ai-stats-extract \
  --temperature 0.2 \
  --top-p 0.9 \
  --max-output-tokens 2500 \
  --reasoning-effort medium \
  --verbosity low

Resume an interrupted run from an existing CSV:

uv run fin-ai-stats-extract --resume

Use a different config file:

uv run fin-ai-stats-extract --config ./custom_extract.toml

Open the Tkinter review window instead of supplying all settings on the command line:

uv run fin-ai-stats-extract --gui

The GUI lets you edit the prompt instructions, model settings, input/output paths, runtime options, see the TOML-driven output format in a dedicated tab, and review the exact run cost automatically before confirming or cancelling.

Local OpenAI-Compatible Endpoints

You can point the tool at a local or self-hosted OpenAI-compatible server with either:

  • base_url in extract.toml
  • --base-url for a one-off override

Example with LM Studio:

uv run fin-ai-stats-extract \
  --base-url http://127.0.0.1:1234/v1 \
  --model google/gemma-3-4b

You can also pass an API key explicitly:

uv run fin-ai-stats-extract --api-key "$OPENAI_API_KEY"

If base_url is set and no API key is available, the tool uses lm-studio as a fallback key for local endpoints that require a non-empty value.

Command-Line Arguments

  • --config: Path to a TOML config file. Defaults to ./extract.toml.
  • --gui: Open a Tkinter settings window for editing values and confirming the run.
  • --input: Required unless --gui is used. Path to one XML file or a folder tree containing XML files.
  • --output: Output CSV path. Defaults to output.csv.
  • --model: Override the configured model.
  • --base-url: Override the configured OpenAI-compatible API base URL.
  • --api-key: Override the API key from the configured environment variable.
  • --temperature: Override the configured Responses API temperature from 0 to 2.
  • --top-p: Override the configured nucleus sampling mass from 0 to 1.
  • --max-output-tokens: Override the configured maximum output tokens.
  • --reasoning-effort: Override the configured reasoning effort: none, minimal, low, medium, high, xhigh.
  • --verbosity: Override the configured output verbosity: low, medium, or high.
  • --max-concurrency, --max-async-jobs, --concurrency: Maximum number of concurrent extraction jobs. Defaults to CONCURRENCY_LIMIT or 100.
  • --dry-run: Parse XML only. No API calls are made.
  • --resume: Resume from an existing output CSV.
  • --sample: Randomly process only N files from the input set.
  • --verbose: Enable debug logging.
  • --yes, -y: Skip cost confirmation.

The most common researcher-facing controls remain temperature, top_p, max_output_tokens, reasoning_effort, and verbosity. In most cases, change temperature or top_p, but not both at the same time.

Output Schema

The [output] section of extract.toml is the extraction contract.

Each group defines:

  • a top-level JSON object key
  • a group title and description
  • an ordered list of fields

Each field defines:

  • the exact field name
  • the field type
  • whether null is allowed
  • the description used in the rendered model instructions

The tool uses that same schema to:

  • build the structured output model at runtime
  • append an Output Contract section to the final instructions sent to the LLM
  • generate CSV headers in the same order

Output

The tool writes one CSV row per transcript.

The CSV always begins with transcript metadata:

  • event_id
  • company_name
  • quarter
  • date
  • headline
  • source_file

Configured extraction fields follow in the order defined in extract.toml.

List-valued fields are serialized using a semicolon-space separator.

When the input is a folder, source_file preserves the relative subfolder path, for example subdir/12345_T.xml.

Notes On Local Models

OpenAI-compatible endpoints vary in how well they support strict structured outputs.

In particular:

  • some local models may fail on long transcripts because of context-window limits
  • some local models may return non-compliant JSON even when a schema is provided

If you use a local endpoint and see parsing or validation failures, try:

  • a model with a larger context window
  • smaller inputs using --sample or a single file first
  • an OpenAI-hosted model for the most reliable structured-output behavior

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fin_ai_stats_extract-0.2.0.tar.gz (20.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fin_ai_stats_extract-0.2.0-py3-none-any.whl (27.3 kB view details)

Uploaded Python 3

File details

Details for the file fin_ai_stats_extract-0.2.0.tar.gz.

File metadata

  • Download URL: fin_ai_stats_extract-0.2.0.tar.gz
  • Upload date:
  • Size: 20.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fin_ai_stats_extract-0.2.0.tar.gz
Algorithm Hash digest
SHA256 6be7a450e68e4dd16c1ded5e611090486cef5291f8655a3f00c70f884f81757d
MD5 0583b175ec49e448a91dea2b50733a84
BLAKE2b-256 e8c728075541c79329dbbfce7a7efb36b808f6acff7314d1874a7f50bce9a7be

See more details on using hashes here.

Provenance

The following attestation bundles were made for fin_ai_stats_extract-0.2.0.tar.gz:

Publisher: release.yml on AlexDrBanana/fin-ai-stats-extract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fin_ai_stats_extract-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for fin_ai_stats_extract-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 91dff4ba815b30b7578e9fc5a06ac6d7556396d279254c6902f28a0246d81b3b
MD5 f59a76e8595826ca8d7c2ca9b823e56d
BLAKE2b-256 eb06269e05c13589636d36231e6f56a9be4f39ea5f2632f2839bdbe91b7f58dd

See more details on using hashes here.

Provenance

The following attestation bundles were made for fin_ai_stats_extract-0.2.0-py3-none-any.whl:

Publisher: release.yml on AlexDrBanana/fin-ai-stats-extract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page