Extract structured AI investment data from earnings call transcripts using OpenAI
Project description
fin-ai-stats-extract
Overview
fin-ai-stats-extract detects and classifies AI-related discussion in XML earnings-call transcripts and writes the results to CSV.
The tool implements a five-stage methodology via LLM extraction:
- AI Mention Detection — keyword-based yes/no decision and hit count
- Representative Sentence Selection — up to 3 most keyword-dense sentences
- Attitude Classification — excited / concerned / neutral tone label
- Initiator Attribution — who raised AI first (management, analyst, both, unclear)
- Confidence Classification — hopeful / confident / transformational / authoritative
The tool is driven by a single TOML config file, extract.toml, which defines:
- the base extraction instructions sent to the model
- model and endpoint settings
- the ordered output field list
That same config drives three things at once:
- the final system instructions sent to the model
- the structured JSON schema used for Responses API parsing
- the CSV column order used for output
CLI flags remain available for fast tuning and one-off overrides. When a CLI flag conflicts with a TOML value, the CLI value wins and the tool logs a warning.
Requirements
- Python 3.14+
uv
Install
For local development in this repository:
uv sync
For one-off usage without creating a local environment:
uvx fin-ai-stats-extract
Configuration
The default config file is ./extract.toml.
If you run the tool without --config and extract.toml does not exist in the current working directory, the packaged default config is copied there automatically.
The default config includes:
- commented-out optional settings such as
temperature,top_p, andreasoning_effort - the full methodology for AI mention detection, attitude classification, initiator attribution, and confidence classification
Edit extract.toml directly to change AI instructions, AI model settings, or output fields.
The output contract now uses a flat ordered list:
[output]
format = [
{ name = "ai_mentioned", description = "Whether at least one core AI keyword appears" },
{ name = "keyword_hit_count", description = "Total count of AI keyword matches" },
]
Descriptions are passed through directly into the rendered LLM instructions. The program does not parse or validate description text beyond requiring it to be non-empty.
Environment
Create an environment file for API-backed runs:
cp .env.example .env
Example values:
OPENAI_API_KEY=sk-your-key-here
By default, extract.toml reads the API key from OPENAI_API_KEY via api_key_env = "OPENAI_API_KEY".
Basic Usage
Run using the defaults in extract.toml:
uv run fin-ai-stats-extract --input ./data/Current
Run on a single transcript:
uv run fin-ai-stats-extract --input ./data/Current/11473715_T.xml
Write to a different CSV for one run:
uv run fin-ai-stats-extract --output ./sample.csv
Validate XML parsing only, without calling a model:
uv run fin-ai-stats-extract --dry-run
Process only a sample of files:
uv run fin-ai-stats-extract --sample 25
Enable verbose logging:
uv run fin-ai-stats-extract --verbose
Tune common model settings for one run:
uv run fin-ai-stats-extract \
--temperature 0.2 \
--top-p 0.9 \
--max-output-tokens 2500 \
--reasoning-effort medium \
--verbosity low
Resume an interrupted run from an existing CSV:
uv run fin-ai-stats-extract --resume
Use a different config file:
uv run fin-ai-stats-extract --config ./custom_extract.toml
Open the Tkinter review window instead of supplying all settings on the command line:
uv run fin-ai-stats-extract --gui
The GUI lets you edit the prompt instructions, model settings, input/output paths, runtime options, see the TOML-driven output format in a dedicated tab, and review the exact run cost automatically before confirming or cancelling.
Local OpenAI-Compatible Endpoints
You can point the tool at a local or self-hosted OpenAI-compatible server with either:
base_urlinextract.toml--base-urlfor a one-off override
Example with LM Studio:
uv run fin-ai-stats-extract \
--base-url http://127.0.0.1:1234/v1 \
--model google/gemma-3-4b
You can also pass an API key explicitly:
uv run fin-ai-stats-extract --api-key "$OPENAI_API_KEY"
If base_url is set and no API key is available, the tool uses lm-studio as a fallback key for local endpoints that require a non-empty value.
Command-Line Arguments
--config: Path to a TOML config file. Defaults to./extract.toml.--gui: Open a Tkinter settings window for editing values and confirming the run.--input: Required unless--guiis used. Path to one XML file or a folder tree containing XML files.--output: Output CSV path. Defaults tooutput.csv.--model: Override the configured model.--base-url: Override the configured OpenAI-compatible API base URL.--api-key: Override the API key from the configured environment variable.--temperature: Override the configured Responses API temperature from0to2.--top-p: Override the configured nucleus sampling mass from0to1.--max-output-tokens: Override the configured maximum output tokens.--reasoning-effort: Override the configured reasoning effort:none,minimal,low,medium,high,xhigh.--verbosity: Override the configured output verbosity:low,medium, orhigh.--max-concurrency,--max-async-jobs,--concurrency: Maximum number of concurrent extraction jobs. Defaults toCONCURRENCY_LIMITor100.--dry-run: Parse XML only. No API calls are made.--resume: Resume from an existing output CSV.--sample: Randomly process onlyNfiles from the input set.--verbose: Enable debug logging.--yes,-y: Skip cost confirmation.
The most common researcher-facing controls remain temperature, top_p, max_output_tokens, reasoning_effort, and verbosity. In most cases, change temperature or top_p, but not both at the same time.
Output Schema
The [output] section of extract.toml is the extraction contract.
Each item in format defines:
- the exact field name
- the output order used in both the JSON contract and CSV columns
- a freeform description that is copied directly into the model instructions
The tool uses that same schema to:
- build the structured output model at runtime
- append an Output Contract section to the final instructions sent to the LLM
- generate CSV headers in the same order
Output
The tool writes one CSV row per transcript.
The CSV always begins with transcript metadata:
event_idcompany_namequarterdateheadlinesource_file
Configured extraction fields follow in the order defined in extract.toml.
List-valued fields are serialized using a semicolon-space separator.
When the input is a folder, source_file preserves the relative subfolder path, for example subdir/12345_T.xml.
Notes On Local Models
OpenAI-compatible endpoints vary in how well they support strict structured outputs.
In particular:
- some local models may fail on long transcripts because of context-window limits
- some local models may return non-compliant JSON even when a schema is provided
If you use a local endpoint and see parsing or validation failures, try:
- a model with a larger context window
- smaller inputs using
--sampleor a single file first - an OpenAI-hosted model for the most reliable structured-output behavior
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fin_ai_stats_extract-0.2.1.tar.gz.
File metadata
- Download URL: fin_ai_stats_extract-0.2.1.tar.gz
- Upload date:
- Size: 22.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5152e9c5d0f6cb964e762ce1982573cd18635e1ff8261db92a60d696c16b88ed
|
|
| MD5 |
a146f96febb131eb305d98f5f4615804
|
|
| BLAKE2b-256 |
dbd14c6c8ce3be4d032c0dab8f13ecaa59801e5e9f93f71c896ef6fe90890e60
|
Provenance
The following attestation bundles were made for fin_ai_stats_extract-0.2.1.tar.gz:
Publisher:
release.yml on AlexDrBanana/fin-ai-stats-extract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fin_ai_stats_extract-0.2.1.tar.gz -
Subject digest:
5152e9c5d0f6cb964e762ce1982573cd18635e1ff8261db92a60d696c16b88ed - Sigstore transparency entry: 1242596559
- Sigstore integration time:
-
Permalink:
AlexDrBanana/fin-ai-stats-extract@6cf8be6e977e90b35a9949d9c657dda9f5e7f21a -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/AlexDrBanana
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@6cf8be6e977e90b35a9949d9c657dda9f5e7f21a -
Trigger Event:
push
-
Statement type:
File details
Details for the file fin_ai_stats_extract-0.2.1-py3-none-any.whl.
File metadata
- Download URL: fin_ai_stats_extract-0.2.1-py3-none-any.whl
- Upload date:
- Size: 28.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3dd79f19865664aaf02fe2d28cdf0899841ad470ce8724bf390aca8a0273cb1f
|
|
| MD5 |
6d3e76625bfe0403622cc1364203eceb
|
|
| BLAKE2b-256 |
5dbf0a5c35fe97ba7fe0b8fee65bce16861da1d91ca7883f4163b61743bfb5b8
|
Provenance
The following attestation bundles were made for fin_ai_stats_extract-0.2.1-py3-none-any.whl:
Publisher:
release.yml on AlexDrBanana/fin-ai-stats-extract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fin_ai_stats_extract-0.2.1-py3-none-any.whl -
Subject digest:
3dd79f19865664aaf02fe2d28cdf0899841ad470ce8724bf390aca8a0273cb1f - Sigstore transparency entry: 1242596570
- Sigstore integration time:
-
Permalink:
AlexDrBanana/fin-ai-stats-extract@6cf8be6e977e90b35a9949d9c657dda9f5e7f21a -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/AlexDrBanana
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@6cf8be6e977e90b35a9949d9c657dda9f5e7f21a -
Trigger Event:
push
-
Statement type: