Extract structured AI investment data from earnings call transcripts using OpenAI

Project description

fin-ai-stats-extract

Overview

fin-ai-stats-extract extracts structured AI and technology investment data from XML earnings-call transcripts and writes the results to CSV.

It supports:

a single XML file or a folder tree of XML files via --input
OpenAI-hosted models
OpenAI-compatible local endpoints via --base-url
explicit resume mode via --resume using an existing output CSV
concurrent async extraction with a live progress bar
dry-run parsing to validate XML input without making API calls
a Streamlit UI for drag-and-drop uploads, model discovery, and in-memory CSV download

The output schema is based on required_output.md.

Requirements

Python 3.14+
uv

Install And Run

For one-off usage without creating a local environment:

uvx fin-ai-stats-extract --input ./data/Current --output ./output.csv

For local development in this repository:

uv sync

Environment

Create an environment file for API-backed runs:

cp .env.example .env

Example values:

OPENAI_API_KEY=sk-your-key-here
OPENAI_BASE_URL=
OPENAI_MODEL=gpt-4o-mini
CONCURRENCY_LIMIT=100

Basic Usage

Run on a folder of transcripts:

uvx fin-ai-stats-extract --input ./data/Current --output ./output.csv

Run on a single transcript:

uvx fin-ai-stats-extract --input ./data/Current/11473715_T.xml --output ./output.csv

Validate XML parsing only, without calling a model:

uvx fin-ai-stats-extract --input ./data/Current --dry-run

Process only a sample of files:

uvx fin-ai-stats-extract --input ./data/Current --sample 25 --output ./sample.csv

Enable verbose logging:

uvx fin-ai-stats-extract --input ./data/Current --output ./output.csv --verbose

Tune common model sampling settings:

uvx fin-ai-stats-extract \
  --input ./data/Current/11473715_T.xml \
  --output ./output.csv \
  --temperature 0.2 \
  --top-p 0.9 \
  --max-output-tokens 2500 \
  --reasoning-effort medium \
  --verbosity low

Resume an interrupted run from an existing CSV:

uvx fin-ai-stats-extract --input ./data/Current --output ./output.csv --resume

Prompt Selection

The CLI supports a custom prompt file:

uvx fin-ai-stats-extract \
  --input ./data/Current \
  --output ./output.csv \
  --prompt ./my_prompt.md

Prompt resolution works like this:

If --prompt PATH is provided, that file is used.
Otherwise the CLI looks for ./system_prompt.md in the current working directory.
If ./system_prompt.md does not exist, the packaged default prompt is copied there and then used.

The packaged default prompt shipped with the distribution lives at src/fin_ai_stats_extract/resources/system_prompt.md in this repository.

Local OpenAI-Compatible Endpoints

You can point the tool at a local or self-hosted OpenAI-compatible server by passing --base-url.

Example with LM Studio:

uvx fin-ai-stats-extract \
  --input ./data/Current/11473715_T.xml \
  --output ./output.csv \
  --model google/gemma-3-4b \
  --base-url http://127.0.0.1:1234/v1

You can also pass an API key explicitly:

uvx fin-ai-stats-extract \
  --input ./data/Current \
  --output ./output.csv \
  --model gpt-4o-mini \
  --api-key "$OPENAI_API_KEY"

If --base-url is provided and no API key is set, the tool automatically uses lm-studio as a fallback key for local servers that require a non-empty value.

Command-Line Arguments

--input: Required. Path to one XML file or a folder tree containing XML files. Folder scans are recursive.
--output: Output CSV path. Defaults to output.csv.
--prompt: Optional path to a custom system prompt markdown file. Without it, the CLI uses ./system_prompt.md, creating it from the packaged default if needed.
--model: Model name. Defaults to OPENAI_MODEL or gpt-4o-mini.
--base-url: Optional OpenAI-compatible API base URL. Defaults to OPENAI_BASE_URL.
--api-key: Optional API key. Defaults to OPENAI_API_KEY.
--temperature: Responses API sampling temperature from 0 to 2.
--top-p: Responses API nucleus sampling mass from 0 to 1.
--max-output-tokens: Maximum output tokens, including reasoning tokens.
--reasoning-effort: Reasoning effort for supported gpt-5 and o-series models: none, minimal, low, medium, high, xhigh.
--verbosity: Output verbosity for supported models: low, medium, or high.
--max-concurrency: Maximum number of concurrent extraction jobs. Also accepts --max-async-jobs and --concurrency. Defaults to CONCURRENCY_LIMIT or 100.
--resume: Resume from an existing output CSV. Requires an explicit --output path.csv. Already processed source_file values are skipped.
--dry-run: Parse XML only. No API calls are made.
--sample: Randomly process only N files from the input set.
--yes, -y: Skip the cost confirmation prompt when using OpenAI-hosted models.
--verbose: Enable debug logging.

For the current OpenAI Responses API path used by this project, the commonly exposed researcher-facing controls are temperature, top_p, max_output_tokens, reasoning_effort, and verbosity. We recommend changing temperature or top_p, but not both at the same time.

Streamlit UI

For repository development, launch the Streamlit app with:

uv run streamlit run src/fin_ai_stats_extract/streamlit_app.py

The UI lets you:

drag and drop one or more XML files
choose OpenAI or a custom OpenAI-compatible endpoint
automatically load available models from the endpoint's /models API
edit the default system prompt before running extraction
adjust max concurrency
view the generated CSV directly in a table
download the generated CSV directly from memory without writing a CSV file to disk

Output

The tool writes one CSV row per transcript with:

transcript metadata: event_id, company_name, quarter, date, headline, source_file
AI infrastructure fields
AI analytics fields
AI talent fields
AI risk fields
non-AI physical technology investment fields
non-AI tech talent fields

List-valued fields are serialized using a semicolon-space separator.

When the input is a folder, source_file preserves the relative subfolder path, for example subdir/12345_T.xml.

Notes On Local Models

OpenAI-compatible endpoints vary in how well they support strict structured outputs.

In particular:

some local models may fail on long transcripts because of context-window limits
some local models may return non-compliant JSON even when a schema is provided

If you use a local endpoint and see parsing or validation failures, try:

a model with a larger context window
smaller inputs using --sample or a single file first
an OpenAI-hosted model for the most reliable structured-output behavior

Project details

Release history Release notifications | RSS feed

0.2.1

Apr 6, 2026

0.2.0

Apr 5, 2026

0.1.2

Mar 30, 2026

0.1.1

Mar 30, 2026

This version

0.1.0

Mar 30, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fin_ai_stats_extract-0.1.0.tar.gz (16.3 kB view details)

Uploaded Mar 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fin_ai_stats_extract-0.1.0-py3-none-any.whl (21.7 kB view details)

Uploaded Mar 30, 2026 Python 3

File details

Details for the file fin_ai_stats_extract-0.1.0.tar.gz.

File metadata

Download URL: fin_ai_stats_extract-0.1.0.tar.gz
Upload date: Mar 30, 2026
Size: 16.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fin_ai_stats_extract-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`1cf50dacc593ea5aab50a1fe30fad70628b592bfc0015ee4d5060af1e7932dd2`
MD5	`27a680ddbe81cc80d7dc05d92a6bfa74`
BLAKE2b-256	`ab623b08851a1768bc0288722a9794892f29de8c086e3e01776c70fda1e612ca`

See more details on using hashes here.

Provenance

The following attestation bundles were made for fin_ai_stats_extract-0.1.0.tar.gz:

Publisher: release.yml on AlexDrBanana/fin-ai-stats-extract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: fin_ai_stats_extract-0.1.0.tar.gz
- Subject digest: 1cf50dacc593ea5aab50a1fe30fad70628b592bfc0015ee4d5060af1e7932dd2
- Sigstore transparency entry: 1199312586
- Sigstore integration time: Mar 30, 2026
Source repository:
- Permalink: AlexDrBanana/fin-ai-stats-extract@e1dac6cab2288d62e04bab83e4d534f666229af2
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/AlexDrBanana
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@e1dac6cab2288d62e04bab83e4d534f666229af2
- Trigger Event: push

File details

Details for the file fin_ai_stats_extract-0.1.0-py3-none-any.whl.

File metadata

Download URL: fin_ai_stats_extract-0.1.0-py3-none-any.whl
Upload date: Mar 30, 2026
Size: 21.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fin_ai_stats_extract-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3452e46262f2d8b23e279c43ebc2be17fb4132978365a663baa7f3330b40c41f`
MD5	`4fa3213b8549df1ace60ddd8195fcd01`
BLAKE2b-256	`01a66ad9ee6f77311c3dbcda42b2c7bd5fed222fe5119260611e76da5cf346ac`

See more details on using hashes here.

Provenance

The following attestation bundles were made for fin_ai_stats_extract-0.1.0-py3-none-any.whl:

Publisher: release.yml on AlexDrBanana/fin-ai-stats-extract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: fin_ai_stats_extract-0.1.0-py3-none-any.whl
- Subject digest: 3452e46262f2d8b23e279c43ebc2be17fb4132978365a663baa7f3330b40c41f
- Sigstore transparency entry: 1199312591
- Sigstore integration time: Mar 30, 2026
Source repository:
- Permalink: AlexDrBanana/fin-ai-stats-extract@e1dac6cab2288d62e04bab83e4d534f666229af2
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/AlexDrBanana
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@e1dac6cab2288d62e04bab83e4d534f666229af2
- Trigger Event: push

fin-ai-stats-extract 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

fin-ai-stats-extract

Overview

Requirements

Install And Run

Environment

Basic Usage

Prompt Selection

Local OpenAI-Compatible Endpoints

Command-Line Arguments

Streamlit UI

Output

Notes On Local Models

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance