Skip to main content

Extract structured AI investment data from earnings call transcripts using OpenAI

Project description

fin-ai-stats-extract

Overview

fin-ai-stats-extract extracts structured AI and technology investment data from XML earnings-call transcripts and writes the results to CSV.

It supports:

  • a single XML file or a folder tree of XML files via --input
  • OpenAI-hosted models
  • OpenAI-compatible local endpoints via --base-url
  • explicit resume mode via --resume using an existing output CSV
  • concurrent async extraction with a live progress bar
  • dry-run parsing to validate XML input without making API calls
  • a Streamlit UI for drag-and-drop uploads, model discovery, and in-memory CSV download

The output schema is based on required_output.md.

Requirements

  • Python 3.14+
  • uv

Install And Run

For one-off usage without creating a local environment:

uvx fin-ai-stats-extract --input ./data/Current --output ./output.csv

For local development in this repository:

uv sync

Environment

Create an environment file for API-backed runs:

cp .env.example .env

Example values:

OPENAI_API_KEY=sk-your-key-here
OPENAI_BASE_URL=
OPENAI_MODEL=gpt-4o-mini
CONCURRENCY_LIMIT=100

Basic Usage

Run on a folder of transcripts:

uvx fin-ai-stats-extract --input ./data/Current --output ./output.csv

Run on a single transcript:

uvx fin-ai-stats-extract --input ./data/Current/11473715_T.xml --output ./output.csv

Validate XML parsing only, without calling a model:

uvx fin-ai-stats-extract --input ./data/Current --dry-run

Process only a sample of files:

uvx fin-ai-stats-extract --input ./data/Current --sample 25 --output ./sample.csv

Enable verbose logging:

uvx fin-ai-stats-extract --input ./data/Current --output ./output.csv --verbose

Tune common model sampling settings:

uvx fin-ai-stats-extract \
  --input ./data/Current/11473715_T.xml \
  --output ./output.csv \
  --temperature 0.2 \
  --top-p 0.9 \
  --max-output-tokens 2500 \
  --reasoning-effort medium \
  --verbosity low

Resume an interrupted run from an existing CSV:

uvx fin-ai-stats-extract --input ./data/Current --output ./output.csv --resume

Prompt Selection

The CLI supports a custom prompt file:

uvx fin-ai-stats-extract \
  --input ./data/Current \
  --output ./output.csv \
  --prompt ./my_prompt.md

Prompt resolution works like this:

  1. If --prompt PATH is provided, that file is used.
  2. Otherwise the CLI looks for ./system_prompt.md in the current working directory.
  3. If ./system_prompt.md does not exist, the packaged default prompt is copied there and then used.

The packaged default prompt shipped with the distribution lives at src/fin_ai_stats_extract/resources/system_prompt.md in this repository.

Local OpenAI-Compatible Endpoints

You can point the tool at a local or self-hosted OpenAI-compatible server by passing --base-url.

Example with LM Studio:

uvx fin-ai-stats-extract \
  --input ./data/Current/11473715_T.xml \
  --output ./output.csv \
  --model google/gemma-3-4b \
  --base-url http://127.0.0.1:1234/v1

You can also pass an API key explicitly:

uvx fin-ai-stats-extract \
  --input ./data/Current \
  --output ./output.csv \
  --model gpt-4o-mini \
  --api-key "$OPENAI_API_KEY"

If --base-url is provided and no API key is set, the tool automatically uses lm-studio as a fallback key for local servers that require a non-empty value.

Command-Line Arguments

  • --input: Required. Path to one XML file or a folder tree containing XML files. Folder scans are recursive.
  • --output: Output CSV path. Defaults to output.csv.
  • --prompt: Optional path to a custom system prompt markdown file. Without it, the CLI uses ./system_prompt.md, creating it from the packaged default if needed.
  • --model: Model name. Defaults to OPENAI_MODEL or gpt-4o-mini.
  • --base-url: Optional OpenAI-compatible API base URL. Defaults to OPENAI_BASE_URL.
  • --api-key: Optional API key. Defaults to OPENAI_API_KEY.
  • --temperature: Responses API sampling temperature from 0 to 2.
  • --top-p: Responses API nucleus sampling mass from 0 to 1.
  • --max-output-tokens: Maximum output tokens, including reasoning tokens.
  • --reasoning-effort: Reasoning effort for supported gpt-5 and o-series models: none, minimal, low, medium, high, xhigh.
  • --verbosity: Output verbosity for supported models: low, medium, or high.
  • --max-concurrency: Maximum number of concurrent extraction jobs. Also accepts --max-async-jobs and --concurrency. Defaults to CONCURRENCY_LIMIT or 100.
  • --resume: Resume from an existing output CSV. Requires an explicit --output path.csv. Already processed source_file values are skipped.
  • --dry-run: Parse XML only. No API calls are made.
  • --sample: Randomly process only N files from the input set.
  • --yes, -y: Skip the cost confirmation prompt when using OpenAI-hosted models.
  • --verbose: Enable debug logging.

For the current OpenAI Responses API path used by this project, the commonly exposed researcher-facing controls are temperature, top_p, max_output_tokens, reasoning_effort, and verbosity. We recommend changing temperature or top_p, but not both at the same time.

Streamlit UI

For repository development, launch the Streamlit app with:

uv run streamlit run src/fin_ai_stats_extract/streamlit_app.py

The UI lets you:

  • drag and drop one or more XML files
  • choose OpenAI or a custom OpenAI-compatible endpoint
  • automatically load available models from the endpoint's /models API
  • edit the default system prompt before running extraction
  • adjust max concurrency
  • view the generated CSV directly in a table
  • download the generated CSV directly from memory without writing a CSV file to disk

Output

The tool writes one CSV row per transcript with:

  • transcript metadata: event_id, company_name, quarter, date, headline, source_file
  • AI infrastructure fields
  • AI analytics fields
  • AI talent fields
  • AI risk fields
  • non-AI physical technology investment fields
  • non-AI tech talent fields

List-valued fields are serialized using a semicolon-space separator.

When the input is a folder, source_file preserves the relative subfolder path, for example subdir/12345_T.xml.

Notes On Local Models

OpenAI-compatible endpoints vary in how well they support strict structured outputs.

In particular:

  • some local models may fail on long transcripts because of context-window limits
  • some local models may return non-compliant JSON even when a schema is provided

If you use a local endpoint and see parsing or validation failures, try:

  • a model with a larger context window
  • smaller inputs using --sample or a single file first
  • an OpenAI-hosted model for the most reliable structured-output behavior

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fin_ai_stats_extract-0.1.1.tar.gz (16.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fin_ai_stats_extract-0.1.1-py3-none-any.whl (21.7 kB view details)

Uploaded Python 3

File details

Details for the file fin_ai_stats_extract-0.1.1.tar.gz.

File metadata

  • Download URL: fin_ai_stats_extract-0.1.1.tar.gz
  • Upload date:
  • Size: 16.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for fin_ai_stats_extract-0.1.1.tar.gz
Algorithm Hash digest
SHA256 9c41c1f394c1c41627f3587a5d47ada1e1729c9ab8a68010a5e4455aea1398c9
MD5 b2f3d235296eeaad047beb0d589b1112
BLAKE2b-256 ea3b4e06180fc2d6d3aedae28663a4709212e789e23804cee499a0d40f7b815e

See more details on using hashes here.

Provenance

The following attestation bundles were made for fin_ai_stats_extract-0.1.1.tar.gz:

Publisher: release.yml on AlexDrBanana/fin-ai-stats-extract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file fin_ai_stats_extract-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for fin_ai_stats_extract-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d1cca25a45f9c733eba00aaaa0a548171ab9eb171fb388872bdd04663f2f2dbf
MD5 96138ec1e744768a29f2807818785c2b
BLAKE2b-256 062d1fac507e3cdfbfd90021e1910e8fab9627c7017b9362c57c27abe3bbd1a9

See more details on using hashes here.

Provenance

The following attestation bundles were made for fin_ai_stats_extract-0.1.1-py3-none-any.whl:

Publisher: release.yml on AlexDrBanana/fin-ai-stats-extract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page