Extract structured AI investment data from earnings call transcripts using OpenAI
Project description
fin-ai-stats-extract
Overview
fin-ai-stats-extract extracts structured AI and technology investment data from XML earnings-call transcripts and writes the results to CSV.
It supports:
- a single XML file or a folder tree of XML files via
--input - OpenAI-hosted models
- OpenAI-compatible local endpoints via
--base-url - explicit resume mode via
--resumeusing an existing output CSV - concurrent async extraction with a live progress bar
- dry-run parsing to validate XML input without making API calls
- a Streamlit UI for drag-and-drop uploads, model discovery, and in-memory CSV download
The output schema is based on required_output.md.
Requirements
- Python 3.14+
uv
Install And Run
For one-off usage without creating a local environment:
uvx fin-ai-stats-extract --input ./data/Current --output ./output.csv
For local development in this repository:
uv sync
Environment
Create an environment file for API-backed runs:
cp .env.example .env
Example values:
OPENAI_API_KEY=sk-your-key-here
OPENAI_BASE_URL=
OPENAI_MODEL=gpt-4o-mini
CONCURRENCY_LIMIT=100
Basic Usage
Run on a folder of transcripts:
uvx fin-ai-stats-extract --input ./data/Current --output ./output.csv
Run on a single transcript:
uvx fin-ai-stats-extract --input ./data/Current/11473715_T.xml --output ./output.csv
Validate XML parsing only, without calling a model:
uvx fin-ai-stats-extract --input ./data/Current --dry-run
Process only a sample of files:
uvx fin-ai-stats-extract --input ./data/Current --sample 25 --output ./sample.csv
Enable verbose logging:
uvx fin-ai-stats-extract --input ./data/Current --output ./output.csv --verbose
Tune common model sampling settings:
uvx fin-ai-stats-extract \
--input ./data/Current/11473715_T.xml \
--output ./output.csv \
--temperature 0.2 \
--top-p 0.9 \
--max-output-tokens 2500 \
--reasoning-effort medium \
--verbosity low
Resume an interrupted run from an existing CSV:
uvx fin-ai-stats-extract --input ./data/Current --output ./output.csv --resume
Prompt Selection
The CLI supports a custom prompt file:
uvx fin-ai-stats-extract \
--input ./data/Current \
--output ./output.csv \
--prompt ./my_prompt.md
Prompt resolution works like this:
- If
--prompt PATHis provided, that file is used. - Otherwise the CLI looks for
./system_prompt.mdin the current working directory. - If
./system_prompt.mddoes not exist, the packaged default prompt is copied there and then used.
The packaged default prompt shipped with the distribution lives at src/fin_ai_stats_extract/resources/system_prompt.md in this repository.
Local OpenAI-Compatible Endpoints
You can point the tool at a local or self-hosted OpenAI-compatible server by passing --base-url.
Example with LM Studio:
uvx fin-ai-stats-extract \
--input ./data/Current/11473715_T.xml \
--output ./output.csv \
--model google/gemma-3-4b \
--base-url http://127.0.0.1:1234/v1
You can also pass an API key explicitly:
uvx fin-ai-stats-extract \
--input ./data/Current \
--output ./output.csv \
--model gpt-4o-mini \
--api-key "$OPENAI_API_KEY"
If --base-url is provided and no API key is set, the tool automatically uses lm-studio as a fallback key for local servers that require a non-empty value.
Command-Line Arguments
--input: Required. Path to one XML file or a folder tree containing XML files. Folder scans are recursive.--output: Output CSV path. Defaults tooutput.csv.--prompt: Optional path to a custom system prompt markdown file. Without it, the CLI uses./system_prompt.md, creating it from the packaged default if needed.--model: Model name. Defaults toOPENAI_MODELorgpt-4o-mini.--base-url: Optional OpenAI-compatible API base URL. Defaults toOPENAI_BASE_URL.--api-key: Optional API key. Defaults toOPENAI_API_KEY.--temperature: Responses API sampling temperature from0to2.--top-p: Responses API nucleus sampling mass from0to1.--max-output-tokens: Maximum output tokens, including reasoning tokens.--reasoning-effort: Reasoning effort for supportedgpt-5ando-series models:none,minimal,low,medium,high,xhigh.--verbosity: Output verbosity for supported models:low,medium, orhigh.--max-concurrency: Maximum number of concurrent extraction jobs. Also accepts--max-async-jobsand--concurrency. Defaults toCONCURRENCY_LIMITor100.--resume: Resume from an existing output CSV. Requires an explicit--output path.csv. Already processedsource_filevalues are skipped.--dry-run: Parse XML only. No API calls are made.--sample: Randomly process onlyNfiles from the input set.--yes,-y: Skip the cost confirmation prompt when using OpenAI-hosted models.--verbose: Enable debug logging.
For the current OpenAI Responses API path used by this project, the commonly exposed researcher-facing controls are temperature, top_p, max_output_tokens, reasoning_effort, and verbosity. We recommend changing temperature or top_p, but not both at the same time.
Streamlit UI
For repository development, launch the Streamlit app with:
uv run streamlit run src/fin_ai_stats_extract/streamlit_app.py
The UI lets you:
- drag and drop one or more XML files
- choose OpenAI or a custom OpenAI-compatible endpoint
- automatically load available models from the endpoint's
/modelsAPI - edit the default system prompt before running extraction
- adjust max concurrency
- view the generated CSV directly in a table
- download the generated CSV directly from memory without writing a CSV file to disk
Output
The tool writes one CSV row per transcript with:
- transcript metadata:
event_id,company_name,quarter,date,headline,source_file - AI infrastructure fields
- AI analytics fields
- AI talent fields
- AI risk fields
- non-AI physical technology investment fields
- non-AI tech talent fields
List-valued fields are serialized using a semicolon-space separator.
When the input is a folder, source_file preserves the relative subfolder path, for example subdir/12345_T.xml.
Notes On Local Models
OpenAI-compatible endpoints vary in how well they support strict structured outputs.
In particular:
- some local models may fail on long transcripts because of context-window limits
- some local models may return non-compliant JSON even when a schema is provided
If you use a local endpoint and see parsing or validation failures, try:
- a model with a larger context window
- smaller inputs using
--sampleor a single file first - an OpenAI-hosted model for the most reliable structured-output behavior
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fin_ai_stats_extract-0.1.2.tar.gz.
File metadata
- Download URL: fin_ai_stats_extract-0.1.2.tar.gz
- Upload date:
- Size: 18.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cebbb1dea37b887a74e188d46abced11a00b584d85c6bf57f8536dc39ee3196f
|
|
| MD5 |
ab840f74168eb8e1c4a213a3267b18f6
|
|
| BLAKE2b-256 |
68f618d97d41fa698feef43dbb40be6cb829779b08167c520d5188d0cf0ab908
|
Provenance
The following attestation bundles were made for fin_ai_stats_extract-0.1.2.tar.gz:
Publisher:
release.yml on AlexDrBanana/fin-ai-stats-extract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fin_ai_stats_extract-0.1.2.tar.gz -
Subject digest:
cebbb1dea37b887a74e188d46abced11a00b584d85c6bf57f8536dc39ee3196f - Sigstore transparency entry: 1199371621
- Sigstore integration time:
-
Permalink:
AlexDrBanana/fin-ai-stats-extract@b2cff7f76ca3b5a5617e99d6e03c55a490537cff -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/AlexDrBanana
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@b2cff7f76ca3b5a5617e99d6e03c55a490537cff -
Trigger Event:
push
-
Statement type:
File details
Details for the file fin_ai_stats_extract-0.1.2-py3-none-any.whl.
File metadata
- Download URL: fin_ai_stats_extract-0.1.2-py3-none-any.whl
- Upload date:
- Size: 24.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f29a8d5b91a083615f817dd58674b84e4b7d9dc05a050273484042b0a536e1de
|
|
| MD5 |
d796cf54d06e1bfc9c5248a84ff3b0b7
|
|
| BLAKE2b-256 |
b42d78e16f70884c3f31b6de2637c6ab3662af9f1d89d1bbeaa45c8c0581bb74
|
Provenance
The following attestation bundles were made for fin_ai_stats_extract-0.1.2-py3-none-any.whl:
Publisher:
release.yml on AlexDrBanana/fin-ai-stats-extract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
fin_ai_stats_extract-0.1.2-py3-none-any.whl -
Subject digest:
f29a8d5b91a083615f817dd58674b84e4b7d9dc05a050273484042b0a536e1de - Sigstore transparency entry: 1199371634
- Sigstore integration time:
-
Permalink:
AlexDrBanana/fin-ai-stats-extract@b2cff7f76ca3b5a5617e99d6e03c55a490537cff -
Branch / Tag:
refs/tags/v0.1.2 - Owner: https://github.com/AlexDrBanana
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@b2cff7f76ca3b5a5617e99d6e03c55a490537cff -
Trigger Event:
push
-
Statement type: