Mining and parsing S-1 IPO filings
Project description
IPO-Mine: A Toolkit and Dataset for Section-Structured Analysis of Long, Multimodal IPO Documents
This work is licensed under a Creative Commons Attribution 4.0 International License.
Dataset Construction Pipelines
| Image Dataset Pipeline | Text Dataset Pipeline |
|
|
|
Quickstart
Install from PyPI
pip install ipo-mine
Using ipo-mine to Download an IPO Filing (Python API)
from download import IPODownloader, Company
downloader = IPODownloader(
email="example@gmail.com",
company="Your Example Organization"
)
company = Company.from_ticker("SNOW")
company_filings = downloader.download_ipo(
company,
limit=1,
save_filing=True,
save_images=False,
verbose=True
)
filing = company_filings.filings[0]
Parsing the Table of Contents
results = parser.parse_company(
ticker="SNOW",
validate=False
)
CLI Usage
You can use the command-line interface to download and parse filings without writing Python code.
Download
Download the latest S-1 filing for a company:
ipo-mine download SNOW --email your@email.com --org "Your Org"
Options:
--limit N: Download previous N filings (default: 1)--images: Download and extract images from the filing--all: Download all available IPO filings for the ticker
Parse
Parse a downloaded filing into section-specific files:
ipo-mine parse SNOW
Options:
--validate: Enable LLM-based validation of extracted sections--provider: LLM provider (anthropic, openai, google, huggingface)--mode: Validation mode (binary, likert)
Validate
Run LLM validation on existing parsed text files to check for truncation or completeness.
ipo-mine validate SNOW --provider anthropic
Supported Providers
You can choose from the following providers (requires API keys):
| Provider | Argument | Env Variable |
|---|---|---|
| Anthropic (Claude) | --provider anthropic |
ANTHROPIC_API_KEY |
| OpenAI (GPT-4o) | --provider openai |
OPENAI_API_KEY |
| Google (Gemini) | --provider google |
GOOGLE_API_KEY |
| HuggingFace | --provider huggingface |
HUGGINGFACE_API_KEY |
Validation Modes
- Binary (
--mode binary): Returns "Yes" (Valid) or "No" (Truncated/Incomplete). Default. - Likert (
--mode likert): Returns a confidence score from 1 (Incomplete) to 5 (Complete).
Authentication
The CLI will look for API keys in this order:
- Command Line Argument:
--api-key "sk-..." - Environment Variable: e.g.,
export OPENAI_API_KEY="sk-..." - Interactive Prompt: If neither is found, the CLI will securely prompt you to enter the key (input is hidden).
Examples
Validate using OpenAI with Likert scale:
ipo-mine validate TSLA --provider openai --mode likert
Validate using Google Gemini with explicit key:
ipo-mine validate TSLA --provider google --api-key "your-api-key"
Notes
- The SEC requires a descriptive User-Agent. Provide a real organization name and your email.
download_iporeturns aCompanyFilingsobject; usecompany_filings.filings[0]to pass aFilinginto the parser.- The parser automatically chooses HTML or text parsing based on the filing URL.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ipo_mine-0.1.4.tar.gz.
File metadata
- Download URL: ipo_mine-0.1.4.tar.gz
- Upload date:
- Size: 921.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f594596f35036072d3cae73365ce007aaa01e27294e8d0abb5bbab64ec19f951
|
|
| MD5 |
51c3c4318c6fd74f25287c9c1187cf4e
|
|
| BLAKE2b-256 |
1862bb83daf56033d02ad0bffc3586c3ff73c042ce0829d5fc9132373ef973ac
|
File details
Details for the file ipo_mine-0.1.4-py3-none-any.whl.
File metadata
- Download URL: ipo_mine-0.1.4-py3-none-any.whl
- Upload date:
- Size: 935.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
49ea8acacd3520fb618f734fac6bbeb3caa43332525e9259b8fae44b589a9779
|
|
| MD5 |
4287fe33aa795177afd0edae8c0d19b8
|
|
| BLAKE2b-256 |
aa81dc2b11b0acde64832648a4821e5eb80d08e6693fe35a309256786f2a157b
|