Mining and parsing S-1 IPO filings
Project description
IPO-Mine: S-1 (IPO) Filings Toolkit
GitHub Repository: https://github.com/gtfintechlab/S1-Filings
Project Website: https://ipo-mine.web.app/
Overview
IPO-Mine is a Python package for downloading, parsing, and structuring S-1 IPO filings from the U.S. Securities and Exchange Commission (SEC) EDGAR system.
This repository implements the data processing pipeline used to construct the IPO-Mine dataset, a section-structured corpus introduced in the research paper:
IPO-Mine: A Section-Structured Dataset for Analyzing Long and Complex IPO Filings
The objective of this project is to transform raw SEC filings into clean, standardized, and section-aligned textual representations suitable for large-scale analysis in natural language processing, information retrieval, and long-document modeling.
Motivation
S-1 filings are among the most complex regulatory documents used in empirical research. They exhibit several challenges:
- Extreme document length, often exceeding 100–300 pages
- Substantial variation in section headers across firms and time
- Heterogeneous formats, including HTML, plain text, and scanned images
- Limited structural consistency despite regulatory guidance
These characteristics complicate tasks such as section segmentation, cross-firm comparison, longitudinal analysis, and long-context modeling.
IPO-Mine addresses these challenges by providing a unified and reproducible pipeline that converts raw EDGAR filings into structured, research-ready data.
Features
- Automated downloading of S-1 and S-1/A filings from SEC EDGAR
- Parsing of Tables of Contents (TOCs) for filings dating back to 1997
- Extraction and normalization of key IPO sections, including:
- Risk Factors
- Business
- Use of Proceeds
- Management’s Discussion and Analysis (MD&A)
- Financial Statements
- Support for multiple filing formats:
- HTML
- plain text
- image-based filings via OCR
- Fuzzy matching of section headers using global section mappings
- Deterministic outputs suitable for reproducible dataset construction
IPO-Mine Dataset
Using this toolkit, the IPO-Mine dataset is constructed as a large-scale corpus of IPO filings with:
- Section-aligned text across firms
- Standardized section nomenclature
- Clean document boundaries
- Compatibility with long-document modeling and retrieval frameworks
Additional details and examples are available at:
Installation
The package is available on PyPI under the name ipo-mine.
pip install ipo-mine
OCR Dependency
Parsing image-based filings requires a local installation of Tesseract OCR.
Tesseract Installation
| Operating System | Installation Method |
|---|---|
| macOS | brew install tesseract |
| Ubuntu / Debian | sudo apt install tesseract-ocr |
| Windows | UB Mannheim Tesseract installer |
| Conda environments | Included automatically |
Example Usage
from ipo_mine.download.company import Company
from ipo_mine.download import S1Downloader
from ipo_mine.parse.s1_parser import S1Parser
from ipo_mine.resources import GLOBAL_SECTIONS_JSON
from ipo_mine.utils.config import PARSED_DIR
downloader = S1Downloader(
email="your_email@domain.com",
company="Your Institution"
)
ticker = "SNOW"
filing = downloader.download_s1(Company.from_ticker(ticker))
parser = S1Parser(
filing=filing,
mappings_path=GLOBAL_SECTIONS_JSON,
output_base_path=PARSED_DIR
)
risk_factors = parser.parse_section("Risk Factors", ticker)
Research-Oriented Design
This library is designed primarily for dataset construction and reproducible empirical research rather than ad-hoc scraping.
Typical use cases include:
- Building section-aligned IPO corpora
- Comparing disclosure language across firms and time
- Training and evaluation of long-document language models
- Large-scale studies of regulatory disclosures
Citation
If you use this package or the IPO-Mine dataset in your research, please cite:
@inproceedings{ipomine2025,
title = {IPO-Mine: A Section-Structured Dataset for Analyzing Long and Complex IPO Filings},
author = {Author names},
booktitle = {Proceedings of the ACM SIGKDD Conference},
year = {2025}
}
License
This project is released under the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ipo_mine-0.0.0.tar.gz.
File metadata
- Download URL: ipo_mine-0.0.0.tar.gz
- Upload date:
- Size: 44.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d6a90d3e54559f1e84e4135d0aeaa8f1841ae5bf636d79afa79369c08e95bc3e
|
|
| MD5 |
8fe99107d5c8221d5f7865af2d6d7349
|
|
| BLAKE2b-256 |
02ba340a6fbf6670788936747f9a41cd1c867ae3c9f5012879379aa272d8cc8b
|
File details
Details for the file ipo_mine-0.0.0-py3-none-any.whl.
File metadata
- Download URL: ipo_mine-0.0.0-py3-none-any.whl
- Upload date:
- Size: 49.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
77f4fa57959472de9ba5e69c471fc2d1419ece8385e7ac7dadd405ba6d8718a2
|
|
| MD5 |
37529257d1073bf4d1866908b4f0e777
|
|
| BLAKE2b-256 |
f815bd3e9663f2903054aedc0fa60d1a7ab743e93857dd1aa361a7f036b3e0a9
|