Mining and parsing S-1 IPO filings

Project description

IPO-Mine: S-1 (IPO) Filings Toolkit

GitHub Repository: https://github.com/gtfintechlab/S1-Filings
Project Website: https://ipo-mine.web.app/

Overview

IPO-Mine is a Python package for downloading, parsing, and structuring S-1 IPO filings from the U.S. Securities and Exchange Commission (SEC) EDGAR system.

This repository implements the data processing pipeline used to construct the IPO-Mine dataset, a section-structured corpus introduced in the research paper:

IPO-Mine: A Section-Structured Dataset for Analyzing Long and Complex IPO Filings

The objective of this project is to transform raw SEC filings into clean, standardized, and section-aligned textual representations suitable for large-scale analysis in natural language processing, information retrieval, and long-document modeling.

Motivation

S-1 filings are among the most complex regulatory documents used in empirical research. They exhibit several challenges:

Extreme document length, often exceeding 100–300 pages
Substantial variation in section headers across firms and time
Heterogeneous formats, including HTML, plain text, and scanned images
Limited structural consistency despite regulatory guidance

These characteristics complicate tasks such as section segmentation, cross-firm comparison, longitudinal analysis, and long-context modeling.

IPO-Mine addresses these challenges by providing a unified and reproducible pipeline that converts raw EDGAR filings into structured, research-ready data.

Features

Automated downloading of S-1 and S-1/A filings from SEC EDGAR
Parsing of Tables of Contents (TOCs) for filings dating back to 1997
Extraction and normalization of key IPO sections, including:
- Risk Factors
- Business
- Use of Proceeds
- Management’s Discussion and Analysis (MD&A)
- Financial Statements
Support for multiple filing formats:
- HTML
- plain text
- image-based filings via OCR
Fuzzy matching of section headers using global section mappings
Deterministic outputs suitable for reproducible dataset construction

IPO-Mine Dataset

Using this toolkit, the IPO-Mine dataset is constructed as a large-scale corpus of IPO filings with:

Section-aligned text across firms
Standardized section nomenclature
Clean document boundaries
Compatibility with long-document modeling and retrieval frameworks

Additional details and examples are available at:

https://ipo-mine.web.app/

Installation

The package is available on PyPI under the name ipo-mine.

pip install ipo-mine

OCR Dependency

Parsing image-based filings requires a local installation of Tesseract OCR.

Tesseract Installation

Operating System	Installation Method
macOS	`brew install tesseract`
Ubuntu / Debian	`sudo apt install tesseract-ocr`
Windows	UB Mannheim Tesseract installer
Conda environments	Included automatically

Example Usage

from ipo_mine.download.company import Company
from ipo_mine.download import S1Downloader
from ipo_mine.parse.s1_parser import S1Parser
from ipo_mine.resources import GLOBAL_SECTIONS_JSON
from ipo_mine.utils.config import PARSED_DIR

downloader = S1Downloader(
    email="your_email@domain.com",
    company="Your Institution"
)

ticker = "SNOW"
filing = downloader.download_s1(Company.from_ticker(ticker))

parser = S1Parser(
    filing=filing,
    mappings_path=GLOBAL_SECTIONS_JSON,
    output_base_path=PARSED_DIR
)

risk_factors = parser.parse_section("Risk Factors", ticker)

Research-Oriented Design

This library is designed primarily for dataset construction and reproducible empirical research rather than ad-hoc scraping.

Typical use cases include:

Building section-aligned IPO corpora
Comparing disclosure language across firms and time
Training and evaluation of long-document language models
Large-scale studies of regulatory disclosures

Citation

If you use this package or the IPO-Mine dataset in your research, please cite:

@inproceedings{ipomine2025,
  title     = {IPO-Mine: A Section-Structured Dataset for Analyzing Long and Complex IPO Filings},
  author    = {Author names},
  booktitle = {Proceedings of the ACM SIGKDD Conference},
  year      = {2025}
}

License

This project is released under the MIT License.

Project details

Release history Release notifications | RSS feed

0.1.4

Apr 9, 2026

0.1.3

Feb 24, 2026

0.1.2

Feb 24, 2026

0.1.1

Feb 17, 2026

0.1.0

Feb 11, 2026

This version

0.0.0

Jan 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ipo_mine-0.0.0.tar.gz (44.3 kB view details)

Uploaded Jan 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ipo_mine-0.0.0-py3-none-any.whl (49.2 kB view details)

Uploaded Jan 20, 2026 Python 3

File details

Details for the file ipo_mine-0.0.0.tar.gz.

File metadata

Download URL: ipo_mine-0.0.0.tar.gz
Upload date: Jan 20, 2026
Size: 44.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for ipo_mine-0.0.0.tar.gz
Algorithm	Hash digest
SHA256	`d6a90d3e54559f1e84e4135d0aeaa8f1841ae5bf636d79afa79369c08e95bc3e`
MD5	`8fe99107d5c8221d5f7865af2d6d7349`
BLAKE2b-256	`02ba340a6fbf6670788936747f9a41cd1c867ae3c9f5012879379aa272d8cc8b`

See more details on using hashes here.

File details

Details for the file ipo_mine-0.0.0-py3-none-any.whl.

File metadata

Download URL: ipo_mine-0.0.0-py3-none-any.whl
Upload date: Jan 20, 2026
Size: 49.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for ipo_mine-0.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`77f4fa57959472de9ba5e69c471fc2d1419ece8385e7ac7dadd405ba6d8718a2`
MD5	`37529257d1073bf4d1866908b4f0e777`
BLAKE2b-256	`f815bd3e9663f2903054aedc0fa60d1a7ab743e93857dd1aa361a7f036b3e0a9`

See more details on using hashes here.

ipo-mine 0.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

IPO-Mine: S-1 (IPO) Filings Toolkit

Overview

Motivation

Features

IPO-Mine Dataset

Installation

OCR Dependency

Tesseract Installation

Example Usage

Research-Oriented Design

Citation

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes