Skip to main content

Mining and parsing S-1 IPO filings

Project description

IPO-Mine: S-1 (IPO) Filings Toolkit

GitHub Repository: https://github.com/gtfintechlab/S1-Filings
Project Website: https://ipo-mine.web.app/

Overview

IPO-Mine is a Python package for downloading, parsing, and structuring S-1 IPO filings from the U.S. Securities and Exchange Commission (SEC) EDGAR system.

This repository implements the data processing pipeline used to construct the IPO-Mine dataset, a section-structured corpus introduced in the research paper:

IPO-Mine: A Section-Structured Dataset for Analyzing Long and Complex IPO Filings

The objective of this project is to transform raw SEC filings into clean, standardized, and section-aligned textual representations suitable for large-scale analysis in natural language processing, information retrieval, and long-document modeling.

Motivation

S-1 filings are among the most complex regulatory documents used in empirical research. They exhibit several challenges:

  • Extreme document length, often exceeding 100–300 pages
  • Substantial variation in section headers across firms and time
  • Heterogeneous formats, including HTML, plain text, and scanned images
  • Limited structural consistency despite regulatory guidance

These characteristics complicate tasks such as section segmentation, cross-firm comparison, longitudinal analysis, and long-context modeling.

IPO-Mine addresses these challenges by providing a unified and reproducible pipeline that converts raw EDGAR filings into structured, research-ready data.

Features

  • Automated downloading of S-1 and S-1/A filings from SEC EDGAR
  • Parsing of Tables of Contents (TOCs) for filings dating back to 1997
  • Extraction and normalization of key IPO sections, including:
    • Risk Factors
    • Business
    • Use of Proceeds
    • Management’s Discussion and Analysis (MD&A)
    • Financial Statements
  • Support for multiple filing formats:
    • HTML
    • plain text
    • image-based filings via OCR
  • Fuzzy matching of section headers using global section mappings
  • Deterministic outputs suitable for reproducible dataset construction

IPO-Mine Dataset

Using this toolkit, the IPO-Mine dataset is constructed as a large-scale corpus of IPO filings with:

  • Section-aligned text across firms
  • Standardized section nomenclature
  • Clean document boundaries
  • Compatibility with long-document modeling and retrieval frameworks

Additional details and examples are available at:

https://ipo-mine.web.app/

Installation

The package is available on PyPI under the name ipo-mine.

pip install ipo-mine

OCR Dependency

Parsing image-based filings requires a local installation of Tesseract OCR.

Tesseract Installation

Operating System Installation Method
macOS brew install tesseract
Ubuntu / Debian sudo apt install tesseract-ocr
Windows UB Mannheim Tesseract installer
Conda environments Included automatically

Example Usage

from ipo_mine.download.company import Company
from ipo_mine.download import S1Downloader
from ipo_mine.parse.s1_parser import S1Parser
from ipo_mine.resources import GLOBAL_SECTIONS_JSON
from ipo_mine.utils.config import PARSED_DIR

downloader = S1Downloader(
    email="your_email@domain.com",
    company="Your Institution"
)

ticker = "SNOW"
filing = downloader.download_s1(Company.from_ticker(ticker))

parser = S1Parser(
    filing=filing,
    mappings_path=GLOBAL_SECTIONS_JSON,
    output_base_path=PARSED_DIR
)

risk_factors = parser.parse_section("Risk Factors", ticker)

Research-Oriented Design

This library is designed primarily for dataset construction and reproducible empirical research rather than ad-hoc scraping.

Typical use cases include:

  • Building section-aligned IPO corpora
  • Comparing disclosure language across firms and time
  • Training and evaluation of long-document language models
  • Large-scale studies of regulatory disclosures

Citation

If you use this package or the IPO-Mine dataset in your research, please cite:

@inproceedings{ipomine2025,
  title     = {IPO-Mine: A Section-Structured Dataset for Analyzing Long and Complex IPO Filings},
  author    = {Author names},
  booktitle = {Proceedings of the ACM SIGKDD Conference},
  year      = {2025}
}

License

This project is released under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ipo_mine-0.0.0.tar.gz (44.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ipo_mine-0.0.0-py3-none-any.whl (49.2 kB view details)

Uploaded Python 3

File details

Details for the file ipo_mine-0.0.0.tar.gz.

File metadata

  • Download URL: ipo_mine-0.0.0.tar.gz
  • Upload date:
  • Size: 44.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for ipo_mine-0.0.0.tar.gz
Algorithm Hash digest
SHA256 d6a90d3e54559f1e84e4135d0aeaa8f1841ae5bf636d79afa79369c08e95bc3e
MD5 8fe99107d5c8221d5f7865af2d6d7349
BLAKE2b-256 02ba340a6fbf6670788936747f9a41cd1c867ae3c9f5012879379aa272d8cc8b

See more details on using hashes here.

File details

Details for the file ipo_mine-0.0.0-py3-none-any.whl.

File metadata

  • Download URL: ipo_mine-0.0.0-py3-none-any.whl
  • Upload date:
  • Size: 49.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.11

File hashes

Hashes for ipo_mine-0.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 77f4fa57959472de9ba5e69c471fc2d1419ece8385e7ac7dadd405ba6d8718a2
MD5 37529257d1073bf4d1866908b4f0e777
BLAKE2b-256 f815bd3e9663f2903054aedc0fa60d1a7ab743e93857dd1aa361a7f036b3e0a9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page