Skip to main content

A package to parse SEC filings

Project description

SEC Parsers

Parses non-standardized SEC 10-K filings into well structured detailed xml. This is a WIP. Not every file will parse correctly.

Alt text Alt text

Installation

pip install sec-parsers

Quickstart

from sec_parsers import *

html = download_sec_filing('https://www.sec.gov/Archives/edgar/data/1318605/000162828024002390/tsla-20231231.htm')
parsed_html = parse_10k(html)
xml = construct_xml_tree(parsed_html)

For more information look at the quickstart, or view a parsed Tesla 10-K here.

Links:

Problem:

When you look at an SEC 10-K you can easily see the structure of the file, and what headers follow each other. Under the hood, these filings are non-standardized making it hard to convert into a well structured format suitable for NLP/RAG.

How SEC Parsers works:

  1. Detects headers in filings using:
  • element tags, e.g. <b>Item 1</b>
  • element css, e.g. <p style="font-weight: bold;">Item 1.</p>
  • text style, e.g. emphasis style "Purchase of Significant Equipment"
  • relative location of above elements to each other
  1. Calculates hierarchy of headers, and converts to a tree structure

Future

  • fix titles for xml (e.g. item 1 instead of item 1. business)
  • better hierarchy calculation
  • more supported filings: 10-Q, 8-K, etc
  • better rag integration
  • converting html tables to nice xml tables
  • metadata, e.g. cik / data from xbrl in html
  • hosting cleaned xml files online
  • better attributes (names / format)
  • better color scheme (color scheme for headers, ignored_elements - e.g. page numbers, text)
  • better function naming
  • better modules naming
  • better parent handling
  • better descriptions of functions
  • better github and pypi pages

Statistics

Not implemented yet.

Some Other Packages that might be useful:

  • edgartools - good interface for interacting with SEC's EDGAR system

Alternative Approaches I've seen to parse SEC Filings

  • sec-parser - oops, we have similar names. They were first, my bad. They parse 10-Qs well.
  • sec-api. Paid API to search / download SEC filings. Basically, SEC's EDGAR but setup in a much nicer format. I haven't used it since it costs money.
  • Bill McDonald's 10-X Archive
  • Computer Vision using OpenCV
  • LLMs (I believe unstructured.io does something like this)
  • Transformers

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sec_parsers-0.504.tar.gz (11.3 kB view hashes)

Uploaded Source

Built Distribution

sec_parsers-0.504-py3-none-any.whl (12.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page