Skip to main content

A package to parse SEC filings

Project description

SEC Parsers

PyPI - Downloads Hits GitHub

Parses non-standardized SEC filings into structured xml. Use cases include LLMs, NLP, and textual analysis. Package is a WIP.

Supported filing types are 10-K, 10-Q, 8-K, S-1. More will be added soon.

Installation

pip install sec-parsers

Quickstart

from sec_parsers import Filing, download_sec_filing

html = download_sec_filing('https://www.sec.gov/Archives/edgar/data/1318605/000162828024002390/tsla-20231231.htm')
filing = Filing(html)
filing.parse() # parses filing
filing.visualize() # opens filing in webbrowser with highlighted section headers
filing.find_nodes_by_title(title) # finds node by title, e.g. 'item 1a'
filing.find_nodes_by_text(text) # finds nodes which contains your text
filing.get_tree(node) # if no argument specified returns xml tree, if node specified, returns that nodes tree
filing.get_title_tree() # returns xml tree using titles instead of tags. More descriptive than get_tree.
filing.set_filing_type(type) # e.g. 'S-1'. Use when automatic detection fails
filing.save_xml(file_name)
filing.save_csv(file_name)

Additional Resources:

Feature Requests:

To request features or suggest a way to improve the package please use the form below. Google Form

  • Extract title of section along with its text (sharif)
  • Extract subsections from section (sharif)
  • Export to dta (Denis)
  • option to remove special chars from document in export (bill)

Statistics

  • Speed: On average, 10-K filings parse in 0.25 seconds. There were 7,118 10-K annual reports filed in 2023, so to parse all 10-Ks from 2023 should take about half an hour.

Updates

Towards Version 1:

  • Most/All SEC text filings supported
  • Few errrors
  • xml

Might be done along the way:

  • Faster parsing, probably using streaming approach, and combining modules together.
  • Introduction section parsing
  • Signatures section parsing

Beyond Version 1:

To improve the package beyond V1 it looks like I need compute and storage. Not sure how to get that. Working on it.

Metadata

  • Clustering similar section titles using ML (e.g. seasonality headers)
  • Adding tags to individual sections using small LLMs (e.g. tag for mentions supply chains, energy, etc)

Other

  • Table parsing
  • Image OCR
  • Parsing non-html filings

Current Priority list:

  • fix layering issue
  • fix all caps and emphasis issue
  • clean text
  • Better historical conversion: handle if PART I appears multiple times as header, e.g. logic here item 1 continued.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sec_parsers-0.531.tar.gz (16.4 kB view hashes)

Uploaded Source

Built Distribution

sec_parsers-0.531-py3-none-any.whl (18.8 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page