A package to parse SEC filings
Project description
SEC Parsers
Parses non-standardized SEC 10-K filings into well structured detailed xml. This is a WIP. Not every file will parse correctly.
Installation
pip install sec-parsers
Quickstart
from sec_parsers import *
html = download_sec_filing('https://www.sec.gov/Archives/edgar/data/1318605/000162828024002390/tsla-20231231.htm')
parsed_html = parse_10k(html)
xml = construct_xml_tree(parsed_html)
For more information look at the quickstart, or view a parsed Tesla 10-K here.
Links:
- GitHub
- Archive of Parsed XMLs - Note: This is often out of date, as package is being updated frequently.
Problem:
When you look at an SEC 10-K you can easily see the structure of the file, and what headers follow each other. Under the hood, these filings are non-standardized making it hard to convert into a well structured format suitable for NLP/RAG.
How SEC Parsers works:
- Detects headers in filings using:
- element tags, e.g.
<b>Item 1</b>
- element css, e.g.
<p style="font-weight: bold;">Item 1.</p>
- text style, e.g. emphasis style "Purchase of Significant Equipment"
- relative location of above elements to each other
- Calculates hierarchy of headers, and converts to a tree structure
Priority TODO
- Test Parser + improve low hanging fruit
- Get Input on design, etc
- organize and clean code
Future
- fix titles for xml (e.g. item 1 instead of item 1. business)
- better hierarchy calculation
- more supported filings: 10-Q, 8-K, etc
- better rag integration
- converting html tables to nice xml tables
- metadata, e.g. cik / data from xbrl in html
- hosting cleaned xml files online
- better attributes (names / format)
- better color scheme (color scheme for headers, ignored_elements - e.g. page numbers, text)
- better function naming
- better modules naming
- better parent handling
- better descriptions of functions
- better github and pypi pages
Statistics
Not implemented yet.
Some Other Packages that might be useful:
- edgartools - good interface for interacting with SEC's EDGAR system
Alternative Approaches I've seen to parse SEC Filings
- sec-parser - oops, we have similar names. They were first, my bad. They parse 10-Qs well.
- sec-api. Paid API to search / download SEC filings. Basically, SEC's EDGAR but setup in a much nicer format. I haven't used it since it costs money.
- Bill McDonald's 10-X Archive
- Computer Vision using OpenCV
- LLMs (I believe unstructured.io does something like this)
- Transformers
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
sec_parsers-0.510.tar.gz
(11.6 kB
view hashes)
Built Distribution
Close
Hashes for sec_parsers-0.510-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0f6d6c9729b3a6b36a2987c519aa5b377b35eb18adcb00a2382bdc21938c5cf1 |
|
MD5 | a76eaea68d70dcd4fdc758995d20b814 |
|
BLAKE2b-256 | caad9cd1c669fa38d55d9e8479dabe545835c479f1535e1ed490afa3638b4bb6 |