A package to parse SEC filings
Project description
SEC Parsers
Parses non-standardized SEC 10-K filings into well structured detailed xml. This is a WIP. Not every file will parse correctly.
Installation
pip install sec-parsers
Quickstart
from sec_parsers import *
html = download_sec_filing('https://www.sec.gov/Archives/edgar/data/1318605/000162828024002390/tsla-20231231.htm')
parsed_html = parse_10k(html)
xml = construct_xml_tree(parsed_html)
For more information look at the quickstart, or view a parsed Tesla 10-K here.
Links:
- GitHub
- Archive of Parsed XMLs - Note: This is often out of date, as package is being updated frequently.
Problem:
When you look at an SEC 10-K you can easily see the structure of the file, and what headers follow each other. Under the hood, these filings are non-standardized making it hard to convert into a well structured format suitable for NLP/RAG.
How SEC Parsers works:
- Detects headers in filings using:
- element tags, e.g.
<b>Item 1</b>
- element css, e.g.
<p style="font-weight: bold;">Item 1.</p>
- text style, e.g. emphasis style "Purchase of Significant Equipment"
- relative location of above elements to each other
- Calculates hierarchy of headers, and converts to a tree structure
Future
- fix titles for xml (e.g. item 1 instead of item 1. business)
- better hierarchy calculation
- more supported filings: 10-Q, 8-K, etc
- better rag integration
- converting html tables to nice xml tables
- metadata, e.g. cik / data from xbrl in html
- hosting cleaned xml files online
- better attributes (names / format)
- better color scheme (color scheme for headers, ignored_elements - e.g. page numbers, text)
- better function naming
- better modules naming
- better parent handling
- better descriptions of functions
- better github and pypi pages
Statistics
Not implemented yet.
Some Other Packages that might be useful:
- edgartools - good interface for interacting with SEC's EDGAR system
Alternative Approaches I've seen to parse SEC Filings
- sec-parser - oops, we have similar names. They were first, my bad. They parse 10-Qs well.
- sec-api. Paid API to search / download SEC filings. Basically, SEC's EDGAR but setup in a much nicer format. I haven't used it since it costs money.
- Bill McDonald's 10-X Archive
- Computer Vision using OpenCV
- LLMs (I believe unstructured.io does something like this)
- Transformers
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
sec_parsers-0.505.tar.gz
(11.3 kB
view hashes)
Built Distribution
Close
Hashes for sec_parsers-0.505-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8b91d402891f7a8bdf06cf9b3ebcc2009642c0431cfb2399d0a91cf788ee52af |
|
MD5 | cbc6514779e72dc38486fe598c99bd9d |
|
BLAKE2b-256 | 815f0e4d9e5400a01a44e3438083e99b4d14adf87e31c75b4179bbc5dc363fc6 |