A package to parse SEC filings
Project description
SEC Parsers
Parses non-standardized SEC filings into structured xml. Use cases include LLMs, NLP, and textual analysis. Package is a WIP.
Supported filing types are 10-K, 10-Q, 8-K, S-1, 20-F. More will be added soon.
Installation
pip install sec-parsers
Quickstart
from sec_parsers import Filing, download_sec_filing
html = download_sec_filing('https://www.sec.gov/Archives/edgar/data/1318605/000162828024002390/tsla-20231231.htm')
filing = Filing(html)
filing.parse() # parses filing
filing.visualize() # opens filing in webbrowser with highlighted section headers
filing.find_nodes_by_title(title) # finds node by title, e.g. 'item 1a'
filing.find_nodes_by_text(text) # finds nodes which contains your text
filing.get_tree(node) # if no argument specified returns xml tree, if node specified, returns that nodes tree
filing.get_title_tree() # returns xml tree using titles instead of tags. More descriptive than get_tree.
filing.set_filing_type(type) # e.g. 'S-1'. Use when automatic detection fails
filing.save_xml(file_name)
filing.save_csv(file_name)
Additional Resources:
- quickstart
- [In Progress] Article explaining how to write custom filing parsers.
- Archive of Parsed XMLs / CSVs - Last updated 7/24/24.
- example parsed filing
- example parsed filing exported to csv.
Feature Requests:
To request features or suggest a way to improve the package please use the form below. Google Form
- Extract title of section along with its text (sharif)
- Extract subsections from section (sharif)
- Export to dta (Denis)
- option to remove special chars from document in export (bill)
- DEF 14A, DEFM14A
Statistics
- Speed: On average, 10-K filings parse in 0.25 seconds. There were 7,118 10-K annual reports filed in 2023, so to parse all 10-Ks from 2023 should take about half an hour.
Updates
Towards Version 1:
- Most/All SEC text filings supported
- Few errrors
- xml
Might be done along the way:
- Faster parsing, probably using streaming approach, and combining modules together.
- Introduction section parsing
- Signatures section parsing
Beyond Version 1:
To improve the package beyond V1 it looks like I need compute and storage. Not sure how to get that. Working on it.
Metadata
- Clustering similar section titles using ML (e.g. seasonality headers)
- Adding tags to individual sections using small LLMs (e.g. tag for mentions supply chains, energy, etc)
Other
- Table parsing
- Image OCR
- Parsing non-html filings
Current Priority list:
- consider adding table of contents, forward looking information, etc
- fix layering issue
- make trees nicer
- add more filing types
- fix all caps and emphasis issue
- clean text
- Better historical conversion: handle if PART I appears multiple times as header, e.g. logic here item 1 continued.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
sec_parsers-0.536.tar.gz
(16.7 kB
view hashes)
Built Distribution
Close
Hashes for sec_parsers-0.536-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 135632c8ade21b7cee3689273ec501ede41383bc65d55fbf64a97819e511a0bb |
|
MD5 | 85f5ffc9ff9ced1bd37c976859820c81 |
|
BLAKE2b-256 | 870d6c302f701edd16e47312f163d44b13e77206fd6a0e0d240ba6cb54bbefae |