A package to parse SEC filings

Project description

SEC Parsers

Parses non-standardized SEC 10-K filings into well structured detailed xml. Use cases include LLMs, NLP, and textual analysis. This is a WIP. Not every file will parse correctly.

Current supported file types: 10-K, 10-Q

Installation

pip install sec-parsers

Quickstart

from sec_parsers import Filing, download_sec_filing

html = download_sec_filing('https://www.sec.gov/Archives/edgar/data/1318605/000162828024002390/tsla-20231231.htm')
filing = Filing(html)
filing.parse() # parses filing
filing.visualize() # opens filing in webbrowser with highlighted section headers
filing.find_nodes_by_title(title) # finds node by title, e.g. 'item 1a'
filing.find_nodes_by_text(text) # finds nodes which contains your text
filing.get_tree(node) # if no argument specified returns xml tree, if node specified, returns that nodes tree
filing.get_title_tree() # returns xml tree using titles instead of tags. More descriptive than get_tree.
filing.save_xml(file_name)
filing.save_csv(file_name)

For more information look at the quickstart, or view a parsed Tesla 10-K here. SEC Parsers also supports exporting to csv, see here.

Problem:

SEC filings are human readable, but messy html makes it hard for machines to detect and read information by section. This is especially important for NLP / RAG using LLMs.

How SEC Parsers works:

Detects headers in filings using:

element tags, e.g. <b>Item 1</b>
element css, e.g. <p style="font-weight: bold;">Item 1.</p>
text style, e.g. emphasis style "Purchase of Significant Equipment"
relative location of above elements to each other

Calculates hierarchy of headers, and converts to a tree structure

Roadmap:

Parser that converts >95% of filings into nicely formatted xml trees. Currently at 90%.
Apply data science on xml to cluster headers, e.g. seasonality, seasonal variation etc, to make xml easier to work with.
LLMs on section text to create node metadata (e.g. is anything said in this section, or 2000 chars to say we are not required to fill out this section)
Backwards compatability for text files. (extends historical reach back into early 1900s)

Possible future features

better hierarchy calculation
more filings supported
better rag integration
converting html tables to nice xml tables
hosting cleaned xml files online
better color scheme (color scheme for headers, ignored_elements - e.g. page numbers, text)
better descriptions of functions
faster - not a priority, but kinda fun to program. Code cleanup + removing redundancies may help a lot.

Features

Parse 10K
Export to XML, CSV
XBRL metadata

Feature request:

save_dta - save xml to dta. similar to csv function
better selection by titles. e.g. selecting by item1, will also return item 1a,... not sure how to set this up in a nice way
More XBRL stuff

Statistics

100% parsed html rate
99.3% conversion to xml rate.
On average ~1s to parse file (range .1s-3s).

Issues

handle if PART I appears multiple times as header, e.g. logic here item 1 continued. Develop logic to handle this. Probably in cleanup?

TODO

Currently I've only focused on parts parsing. Signature node has been added, but tree is likely to be crazy. Focusing on reducing parts parsing tree crazyness first
we fixed one table issue, now need to account for too much tables https://www.sec.gov/Archives/edgar/data/18255/000001825518000024/cato10k2017-jrs.htm
Code cleanup. Right now I'm tweaking code to increase parse rate, eventually need to incorporate lessons learned, and rewrite.

Other people's SEC stuff

edgartools - good interface for interacting with SEC's EDGAR system
sec-parser - oops, we have similar names. They were first, my bad. They parse 10-Qs well.
sec-api. Paid API to search / download SEC filings. Basically, SEC's EDGAR but setup in a much nicer format. I haven't used it since it costs money.
Bill McDonald's 10-X Archive
Eclect - "Save time reading SEC filings with the help of machine learning.". Paid.
Textblocks.app - Paid API to extract and analyze structured data from SEC filings. The approach seems to be similar to mine.
Yu Zhu - article with an approach to parse 10K filings using regex
Wharton Research Data Services - heard they have SEC stuff, looking into it
Gist - using regex and beautifulsoup to parse 10Ks
Victor Dahan - Sentiment Analysis of 10-K Files
edgarWebR - edgarWebR provides an interface to access the SECâ€™s EDGAR system for company financial filings.
NLP in the stock market - Leveraging sentiment analysis on 10-k fillings as an edge
Computer Vision using OpenCV
LLMs (I believe unstructured.io does something like this)
Transformers

Other people's papers related to SEC stuff

Sentiment Analysis on 10-K Financial Reports using Machine Learning Approaches

Project details

Release history Release notifications | RSS feed

0.549

Jul 29, 2024

0.546

Jul 28, 2024

0.544

Jul 28, 2024

0.543

Jul 27, 2024

0.542

Jul 25, 2024

0.541

Jul 25, 2024

0.540

Jul 25, 2024

0.537

Jul 24, 2024

0.536

Jul 24, 2024

0.535

Jul 24, 2024

0.532

Jul 24, 2024

0.531

Jul 24, 2024

0.529

Jul 15, 2024

This version

0.528

Jul 13, 2024

0.527

Jul 13, 2024

0.526

Jul 12, 2024

0.524

Jul 12, 2024

0.522

Jul 11, 2024

0.521

Jul 11, 2024

0.520

Jul 11, 2024

0.513

Jul 11, 2024

0.511

Jul 6, 2024

0.510

Jul 6, 2024

0.507

Jul 3, 2024

0.505

Jul 3, 2024

0.504

Jul 3, 2024

0.503

Jul 3, 2024

0.421

Jun 27, 2024

0.420

Jun 27, 2024

0.406

Jun 23, 2024

0.405

Jun 23, 2024

0.404

Jun 23, 2024

0.402

Jun 23, 2024

0.401

Jun 23, 2024

0.3

Jun 22, 2024

0.2

Jun 22, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sec_parsers-0.528.tar.gz (17.7 kB view hashes)

Uploaded Jul 13, 2024 Source

Built Distribution

sec_parsers-0.528-py3-none-any.whl (16.6 kB view hashes)

Uploaded Jul 13, 2024 Python 3

Hashes for sec_parsers-0.528.tar.gz

Hashes for sec_parsers-0.528.tar.gz
Algorithm	Hash digest
SHA256	`cce5c97b04a4f8dec75b03c2a5e0e352013e6710b8da5bb2a4eed9df6b02b1c6`
MD5	`3ccd72ba707ba9727313d8948d958df7`
BLAKE2b-256	`6461c02c9a44e293856a7f1c03bb95e2292ee2e6c23505837cc480891b136e0e`

Hashes for sec_parsers-0.528-py3-none-any.whl

Hashes for sec_parsers-0.528-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e6fa62be35e48ee6b90356aca18cb2e68983ce046cebacabcf9a49039324c4de`
MD5	`df9f56c89e676c52a1d189e90cb3ad57`
BLAKE2b-256	`36823bbf30222e1a4a6c3a7e379dd98e2e7f0c4e037be611874ba6a03697d828`