Skip to main content

Parse SEC EDGAR HTML documents into a tree of elements that correspond to the visual structure of the document.

Project description

 

sec-parser

Essentials ➔       Documentation Status PyPI - License Project Type: Federation Beta
Health ➔              GitHub Workflow Status: ci.yml GitHub Workflow Status: cd.yml Last Commit
Quality ➔             Codacy grade codecov Code Style: Black Ruff
Distribution ➔    PyPI version PyPI - Python Version PyPI downloads
Community ➔     Discord X (formerly Twitter) Follow GitHub stars

Parse SEC EDGAR HTML documents into a tree of elements that correspond to the visual structure of the document.


Overview

The sec-parser project simplifies extracting meaningful information from SEC EDGAR HTML documents by organizing them into semantic elements and a tree structure. Semantic elements might include section titles, paragraphs, and tables, each classified for easier data manipulation. This forms a semantic tree that corresponds to the visual and informational structure of the document. If you're familiar with the Image Semantic Segmentation concept, it's the same but applied to HTML documents.

This tool is especially beneficial for Artificial Intelligence (AI), Machine Learning (ML), and Large Language Models (LLM) applications by streamlining data pre-processing and feature extraction.

Key Use-Cases

sec-parser is versatile and can be applied in various scenarios, including but not limited to:

Financial and Regulatory Analysis

  • Financial Analysis: Extract financial data from 10-Q and 10-K filings for quantitative modeling.
  • Risk Assessment: Evaluate risk factors or Management's Discussion and Analysis sections for qualitative analysis.
  • Regulatory Compliance: Assist in automating compliance checks for the legal teams.
  • Flexible Filtering: Easily filter SEC documents by sections and types, giving you precisely the data you need.

Analytics and Data Science

  • Academic Research: Facilitate large-scale studies involving public financial disclosures, sentiment analysis, or corporate governance exploratory.
  • Analytics Ready: Integrate parsed data seamlessly into popular analytics tools for further analysis and visualization.

AI and Machine Learning

  • Cutting-Edge AI for SEC EDGAR: Apply advanced AI techniques like MemWalker to navigate and extract and transform complex information from SEC documents efficiently. Learn more in our blog post: Cutting-Edge AI for SEC EDGAR: Introducing MemWalker.
  • AI Applications: Leverage parsed data for various AI tasks such as text summarization, sentiment analysis, and named entity recognition.
  • Data Augmentation: Use authentic financial text to train and test machine learning models.

Causal AI

  • Causal Analysis: Use parsed data to understand cause-effect relationships in financial data, beyond mere correlations.
  • Predictive Modeling: Enhance predictive models by incorporating causal relationships, leading to more robust and reliable predictions.
  • Decision Making: Aid decision-making processes by providing insights into the potential impact of different actions, based on causal relationships identified in the data.

Large Language Models

  • LLM Compatible: Use parsed data to facilitate complex NLU tasks with Large Language Models like ChatGPT, including question-answering, language translation, and information retrieval.

These use-cases demonstrate the flexibility and power of sec-parser in handling both traditional data extraction tasks and facilitating more advanced AI-driven analysis.

Disclaimer

[!IMPORTANT] This project, sec-parser, is an independent, open-source initiative and has no affiliation, endorsement, or verification by the United States Securities and Exchange Commission (SEC). It utilizes public APIs and data provided by the SEC solely for research, informational, and educational objectives. This tool is not intended for financial advisement or as a substitute for professional investment advice or compliance with securities regulations. The creators and maintainers make no warranties, expressed or implied, about the accuracy, completeness, or reliability of the data and analyses presented. Use this software at your own risk. For accurate and comprehensive financial analysis, consult with qualified financial professionals and comply with all relevant legal requirements. The project maintainers and contributors are not liable for any financial or legal consequences arising from the use of this tool.

Getting Started

This guide will walk you through the process of installing the sec-parser package and using it to extract the "Segment Operating Performance" section as a semantic tree from the latest Apple 10-Q filing.

[!TIP] To run the example code in a ready-to-code environment, you can use GitHub Codespaces. Click the button below to open the example code below in a codespace and start experimenting with sec-parser:

Open in GitHub Codespaces

Installation

First, install the sec-parser package using pip:

pip install sec-parser

To run the example code in this README, you'll also need the sec_downloader package:

pip install sec-downloader

Usage

Once you've installed the necessary packages, you can start by downloading the filing from the SEC EDGAR website. Here's how you can do it:

from sec_downloader import Downloader

# Initialize the downloader with your company name and email
dl = Downloader("MyCompanyName", "email@example.com")

# Download the latest 10-Q filing for Apple
html = dl.get_filing_html(ticker="AAPL", form="10-Q")

[!NOTE] The company name and email address are used to form a user-agent string that adheres to the SEC EDGAR's fair access policy for programmatic downloading. Source

[!TIP] Read sec-downloader documentation (and examples) for more advanced usage (such as downloading three latest Apple 10-Q filings instead of just one, or downloading based on a specific CIK or Filing ID (i.e. accession number)).

Now, we can parse the filing HTML into a list of semantic elements:

# Utility function to make the example code a bit more compact
def print_first_n_lines(text: str, *, n: int):
    print("\n".join(text.split("\n")[:n]), "...", sep="\n")
import sec_parser as sp

elements: list = sp.Edgar10QParser().parse(html)

demo_output: str = sp.render(elements)
print_first_n_lines(demo_output, n=7)
TopSectionTitle: PART I  —  FINANCIAL INFORMATION
TopSectionTitle: Item 1.    Financial Statements
TitleElement: CONDENSED CONSOLIDATED STATEMENTS OF OPERATIONS (Unaudited)
SupplementaryText: (In millions, except number of ...housands and per share amounts)
TableElement: Table with 24 rows, 80 numbers, and 1058 characters.
SupplementaryText: See accompanying Notes to Conde...solidated Financial Statements.
TitleElement: CONDENSED CONSOLIDATED STATEMEN...OMPREHENSIVE INCOME (Unaudited)
...

[!TIP]

FAQ: How do I get the text of each element (or all of the document)? How do I get all of the text in a specific section?

Use the element.text field. Check out this notebook for a full example.

We can also construct a semantic tree to allow for easy filtering by parent sections:

tree = sp.TreeBuilder().build(elements)

demo_output: str = sp.render(tree)
print_first_n_lines(demo_output, n=7)
TopSectionTitle: PART I  —  FINANCIAL INFORMATION
├── TopSectionTitle: Item 1.    Financial Statements
│   ├── TitleElement: CONDENSED CONSOLIDATED STATEMENTS OF OPERATIONS (Unaudited)
│   │   ├── SupplementaryText: (In millions, except number of ...housands and per share amounts)
│   │   ├── TableElement: Table with 24 rows, 80 numbers, and 1058 characters.
│   │   ├── SupplementaryText: See accompanying Notes to Conde...solidated Financial Statements.
│   ├── TitleElement: CONDENSED CONSOLIDATED STATEMEN...OMPREHENSIVE INCOME (Unaudited)
...

[!TIP]

Feel free to experiment with the example code provided above. You can easily do this by launching a GitHub Codespace for the sec-parser repository, which will set up a development environment for you in the cloud:

Open in GitHub Codespaces

This is a great way to play around with the code without having to set up anything on your local machine. Give it a try!

For more examples and advanced usage, you can continue learning how to use sec-parser by referring to the User Guide, Developer Guide, and Documentation.

This was an example of 10-Q SEC Form parsing. How do we parse other SEC Form types, such as 10-K, 8-K, S-1, etc.?

Please refer to this document.

What's Next?

Your turn to explore the capabilities of sec-parser! With the tools and examples provided, you can now dive into parsing and analyzing SEC filings.

The semantic elements and tree structures created by the parser will serve as a solid foundation for your financial analysis and research tasks with the tools of your choice.

For a tailored experience, consider using our free and open-source library for AI-powered financial analysis:

pip install sec-ai

Explore sec-ai on GitHub

Best Practices

How to Import Modules In Your Code

To ensure your code remains functional even when we change the internal structure of sec-parser, it's recommended to avoid deep imports. Here is an example of a deep import (not recommended):

[!CAUTION]

from sec_parser.semantic_tree.internal_utils.core import SomeInternalClass

Instead, use the suggested ways to import modules from sec-parser:

Root Import (prefix)

  • import sec_parser as sp. This imports the main package as sp. You can then access its functionalities using sp. prefix.

Root Import (direct)

  • from sec_parser import SomeClass: This allows you to directly use SomeClass without any prefix.

Submodule Import (prefix)

  • import sec_parser.semantic_tree: This imports the semantic_tree submodule, and you can access its classes and functions using semantic_tree. prefix.

Submodule Import (direct)

  • from sec_parser.semantic_tree import SomeClass: This imports a specific class SomeClass from the semantic_tree submodule.

[!NOTE] The main package sec_parser contains only the most common functionalities. For specialized tasks, please use submodule or submodule-level imports.

Contributing

For information about setting up the development environment, coding standards, and contribution workflows, please refer to our CONTRIBUTING.md guide.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sec_parser-0.58.1.tar.gz (50.3 kB view details)

Uploaded Source

Built Distribution

sec_parser-0.58.1-py3-none-any.whl (76.1 kB view details)

Uploaded Python 3

File details

Details for the file sec_parser-0.58.1.tar.gz.

File metadata

  • Download URL: sec_parser-0.58.1.tar.gz
  • Upload date:
  • Size: 50.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.9 Linux/6.5.0-1021-azure

File hashes

Hashes for sec_parser-0.58.1.tar.gz
Algorithm Hash digest
SHA256 6cfad6c234818d2ead08a3610b77c56d7fede7815330d31cc398d0763ed6a3b7
MD5 ac9e6192d581ca59ee2613d19987c86e
BLAKE2b-256 2b640397669a7e23a67fbd267177fd96631b43abeb8c3e6a70e10561ebf43409

See more details on using hashes here.

File details

Details for the file sec_parser-0.58.1-py3-none-any.whl.

File metadata

  • Download URL: sec_parser-0.58.1-py3-none-any.whl
  • Upload date:
  • Size: 76.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.6.1 CPython/3.11.9 Linux/6.5.0-1021-azure

File hashes

Hashes for sec_parser-0.58.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d6c925996bad0e51fb07166a19fe599db681606b0e77dbb484d58437fd62b70b
MD5 e85e2c7f125161ba884d6fb0c98022ed
BLAKE2b-256 3919501b1d4c8dd80311622cf6cebf8404a733d84713f6f153b7866d0fec60dd

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page