Skip to main content

The 10-K Report Item Segmentation Tool

Project description

itemseg

Itemseg is a 10-K item segmentation tool for processing 10-K filings and extracting item-specific text.

Itemseg supports the following input formats (--input_type):

  • raw: Complete submission text file. See example at SEC Website
  • html: 10-K report in HTML format. See example at SEC Website
  • native_text: 10-K report in pure text format. See example at SEC Website
  • cleaned_text: 10-K report converted to pure text format with tables removed.

The input (--input) can be either a local file or a URL pointing to the SEC website.

Itemseg supports the following item segmentation approaches (--method):

  • crf: Conditional Random Field (default method). Recommended for machines without a GPU.
  • bert: BERT4ItemSeg; BERT encoder coupled with Bi-LSTM.
  • chatgpt: GPT4ItemSeg; Uses OpenAI API and line-id-based prompting.

bert require a GPU to work at a reasonable speed. You will need to setup the GPU hardware and driver before using these approaches. You can still use itemseg to process 10-K reports without GPUs by selecting the crf approach.

PyPI - Version PyPI - Python Version


Table of Contents

Installation

We recommend installing itemseg in a separate environment created by virtualenv to prevent library version conflicts. The instructions below have been tested with Ubuntu 24 LTS.

Setup virtualenv

Install virtualenv first if it is not already installed.

sudo apt install python3-venv

The next step is to setup the virtualenv.

python3 -m venv env_itemseg

Activate the virtual environment:

source env_itemseg/bin/activate

Now we can install itemseg

pip3 install itemseg

Download resource files

You will need to download resource files first before start using the tool.

python3 -m itemseg --get_resource

Download NLTK data

python3 -m nltk.downloader punkt punkt_tab

Itemseg Example Usage

Segment items in a 10-K file

Using Apple 10-K (2023) as an example (adjust --user_agent according to your affiliation):

python3 -m itemseg --input_type raw --input https://www.sec.gov/Archives/edgar/data/320193/000032019323000106/0000320193-23-000106.txt --user_agent "Some University johndow@someuniversity.edu"

The default method is CRF. See the results in ./segout01/.

The *.csv file contains line-by-line predictions for items in Begin-Inside-Outside (BIO) style tags. Other files contain item-specific text.

Other sample command

To use BERT4ItemSeg

python3 -m itemseg --input_type raw --input https://www.sec.gov/Archives/edgar/data/320193/000032019323000106/0000320193-23-000106.txt --user_agent "Some University johndow@someuniversity.edu" --method bert

To use GPT4ItemSeg

python3 -m itemseg --input_type raw --input https://www.sec.gov/Archives/edgar/data/320193/000032019323000106/0000320193-23-000106.txt --user_agent "Some University johndow@someuniversity.edu" --method chatgpt --apikey zzzzxxxxzzzz

About 10-K reports

A 10-K report is an annual report filed by publicly traded companies with the U.S. Securities and Exchange Commission (SEC). It provides a comprehensive overview of the company's financial performance and is more detailed than an annual report. Key items of a 10-K report include:

  • Item 1 (Business): Describes the company's main operations, products, and services.
  • Item 1A (Risk Factors): Outlines risks that could affect the company's business, financial condition, or operating results.
  • Item 3 (Legal Proceedings)
  • Item 7 (Management’s Discussion and Analysis of Financial Condition and Results of Operations; MD&A): Offers management's perspective on the financial results, including discussion of liquidity, capital resources, and results of operations.

You can search and read 10-K reports through the EDGAR web interface. For raw input type, Itemseg takes the URL of the Complete submission text file, converts the HTML to formatted text, removing tables with numerical content, and segments the text file by items.

As an example, the Amazon 10-K report page for fiscal year 2022 shows the link to the HTML 10-K report and a Complete submission text file 0001018724-23-000004.txt. Pass this link to the itemseg module, and it will retrieve the file and segment items for you. Remember to adjust --user_agent according to your affiliation.

python3 -m itemseg --input_type raw --input https://www.sec.gov/Archives/edgar/data/1018724/000101872423000004/0001018724-23-000004.txt --user_agent "Some University johedoe@someuniv.edu"

The default setting outputs line-by-line tags (BIO style) in a CSV file, together with Item 1, Item 1A, Item 3, and Item 7 in separate files (--outfn_type "csv,item1,item1a,item3,item7"). You can change the output file type combination with --outfn_type. For example, if you only want to output Item 1A and Item 7, set --outfn_type "item1a,item7".

If you are trying to process large amounts of 10-K files, a good starting point is the master index, which lists all available files and provides a convenient way to construct a comprehensive list of target files.

License

itemseg is distributed under the terms of the CC BY-NC license.

We extend our special thanks to Chia-Tai Li and I-Chen Tsai for their valuable support in managing the dataset, as well as merging and refactoring the project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

itemseg-3.4.0.tar.gz (80.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

itemseg-3.4.0-py3-none-any.whl (83.4 kB view details)

Uploaded Python 3

File details

Details for the file itemseg-3.4.0.tar.gz.

File metadata

  • Download URL: itemseg-3.4.0.tar.gz
  • Upload date:
  • Size: 80.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for itemseg-3.4.0.tar.gz
Algorithm Hash digest
SHA256 853aae0455f1fa0a4ba23b6e7ea81f934b20b13f40a2beaf7202eb5008bf3859
MD5 eebf4196d08c646de6bcbd2b6b9cb8e8
BLAKE2b-256 5899bcd1c7697df13e505e1601707479e707f643885c67d47e9ef38616c7d02f

See more details on using hashes here.

File details

Details for the file itemseg-3.4.0-py3-none-any.whl.

File metadata

  • Download URL: itemseg-3.4.0-py3-none-any.whl
  • Upload date:
  • Size: 83.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for itemseg-3.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d7a46462d4fb6568791bd34fa34098c9cdbefa53fb0a8f8f37bcc2d1ba4825c3
MD5 558c5ffa7ae62f537d4e392db6109ff4
BLAKE2b-256 2838f044fb1737905057d5b3fbd11e32404cd16d4f770322195db6277b2e9816

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page