The 10-K Report Item Segmentation Tool
Project description
itemseg
Itemseg is a 10-K item segmentation tool for processing 10-K filings and extracting item-specific text.
Itemseg supports the following input formats (--input_type):
- raw: Complete submission text file. See example at SEC Website
- html: 10-K report in HTML format. See example at SEC Website
- native_text: 10-K report in pure text format. See example at SEC Website
- cleaned_text: 10-K report converted to pure text format with tables removed.
The input (--input) can be either a local file or a URL pointing to the SEC website.
Itemseg supports the following item segmentation approaches (--method):
- crf: Conditional Random Field (default method). Recommended for machines without a GPU.
- bert: BERT4ItemSeg; BERT encoder coupled with Bi-LSTM.
- chatgpt: GPT4ItemSeg; Uses OpenAI API and line-id-based prompting.
bert require a GPU to work at a reasonable speed. You will need to setup the GPU hardware and driver before using these approaches. You can still use itemseg to process 10-K reports without GPUs by selecting the crf approach.
Table of Contents
Installation
We recommend installing itemseg in a separate environment created by virtualenv to prevent library version conflicts. The instructions below have been tested with Ubuntu 24 LTS.
Setup virtualenv
Install virtualenv first if it is not already installed.
sudo apt install python3-venv
The next step is to setup the virtualenv.
python3 -m venv env_itemseg
Activate the virtual environment:
source env_itemseg/bin/activate
Now we can install itemseg
pip3 install itemseg
Download resource files
You will need to download resource files first before start using the tool.
python3 -m itemseg --get_resource
Download NLTK data
python3 -m nltk.downloader punkt punkt_tab
Itemseg Example Usage
Segment items in a 10-K file
Using Apple 10-K (2023) as an example (adjust --user_agent according to your affiliation):
python3 -m itemseg --input_type raw --input https://www.sec.gov/Archives/edgar/data/320193/000032019323000106/0000320193-23-000106.txt --user_agent "Some University johndow@someuniversity.edu"
The default method is CRF.
See the results in ./segout01/.
The *.csv file contains line-by-line predictions for items in Begin-Inside-Outside (BIO) style tags. Other files contain item-specific text.
Other sample command
To use BERT4ItemSeg
python3 -m itemseg --input_type raw --input https://www.sec.gov/Archives/edgar/data/320193/000032019323000106/0000320193-23-000106.txt --user_agent "Some University johndow@someuniversity.edu" --method bert
To use GPT4ItemSeg
python3 -m itemseg --input_type raw --input https://www.sec.gov/Archives/edgar/data/320193/000032019323000106/0000320193-23-000106.txt --user_agent "Some University johndow@someuniversity.edu" --method chatgpt --apikey zzzzxxxxzzzz
About 10-K reports
A 10-K report is an annual report filed by publicly traded companies with the U.S. Securities and Exchange Commission (SEC). It provides a comprehensive overview of the company's financial performance and is more detailed than an annual report. Key items of a 10-K report include:
- Item 1 (Business): Describes the company's main operations, products, and services.
- Item 1A (Risk Factors): Outlines risks that could affect the company's business, financial condition, or operating results.
- Item 3 (Legal Proceedings)
- Item 7 (Management’s Discussion and Analysis of Financial Condition and Results of Operations; MD&A): Offers management's perspective on the financial results, including discussion of liquidity, capital resources, and results of operations.
You can search and read 10-K reports through the EDGAR web interface. For raw input type, Itemseg takes the URL of the Complete submission text file, converts the HTML to formatted text, removing tables with numerical content, and segments the text file by items.
As an example, the Amazon 10-K report page for fiscal year 2022 shows the link to the HTML 10-K report and a Complete submission text file 0001018724-23-000004.txt. Pass this link to the itemseg module, and it will retrieve the file and segment items for you. Remember to adjust --user_agent according to your affiliation.
python3 -m itemseg --input_type raw --input https://www.sec.gov/Archives/edgar/data/1018724/000101872423000004/0001018724-23-000004.txt --user_agent "Some University johedoe@someuniv.edu"
The default setting outputs line-by-line tags (BIO style) in a CSV file, together with Item 1, Item 1A, Item 3, and Item 7 in separate files (--outfn_type "csv,item1,item1a,item3,item7"). You can change the output file type combination with --outfn_type. For example, if you only want to output Item 1A and Item 7, set --outfn_type "item1a,item7".
If you are trying to process large amounts of 10-K files, a good starting point is the master index, which lists all available files and provides a convenient way to construct a comprehensive list of target files.
License
itemseg is distributed under the terms of the CC BY-NC license.
We extend our special thanks to Chia-Tai Li and I-Chen Tsai for their valuable support in managing the dataset, as well as merging and refactoring the project.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file itemseg-3.4.0.tar.gz.
File metadata
- Download URL: itemseg-3.4.0.tar.gz
- Upload date:
- Size: 80.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
853aae0455f1fa0a4ba23b6e7ea81f934b20b13f40a2beaf7202eb5008bf3859
|
|
| MD5 |
eebf4196d08c646de6bcbd2b6b9cb8e8
|
|
| BLAKE2b-256 |
5899bcd1c7697df13e505e1601707479e707f643885c67d47e9ef38616c7d02f
|
File details
Details for the file itemseg-3.4.0-py3-none-any.whl.
File metadata
- Download URL: itemseg-3.4.0-py3-none-any.whl
- Upload date:
- Size: 83.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d7a46462d4fb6568791bd34fa34098c9cdbefa53fb0a8f8f37bcc2d1ba4825c3
|
|
| MD5 |
558c5ffa7ae62f537d4e392db6109ff4
|
|
| BLAKE2b-256 |
2838f044fb1737905057d5b3fbd11e32404cd16d4f770322195db6277b2e9816
|