Skip to main content

Download SEC files in bulk

Project description

made-with-python Build Status PyPI version

Welcome to secutils

secutils is a utility package to facilitate large bulk downloads of SEC documents. It works with any SEC document type and will retrieve the entire historical database if required. Multi-threaded file downloads are enabled in the command line utility.

Key functionality includes:

  • Multi-threaded downloading
  • Caching of index files
  • Automatic directory structure buildout (i.e. downloading multiple file types w/dir structure: ftype --> year --> quarter --> files)
  • Resume downloading
  • Built in logging and download success tracker

Overview of README:

Motivation

secutils picks up where a number of other repos left off. There are a couple SEC downloading python packages out there, however they are designed from retrieval of few documents. I needed a way to consistently download the latest updates from the SEC and secure a local copy of the entire history of the SEC for certain file types. This translates into TB's of documents, where networking, directory structure, logging, etc. issues arise.

There is a nice package available to download and construct index files, however the user is still left to download the actual files and must be comfortable with bash scripting.

With secutils the program handles files you have already retrieved, get's the missing files you don't have in your local archive, and continues.

For examples of other repos that exist:

Furthermore, the hope of this package is to create parsers for repsective form types. A user could import the 10-K parser and call the Management Discussion and Analysis method to retrieve respective MD&A's from selected files.

There are also plans to integrate directly with popular cloud providers given the scale of these filings. Processing 10-K/Q's alone requires TB's of storage.

Installation

There are two primary methods of installing sec-utils. The first is via the python packaging index (pypi). The second is straight from source.

To install from pypi:

pip install secutils

And to install from source:

git clone https://github.com/datawrestler/sec-utils && cd sec-utils
conda create --name sec_env python=3.7 pip
conda activate sec_env
pip install -r requirements.txt
pip install -e .
Usage
conda activate sec_env
python download_sec.py --output_dir=/mnt/sda/sec --form_types=S-1 --num_workers=-1 --start_year=2014 --end_year=2019 --quarters 1 2 3 4

Even more cleanly, you can coordinate long running jobs and keep track of your parameters by modifying this example script

Make sure to make it executable on your system:

chmod +x run.sh
./run.sh

You can also generate a config file and use the config to control parameters of longer runs:

from secutils.utils import generate_config
path_for_config = ''
generate_config(path_for_config)

then when calling the longer download run:

python -m secutils.download_sec --config_path='path_for_config'

A useful trick when working with remote servers is to direct output from a session to a file. Using screen also maintains a session even if you disconnect from ssh:

screen -dm -L python -m secutils.download_sec --config_path='path_for_config'

Additionally, users can leverage the API directly for more hands on work. An overview resides in an example jupyter notebook with additional details below:

from secutils.edgar import FormIDX
form = FormIDX(year=2017, quarter=1, seen_files=None, cache_dir=None, form_types=['10-K])
files = form.index_to_files()
form.master_index.head()
# CIK	Company Name	Form Type	Date Filed	Filename	fname
# 1000015	META GROUP INC	10-K	1998-03-31	edgar/data/1000015/0001000015-98-000009.txt	0001000015-98-000009.txt
# 1000112	CHEVY CHASE MASTER CREDIT CARD TRUST II	10-K	1998-03-27	edgar/data/1000112/0000920628-98-000038.txt	0000920628-98-000038.txt
# 1000179	PARAMOUNT FINANCIAL CORP	10-K	1998-03-30	edgar/data/1000179/0000950120-98-000108.txt	0000950120-98-000108.txt

# lets take a peek at attributes available to individual files:
ex = files[0]
msg = f"""
      Company Name: {ex.company_name}
      CIK Number: {ex.cik_number}
      Date Filed: {ex.date_filed}
      Form Type: {ex.form_type}
      File Name: {ex.file_name}
      Download URL: {ex.file_download_url}
      """
print(msg) 
# Company Name: OPTICAL CABLE CORP
# CIK Number: 1000230
# Date Filed: 2017-12-20 00:00:00
# Form Type: 10-K
# File Name: 0001437749-17-020936.txt
# Download URL: https://www.sec.gov/Archives/edgar/data/1000230/0001437749-17-020936.txt                                                                        
# get example file and download:
# to download our example file:
output_dir = '.'
ex.download_file(output_dir) # 200 is a successful download

# verify download 
import os
list(filter(lambda x: x.endswith('txt'), os.listdir(output_dir)))
# ['0001437749-17-020936.txt']

Getting hands on is great, however using the CLI does provide several advantages:

  • Automatic directory structure creation
  • Built in logging and caching
  • Ability to resume training via download scanning
  • Multi-threaded file downloading
Vision

The vision for this project extends far beyond it's current state of downloading index and SEC files from the Edgar database. Currently, parsing SEC files is tremendously difficult. There are numerous reasons for these difficulties including:

  • No systematic tagging structure for SEC filings
  • File submissions changed over the years
  • Many different file types, header types, and content from one filing type to another

Given the above, parsing even a 10-K takes tremendous effort. The goal of this project is to bring together like minded individuals and take a stab at a systematic parsing effort with a consistent API. The future state of the project would allow users to download SEC filings and use convenient methods to retrieve particular sections of the filings. For instance, a user could do something like the following:

from secutils.file_types import file_10k

file_path = '/path/to/10-K'
f = file_10k.from_path(file_path)

# and retrieve the management discussion and analysis section directly:
f.management_discussion()
# Here at XYZ company, we believe the following year will bring about great properity due to our R&D efforts in packages like secutils...

This would open up a world of opportunity for collaboration, text analytics research, and general business information gathering.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

secutils-0.0.3.tar.gz (13.2 kB view details)

Uploaded Source

Built Distribution

secutils-0.0.3-py3-none-any.whl (12.4 kB view details)

Uploaded Python 3

File details

Details for the file secutils-0.0.3.tar.gz.

File metadata

  • Download URL: secutils-0.0.3.tar.gz
  • Upload date:
  • Size: 13.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for secutils-0.0.3.tar.gz
Algorithm Hash digest
SHA256 e3570ea502c8d728f8a255663183e2f84171ebc2e29f709b424378f2bbc48a0c
MD5 e55e61e7d42651e39e4f856a46e6b3e5
BLAKE2b-256 2da57070948ec4df0d1622006ca0083237296139d6b1033310e9c6164dcc4860

See more details on using hashes here.

File details

Details for the file secutils-0.0.3-py3-none-any.whl.

File metadata

  • Download URL: secutils-0.0.3-py3-none-any.whl
  • Upload date:
  • Size: 12.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for secutils-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 e5eddcd92d7ebfd8713f8c303dc7eb26fb1ecc64e17e261e7cabb209e44836b5
MD5 48eedfdc6d105ed9dbfe2151c373fa15
BLAKE2b-256 e54e18b39f8dcf1e77ad1c86ee804f6ab8f55bd816bd6c1e411a54fcf4ab0f46

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page