Skip to main content

Corpcrawl EDGAR Scraper

Project description

# About This is a python scraper for the Securities and Exchange Commission EDGAR database. It looks at the Form 10k filings that publicly held corporations are required to file with the SEC. It then attempts to extract the subsidiary relationships from the 10k exhibit 21.1.

Corpcrawl is a storage agnostic scraper, so you’ll need to implement your own storage scheme.

# Installation You can get the package from PyPi through Pip.

pip install corpcrawl

From a python console you can try

import corpcrawl

If it works, then you can get started

# How to use

## Running it First import the required pieces

from corpcrawl.crawler import CorpCrawl from corpcrawl.backend import Backend

def main()

my_backend = MyBackend() crawler = CorpCrawl(cache_path = ‘/an/absolute/path/to/some/dir’, backend = my_backend) c.crawl(years = [2011, 2012], quarters = [1, 2, 3, 4])

class MyBackend(Backend):

def get_company(self, name):

pass

def add_company(self, comp):

print “Adding %s” % str(comp)

This is about as simple of a backend as can be. This code will crawl the filings and print out the name of each company and subsidiary that it finds for all quarters of years 2011 and 2012. Obviously you will want to hook it up to a database or something.

# Notes The SEC data is very unstructured. As such, there are a lot of errors. This is version 0.0.1, so instead of implementing your own methods to massage the data that it returns, it would be better if you could contribute so make the core parser better.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpcrawl-0.0.37.tar.gz (16.6 kB view details)

Uploaded Source

File details

Details for the file corpcrawl-0.0.37.tar.gz.

File metadata

  • Download URL: corpcrawl-0.0.37.tar.gz
  • Upload date:
  • Size: 16.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for corpcrawl-0.0.37.tar.gz
Algorithm Hash digest
SHA256 c29bca0e249d24286f01dd7630eabe953b4498615a6b9de26c2de563d7803214
MD5 7110980ed14a0735f4e984547a14bb7f
BLAKE2b-256 1b112d93c6021fc0bd1bcc54b08fb80b304299057a12e554cfff1d9316274dff

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page