Skip to main content

Library for Insight project

Project description

Lazy prices: An NLP study of the changes in 10-K/10-Q

  1. Motivation
  2. Pipeline
  3. Dataset
  4. Conversion
  5. Transformation
  6. Database
  7. Visualization
  8. Notes

Motivation

In this project, I built a platform that parses public companies quarterly filings, that I will refer to as 10-Xs, and looks at similarities between them with various NLP metrics.

Furthermore, I formulate the hypothesis that “No news, good news”: companies that do not edit their filings from quarter to quarter outperform the market, while the ones that do see their stock value go down, as they were legally constrained to disclose bad news.

Useful links

Web app/UI: sec-scraper.club (Security warning is due to self-signed SSL keys for the HTTPS connection)

Documentation: Read The Docs!

Tests: Tests

Loosely based on: Cohen, Lauren and Malloy, Christopher J. and Nguyen, Quoc, Lazy Prices (March 7, 2019). 2019 Academic Research Colloquium for Financial Planning and Related Disciplines, https://ssrn.com/abstract=1658471

Pipeline

The pipeline can be described by the following picture:

pipeline

Dataset

A few datasets are used in this project:

Name of the dataset Number of txt files Total size (Gb)
SEC scraping ~1,000,000 2,500
Filtered scraping ~1,000,000 125
Stock prices 1 1
Lookup CIK/Tickers 1 0
Market index 4 0

The available data spans 1997 Q1 - 2018 Q4.

Conversion

Once the raw data has been scraped from the SEC's website, we are left with a collection of html files. These need to be parsed and filtered in order to remove all the html tags, the embedded pictures & zip and the financial data. Beautiful Soup is used for that purpose, yielding 125 Gb of filtered data.

Transformation

The filtered data is processed using Spark. This involves parsing the reports, finding the sections in the unorganized text with multiple layers of regex and comparing them using various metrics.

Two main challenges arose:

  1. Even after many iterations aimed at improving the parser, only 80-85% of the reports are successfuly parsed. Among them, less than 2% of the the reports are false positive.
  2. Depending the metric, the RAM requirements vary from limited to high, even leading to swapping which slows down the compute node to a crawl. The type of instance was upgraded to m5.2xlarge and a function was added to check the amount of RAM before using these resource intensive metrics.

Database

The results are stored in PostGres. The user interacts with the system via a Dash/Flask powered web app, available at:

sec-scraper.club (Copy paste in browser if not re-directed)

WARNING: I generated the SSL keys myself, and nothing was certified by a third party so your browser might flag the connection as suspicious. But I just did not pay for a pro service here.

Visualization

The user interacts with the system via a web app. Two views are possible:

  1. The company view
  2. The portfolio view

Company view

The company view shows the result of the metrics for a given ticker. Ideally, a correlation can be observed between quarter with high similarities scores and soaring stock price. Likewise, a negative correlation would be observed between low similarity scores and falling stock price.

company_view

However, the metrics seem to change a lot in the above picture. Why? Because there are two ways to compare reports:

  1. Quarterly comparisons, which means that twice a year a 10-K/10-Q comparison happens (that's what happens in the above picture).
  2. Yearly comparisons, which means that a 10-K is always compared to a 10-K & the same for 10-Qs.

Portfolio view

In the portfolio view, the company have been grouped by quintiles of similarity scores. Q1 is the portfolio of companies with the lowest similarity scores for a given metric, and Q5 the portfolio of companies with the highest similarity scores. Stocks are kept in the portfolio for one quarter, then all of them are sold, a tax is applied, and the new portfolio is purchased.

Ideally, the portfolio are layered: Q5>Q4>Q3>Q2>Q1, which is most of the time the case. Even better, we would like to see Q5 to beat the main market indices.

portfolio_view

Notes

What is EDGAR (from the SEC's website)?

EDGAR is the Electronic Data Gathering, Analysis, and Retrieval system used at the U.S. Securities and Exchange Commission (SEC). EDGAR is the primary system for submissions by companies and others who are required by law to file information with the SEC.

Containing millions of company and individual filings, EDGAR benefits investors, corporations, and the U.S. economy overall by increasing the efficiency, transparency, and fairness of the securities markets. The system processes about 3,000 filings per day, serves up 3,000 terabytes of data to the public annually, and accommodates 40,000 new filers per year on average.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

secScraper-0.0.60.tar.gz (49.0 kB view details)

Uploaded Source

Built Distribution

secScraper-0.0.60-py3-none-any.whl (68.6 kB view details)

Uploaded Python 3

File details

Details for the file secScraper-0.0.60.tar.gz.

File metadata

  • Download URL: secScraper-0.0.60.tar.gz
  • Upload date:
  • Size: 49.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/2.7.15+

File hashes

Hashes for secScraper-0.0.60.tar.gz
Algorithm Hash digest
SHA256 1a315b57a4aae578548c91266f9545da7bb8db8a27148e0704a214b650b2c573
MD5 8861c135ddfcd2f4f7dba82a48b2b750
BLAKE2b-256 b4e4d667b1f00352173d4a042b8fd04841a7d224396d57dc54cbf28bf73fd1ae

See more details on using hashes here.

File details

Details for the file secScraper-0.0.60-py3-none-any.whl.

File metadata

  • Download URL: secScraper-0.0.60-py3-none-any.whl
  • Upload date:
  • Size: 68.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/2.7.15+

File hashes

Hashes for secScraper-0.0.60-py3-none-any.whl
Algorithm Hash digest
SHA256 333b1b8827c80d116d5bd0d2f882b95cb9a28485d9687c845a3fd146bd5dc1ef
MD5 059c672ec76042bc3cecabff035b4738
BLAKE2b-256 c473e8edd60ba96c5b5d1d0ec12c2222d73f1779fc39f3a6a07f2d3b54ca3e87

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page