Library for Insight project

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- POSIX :: Linux
Programming Language
- Python :: 3

Project description

Lazy prices: An NLP study of the changes in 10-K/10-Q

Motivation
Pipeline
Dataset
Conversion
Transformation
Database
Visualization
Notes

Motivation

In this project, I built a platform that parses public companies quarterly filings, that I will refer to as 10-Xs, and looks at similarities between them with various NLP metrics.

Furthermore, I formulate the hypothesis that “No news, good news”: companies that do not edit their filings from quarter to quarter outperform the market, while the ones that do see their stock value go down, as they were legally constrained to disclose bad news.

Useful links

Web app/UI: sec-scraper.club (Security warning is due to self-signed SSL keys for the HTTPS connection)

Documentation: Read The Docs!

Tests: Tests

Loosely based on: Cohen, Lauren and Malloy, Christopher J. and Nguyen, Quoc, Lazy Prices (March 7, 2019). 2019 Academic Research Colloquium for Financial Planning and Related Disciplines, https://ssrn.com/abstract=1658471

Pipeline

The pipeline can be described by the following picture:

pipeline

Dataset

A few datasets are used in this project:

Name of the dataset	Number of txt files	Total size (Gb)
SEC scraping	~1,000,000	2,500
Filtered scraping	~1,000,000	125
Stock prices	1	1
Lookup CIK/Tickers	1	0
Market index	4	0

The available data spans 1997 Q1 - 2018 Q4.

Conversion

Once the raw data has been scraped from the SEC's website, we are left with a collection of html files. These need to be parsed and filtered in order to remove all the html tags, the embedded pictures & zip and the financial data. Beautiful Soup is used for that purpose, yielding 125 Gb of filtered data.

Transformation

The filtered data is processed using Spark. This involves parsing the reports, finding the sections in the unorganized text with multiple layers of regex and comparing them using various metrics.

Two main challenges arose:

Even after many iterations aimed at improving the parser, only 80-85% of the reports are successfuly parsed. Among them, less than 2% of the the reports are false positive.
Depending the metric, the RAM requirements vary from limited to high, even leading to swapping which slows down the compute node to a crawl. The type of instance was upgraded to m5.2xlarge and a function was added to check the amount of RAM before using these resource intensive metrics.

Database

The results are stored in PostGres. The user interacts with the system via a Dash/Flask powered web app, available at:

sec-scraper.club (Copy paste in browser if not re-directed)

WARNING: I generated the SSL keys myself, and nothing was certified by a third party so your browser might flag the connection as suspicious. But I just did not pay for a pro service here.

Visualization

The user interacts with the system via a web app. Two views are possible:

The company view
The portfolio view

Company view

The company view shows the result of the metrics for a given ticker. Ideally, a correlation can be observed between quarter with high similarities scores and soaring stock price. Likewise, a negative correlation would be observed between low similarity scores and falling stock price.

company_view

However, the metrics seem to change a lot in the above picture. Why? Because there are two ways to compare reports:

Quarterly comparisons, which means that twice a year a 10-K/10-Q comparison happens (that's what happens in the above picture).
Yearly comparisons, which means that a 10-K is always compared to a 10-K & the same for 10-Qs.

Portfolio view

In the portfolio view, the company have been grouped by quintiles of similarity scores. Q1 is the portfolio of companies with the lowest similarity scores for a given metric, and Q5 the portfolio of companies with the highest similarity scores. Stocks are kept in the portfolio for one quarter, then all of them are sold, a tax is applied, and the new portfolio is purchased.

Ideally, the portfolio are layered: Q5>Q4>Q3>Q2>Q1, which is most of the time the case. Even better, we would like to see Q5 to beat the main market indices.

portfolio_view

Notes

What is EDGAR (from the SEC's website)?

EDGAR is the Electronic Data Gathering, Analysis, and Retrieval system used at the U.S. Securities and Exchange Commission (SEC). EDGAR is the primary system for submissions by companies and others who are required by law to file information with the SEC.

Containing millions of company and individual filings, EDGAR benefits investors, corporations, and the U.S. economy overall by increasing the efficiency, transparency, and fairness of the securities markets. The system processes about 3,000 filings per day, serves up 3,000 terabytes of data to the public annually, and accommodates 40,000 new filers per year on average.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- POSIX :: Linux
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.0.60

Oct 18, 2019

0.0.59

Oct 18, 2019

0.0.58

Oct 18, 2019

0.0.57

Oct 17, 2019

0.0.56

Oct 17, 2019

0.0.55

Oct 17, 2019

0.0.54

Oct 17, 2019

0.0.53

Oct 17, 2019

0.0.52

Oct 15, 2019

0.0.51

Oct 15, 2019

0.0.50

Oct 15, 2019

0.0.49

Oct 15, 2019

0.0.48

Oct 15, 2019

0.0.47

Oct 15, 2019

0.0.46

Oct 15, 2019

0.0.45

Oct 15, 2019

0.0.44

Oct 11, 2019

0.0.43

Oct 11, 2019

0.0.42

Oct 11, 2019

0.0.41

Oct 11, 2019

0.0.40

Oct 11, 2019

0.0.39

Oct 10, 2019

0.0.38

Oct 10, 2019

0.0.36

Oct 10, 2019

0.0.35

Oct 10, 2019

0.0.34

Oct 10, 2019

0.0.33

Oct 10, 2019

0.0.32

Oct 10, 2019

0.0.31

Oct 10, 2019

0.0.30

Oct 7, 2019

0.0.29

Oct 3, 2019

0.0.27

Oct 2, 2019

0.0.26

Sep 30, 2019

0.0.25

Sep 30, 2019

0.0.24

Sep 30, 2019

0.0.23

Sep 30, 2019

0.0.22

Sep 30, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

secScraper-0.0.60.tar.gz (49.0 kB view details)

Uploaded Oct 18, 2019 Source

Built Distribution

secScraper-0.0.60-py3-none-any.whl (68.6 kB view details)

Uploaded Oct 18, 2019 Python 3

File details

Details for the file secScraper-0.0.60.tar.gz.

File metadata

Download URL: secScraper-0.0.60.tar.gz
Upload date: Oct 18, 2019
Size: 49.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/2.7.15+

File hashes

Hashes for secScraper-0.0.60.tar.gz
Algorithm	Hash digest
SHA256	`1a315b57a4aae578548c91266f9545da7bb8db8a27148e0704a214b650b2c573`
MD5	`8861c135ddfcd2f4f7dba82a48b2b750`
BLAKE2b-256	`b4e4d667b1f00352173d4a042b8fd04841a7d224396d57dc54cbf28bf73fd1ae`

See more details on using hashes here.

File details

Details for the file secScraper-0.0.60-py3-none-any.whl.

File metadata

Download URL: secScraper-0.0.60-py3-none-any.whl
Upload date: Oct 18, 2019
Size: 68.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.2 CPython/2.7.15+

File hashes

Hashes for secScraper-0.0.60-py3-none-any.whl
Algorithm	Hash digest
SHA256	`333b1b8827c80d116d5bd0d2f882b95cb9a28485d9687c845a3fd146bd5dc1ef`
MD5	`059c672ec76042bc3cecabff035b4738`
BLAKE2b-256	`c473e8edd60ba96c5b5d1d0ec12c2222d73f1779fc39f3a6a07f2d3b54ca3e87`

See more details on using hashes here.

secScraper 0.0.60

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Lazy prices: An NLP study of the changes in 10-K/10-Q

Motivation

Useful links

Pipeline

Dataset

Conversion

Transformation

Database

Visualization

Company view

Portfolio view

Notes

What is EDGAR (from the SEC's website)?

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes