CORD 19 tools and utilities
Project description
COVID-19 Data Tools
Tools for making COVID 19 data slightly easier for everyone!
Installation
pip install cord-19-tools
The Paperset class
This is a class for lazily loading papers from the CORD-19 dataset. Here are the instructions for use:
-
Download a dataset in tar.gz form from the Download Here section, or using download bash script in this repository (which automatically completes step 2 for you)
-
Extract it into a directory of your choice (functionality for leaving the tarballs unpacked/online may be added later, this is version 0.0.1), for example:
tar -xvzf comm_use_subset.tar.gz
- Load it into python!
import cotools
from pprint import pprint
# no `/` at the end please!
data = cotools.Paperset("data/comm_use_subset")
# indexes with ints
pprint(data[0])
# and slices!
pprint(data[:2])
print(len(data))
# takes about 5gb in memory
alldata = [x[0] for x in data]
Lets talk for a bit about how it works, and why it doesnt take a gigantic amount of memory. The files are not actually loaded into python until the data is indexed. Upon indexing, the files at those indexes are read into python, resulting in a list of dictionaries.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for cord_19_tools-0.0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7d85a2e79235b8d93b70c1c731c5bab01fe849e86a6b150af1ba0e6f160367e5 |
|
MD5 | 7054862dfed8db4440d6a70e6070d4a2 |
|
BLAKE2b-256 | b3639604c65dffa76ef8dca081f989f09108aaa96de73809b636a923b9617d95 |