A corpus reader for the CORD-19 dataset, compatible with NLTK
Project description
Table of Contents
- CORD-19 Corpus Reader
- About
- Dataset Format
- Examples
- Creating a Corpus Reader
- Creating a Corpus Reader to Read Only Bodies of Papers
- Creating a Corpus Reader Preferring PMC Parses of Papers
- Getting a List of Documents in the Corpus
- Getting All the Words in the Corpus
- Getting All the Words From Specific Documents in the Corpus
- Getting Metadata for Specific Documents in the Corpus
- Display Statistics About the Corpus
- Plotting 50 Most Common Words from 10000 Documents
- Plotting 50 Most Popular Days to Publish
- Tasks
- Questions
- Issues
CORD-19 Corpus Reader
About
The CORD19CorpusReader
class is a corpus reader for the CORD-19 corpus
(https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge).
The idea is to be able to load the CORD-19 corpus into NLTK. This project
was originally created for Natural Language Processing (NLP) coursework
at the University of Missouri by Jason James, Alex Morehead, and others.
Documents in the CORD-19 corpus are stored as JSON files. Since the
documents are in JSON, the parts of the paper are separated out.
Currently, when setting up the corpus reader, you can specify which
parts of the papers you want to access: titles, abstracts, and / or
bodies with the flags include_titles
, include_abstracts
, and
include_bodies
. For example, in some situations, you might really only
care about the body of the paper, which contains most of the text. In
other situations, you might just want the abstracts. By default, titles,
abstracts, and bodies are included.
For some papers, there are both PDF parses and PMC parses. To indicate
which is preferred, the corpus reader has a prefer_pdf_parses
parameter and a prefer_pmc_parses
parameter. In cases where both
parses are available, the prefer_pdf_parses
tells the reader to choose
the PDF parses when set to True
and the prefer_pmc_parses
parameter
tells the corpus reader to choose the PMC parse when set to True
. If
both are set to True
, it just uses all the JSON files, which is
actually a bit faster since it does not have to perform deduplication.
If both are set to False
(which is not something you would probably
want to do in actuality), it would basically exclude either parse of a
paper when there’s a duplicate. By default, prefer_pdf_parses
is set
to True
and prefer_pmc_parses
is set to False
.
Currently, raw()
, words()
, sents()
, and paras()
have
implementations and basically work like would be expected using NLTK.
Right now, these functions are just returning the text body of the
papers, but the intent is to be able to specify in the reader which
parts of the paper to return (title, abstract, body, and bibliography).
In fair warning, it should also be noted that currently paras()
is
treating a section of the paper as a paragraph, which may or may not be
an acceptable treatment of paragraphs, so if we really need to work at
the paragraph level, we may want to revisit the implementation details
of paras()
.
There’s a few options for inputs into raw()
, words()
, sents()
, and
paras()
: a file ID as a string does the operation on just the
corresponding document, a list of file IDs does the operation on just
the corresponding documents, and no arguments performs the operation on
all the documents in the corpus.
The corpus is very large, so operations can take a while to complete
when performed on the entire corpus. For example, it took about 30
minutes to store the output from words()
to a list and print the
length of the list. To convert the list to a set and print the length of
the set took about another 10 minutes.
Metadata for papers is stored in a file called metadata.csv
in the
root directory of the corpus. Basically, each row of the file is the
metadata for a particular paper. So far, there’s an initial
implementation of metadata()
that can be used to retrieve the metadata
for papers. Similar to the above methods, either a string of a file ID
or a list of file IDs can be input to metadata()
to retrieve the
corresponding metadata for those documents. Additionally, passing in no
arguments returns the metadata for all the documents in the corpus.
There’s also a parameter fileids_only
that specifies whether to return
metadata for only papers actually in the corpus when set to True
or
all the metadata available (even for papers whose parses aren’t in the
corpus) when set to False
. The default value for fileids_only
is
True
. When fileids_only
is True
, the metadata is returned as a
dictionary, with the fileids as keys, of lists, where an entry in the
list is itself a dictionary representing a row from metadata.csv
.
However, when fileids_only
is set to False
, the metadata returned is
a dictionary, with the corduid as the keys, of lists, where each entry
in the list is itself a dictionary representing a row from
metadata.csv
. The reason for the discrepency is that there might be a
want to correlate metadata with its corresponding paper, in which case
indexing into the metadata with the paper’s fileid seems intuitive. On
the other hand, for metadata for papers for which there is no
corresponding fileid, there’s not really a lot of options for what to
index into it with, so teh corduid is used.
test_coord19.py
contains some rudimentary tests using methods in the
CORD19CorpusReader
class to display the output of the methods.
Dataset Format
The dataset is updated pretty regularly on Kaggle. So, the exact number of papers can change on a daily basis.
Inside the root directory of the dataset is a metadata file called
metadata.csv
. Each row represents a particular paper and has a
cord_uid
that serves as a sort of key for that paper. Note that the
cord_uid=s are not unique in =metadata.csv
: multiple rows might have
the same cord_uid
, but contain differening data (it usually looks like
one of the entries is more complete than the other). If a paper has a
PMC parse, there will be an entry for the filename in the
pmc_json_files
column. Similarly, if the paper has a PDF parse, there
will be an entry for its filename in the pdf_json_files
column.
However, sometimes multiple files might be listed in the
pdf_json_files
column: apparently, one corresponds to the main article
and the others contain related matter and so these types of entries are
represented as a ;
separated list (semicolon and space).
Inside the directory for the dataset, there is a directory called
document_parses
. Inside of document_parses
, there are two folders
called pdf_json
and pmc_json
. The files in these folders are the
actual data from the papers in JSON format. Some papers only have a PDF
parse and some papers only have a PMC parse and some papers have both a
PDF and a PMC parse and some papers have neither a PDF parse nor a PMC
parse.
Examples
Creating a Corpus Reader
# Import the CORD19CorpusReader class.
from cord19 import CORD19CorpusReader
# Directory of the CORD-19 corpus.
# Assumes the repository directory is sibling to / at the same level as the corpus directory.
# Also assumes in the same directory as cord19.py.
root = '../../../../551982-1490480-bundle-archive/'
# Make a corpus reader on everything that is a .json file in the directory.
reader = CORD19CorpusReader(root, '.*\.json')
Creating a Corpus Reader to Read Only Bodies of Papers
# Assume CORD19CorpusReader has been imported and root has been specified.
# By default include_titles, include_abstracts, and include_bodies are all True.
reader = CORD19CorpusReader(root, '.*\.json', include_titles = False, include_abstracts = False, include_bodies = True)
Creating a Corpus Reader Preferring PMC Parses of Papers
# Assume CORD19CorpusReader has been imported and root has been specified.
# When there's both a PDF parse and a PMC parse available for a paper, choose the PMC parse.
reader = CORD19CorpusReader(root, '.*\.json', prefer_pmc_parses = True, prefer_pdf_parses = False)
Getting a List of Documents in the Corpus
document_list = reader.fileids()
Getting All the Words in the Corpus
# Note that this could take a while since the corpus is quite large.
word_list = reader.words()
Getting All the Words From Specific Documents in the Corpus
# Make a list of documents of interest.
document_list = ['document_parses/pdf_json/0000028b5cc154f68b8a269f6578f21e31f62977.json',
'document_parses/pmc_json/PMC7480786.xml.json']
# Retrieves words from only the specified documents.
word_list = reader.words(document_list)
Getting Metadata for Specific Documents in the Corpus
# Make a list of documents of interest.
document_list = ['document_parses/pdf_json/0000028b5cc154f68b8a269f6578f21e31f62977.json',
'document_parses/pmc_json/PMC7480786.xml.json']
# Retrieves metadata from metadata.csv for only the specified documents.
metadata_dictionary = reader.metadata(document_list)
Display Statistics About the Corpus
# Displays information about rows in metadata.csv and counts of document parse folders.
reader.statistics()
Plotting 50 Most Common Words from 10000 Documents
# Grab a list of stopwords.
stopword_list = nltk.corpus.stopwords.words('english')
# Make a list of punctuation and uninteresting items that might show up in a paper.
ignore_list = ['.', ',', '!', '?', '[', ']', '(', ')', '`', '"', '\'', ';', ':', '%', '-', '+', '=', '_', '),', ').', '],', '/', '\\', '.,', 'et', 'al']
# Concatenate the lists.
ignore_list = ignore_list + stopword_list
# Grab 10k documents.
document_list = reader.fileids()[0:10000]
# Make a word list of words in the documents, but not in the ignore list.
word_list = [word.lower() for word in reader.words(document_list) if word.lower() not in ignore_list]
# Create a frequency distribution of th words.
frequency_distribution = nltk.FreqDist(word_list)
# Plot the distribution of the 50 most common words.
frequency_distribution.plot(50)
Plotting 50 Most Popular Days to Publish
# Grab the metadata.
metadata = reader.metadata()
# Make an empty list of publish times.
publish_times = []
# Go through each item in the metadata.
for (key, entry_list) in metadata.items():
# Go through each entry in the list.
for entry in entry_list:
# Add the publish time to the list.
publish_times.append(entry['publish_time'])
# Make a frequency distribution on the list of publish times.
frequency_distribution = nltk.FreqDist(publish_times)
# Plot the 50 most popular days to publish on.
frequency_distribution.plot(50)
Tasks
To Do
-
Add functions to retrieve specific pieces of metadata
- Implement
journals()
to pull journals metadata - Implement
authors()
to pull authors metadata - Implement
publish_times()
to pull publish times metadata - Implement
countries()
to pull countries (from authors metadata) metadata - Implement
institutions()
to pull pull institutions (from authors metadata) metadata
- Implement
-
Add ability to specify which parts of paper to read (title, abstract, body, bibliography)
- Add
include_bibliographies
flag to indicate whether to include bibiliographies at the end of papers- Is this needed and what parts of the bibliographies to include?
- Add
In Progress
Done
-
Do initial implementation of
CORD19CorpusReader
class :jason:- Implement
raw()
:jason: - Implement
words()
:jason: - Implement
sents()
:jason: - Implement
paras()
:jason:
- Implement
-
Add ability to specify which parts of paper to read (title, abstract, and body) :jason:
- Add
include_titles
flag to indicate whether to include paper titles :jason: - Add
include_abstracts
flag to indicate whether to include paper abstracts :jason: - Add
include_bodies
flag to indicate whether to include the actual text bodies of papers :jason:
- Add
-
Implement
metadata()
to pull entire metadata blocks frommetadata.csv
:jason: -
Implement
statistics()
to display some simple statistics about the corpus :jason: -
Add preferences for deduplication of papers (i.e., don’t include both PDF parse and PMC parse of the same paper) :jason:
Questions
- Why do some papers have both PDF and PMC parses?
- Are PDF and PMC parses of the same paper the same?
- If a JSON file has an abstract, authors, etc., is that metadata also in metadata.csv and vice versa?
- Is the overlapping metadata (e.g., abstracts) in the JSON files and metadata.csv the same?
Boy, I wish I knew. We would need to compare each pair point by point to be sure.
I suspect it really just reflects the origin of the papers — scooped from open access journals or from PubMed’s free text links. I’m surprised there aren’t any notes about methodoloy in the CORD-19 data set itself.
Issues
- Operations can take a long time because of the size of the corpus
- Metadata availability seems to vary between files
- Corpus is updated frequently, so different versions may give different results
- Some papers seem to appear twice, as both a PDF parse and a PMC parse
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Hashes for CORD_19_Corpus_Reader-0.0.4.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 469c1121a8a7e95b16f8e4d9b6497e63c7dd908f2b1ec7e9f3f64d49e92c5c07 |
|
MD5 | 379a79778c6337f563b4f67e0d2b5694 |
|
BLAKE2b-256 | bf253b4fdd4013e37ddb2ed21ea78df9e227618bb2c1c12284db13a1f0444f8b |