Skip to main content

ETL processes for medical and scientific papers

Project description

ETL processes for medical and scientific papers

Version GitHub Release Date GitHub issues GitHub last commit Build Status Coverage Status


paperetl is an ETL library for processing medical and scientific papers.

architecture

paperetl supports the following sources:

  • File formats:
    • PDF
    • XML (arXiv, PubMed, TEI)
    • CSV
  • COVID-19 Research Dataset (CORD-19)

paperetl supports the following output options for storing articles:

  • SQLite
  • Elasticsearch
  • JSON files
  • YAML files

Installation

The easiest way to install is via pip and PyPI

pip install paperetl

Python 3.7+ is supported. Using a Python virtual environment is recommended.

paperetl can also be installed directly from GitHub to access the latest, unreleased features.

pip install git+https://github.com/neuml/paperetl

Additional dependencies

PDF parsing relies on an existing GROBID instance to be up and running. It is assumed that this is running locally on the ETL server. This is only necessary for PDF files.

Docker

A Dockerfile with commands to install paperetl, all dependencies and scripts is available in this repository.

wget https://raw.githubusercontent.com/neuml/paperetl/master/docker/Dockerfile
docker build -t paperetl -f Dockerfile .
docker run --name paperetl --rm -it paperetl

This will bring up a paperetl command shell. Standard Docker commands can be used to copy files over or commands can be run directly in the shell to retrieve input content.

Examples

Notebooks

Notebook Description
Introducing paperetl Overview of the functionality provided by paperetl Open In Colab

Load Articles into SQLite

The following example shows how to use paperetl to load a set of medical/scientific articles into a SQLite database.

  1. Download the desired medical/scientific articles in a local directory. For this example, it is assumed the articles are in a directory named paperetl/data

  2. Build the database

    python -m paperetl.file paperetl/data paperetl/models
    

Once complete, there will be an articles.sqlite file in paperetl/models

Load into Elasticsearch

Elasticsearch is also a supported datastore as shown below. This example assumes Elasticsearch is running locally, change the URL to a remote server as appropriate.

python -m paperetl.file paperetl/data http://localhost:9200

Once complete, there will be an articles index in Elasticsearch with the metadata and full text stored.

Convert articles to JSON/YAML

paperetl can also be used to convert articles into JSON or YAML files. This is useful if the data is to be fed into another system or for manual inspection/debugging of a single file.

JSON:

python -m paperetl.file paperetl/data json://paperetl/json

YAML:

python -m paperetl.file paperetl/data yaml://paperetl/yaml

Converted files will be stored in paperetl/(json|yaml)

Load CORD-19

Note: The final version of CORD-19 was released on 2022-06-22. But this is still a large, valuable set of medical documents.

The following example shows how to use paperetl to load the CORD-19 dataset into a SQLite database.

  1. Download and extract the dataset from Allen Institute for AI CORD-19 Release Page.

    scripts/getcord19.sh cord19/data
    

    The script above retrieves and unpacks the latest copy of CORD-19 into a directory named cord19/data. An optional second argument sets a specific date of the dataset in the format YYYY-MM-DD (ex. 2021-01-01) which defaults to the latest date.

  2. Generate entry-dates.csv for current version of the dataset

    python -m paperetl.cord19.entry cord19/data
    

    An optional second argument sets a specific date of the dataset in the format YYYY-MM-DD (ex. 2021-01-01) which defaults of the latest date. This should match the date used in Step 1.

  3. Build database

    python -m paperetl.cord19 cord19/data cord19/models
    

    Once complete, there will be an articles.sqlite file in cord19/models. As with earlier examples, the data can also be loaded into Elasticsearch.

    python -m paperetl.cord19 cord19/data http://localhost:9200
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paperetl-2.2.1.tar.gz (27.3 kB view details)

Uploaded Source

Built Distribution

paperetl-2.2.1-py3-none-any.whl (32.3 kB view details)

Uploaded Python 3

File details

Details for the file paperetl-2.2.1.tar.gz.

File metadata

  • Download URL: paperetl-2.2.1.tar.gz
  • Upload date:
  • Size: 27.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for paperetl-2.2.1.tar.gz
Algorithm Hash digest
SHA256 6768f70081c80f8bf5a291a02763e853fff42c934f7dce1dcf3b0304db8b1f7d
MD5 195a6680556d0bd6badc429e7632e684
BLAKE2b-256 2dc7319f65489b24789ce8b646095ed9043535608a3ccaf93b0f1c18789b4fdf

See more details on using hashes here.

File details

Details for the file paperetl-2.2.1-py3-none-any.whl.

File metadata

  • Download URL: paperetl-2.2.1-py3-none-any.whl
  • Upload date:
  • Size: 32.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.18

File hashes

Hashes for paperetl-2.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 2c518b98eeb08f35d5a5b12011b94fcc4adb4e737763c1bb1b1e60a235559f8b
MD5 e6212379ad7bffa86d4007cc8b97f013
BLAKE2b-256 a058b972342f8e75c8d1cd137e905f43b5dfe49600b9e424f4f80d0b2b94e171

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page