Skip to main content

ETL processes for medical and scientific papers

Project description

ETL processes for medical and scientific papers

Version GitHub Release Date GitHub issues GitHub last commit Build Status Coverage Status


paperetl is an ETL library for processing medical and scientific papers.

architecture

paperetl supports the following sources:

paperetl supports the following datastores for parsed articles.

  • SQLite
  • JSON files
  • YAML files

Additional optional datastores are available.

  • Elasticsearch

Installation

The easiest way to install is via pip and PyPI

pip install paperetl

Python 3.10+ is supported. Using a Python virtual environment is recommended.

paperetl can also be installed directly from GitHub to access the latest, unreleased features.

pip install git+https://github.com/neuml/paperetl

Additional dependencies

PDF parsing relies on an existing GROBID instance to be up and running. It is assumed that this is running locally on the ETL server. This is only necessary for PDF files.

Note: In some cases, the GROBID engine pool can be exhausted, resulting in a 503 error. This can be fixed by increasing concurrency and/or poolMaxWait in the GROBID configuration file.

Docker

A Dockerfile with commands to install paperetl, all dependencies and scripts are available in this repository.

wget https://raw.githubusercontent.com/neuml/paperetl/master/docker/Dockerfile
docker build -t paperetl -f Dockerfile .
docker run --name paperetl --rm -it paperetl

This will bring up a paperetl command shell. Standard Docker commands can be used to copy files over or commands can be run directly in the shell to retrieve input content.

Examples

Notebooks

Notebook Description
Introducing paperetl Overview of the functionality provided by paperetl Open In Colab

Load Articles into SQLite

The following example shows how to use paperetl to load a set of medical/scientific articles into a SQLite database.

  1. Download the desired medical/scientific articles in a local directory. For this example, it is assumed the articles are in a directory named paperetl/data

  2. Build the database

    python -m paperetl.file paperetl/data paperetl/models
    

Once complete, there will be an articles.sqlite file in paperetl/models

Load into Elasticsearch

Elasticsearch is a supported datastore. It's an optional install feature via the Elasticsearch extra.

pip install paperetl[elasticsearch]

This example assumes Elasticsearch is running locally, change the URL to a remote server as appropriate.

python -m paperetl.file paperetl/data http://localhost:9200

Once complete, there will be an articles index in Elasticsearch with the metadata and full text stored.

Convert articles to JSON/YAML

paperetl can also be used to convert articles into JSON or YAML files. This is useful if the data is to be fed into another system or for manual inspection/debugging of a single file.

JSON:

python -m paperetl.file paperetl/data json://paperetl/json

YAML:

python -m paperetl.file paperetl/data yaml://paperetl/yaml

Converted files will be stored in paperetl/(json|yaml)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paperetl-2.5.2.tar.gz (24.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

paperetl-2.5.2-py3-none-any.whl (26.9 kB view details)

Uploaded Python 3

File details

Details for the file paperetl-2.5.2.tar.gz.

File metadata

  • Download URL: paperetl-2.5.2.tar.gz
  • Upload date:
  • Size: 24.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.19

File hashes

Hashes for paperetl-2.5.2.tar.gz
Algorithm Hash digest
SHA256 e03fa265564ff7533bde018ebdba2c165fa1247eb9fab0f683d10030498bccec
MD5 3fba7a6f4c40162db43f6664905b584c
BLAKE2b-256 5570b2850b9edcba403185c29c1f9709edd63283d5357c53a08d23b9c30e8b7e

See more details on using hashes here.

File details

Details for the file paperetl-2.5.2-py3-none-any.whl.

File metadata

  • Download URL: paperetl-2.5.2-py3-none-any.whl
  • Upload date:
  • Size: 26.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.19

File hashes

Hashes for paperetl-2.5.2-py3-none-any.whl
Algorithm Hash digest
SHA256 199a44be7e4f82fc08e0720110f3abc3ba17b5d8e1d7577892dd432567fc1819
MD5 804bb78a1fdd20bd10971eb7147eb7b3
BLAKE2b-256 2be8db69bf35fac87b7d051770208294dd32a680070c3b129892adf3727d6744

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page