Skip to main content

Pull all articles from PubMed and insert them into a Neo4j Graph Database

Project description

pipeline status

PubMedSucker (PMS)

Load the MEDLINE/PubMed bulk download package into a Neo4j database.

PMS is a software written in Python3 which downloads the MEDLINE/PubMed bulk downloadable datahttps://ftp.ncbi.nlm.nih.gov/pubmed/baseline/ from the US National Library of Medicine, transforms it reasonable and loads a subset of all PubMed articles attributes into a Neo4j graph database.

License: MIT

Maintainer: Datamgmt Team of the German Center of Diabetes Research / Deutsches Zentrum für Diabetesforschung e.V. | Tim Bleimehl

honourable mentionable external Python modules PMS is using:

  • graphio - A tool to conveniently load sets of data into neo4j
  • py2neo - A high level python Neo4j driver/framework
  • xmltodict - Convert xml into Python dicts
  • neobulkmp - Load tons of data in an organized manner with multiple processes into Neo4j

Content

[[TOC]]


What can i do with PMS Graph?

Whatever you can imagen :) We could just calculate some statistics on authors, topics or keyword. A more advanced example: you could use the Neo4j Graph Data Science Library for community detection on entities in the graph.

At the DZD, we take this graph as base for a biomedical knowledge graph. We connect it with other Datasources and process the data with NLP libraries. This way we later want, for example, create new theses for our scientiest.

Setup

Hardware Requirements

Depending on how much of the MEDLINE/PubMed Data you want to load into the graph in which time, the requirements vary widely.

MAX

For a full import (Articles from 60s-70s till today, baseline + annual update), in reasonable time you need two full blown servers.

Neo4j Server:

  • 256GB Ram
  • at least 12cores, better more
  • 1 x SDDs with ~128GB
  • 1 x SDDs with ~512GB

Parser/Importer

  • 12GB Ram
  • at least 12Cores, better more (should approx. match neo4j server count)
  • about 200GB of disk space

The full import should be completed in under 24hours. You can always save up on the requirements, PMS will still run, but this could increase the import duration drastical.

MIN

A small sample import of 2-3 xml files from baseline should be done in under 10 minutes on your laptop

DB / Neo4j setup

Dislaimer: I am not claiming to be an expert on setting up a perfomant Neo4j instance. These are just some things i read/learned/catched/noticed on the way. Some of the facts could be cargo cult or just plain wrong. if you have suggestions on how to improve this document, i would be happy :) contact me or create an issue

You can use a plain Neo4j instance without any plugins.

There are a lot of manuals, on how to install a Neo4j instance, out there.

We recommend using docker to reduce deployment pain :)

Demosetup via docker

For a small sample import, a tiny Neo4j instance will do it:

docker run\
    --publish=7474:7474 --publish=7687:7687 \
    -e NEO4J_AUTH=none \
    --name neo4jtests \
    -v $PWD/data:/data \
    neo4j:4.3

Productive(-ish) setup via docker-compose

A basic conf via docker-compose for a large import server instance with docker-compose could look like this:

docker-compose.yml

version: '3'
services:
  neo4j:
    image: neo4j:4.1
    ports:
      - "7474:7474"
      - "7687:7687"
    environment:
      - NEO4J_AUTH=neo4j/mysupersavepassword
      - NEO4J_dbms_memory_pagecache_size=200GB
      - NEO4J_dbms_memory_heap_max__size=30G
      - NEO4J_dbms_memory_heap_initial__size=30G
      - NEO4J_dbms_logs_query_enabled=off
      - NEO4J_dbms_default__listen__address=0.0.0.0
    volumes:
      - ./plugins:/plugins
      - ./data:/data
      - ./conf:/conf
      - ./logs:/logs

Start this with

docker-compose up -d

Very important is the dbms.memory.pagecache.size / NEO4J_dbms_memory_pagecache_size parameter. Max this out as far as you have memory.

Some more information on memory configuration for Neo4j https://neo4j.com/docs/operations-mannual/current/performance/memory-configuration/

To improve perfomance (and therefore import time) even more seperate disks for Neo4j database and transaction logs could be a reasonable way to go. https://neo4j.com/docs/operations-mannual/current/performance/linux-file-system-tuning/

Some more hints on improving perfomance: https://neo4j.com/docs/operations-mannual/current/performance/

Install PMS

Via Docker

Just catch the needed image from the registries:

docker pull registry-gl.connect.dzd-ev.de:443/dzdtools/pubmedsucker:prod

docker pull redis

Via Git

Requirements

  • Python3
  • pip
  • A running redis DB

Steps

  • Clone the repo

git clone ssh://git@git.connect.dzd-ev.de:22022/dzdtools/pubmedsucker.git

  • cd into the repo

cd pudmedsucker

  • Install the required python modules

pip3 install -r reqs.txt

Start PMS

Via docker

A small sample example.

First start the redis database in backround

docker run --network=host --rm --name redis -d redis

Then start PMS itself

docker run --rm \
    -v ${PWD}/data:/data \
    -v ${PWD}/log:/log \
    -e CONFIGS_NEO4J="{'host':'$HOSTNAME', 'user':'neo4j', 'password':'mysuperpw'}" \
    -e CONFIGS_PUBMED_SOURCE="https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed21n0001.xml.gz \
    --network=host \
    registry-gl.connect.dzd-ev.de:443/dzdtools/pubmedsucker:stable

Via docker-compose

A larger import (~ last 10 years)

version: '3'
services:
  redis:
    image: redis
    container_name: redis
    ports:
      - 6379:6379
    command:
      - redis-server
      - --save ""
      - --appendonly no
  pms_baseline:
    image: registry-gl.connect.dzd-ev.de:443/dzdtools/pubmedsucker:prod
    environment:
      - CONFIGS_NEO4J="{'host':'myNeo4jHost','port':'7687', 'user':'neo4j','password':'supersecret','name':'MyDBInstance'}"
      - CONFIGS_REDIS="{'host':'redis'}"
      - CONFIGS_PUBMED_SOURCE=350
      - CONFIGS_BASE_LINE_MODE=True
    volumes:
      - ./data:/data
      - ./log:/log
      - ./dump:/dump

And we need to run a second time CONFIGS_BASE_LINE_MODEset to False, to import the updates for the running year

version: '3'
services:
  redis:
    image: redis
    container_name: redis
    ports:
      - 6379:6379
    command:
      - redis-server
      - --save ""
      - --appendonly no
  pms_updates:
    image: registry-gl.connect.dzd-ev.de:443/dzdtools/pubmedsucker:prod
    environment:
      - CONFIGS_NEO4J="{'host':'myNeo4jHost','port':'7687', 'user':'neo4j','password':'supersecret','name':'MyDBInstance'}"
      - CONFIGS_REDIS="{'host':'redis'}"
      - CONFIGS_BASE_LINE_MODE=False
    volumes:
      - ./data:/data
      - ./log:/log
      - ./dump:/dump

Config parameters

Config for PMS is located in the file pms/config.py

All config parameters can be set/overwritten via environement variables, but then the prefix CONFIGS_is needed. E.g. the parameter PUBMED_SOURCE set via environment variable must be CONFIGS_PUBMED_SOURCE


PUBMED_SOURCE

  • Parameter to define which xmls from MEDLINE/PubMed should be parsed

default: "https://ftp.ncbi.nlm.nih.gov/pubmed/baseline-2019-sample/pubmedsample.xml"

  • None - Download and process all pubmed xml files from the source ftp
    • example: CONFIGS_PUBMED_SOURCE=None
  • int - Download and process the most recent n xml files from the pubmed server
    • example: CONFIGS_PUBMED_SOURCE=5
  • str of remote file path - Download and process a single file
    • example: CONFIGS_PUBMED_SOURCE=https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed20n1009.xml.gz
  • str of local dir path - Process all files in directory
    • example: CONFIGS_PUBMED_SOURCE=/home/files/
  • str of local file path - Process a single file
    • example: CONFIGS_PUBMED_SOURCE=/home/files/pubmed20n1008.xml
  • list of local files paths - Process the xml files in the list
    • example: CONFIGS_PUBMED_SOURCE=["/home/files/pubmed20n1009.xml","/home/files/pubmed20n1008.xml"]
  • list of remote files - ftp,http urls to be downloaded and processed
    • example: CONFIGS_PUBMED_SOURCE=["https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed20n1009.xml.gz","https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed20n1008.xml.gz"]

BASE_LINE_MODE

  • Define if baseline or annual update should be processed. When set to True the base line will be downloaded, parsed on loaded into Neo4j. When to Falsethe annual update XMLs will be downloaded, parsed and loaded into Neo4j

default: True

Datamodel

datamodel

Changes in datamodel:

0.9.22 -> 1.2.13

  • PublicationType and PublicationTypeUI are not longer attribute of :PubMedArticle but standalone Nodes related to :PubMedArticle via PUBMEDARTICLE_HAS_PUBLICATIONTYPE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PubMedSucker-1.4.5.tar.gz (198.8 kB view details)

Uploaded Source

Built Distribution

PubMedSucker-1.4.5-py3-none-any.whl (24.6 kB view details)

Uploaded Python 3

File details

Details for the file PubMedSucker-1.4.5.tar.gz.

File metadata

  • Download URL: PubMedSucker-1.4.5.tar.gz
  • Upload date:
  • Size: 198.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for PubMedSucker-1.4.5.tar.gz
Algorithm Hash digest
SHA256 e8773e7993ef59c41c7845e7f358cada6a6edf86ac94a32cc72daf5fefca4ae6
MD5 c9967dba28d62d2d73daa5a9d6ab7abc
BLAKE2b-256 b0cb563d7c24ea7ec9152c741e33790844ca53902290661df3436ef596cf7ad5

See more details on using hashes here.

File details

Details for the file PubMedSucker-1.4.5-py3-none-any.whl.

File metadata

  • Download URL: PubMedSucker-1.4.5-py3-none-any.whl
  • Upload date:
  • Size: 24.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.8.2 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.10.1

File hashes

Hashes for PubMedSucker-1.4.5-py3-none-any.whl
Algorithm Hash digest
SHA256 7de815c5f42e9e6f237a3d43b49ccbd27032ceb394f92ab470a46966017a85bf
MD5 e42beb882cf357bcd0f6f1ef8aa700a4
BLAKE2b-256 4d127754a7e5c4ba93a70d315cc504c87450d2ac4d6b889367d5136dd0fbcb95

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page