Skip to main content

Pull all articles from PubMed and insert them into a Neo4j Graph Database

Project description

pipeline status

PubMedSucker (PMS)

Load the MEDLINE/PubMed "bulk download package" into a Neo4j database.

PMS is a software written in Python3 which downloads the MEDLINE/PubMed bulk downloadable datahttps://ftp.ncbi.nlm.nih.gov/pubmed/baseline/ from the US National Library of Medicine, transforms it reasonable and loads a subset of all PubMed articles attributes into a Neo4j graph database.

License: MIT

Maintainer: Datamgmt Team of the German Center of Diabetes Research / Deutsches Zentrum für Diabetesforschung e.V. | Tim Bleimehl

honourable mentionable external Python modules PMS is using:

  • graphio - A tool to conveniently load sets of data into neo4j
  • py2neo - A high level python Neo4j driver/framework
  • xmltodict - Convert xml into Python dicts
  • neobulkmp - Load tons of data in an organized manner with multiple processes into Neo4j

Content

[[TOC]]


What can i do with PMS Graph?

Whatever you can imagen :) We could just calculate some statistics on authors, topics or keyword. A more advanced example: you could use the Neo4j Graph Data Science Library for community detection on entities in the graph to find groups of gene names that often form in publications.

At the DZD, we take this graph as base for a biomedical knowledge graph. We connect it with other Datasources and process the data with NLP libraries. This way we later want, for example, create new theses for our scientiest.

Quickstart

You need to have git and docker installed.

Clone the PMS code repository to your computer

git clone ssh://git@git.connect.dzd-ev.de:22022/dzdtools/pubmedsucker.git cd pubmedsucker

Build the image thats will run PMS

docker-compose build

Start PMS with Neo4j and Redis

docker-compose up

This will load the latest 10 XMLs from the PubMed baseline and will take around 20 Minutes on a decent Laptop.

You can visit http://localhost:7474 to inspect the result in the Neo4j Graphdatabase

Setup

Hardware Requirements

Depending on how much of the MEDLINE/PubMed Data you want to load into the graph in which time, the requirements vary widely.

MAX

For a full import (Articles from 60s-70s till today, baseline + annual update), in reasonable time you need two full blown servers.

Neo4j Server:

  • 256GB Ram
  • at least 12cores, better more
  • 1 x SDDs with ~128GB
  • 1 x SDDs with ~512GB

Parser/Importer

  • 12GB Ram
  • at least 12Cores, better more (should approx. match neo4j server count)
  • about 200GB of disk space

The full import should be completed in under 24hours. You can always save up on the requirements, PMS will still run, but this could increase the import duration drastical.

MIN

A small sample import of 2-3 xml files from baseline should be done in under 10 minutes on your laptop

DB / Neo4j setup

Dislaimer: I am not claiming to be an expert on setting up a perfomant Neo4j instance. These are just some things i read/learned/catched/noticed on the way. Some of the facts could be cargo cult or just plain wrong. if you have suggestions on how to improve this document, i would be happy :) contact me or create an issue

You can use a plain Neo4j instance without any plugins.

There are a lot of manuals, on how to install a Neo4j instance, out there.

We recommend using docker to reduce deployment pain :)

Demosetup via docker

For a small sample import, a tiny Neo4j instance will do it:

docker run\
    --publish=7474:7474 --publish=7687:7687 \
    -e NEO4J_AUTH=none \
    --name neo4jtests \
    -v $PWD/data:/data \
    neo4j:4.3

Productive(-ish) setup via docker-compose

A basic conf via docker-compose for a large import server instance with docker-compose could look like this:

docker-compose.yml

version: '3'
services:
  neo4j:
    image: neo4j:4.1
    ports:
      - "7474:7474"
      - "7687:7687"
    environment:
      - NEO4J_AUTH=neo4j/mysupersavepassword
      - NEO4J_dbms_memory_pagecache_size=200GB
      - NEO4J_dbms_memory_heap_max__size=30G
      - NEO4J_dbms_memory_heap_initial__size=30G
      - NEO4J_dbms_logs_query_enabled=off
      - NEO4J_dbms_default__listen__address=0.0.0.0
    volumes:
      - ./plugins:/plugins
      - ./data:/data
      - ./conf:/conf
      - ./logs:/logs

Start this with

docker-compose up -d

Very important is the dbms.memory.pagecache.size / NEO4J_dbms_memory_pagecache_size parameter. Max this out as far as you have memory.

Some more information on memory configuration for Neo4j https://neo4j.com/docs/operations-mannual/current/performance/memory-configuration/

To improve perfomance (and therefore import time) even more seperate disks for Neo4j database and transaction logs could be a reasonable way to go. https://neo4j.com/docs/operations-mannual/current/performance/linux-file-system-tuning/

Some more hints on improving perfomance: https://neo4j.com/docs/operations-mannual/current/performance/

Install PMS

Via Docker

Just catch the needed image from the registries:

docker pull registry-gl.connect.dzd-ev.de:443/dzdtools/pubmedsucker:prod

docker pull redis

Via Git

Requirements

  • Python3
  • pip
  • A running redis DB

Steps

  • Clone the repo

git clone ssh://git@git.connect.dzd-ev.de:22022/dzdtools/pubmedsucker.git

  • cd into the repo

cd pudmedsucker

  • Install the required python modules

pip3 install -r reqs.txt

Start PMS

Via docker

A small sample example.

First start the redis database in backround

docker run --network=host --rm --name redis -d redis

Then start PMS itself

docker run --rm \
    -v ${PWD}/data:/data \
    -v ${PWD}/log:/log \
    -e CONFIGS_NEO4J="{'host':'$HOSTNAME', 'user':'neo4j', 'password':'mysuperpw'}" \
    -e CONFIGS_PUBMED_SOURCE="https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed21n0001.xml.gz \
    --network=host \
    registry-gl.connect.dzd-ev.de:443/dzdtools/pubmedsucker:stable

Via docker-compose

A larger import (~ last 10 years)

version: '3'
services:
  redis:
    image: redis
    container_name: redis
    ports:
      - 6379:6379
    command:
      - redis-server
      - --save ""
      - --appendonly no
  pms_baseline:
    image: registry-gl.connect.dzd-ev.de:443/dzdtools/pubmedsucker:prod
    environment:
      - CONFIGS_NEO4J="{'host':'myNeo4jHost','port':'7687', 'user':'neo4j','password':'supersecret','name':'MyDBInstance'}"
      - CONFIGS_REDIS="{'host':'redis'}"
      - CONFIGS_PUBMED_SOURCE=350
      - CONFIGS_BASE_LINE_MODE=True
    volumes:
      - ./data:/data
      - ./log:/log
      - ./dump:/dump

And we need to run a second time CONFIGS_BASE_LINE_MODEset to False, to import the updates for the running year

version: '3'
services:
  redis:
    image: redis
    container_name: redis
    ports:
      - 6379:6379
    command:
      - redis-server
      - --save ""
      - --appendonly no
  pms_updates:
    image: registry-gl.connect.dzd-ev.de:443/dzdtools/pubmedsucker:prod
    environment:
      - CONFIGS_NEO4J="{'host':'myNeo4jHost','port':'7687', 'user':'neo4j','password':'supersecret','name':'MyDBInstance'}"
      - CONFIGS_REDIS="{'host':'redis'}"
      - CONFIGS_BASE_LINE_MODE=False
    volumes:
      - ./data:/data
      - ./log:/log
      - ./dump:/dump

Config parameters

Config for PMS is located in the file pms/config.py

All config parameters can be set/overwritten via environement variables, but then the prefix CONFIGS_is needed. E.g. the parameter PUBMED_SOURCE set via environment variable must be CONFIGS_PUBMED_SOURCE


PUBMED_SOURCE

  • Parameter to define which xmls from MEDLINE/PubMed should be parsed

default: "https://ftp.ncbi.nlm.nih.gov/pubmed/baseline-2019-sample/pubmedsample.xml"

  • None - Download and process all pubmed xml files from the source ftp
    • example: CONFIGS_PUBMED_SOURCE=None
  • int - Download and process the most recent n xml files from the pubmed server
    • example: CONFIGS_PUBMED_SOURCE=5
  • str of remote file path - Download and process a single file
    • example: CONFIGS_PUBMED_SOURCE=https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed20n1009.xml.gz
  • str of local dir path - Process all files in directory
    • example: CONFIGS_PUBMED_SOURCE=/home/files/
  • str of local file path - Process a single file
    • example: CONFIGS_PUBMED_SOURCE=/home/files/pubmed20n1008.xml
  • list of local files paths - Process the xml files in the list
    • example: CONFIGS_PUBMED_SOURCE=["/home/files/pubmed20n1009.xml","/home/files/pubmed20n1008.xml"]
  • list of remote files - ftp,http urls to be downloaded and processed
    • example: CONFIGS_PUBMED_SOURCE=["https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed20n1009.xml.gz","https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed20n1008.xml.gz"]

BASE_LINE_MODE

  • Define if baseline or annual update should be processed. When set to True the base line will be downloaded, parsed on loaded into Neo4j. When to Falsethe annual update XMLs will be downloaded, parsed and loaded into Neo4j

default: True

Datamodel

datamodel

Changes in datamodel:

0.9.22 -> 1.2.13

  • PublicationType and PublicationTypeUI are not longer attribute of :PubMedArticle but standalone Nodes related to :PubMedArticle via PUBMEDARTICLE_HAS_PUBLICATIONTYPE

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

PubMedSucker-1.5.13.tar.gz (201.2 kB view hashes)

Uploaded Source

Built Distribution

PubMedSucker-1.5.13-py3-none-any.whl (26.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page