Pull all articles from PubMed and insert them into a Neo4j Graph Database
Project description
PubMedSucker (PMS)
Load the MEDLINE/PubMed "bulk download package" into a Neo4j database.
PMS is a software written in Python3 which downloads the MEDLINE/PubMed bulk downloadable datahttps://ftp.ncbi.nlm.nih.gov/pubmed/baseline/ from the US National Library of Medicine, transforms it reasonable and loads a subset of all PubMed articles attributes into a Neo4j graph database.
License: MIT
Maintainer: Datamgmt Team of the German Center of Diabetes Research / Deutsches Zentrum für Diabetesforschung e.V. | Tim Bleimehl
honourable mentionable external Python modules PMS is using:
- graphio - A tool to conveniently load sets of data into neo4j
- py2neo - A high level python Neo4j driver/framework
- xmltodict - Convert xml into Python dicts
- neobulkmp - Load tons of data in an organized manner with multiple processes into Neo4j
Content
[[TOC]]
What can i do with PMS Graph?
Whatever you can imagen :) We could just calculate some statistics on authors, topics or keyword. A more advanced example: you could use the Neo4j Graph Data Science Library for community detection on entities in the graph to find groups of gene names that often form in publications.
At the DZD, we take this graph as base for a biomedical knowledge graph. We connect it with other Datasources and process the data with NLP libraries. This way we later want, for example, create new theses for our scientiest.
Quickstart
You need to have git and docker installed.
Clone the PMS code repository to your computer
git clone ssh://git@git.connect.dzd-ev.de:22022/dzdtools/pubmedsucker.git
cd pubmedsucker
Build the image thats will run PMS
docker-compose build
Start PMS with Neo4j and Redis
docker-compose up
This will load the latest 10 XMLs from the PubMed baseline and will take around 20 Minutes on a decent Laptop.
You can visit http://localhost:7474 to inspect the result in the Neo4j Graphdatabase
Setup
Hardware Requirements
Depending on how much of the MEDLINE/PubMed Data you want to load into the graph in which time, the requirements vary widely.
MAX
For a full import (Articles from 60s-70s till today, baseline + annual update), in reasonable time you need two full blown servers.
Neo4j Server:
- 256GB Ram
- at least 12cores, better more
- 1 x SDDs with ~128GB
- 1 x SDDs with ~512GB
Parser/Importer
- 12GB Ram
- at least 12Cores, better more (should approx. match neo4j server count)
- about 200GB of disk space
The full import should be completed in under 24hours. You can always save up on the requirements, PMS will still run, but this could increase the import duration drastical.
MIN
A small sample import of 2-3 xml files from baseline should be done in under 10 minutes on your laptop
DB / Neo4j setup
Dislaimer: I am not claiming to be an expert on setting up a perfomant Neo4j instance. These are just some things i read/learned/catched/noticed on the way. Some of the facts could be cargo cult or just plain wrong. if you have suggestions on how to improve this document, i would be happy :) contact me or create an issue
You can use a plain Neo4j instance without any plugins.
There are a lot of manuals, on how to install a Neo4j instance, out there.
We recommend using docker to reduce deployment pain :)
Demosetup via docker
For a small sample import, a tiny Neo4j instance will do it:
docker run\
--publish=7474:7474 --publish=7687:7687 \
-e NEO4J_AUTH=none \
--name neo4jtests \
-v $PWD/data:/data \
neo4j:4.3
Productive(-ish) setup via docker-compose
A basic conf via docker-compose for a large import server instance with docker-compose could look like this:
docker-compose.yml
version: '3'
services:
neo4j:
image: neo4j:4.1
ports:
- "7474:7474"
- "7687:7687"
environment:
- NEO4J_AUTH=neo4j/mysupersavepassword
- NEO4J_dbms_memory_pagecache_size=200GB
- NEO4J_dbms_memory_heap_max__size=30G
- NEO4J_dbms_memory_heap_initial__size=30G
- NEO4J_dbms_logs_query_enabled=off
- NEO4J_dbms_default__listen__address=0.0.0.0
volumes:
- ./plugins:/plugins
- ./data:/data
- ./conf:/conf
- ./logs:/logs
Start this with
docker-compose up -d
Very important is the dbms.memory.pagecache.size / NEO4J_dbms_memory_pagecache_size parameter. Max this out as far as you have memory.
Some more information on memory configuration for Neo4j https://neo4j.com/docs/operations-mannual/current/performance/memory-configuration/
To improve perfomance (and therefore import time) even more seperate disks for Neo4j database and transaction logs could be a reasonable way to go. https://neo4j.com/docs/operations-mannual/current/performance/linux-file-system-tuning/
Some more hints on improving perfomance: https://neo4j.com/docs/operations-mannual/current/performance/
Install PMS
Via Docker
Just catch the needed image from the registries:
docker pull registry-gl.connect.dzd-ev.de:443/dzdtools/pubmedsucker:prod
docker pull redis
Via Git
Requirements
- Python3
- pip
- A running redis DB
Steps
- Clone the repo
git clone ssh://git@git.connect.dzd-ev.de:22022/dzdtools/pubmedsucker.git
- cd into the repo
cd pudmedsucker
- Install the required python modules
pip3 install -r reqs.txt
Start PMS
Via docker
A small sample example.
First start the redis database in backround
docker run --network=host --rm --name redis -d redis
Then start PMS itself
docker run --rm \
-v ${PWD}/data:/data \
-v ${PWD}/log:/log \
-e CONFIGS_NEO4J="{'host':'$HOSTNAME', 'user':'neo4j', 'password':'mysuperpw'}" \
-e CONFIGS_PUBMED_SOURCE="https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed21n0001.xml.gz \
--network=host \
registry-gl.connect.dzd-ev.de:443/dzdtools/pubmedsucker:stable
Via docker-compose
A larger import (~ last 10 years)
version: '3'
services:
redis:
image: redis
container_name: redis
ports:
- 6379:6379
command:
- redis-server
- --save ""
- --appendonly no
pms_baseline:
image: registry-gl.connect.dzd-ev.de:443/dzdtools/pubmedsucker:prod
environment:
- CONFIGS_NEO4J="{'host':'myNeo4jHost','port':'7687', 'user':'neo4j','password':'supersecret','name':'MyDBInstance'}"
- CONFIGS_REDIS="{'host':'redis'}"
- CONFIGS_PUBMED_SOURCE=350
- CONFIGS_BASE_LINE_MODE=True
volumes:
- ./data:/data
- ./log:/log
- ./dump:/dump
And we need to run a second time CONFIGS_BASE_LINE_MODE
set to False
, to import the updates for the running year
version: '3'
services:
redis:
image: redis
container_name: redis
ports:
- 6379:6379
command:
- redis-server
- --save ""
- --appendonly no
pms_updates:
image: registry-gl.connect.dzd-ev.de:443/dzdtools/pubmedsucker:prod
environment:
- CONFIGS_NEO4J="{'host':'myNeo4jHost','port':'7687', 'user':'neo4j','password':'supersecret','name':'MyDBInstance'}"
- CONFIGS_REDIS="{'host':'redis'}"
- CONFIGS_BASE_LINE_MODE=False
volumes:
- ./data:/data
- ./log:/log
- ./dump:/dump
Config parameters
Config for PMS is located in the file pms/config.py
All config parameters can be set/overwritten via environement variables, but then the prefix CONFIGS_
is needed.
E.g. the parameter PUBMED_SOURCE
set via environment variable must be CONFIGS_PUBMED_SOURCE
PUBMED_SOURCE
- Parameter to define which xmls from MEDLINE/PubMed should be parsed
default: "https://ftp.ncbi.nlm.nih.gov/pubmed/baseline-2019-sample/pubmedsample.xml"
- None - Download and process all pubmed xml files from the source ftp
- example:
CONFIGS_PUBMED_SOURCE=None
- example:
- int - Download and process the most recent n xml files from the pubmed server
- example:
CONFIGS_PUBMED_SOURCE=5
- example:
- str of remote file path - Download and process a single file
- example:
CONFIGS_PUBMED_SOURCE=https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed20n1009.xml.gz
- example:
- str of local dir path - Process all files in directory
- example:
CONFIGS_PUBMED_SOURCE=/home/files/
- example:
- str of local file path - Process a single file
- example:
CONFIGS_PUBMED_SOURCE=/home/files/pubmed20n1008.xml
- example:
- list of local files paths - Process the xml files in the list
- example:
CONFIGS_PUBMED_SOURCE=["/home/files/pubmed20n1009.xml","/home/files/pubmed20n1008.xml"]
- example:
- list of remote files - ftp,http urls to be downloaded and processed
- example:
CONFIGS_PUBMED_SOURCE=["https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed20n1009.xml.gz","https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/pubmed20n1008.xml.gz"]
- example:
BASE_LINE_MODE
- Define if baseline or annual update should be processed. When set to
True
the base line will be downloaded, parsed on loaded into Neo4j. When toFalse
the annual update XMLs will be downloaded, parsed and loaded into Neo4j
default: True
Datamodel
Changes in datamodel:
0.9.22 -> 1.2.13
PublicationType
andPublicationTypeUI
are not longer attribute of:PubMedArticle
but standalone Nodes related to:PubMedArticle
viaPUBMEDARTICLE_HAS_PUBLICATIONTYPE
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file PubMedSucker-1.5.12.tar.gz
.
File metadata
- Download URL: PubMedSucker-1.5.12.tar.gz
- Upload date:
- Size: 201.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b6db3c6af2935bc34bc39c058068b2b0cd8558e306c45c73e3d015249e288919 |
|
MD5 | 5790851a816916d6c5f243cdc4b67cca |
|
BLAKE2b-256 | 4c719f2cd439a6f17604c29c412b6ac28416a42f0f0ab69e2c320a17e1ed5764 |
File details
Details for the file PubMedSucker-1.5.12-py3-none-any.whl
.
File metadata
- Download URL: PubMedSucker-1.5.12-py3-none-any.whl
- Upload date:
- Size: 26.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7fb5aadb20c4447f5fec43723254650d0b6a08162cc3adb16b609a77b87c303e |
|
MD5 | a805c3c41449692109c1ffe6ce6b2ba9 |
|
BLAKE2b-256 | 4c899a90c5eb5c87a8ea77aedb7c41f7bf92a3837a762b397a011ac908242188 |