Skip to main content

Pubmed / NCBI / eutils interaction library, handling the metadata of pubmed papers.

Project description

Metapub is a Python library that provides python objects fetched via eutils that represent papers and concepts found within the NLM.

These objects abstract some interactions with pubmed, and intends to encompass as many types of database lookups and summaries as can be provided via Eutils / Entrez.

PubMedArticle / PubMedFetcher

Basic usage:

fetch = PubMedFetcher()
article = fetch.article_by_pmid('123456')
print(article.journal, article.year, article.volume, article.issue)

# NEW (0.4.x): PMA can generate a rudimentary MLA article citation string.

PubMedFetcher uses an SQLite cacheing engine (provided through eutils), which by default places a file in your user directory. E.g. the author’s cache directory path would be /home/nthmost/.cache/eutils-cache.db

This cache file can grow quite large over time. Deleting the cache file is safe and can also be regarded as the way to “reset” the cache.

The cachedir keyword argument can be supplied to PubMedFetcher as a way to specify where the cache file will reside. For example:

fetch = PubMedFetcher(cachedir='/path/to/cachedir')

User directory expansion also works:

fetch = PubMedFetcher(cachedir=’~/.othercachedir’)

The cachedir will be created for you if it doesn’t already exist, assuming the user account you’re running metapub under has permissions to do so.

PubMedArticle Lookup Methods

The following methods return a PubMedArticle object (or raise InvalidPMID if NCBI lookup fails).


(Attempt to) fetch an article by supplying its pubmed ID (both integer and string accepted).


(Attempt to) fetch an article by looking up the DOI first.


Fetch an article by looking up the PMCID first. Both integer and string accepted.

Pubmed ID List Methods

The following methods return a list of pubmed IDs (if found) or an empty list (if None).


Produces a list of possible PMIDs for the submitted citation, where the citation is submitted as a collection of keyword arguments. At least 3 of the 5, preferably 4 or 5 for best results, must be included:

aulast or author_last_fm1
first_page or spage
journal or jtitle

Use NLM Title Abbreviation (aka ISO Abbreviation) journal strings whenever possible.


Returns list of pmids for given freeform query string plus keyword arguments.

All Pubmed Advanced Query tokens are supported.

See [NCBI Search Field Descriptions and Tags](


Composes a “Clinical Query” as on this page: (

Supply a “category” (required) and an optimization (“broad” or “narrow”) for this function. Available categories:

  • therapy
  • diagnosis
  • etiology
  • prognosis
  • prediction

All keyword arguments for PubMedFetcher.pmids_for_query available.


Composes a “Medical Genetics Query” as described here: (

Supply a “category” (required) and an optimization (“broad” or “narrow”) for this function. Available categories:

  • therapy
  • diagnosis
  • etiology
  • prognosis
  • prediction

All keyword arguments for PubMedFetcher.pmids_for_query available.


The PubMedCentral functions are a loose collection of conversion methods for academic publishing IDs, allowing conversion (where possible) between the following ID types:

doi (Digital object identifier)
pmid (PubMed ID)
pmcid (Pubmed Central ID (including versioned document ID)

The following methods are supplied, returning a string (if found) or None:


As implied by the function names, you can supply any valid ID type (“otherid”) to acquire the desired ID type.

MedGenConcept / MedGenFetcher

The MedGen (medical genetics) database is a clinical dictionary linking medical concepts across multiple medical ontologies and dictionaries such as OMIM and SNOMED.

Basic usage:

fetch = MedGenFetcher()
concept = fetch.concept_by_uid('336867')

ClinVarVariation / ClinVarFetcher

The ClinVar database contains information submitted by genetic researchers, labs, and testing companies around the world.

Information queryable using the ClinVarFetcher currently includes searching for the ID of a variant (“Variation”) in the database using an HGVS string and retrieving the Variant Summmary using a variation ID or HGVS string.

Since Pubmed citations by Variation ID are also available by a cross-query between ClinVar and Pubmed, ClinVarFetcher allows retrieving PMIDs for given HGVS string.

Basic usage:

clinvar = ClinVarFetcher()
cv = clinvar.variation_by_hgvs('NM_000249.3:c.1958T>G')

pubmed_citations = clinvar.pmids_for_hgvs('NM_000249.3:c.1958T>G')


The CrossRef object provides an object layer into’s API. See

CrossRef excels at resolving DOIs into article citation details.

CrossRef can also be used to resolve a DOI /from/ article citation details, with a bit of finagling. The “get_top_result” function was built to do some light interpretation of the json-based results of a CrossRef lookup.

Result scores under 2.0 are usually False matches. Result scores over 3.0 are always (?) True. Between 2.0 and 3.0 is a grey area: be wary and check results against any known info you may have.

Current testing (as of 1/23/2015) indicates that a cleverly-formed CrossRef query can return results 99% correct about 90% of the time.

The more params submitted with the query, the more accurate the results may be.

Basic usage:

CR = CrossRef()       # starts the query cache engine
results = CR(search_string, params)
top_result = CR.get_top_result(results)

Example starting from a known pubmed ID:

pma = PubMedFetcher().article_by_pmid(known_pmid)
results = CR.query_from_PubMedArticle(pma)
top_result = CR.get_top_result(results, CR.last_params, use_best_guess=True)

NOTE: if you don’t supply “CR.last_params”, you can’t use the “use_best_guess” operator. In cases where all results have scores under 2, no results will be returned unless use_best_guess=True. That’s often desired behavior, since results with scores under 2 are usually pretty bad.

As with the PubMedFetcher object, you can configure where the cache file ends up on the filesystem via the cachedir keyword argument.


Looking for an article PDF? Trying to gather a large corpus of research?

The FindIt object was designed to be able to locate the direct urls of as many different articles from as many different publishers of PubMed content as possible.

Any article that is Open Access, whether it is in PubmedCentral or not, can potentially be “FindIt-able”. Usage is simple:

from metapub import FindIt
src = FindIt('18381613')

You can start FindIt from a DOI instead of a PMID by instantiating with FindIt(doi=‘10.1234/some.doi’).

If FindIt couldn’t get a URL, you can take a look at the “reason” attribute to find out why. For example:

src = FindIt('1234567')
if src.url is None: print(src.reason)

The FindIt object is cached (keyed to PMID), so while initialization the first time around for a given PMID or DOI may take a few seconds, the second time this information is requested it will take far less time.

If you see a FindIt “reason” that starts with NOFORMAT, this is a great place to contribute some help to metapub! Feel free to dive in and submit a pull request, or contact the author ( for advice on how to fill in these gaps.


Starting with a URL pointing to the abstract, pdf, or online fulltext of an article, UrlReverse can “reverse” the DOI and/or the PubMed ID (pmid) of the article (assuming it can be found in PubMed).

The UrlReverse object provides an interface to the urlreverse logic, and it attributes hold state for all of the information gathered and steps used to gather that information.

Usage is very similar to FindIt:

from metapub import UrlReverse
urlrev = UrlReverse('')

UrlReverse is cached (keyed to URL); by default its cache db can be found in ~/.cache/urlreverse-cache.db

As of metapub 0.4.3, there is no mechanism to have an item in cache expire. This is considered a deficiency and will be remedied in a future version.

This is the newest feature in metapub (as of 0.4.2a0) and there is still much work to be done. The world of biomedical literature URLs is fraught with inconsistencies and very weird URL formats. UrlReverse could really benefit from being able to parse supplement URLs, for example.

Collaboration and contributions heartily encouraged.

Miscellaneous Utilities

Currently underdocumented utilities that you might find useful.

In metapub.utils:

  • asciify (nuke all the unicode from orbit; it’s the only way to be sure)
  • parameterize (make strings suitable for submission to GET-based query service)
  • deparameterize (somewhat-undo parameterization in string)
  • remove_html_markup (remove html and xml tags from text. preserves HTML entities like &)
  • hostname_of (returns hostname part of URL, e.g. ==>
  • rootdomain_of (returns the root domain of hostname of supplied URL, e.g.

In metapub.text_mining:

  • find_doi_in_string (returns the first seen DOI in the input string)
  • findall_dois_in_text (returns all seen DOIs in input string)
  • pick_pmid (return longest numerical string from text (string) as the pmid)

In metapub.convert:

  • PubMedArticle2doi (uses CrossRef to find a DOI for given PubMedArticle object.)
  • pmid2doi (returns first found doi for pubmed ID “by any means necessary.)
  • doi2pmid (uses CrossRef and eutils to return a PMID for given DOI if possible.)

More Information

Digital Identifiers of Scientific Literature: what they are, when they’re used, and what they look like.

About, and a Disclaimer

Metapub relies on the very neat eutils package created by Reece Hart, which you can check out here:

Metapub has been in development since November 15, 2014, and has come quite a long way in a short time. Metapub has been deployed in production at many bioinformatics facilities (please tell me your story if you are among them!).

Feel free to use the library with confidence that each released version is well tested and battle-hardened from extensive use, but until (say) version 0.5, don’t expect total consistency between versions.

YMMV, At your own risk, etc. Please do report bugs and bring your comments and suggestions to the bitbucket home for metapub at:

–Naomi Most (@nthmost)

About Python 2 and Python 3 Support

Alert: version 0.3.17 will be the last version to support python 2.7 only.

The upcoming metapub version 0.4 will support Python 3.3+ and deal entirely in unicode strings in both python 2 and python 3. The effects this shift may have on your code depend highly on what you’re doing with strings.

So, while other types of weirdness are highly unlikely, the high probability of string processing weirdness in the transition between metapub 0.3 and 0.4 means you should prepare to test, test, and test again when upgrading between these two versions.

Bugfixes that apply to this version will continue to appears as the latest 0.3.x release here on pypi until Python 2.7 is finally retired.

Project details

Release history Release notifications

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
metapub- (104.8 kB) Copy SHA256 hash SHA256 Source None

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page