Skip to main content

PubMed article data collector

Project description

Data Collector

By Query

To get the papers data by search query, you can use the following python script,

from datacollector import datacollector_query

# Search for a query, biofilm co2, with 10 results and download the pdf format of the result 
datacollector_query('biofilm co2', 10, True)

# Search for a query, biofilm co2, with 10 results and without download the pdf format of the result 
datacollector_query('biofilm co2', 10)

In this script, it takes 3 arguments.

  • Search Query (Non-empty String)
  • Number of results (Integer)
  • Download PDF file? (yes/no)

NOTE: For the search query, it must be a non-empty String, and if you want more than one query, using + symbol between two words (Ex. word1+word2). This script will

  • Search for the query at the PubMed repository
  • Format the result as JSON format
  • Name the JSON with paper’s id (Ex. 1234.json)
  • Automatucally create a directory (./datasets_query/json) to save the JSON results
  • For the pdf file, they will be saved into ./pdf

The JSON file will contain the metadata of the paper

  • pmid
  • authors
  • year
  • title
  • abstract
  • doi
  • references
  • citedby

If you type yes for the third argument, it also will download the paper with pdf format into the pdf directory from a repository HERE.

NOTE: You might not be able to download some of the paper with pdf format. The reason I have found is first the repository doesn’t have it. Second, the paper hasn't been published yet because it’s too new.

By PMIDs

To get the papers data by list of ids, you can use the following python script,

from datacollector import datacollector_query

# import list of ids with text format, download metadata, and download the pdf format of the result 
datacollector_query('./ids_list.txt', True)

# import list of ids with text format, download metadata without downloading the pdf format of the result 
datacollector_query('./ids_list.txt')

In this script, it takes 2 arguments.

  • File path (TXT format)
  • Download PDF file? (yes/no)

This script will do the same things as ABOVE, but with list of pmids, not query. It also will automatically create a directory (./datasets_id/json) to save the JSON results. NOTE: For the TXT file, you must separate those pmids with a new line character (“\n”).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datacollector-0.0.7.tar.gz (5.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

datacollector-0.0.7-py3-none-any.whl (7.8 kB view details)

Uploaded Python 3

File details

Details for the file datacollector-0.0.7.tar.gz.

File metadata

  • Download URL: datacollector-0.0.7.tar.gz
  • Upload date:
  • Size: 5.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.5.0 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.7.3

File hashes

Hashes for datacollector-0.0.7.tar.gz
Algorithm Hash digest
SHA256 00c8d758880459721610915bb98868167496c5b5fdb73fbdf2828442df2542a1
MD5 cd35295d2582e1400f54ec1f3454663f
BLAKE2b-256 1ada0082a8c334d99177498242b5c661cd9bb79b39885e79bcb9a71f3a56b3e7

See more details on using hashes here.

File details

Details for the file datacollector-0.0.7-py3-none-any.whl.

File metadata

  • Download URL: datacollector-0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 7.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.5.0 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.62.2 CPython/3.7.3

File hashes

Hashes for datacollector-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 4ebbdc3ab8ffa3cae559100015e7d2093c0c938891bf2a4efa3c0deca571094b
MD5 d667c637a9ba24899e1574eb3b4e7bbe
BLAKE2b-256 15810fbb21679b2b244da8fed426fc900e5abebc850fd4440414bd06c3ccc635

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page