A simple tool for generating and analyzing bibliometric citation network data from Pubmed.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

PubMed Network Toolkit (`pnt`)

pnt is a simple Python package for extracting and analyzing bibliometric citation network data from PubMed. The package is designed to support:

pulling citation metadata from PubMed;
constructing co-authorship networks;
generating edge and node lists;
visualizing basic network structures; and
filtering and summarizing PubMed data sets

Author: Jacob Rohde (jarohde1@gmail.com)

Release notes: Version 0.0.6 (released 2025-05-28) added a sub-package library that includes tools for filtering and summarizing PubMed data sets using locally hosted large language models (LLMs) via Ollama. Another This package is released under the MIT license.

Package overview

GetPubMedData()

Extract a citation data set from PubMed using Metapub.

pnt.GetPubMedData(search_term, 
                  pubmed_api_key=None, 
                  size=250, 
                  start_date=None, 
                  end_date=None)

Arguments/attributes:

search_term
The only required argument. Takes a single string as a search term(s). Examples:
```
search_term='cancer' 
search_term='cancer and tobacco' 
```
pubmed_api_key (optional)
A string argument to specify a PubMed NCBI API key. If set, this key is registered as an environment variable, reducing API rate limiting.
size (optional)
An integer that indicates how many PubMed citations to retrieve. Default is 250.
Note: This class is intended for small-scale or exploratory data pulls.
start_date / end_date (optional)
String parameter(s) to specify the date range for citation retrieval. Default end_date set to current date. Format: 'YYYY, MM, DD' (e.g., '2023, 01, 01')
GetPubMedData.citation_df
A pandas DataFrame containing the citation data. The DataFrame includes the following columns: 'pmid', 'first_author', 'last_author', 'author_list', 'title', 'journal', 'year', 'volume', 'issue', 'pages', 'url', 'abstract', 'citation', 'doi'
GetPubMedData.write_data()
Saves the citation DataFrame to file. Accepts the following optional keyword arguments:
- file_type: Format to save the file. Accepts 'csv' or 'json'. Default is 'csv'.
- file_name: Name of the output file (without extension). Default is the provided search_term.

pnt.GetCitationNetwork()

Generate edge and node lists (and a NetworkX graph object) from a PubMed citation data set.

pnt.GetCitationNetwork(citation_dataset, 
                       edge_type='directed')

Arguments/attributes:

citation_dataset
The only required argument. Takes an existing citation data set or a GetPubMedData object.
edge_type (optional)
String argument set to either 'directed' or 'undirected', to signify network edge type; default is 'directed'.
GetCitationNetwork.edge_list
Returns a pandas DataFrame of the network edge list with columns for source author, target co-author, and the journal.
GetCitationNetwork.node_list
Returns a pandas DataFrame of the network node list with columns for unique nodes, degree, and the node's associated journals.
GetCitationNetwork.graph
Returns a NetworkX graph object.
GetCitationNetwork.write_data()
Object method that writes edge_list and node_list data sets to file. Accepts the same optional arguments as GetPubMedData.write_data() (i.e., 'file_type' and 'file_name').

single_network_plot()

A simple function for plotting networks via NetworkX and Matplotlib (additional install required). Please note this function is currently a work in progress and is meant to be basic tool to plot a single graph. See NetworkX documentation for more advanced plotting needs.

pnt.single_network_plot(network, 
                        **kwargs)

Arguments:

network
The only required argument. Takes a GetCitationNetwork or NetworkX graph object.
title(optional)
String argument to add a title to the plot.
pos(optional)
String argument to set the NetworkX plotting algorithm. For ease of use, the argument currently accepts one of the following layout types as a string: 'spring_layout' (default), 'kamada_kawai_layout', 'circular_layout', or 'random_layout'

Optional keyword arguments (**kwargs):

The function also accepts several other NetworkX keyword arguments for plotting (please see NetworkX documentation for more info on these arguments). Currently accepted arguments include:

'arrows' (bool)
'arrowsize' (int)
'edge_color' (str or list/array)
'font_size' (int)
'node_color' (str or list/array)
'node_size' (str or list/array)
'verticalalignment' (str)
'width' (int/float or list/array)
'with_labels' (bool)

Example use case for `pnt`

This example demonstrates how to use pnt to:

Extract a PubMed citation data set
Write the citation data to file
Construct a citation network graph from the data
Plot the citation network using Matplotlib
Write the resulting edge list, node list, and adjacency matrices to file

import pnt  # Assumes pnt is installed 

# Extract citation data for the keyword topic 'tobacco control'
pubmed_data = pnt.GetPubMedData(search_term='tobacco control',
                                size=25, 
                                start_date='2025, 1, 1',
                                end_date='2025, 1, 31')

# Access the resulting data set 
df = pubmed_data.citation_df
print(df)

# Write the data to CSV
pubmed_data.write_data(file_type='csv', file_name='tob_control_citations')

# Create a citation network object from the data 
network = pnt.GetCitationNetwork(pubmed_data, edge_type='directed')

# Plot the citation network 
pnt.single_network_plot(network=network,
                        title='Example tobacco control co-citation network plot',
                        arrows=True,
                        with_labels=True)

# Access the edge and node lists and save the data to file
edge_df = network.edge_list
node_df = network.node_list
network.write_data(file_type='csv', file_name='citation_network')

Literature review tools using AI (sub-package)

The AI_review_tools sub-package includes functions to filter and summarize academic articles using a locally hosted LLM via Ollama. It leverages a DataFrame of articles, including titles and/or abstracts, to generate structured prompts for the model. The package processes articles in chunks, providing detailed summaries or filtering irrelevant papers based on custom topics. Users are also able to provide custom prompts via a .txt file.

To install the package, run:

from pnt import AI_review_tools

NOTE. You will need to install Ollama prior to using this package. In addition, you will need to download at least one local LLM to use alongside these functions. By default, this package uses gemma3:12b, but other models such as Mistral can be used. See here for more information about Ollama and installing local LLMs: https://ollama.com

FilterLiterature()

Filters out irrelevant articles from a PubMed citation data set using a local LLM via Ollama.

AI_review_tools.filter_literature(df, 
                                  filter_topic, 
                                  **kwargs)

Arguments/attributes:

df
A pandas.DataFrame of articles with at least a 'title' or 'abstract' column.
filter_topic
A string representing the topic or concept to filter by (i.e., what makes an article "relevant").
FilterLiterature.filter_df()
Returns a Pandas Dataframe containing the PubMed citation data set with the filtered articles removed.
FilterLiterature.write_outliers_to_csv()
Writes removed article indices and justifications to a file named 'articles_filtered_out.csv'.

Optional keyword arguments (**kwargs):

chunk_size (int)
Number of articles sent to the LLM per request. Default is 5. Lower numbers are preferred to increase the performance of the local LLM.
model (str)
Name of the Ollama-compatible LLM model to use. Default is 'gemma3:12b'.
filter_by (str)
Column used for filtering. Accepts 'titles' or 'abstracts'. Default is 'titles'.
custom_prompt_path (str)
File path to a custom .txt prompt template. If not provided, a default one is used.

SummarizeLiterature()

Summarizes article content in chunks using a local LLM.

summarize_literature(df, 
                     summary_topic, 
                     **kwargs)

Arguments/attributes:

df
A pandas.DataFrame with article data. Must include an 'abstract' or 'title' column.
summary_topic
A string describing the focus of the summary (e.g., "tobacco control").
SummarizeLiterature.provide_overall_summary()
A method that returns an overall summary of the provided PubMed data set. This method returns the summary in a paragraph string format, which is stored in SummarizeLiterature.overall_summary

Optional keyword arguments (**kwargs):

chunk_size (int)
Number of articles sent to the LLM per request. Default is 3. Lower numbers are preferred to increase the performance of the local LLM.
model (str)
Name of the Ollama-compatible LLM to use. Default is 'gemma3:12b'.
filter_by (str)
Text field to summarize. Accepts 'abstracts' or 'titles'. Default is 'abstracts'.
num_summary_points (int)
Number of summary bullet points the LLM should return per chunk. Default is 3.
custom_prompt_path (str)
Path to a custom .txt prompt template. If not provided, a default one is used.
provide_overall_summary (bool)
If True, compiles an additional overall summary across all chunks using a second prompt.

CodeArticle()

Codes an academic article using a local LLM (feature currently in development).

CodeArticle(article_pdf_path, 
            custom_prompt_path, 
            **kwargs)

Arguments/attributes:

article_pdf_path
Path to the PDF file of the academic article to be coded.
custom_prompt_path
Path to a .txt prompt codebook. The prompt should include a {article_text} placeholder, which will be populated with the extracted text from the given PDF.
CodeArticle.coded_article_results
Results of the coded output of the article, returned after applying the custom prompt to the extracted text.

Optional keyword arguments (**kwargs):

model (str)
Name of the Ollama-compatible LLM to use. Default is 'gemma3:12b'.
clean_text (bool)
If True, attempts to clean and preprocess the extracted text from the PDF before coding. Default is True.

Example use case for `AI_review_tools`

This example demonstrates how to use pnt and AI_review_tools to:

Extract a PubMed citation data set
Filter the data set based on a specific topic
Write filtered articles to file
Summarize remaining articles

import pnt 
from pnt import AI_review_tools as rev

# Extract citation data set for the topic 'tobacco control'
pubmed_data = pnt.GetPubMedData(search_term='tobacco control',
                                size=25, 
                                start_date='2025, 1, 1',
                                end_date='2025, 1, 31')
                                
# Review the data set
print(pubmed_data.citation_df) 


# Filter the data set for articles about 'vaping' 
filtered_results = rev.FilterLiterature(df=pubmed_data.citation_df,
                                        filter_topic='vaping and eCigarettes',
                                        filter_by='titles', # or 'abstracts'
                                        chunk_size=5, # Set to 3 if filtering 'abstracts'
                                        model='gemma3:12b')

# Extract the filtered data set and write the outliers to file 
print(filtered_results.filtered_df)
filtered_results.write_outliers_to_csv()

# Review the number of extracted articles removed by the filter
print(len(pubmed_data.citation_df) - len(filtered_results.filtered_df)) 


# Summarize chunked article batches
summarzied_articles = rev.SummarizeLiterature(df=pubmed_data.citation_df,
                                              summary_topic='vaping and eCigarettes',
                                              chunk_size=3)
# Print chunked summaries
print(summarized_articles.processed_results)
  
# Extract an overall summary in paragraph form
summarized_articles.provide_overall_summary()
print(summarized_articles.overall_summary)

Requirements

Python 3.XX
metapub - a Python library with functions to query the PubMed API
numpy - a Python library for handling arrays and matrices
pandas - a Python library for data management
NetworkX - a Python library for network analysis
PyMuPDF - a Python package for parsing PDFs
Matplotlib (only if using the single_network_plot() function) - a Python library for plotting

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.0.6

May 28, 2025

0.0.5

May 20, 2025

0.0.4

May 20, 2025

0.0.3

May 20, 2025

0.0.2

May 20, 2025

0.0.1

May 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pnt-0.0.6.tar.gz (19.2 kB view details)

Uploaded May 28, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pnt-0.0.6-py3-none-any.whl (17.6 kB view details)

Uploaded May 28, 2025 Python 3

File details

Details for the file pnt-0.0.6.tar.gz.

File metadata

Download URL: pnt-0.0.6.tar.gz
Upload date: May 28, 2025
Size: 19.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for pnt-0.0.6.tar.gz
Algorithm	Hash digest
SHA256	`4465b6e7646b240e99539f4b9587b42b42b4e7c2b6b42bb4f21da25bb2dcc780`
MD5	`44cbe3247becf95ed4bd13b5f9b8d3aa`
BLAKE2b-256	`ba1bd790f693cf4be277f66845d9ffaf1360fefd2178cc0dffe3c741b7240803`

See more details on using hashes here.

File details

Details for the file pnt-0.0.6-py3-none-any.whl.

File metadata

Download URL: pnt-0.0.6-py3-none-any.whl
Upload date: May 28, 2025
Size: 17.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for pnt-0.0.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`00d46b7fc464430a4395e8c364c5ec013f1b5d4835e3e1c1e7fca044bbfdef97`
MD5	`ab114d0274f9964308cdf3347566c37b`
BLAKE2b-256	`acd5f98e27cc73099aff9b04152ce733375aa93598fdc1ecea76df12799f78fc`

See more details on using hashes here.

pnt 0.0.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PubMed Network Toolkit (`pnt`)

Package overview

GetPubMedData()

pnt.GetCitationNetwork()

single_network_plot()

Example use case for `pnt`

Literature review tools using AI (sub-package)

FilterLiterature()

SummarizeLiterature()

CodeArticle()

Example use case for `AI_review_tools`

Requirements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

pnt 0.0.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PubMed Network Toolkit (pnt)

Package overview

GetPubMedData()

pnt.GetCitationNetwork()

single_network_plot()

Example use case for pnt

Literature review tools using AI (sub-package)

FilterLiterature()

SummarizeLiterature()

CodeArticle()

Example use case for AI_review_tools

Requirements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

PubMed Network Toolkit (`pnt`)

Example use case for `pnt`

Example use case for `AI_review_tools`