Skip to main content

A Pipeline for Obtaining Relevant Literature Based on Given Keywords

Project description

Pipleline to literature

A Pipeline for Obtaining Relevant Literature Based on Given Keywords

It's a pipeline to help researchers accelerate literature searches and information acquisition

Let's start following the steps!

Step 1

Syntax for obtaining query syntaxes for databases such as PubMed based on keywords

  1. Common approach

Take PubMed as an example.

Take the subject keywords of our current study (e.g. Mycotoxin, enzyme, degrade, degradation, etc.) as an example.

Website: https://pubmed.ncbi.nlm.nih.gov/advanced/

Search based on search keyword statements

1

Note: When you use a literature database to search for relevant literature resources, we recommend that you optimize your keywords. For example, if your research area of interest is a physician topic, you should perform keyword validation at the MeSH URL (http://www.nlm.nih.gov/mesh/). This is to ensure that the most accurate research vocabulary is used. This maximizes the chance of ensuring that the literature resources searched in the database are the most accurate and relevant.

Download all retrieved literature information

2

For Web of Science:

Website: https://www.webofscience.com/wos/woscc/advanced-search

3

4

5

You can also supplement the relevant literature in other databases such as Google Scholar, Science Direct, etc.

  1. Common approach

To minimize manual operations, here we have created a homemade Python script that automatically generates all possible lexical variations and PubMed and Web of Science query syntaxes and corresponding download links based on keywords provided by the user.

Python script name: generate_query_statements_and_links_to_literature_database_searches_based_on_keywords.py

Required Modules:

nltk, inflect, argparse, itertools

If your machine does not have the corresponding module, use pip install module to install it successfully.

Usage:

Enter the following command in the terminal to see help on using the program:

python generate_query_statements_and_links_to_literature_database_searches_based_on_keywords.py -h

image-20240201144701909

All parameters and descriptions are listed below:

Parameters Descriptions
-m When running the script for the first time, use -m init to download the dictionary library first. Once downloaded, use -m run for subsequent run parameters.
-i Setting the path to a file containing only keywords.
-o Setting the output file path.

Enter the file format:

keyword 1

keyword 2

keyword 3

...

As shown in the figure below:

image-20240201150428130

Practical training:

python generate_query_statements_and_links_to_literature_database_searches_based_on_keywords.py -m run -i keywords.txt -o my_result.txt

Outputs the contents of the file:

image-20240201150847147

image-20240201151148043

image-20240201151409344

After that, according to the results given by this program, go directly from PubMed or Web of Science to download the results of searching literature information. You can refer to the next steps in the section 1. common approach.

Step 2

Consolidation of literature information

Literature collected from different databases was combined into one file through MS Excel. We keep only the Title and DOI number and save it as an xlsx file. Example:

6

The file was then processed to remove duplicates using the Python script.

Python script name:

remove_duplicates.py

Required Modules:

pandas, argparse

If your machine does not have the corresponding module, use pip install module to install it successfully.

Usage:

Enter the following command in the terminal to see help on using the program:

python remove_duplicates.py -h

image-20240201191846460

All parameters and descriptions are listed below:

Parameters Descriptions
-i Setting the path to MS Excel files ending in .xlsx extension
-o Setting the output file path.

Practical training:

python remove_duplicates.py -i all_database_literatures_data.xlsx -o all_database_literatures_data_single.txt

Outputs the contents of the file:

image-20240201192413535

Step 3

Download literatures

Based on the entirety of the relevant literature obtained earlier, a pdf of each piece of literature was downloaded.

Note: In order to get all the above literature as fast as possible, we suggest that a one-time batch download can be realized by tools such as EndNote, crawler, scihub2pdf, and so on. Please note that at all times, please respect the copyrights of the authors and publishers of the literature. That is, the acquisition of the target literature is carried out through legal channels.

Here, we provide a crawler script that can batch download pdf format literature. Just for reference.

Python script name:

batch_download_literatures_pdf_alpha_test.py

Required Modules:

pandas, selenium, time, os, random, argparse

If your machine does not have the corresponding module, use pip install module to install it successfully.

Usage:

Enter the following command in the terminal to see help on using the program:

python batch_download_literatures_pdf_alpha_test.py -h

image-20240201193246426

Note: This script is for test use by interested parties only, and in order to comply with the publisher's copyright, please download it from the official link of the literature publisher, or purchase the target literature you need.

Step 4

Convert pdf documents to text files

After downloading all the documents (pdf), use the Python script for batch processing to convert all the documents into text files.

Python script name:

batch_pdf_file_to_text_file.py

Required Modules:

os, argparse

If your machine does not have the corresponding module, use pip install module to install it successfully.

Usage:

Enter the following command in the terminal to see help on using the program:

python batch_pdf_file_to_text_file.py -h

image-20240201194134075

All parameters and descriptions are listed below:

Parameters Descriptions
-m The script provides four kinds of pdf files into text files, respectively, numbered 1, 2, 3, 4, the user can set up according to their own preferences. A run, only one of the methods can be set. The purpose of such a design is that when some of the pdf documents can not be converted into text files, you can put these documents into a separate directory, try another method of conversion.
-i Setting the path to the folder that includes only pdf-formatted literatures.
-o Setting the path of output folder, all the text files which are converted successfully will be stored in this directory.

Practical training:

python batch_pdf_file_to_text_file.py -m 4 -i literatures_pdf -o literatures_text

View a text-formatted document from the leteratures_text folder as follows:

image-20240201195232993

Note: The file name of the document is logged in the terminal for failed conversions. Convenient for users to follow up.

Access to large language modeling tools

After that, following the process described in our article, the research question is prepared manually and then the text file is copied and pasted into the input box of a big language model such as ChatGPT. The goal of capturing information from the literature by big language models instead of manually can be realized.

Finally, I sincerely hope that this pipeline can accelerate your research process and wish the best of luck in research.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ptol-0.1.0.tar.gz (12.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ptol-0.1.0-py3-none-any.whl (10.0 kB view details)

Uploaded Python 3

File details

Details for the file ptol-0.1.0.tar.gz.

File metadata

  • Download URL: ptol-0.1.0.tar.gz
  • Upload date:
  • Size: 12.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.8.3

File hashes

Hashes for ptol-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5495aefb3df9df1dd23fc4c48c3f390d197773f5d2c9139b17a81e50e50ff199
MD5 70debff1a956e0dab09c0fe2c5e0c8ab
BLAKE2b-256 5f303e1c424fc2e90bb3159f343599c6b5aab45e3263b22b5c6ac5caeba37094

See more details on using hashes here.

File details

Details for the file ptol-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ptol-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 10.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.8.3

File hashes

Hashes for ptol-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ed6691f60a9f44612015325bb84af20bb344c5b73e7e97decc9e0743ad2e6ab8
MD5 a4eb882b49a6e0334d483b173948886b
BLAKE2b-256 a71487e6f13a157b42c933f0f018d5924688bb0ed917dd03d2c3b54596fe1b16

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page