Skip to main content

A Python package to generate BEL statements and CX2 networks.

Project description

textToKnowledgeGraph

A Python package to generate BEL statements and CX2 networks from scientific text.

Table of Contents

Project Description

textToKnowledgeGraph is a Python package that converts natural language scientific text into structured knowledge graphs using advanced language models (LLMs). It is designed to:

  • Generate BEL statements from scientific text.
  • Extract entities and interactions to build knowledge graphs.
  • Convert interactions into CX2 network format for visualization.
  • Optionally upload the generated networks to NDEx.

Dependencies

  • Python >= 3.11
  • langchain==0.3.13
  • langchain_core==0.3.27
  • langchain_openai==0.2.13
  • lxml==5.2.1
  • ndex2>=3.8.0,<4.0.0
  • pandas
  • pydantic==2.10.4
  • python-dotenv==1.0.1
  • Requests==2.32.3

Glossary

  • BEL (Biological Expression Language): A language for representing scientific findings in a computable format. Learn More
  • CX2 (Cytoscape Exchange Format 2): A JSON-based format for storing and exchanging network data. Learn More
  • PMCID: A unique identifier for articles in PubMed Central. Learn More
  • NDEx: An online resource for sharing, storing, and visualizing biological networks. Learn More
  • LangChain: A framework to build applications powered by language models. Learn More
  • Cytoscape: An open-source platform for network visualization and analysis. Learn More
  • Knowledge Graph: A structured graph representation of entities and relationships.
  • Pubtator: A tool for extracting and annotating biomedical entities from scientific literature. Learn More
  • OpenAI: An AI research lab providing access to advanced language models. Learn More

Installation

Install the package via pip:

pip install textToKnowledgeGraph

Methodology

Features Available

  • BEL Generation: Extracts biological interactions from scientific papers and generates BEL statements.
  • CX2 Network Generation: Converts extracted interactions into CX2 network format for visualization in Cytoscape.
  • Uploading to NDEx: Uploads the generated CX2 networks to NDEx for sharing and visualization.

BEL Generation

Input:

  • The user provides one or more PMC IDs and/or local file paths (PDF or TXT) along with an OpenAI API key.
  • If a PMC ID is provided, the tool fetches the XML version of the scientific paper from PubTator’s API.
  • If a PDF/TXT path is provided, the tool extracts the text using MarkItDown.

Processing:

  • The text is broken down into paragraphs and annotated.
  • The prompt for the model is defined in a prompt file, which directs the LLM on how to extract BEL statements.

Output:

  • The extracted BEL statements are saved in JSON files.
  • The results are converted into a CX2 network format for visualization in Cytoscape.

CX2 Network Generation

Converts extracted interactions into CX2 network format for visualization in Cytoscape.

Uploading to NDEx

Uploads the generated CX2 networks to NDEx for sharing and visualization.
In order to use this function, you need to provide your NDEx email and password as an argument.

Usage

The package supports processing multiple documents in one run and requires an OpenAI API key.

Parameters:

  • api_key (required): Your OpenAI API key used for LLM calls.
  • pmc_ids (optional, list): A list of PMC IDs to process.
  • pdf_paths (optional, list): A list of PDF file paths to process.
  • txt_paths (optional, list): A list of TXT file paths to process.
  • ndex_email (optional): NDEx account email for authentication.
  • ndex_password (optional): NDEx account password for authentication.
  • upload_to_ndex (flag): If set, uploads the generated networks to NDEx.
  • style_path (optional): Path to a Cytoscape style JSON file; defaults to the file in the data directory.
  • custom_name (optional): Custom name for the network; defaults to the timestamp.
  • pmid_for_file (optional): In this case, the tool fetches metadata for the provided PMID/PMCID and this is used to set the network name.

Network Naming and Metadata

When processing a file, there are three options for naming the generated network:

  1. Automatic Naming (Default): If you provide no custom name or PMID/PMCID, the network name defaults to a timestamp (e.g., 20250304_1537).

  2. Custom Network Name: Provide a custom network name using the --custom_name parameter. The network will use that exact name.

  3. Metadata-Based Naming: If you supply a PMID or PMCID via the --pmid_for_file parameter, the package will:

  • Fetch metadata (title, abstract, and authors) from E-Utilities.
  • Set the network name to the format:
  • FirstAuthorLastName et al.: PMID_Number For example, if the first author is "Wen‐Cheng Lu" and the PMID is 35080342, the network name will be: Lu et al.: 35080342
  • Set additional network attributes:
    • description: Combining the paper's title and abstract.
    • reference: Displayed as PMID: 35080342 (with "PMID: " prefixed). Note: When processing a PMC article (using a PMC ID input), the package automatically fetches metadata from E-Utilities. In this case, you do not need to provide the --pmid_for_file parameter because the metadata-based naming is handled internally.

Command Line Examples

**Example A: Processing a PMC Article (PMCID_Input)**
python -m textToKnowledgeGraph.main \
  --api_key YOUR_OPENAI_API_KEY \
  --pmc_ids PMC123456 PMC234567 \
  --pdf_paths /path/to/paper1.pdf /path/to/paper2.pdf \
  --txt_paths /path/to/document1.txt \
  --ndex_email <your_email@example.com> \
  --ndex_password your_password \
  --upload_to_ndex
**Example B: Processing Local Files (PDF/TXT_Input)**
python -m textToKnowledgeGraph.main \
  --api_key YOUR_OPENAI_API_KEY \
  --pmc_ids PMC123456 PMC234567 \
  --pdf_paths /path/to/paper1.pdf /path/to/paper2.pdf \
  --txt_paths /path/to/document1.txt \
  --ndex_email <your_email@example.com> \
  --ndex_password your_password \
  --upload_to_ndex \
  --custom_name My_Network \
  --pmid_for_file 35080342

Interactive Python Examples

from textToKnowledgeGraph import main

main(
    api_key="YOUR_OPENAI_API_KEY",
    pmc_ids=["PMC123456", "PMC234567"],
    pdf_paths=["/path/to/paper1.pdf", "/path/to/paper2.pdf"],
    txt_paths=["/path/to/document1.txt"],
    ndex_email="<your_email@example.com>",  # optional
    ndex_password="your_password",        # optional
    style_path=None,                      # uses default style file if None
    upload_to_ndex=False,
    custom_name="My_Network",            # optional
    pmid_for_file=35080342,              # optional
)

Expected Output

  • BEL Statements: Extracted from the processed documents.
  • CX2 Networks: Generated for each input and saved in a dedicated results/ directory named after each document or PMC ID.
  • NDEx Upload: If enabled, the networks will be uploaded to NDEx.
  • Example of CX2 network: CX2 network image of paper:PMC8354587

Notes

  • The default style file is located at <repo_root>/data/cx_style.json. You can override this with the --style_path parameter.
  • Make sure your input files exist and that your API key is valid.
  • The package logs detailed processing steps and total runtime for batch processing.

python -m nltk.downloader stopwords

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

texttoknowledgegraph-0.3.6.tar.gz (76.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

texttoknowledgegraph-0.3.6-py3-none-any.whl (89.4 kB view details)

Uploaded Python 3

File details

Details for the file texttoknowledgegraph-0.3.6.tar.gz.

File metadata

  • Download URL: texttoknowledgegraph-0.3.6.tar.gz
  • Upload date:
  • Size: 76.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.8

File hashes

Hashes for texttoknowledgegraph-0.3.6.tar.gz
Algorithm Hash digest
SHA256 67bca431fffa2bfb27090056a857715a03c2e185071efa66b4fe8d7c50256335
MD5 5a71e17b482a4c51274d2bb0c3d48687
BLAKE2b-256 b70d8c91f18b22a97e26f39ffeaa15a18dfb74b3cf0b952013bd2e5ee93bbb73

See more details on using hashes here.

File details

Details for the file texttoknowledgegraph-0.3.6-py3-none-any.whl.

File metadata

File hashes

Hashes for texttoknowledgegraph-0.3.6-py3-none-any.whl
Algorithm Hash digest
SHA256 11bff3bedfd7b2c47d206c72276cf7f075f3ac5635ae17d85560d6ceba4c7e77
MD5 d994792744b97bc4449b25139dda39c9
BLAKE2b-256 188743b8795955d3f9223def3062624b184769cca6af18b4f1487b75624aedac

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page