A Python package for working with the Europe PMC API to search and retrieve scientific literature.

These details have not been verified by PyPI

Project links

Project description

PyEuropePMC

🔄 Build Status

PyEuropePMC is a robust Python toolkit for automated search, extraction, and analysis of scientific literature from Europe PMC.

✨ Key Features

🔍 Comprehensive Search API - Query Europe PMC with advanced search options
� Advanced Query Builder - Fluent API for building complex search queries with type safety
�📄 Full-Text Retrieval - Download PDFs, XML, and HTML content from open access articles
🔬 XML Parsing & Conversion - Parse full text XML and convert to plaintext, markdown, extract tables and metadata
🏷️ Text-Mining Annotations - Retrieve and parse entity annotations, sentences, and relationships (genes, diseases, chemicals)
📊 Multiple Output Formats - JSON, XML, Dublin Core (DC)
📦 Bulk FTP Downloads - Efficient bulk PDF downloads from Europe PMC FTP servers
🔄 Smart Pagination - Automatic handling of large result sets
🛡️ Robust Error Handling - Built-in retry logic and connection management
🧑‍💻 Type Safety - Extensive use of type annotations and validation
⚡ Rate Limiting - Respectful API usage with configurable delays
🧪 Extensively Tested - 200+ tests with 90%+ code coverage
📋 Systematic Review Tracking - PRISMA-compliant search logging and audit trails
📈 Advanced Analytics - Publication trends, citation analysis, quality metrics, and duplicate detection
📉 Rich Visualizations - Interactive plots and dashboards using matplotlib and seaborn
🔗 External API Enrichment - Enhance metadata with CrossRef, Unpaywall, Semantic Scholar, and OpenAlex

📁 Project Structure

The repository is organized as follows:

src/pyeuropepmc/ - Main package source code
tests/ - Unit and integration tests
docs/ - Documentation and guides
examples/ - Example scripts and usage demonstrations
benchmarks/ - Performance benchmarking scripts and results
data/ - Downloads, outputs, and generated data files
conf/ - Configuration files for RDF mapping and other settings

🚀 Quick Start

Installation

pip install pyeuropepmc

Basic Usage

from pyeuropepmc.search import SearchClient

# Search for papers
with SearchClient() as client:
    results = client.search("CRISPR gene editing", pageSize=10)

    for paper in results["resultList"]["result"]:
        print(f"Title: {paper['title']}")
        print(f"Authors: {paper.get('authorString', 'N/A')}")
        print("---")

Advanced Search with QueryBuilder

from pyeuropepmc import QueryBuilder

# Build complex queries with fluent API
qb = QueryBuilder()
query = (qb
    .keyword("cancer", field="title")
    .and_()
    .keyword("immunotherapy")
    .and_()
    .date_range(start_year=2020, end_year=2023)
    .and_()
    .citation_count(min_count=10)
    .build())

print(f"Generated query: {query}")
# Output: (TITLE:cancer) AND immunotherapy AND (PUB_YEAR:[2020 TO 2023]) AND (CITED:[10 TO *])

Advanced Search with Parsing

# Search and automatically parse results
papers = client.search_and_parse(
    query="COVID-19 AND vaccine",
    pageSize=50,
    sort="CITED desc"
)

for paper in papers:
    print(f"Citations: {paper.get('citedByCount', 0)}")
    print(f"Title: {paper.get('title', 'N/A')}")

Full-Text Content Retrieval

from pyeuropepmc.fulltext import FullTextClient

# Initialize full-text client
fulltext_client = FullTextClient()

# Download PDF
pdf_path = fulltext_client.download_pdf_by_pmcid("PMC1234567", output_dir="./downloads")

# Download XML
xml_content = fulltext_client.download_xml_by_pmcid("PMC1234567")

# Bulk FTP downloads
from pyeuropepmc.ftp_downloader import FTPDownloader

ftp_downloader = FTPDownloader()
results = ftp_downloader.bulk_download_and_extract(
    pmcids=["1234567", "2345678"],
    output_dir="./bulk_downloads"
)

Full-Text XML Parsing

Parse full text XML files and extract structured information:

from pyeuropepmc import FullTextClient, FullTextXMLParser

# Download and parse XML
with FullTextClient() as client:
    xml_path = client.download_xml_by_pmcid("PMC3258128")

# Parse the XML
with open(xml_path, 'r') as f:
    parser = FullTextXMLParser(f.read())

# Extract metadata
metadata = parser.extract_metadata()
print(f"Title: {metadata['title']}")
print(f"Authors: {', '.join(metadata['authors'])}")

# Convert to different formats
plaintext = parser.to_plaintext()  # Plain text
markdown = parser.to_markdown()     # Markdown format

# Extract tables
tables = parser.extract_tables()
for table in tables:
    print(f"Table: {table['label']} - {len(table['rows'])} rows")

# Extract references
references = parser.extract_references()
print(f"Found {len(references)} references")

Text-Mining Annotations

Retrieve and parse entity annotations, sentences, and relationships from scientific literature:

from pyeuropepmc import AnnotationsClient, parse_annotations

# Initialize annotations client
with AnnotationsClient() as client:
    # Get annotations for specific articles
    annotations = client.get_annotations_by_article_ids(
        article_ids=["PMC3359311"],
        section="abstract"  # or "fulltext", "all"
    )

    # Parse annotations to extract structured data
    parsed = parse_annotations(annotations)

    print(f"Found {len(parsed['entities'])} entities")
    print(f"Found {len(parsed['relationships'])} relationships")

    # Display entities by type
    for entity in parsed['entities'][:5]:
        print(f"{entity['name']} ({entity['type']})")

    # Search for specific entities (e.g., chemicals)
    entity_annotations = client.get_annotations_by_entity(
        entity_id="CHEBI:16236",  # Ethanol
        entity_type="CHEBI",
        page_size=20
    )

    # Filter by annotation provider
    provider_annotations = client.get_annotations_by_provider(
        provider="Europe PMC",
        annotation_type="Disease"
    )

Supported Entity Types:

🧬 Genes and proteins
🦠 Diseases and conditions
🧪 Chemicals and drugs (CHEBI)
🔬 Gene Ontology terms
🌱 Organisms and species
🔗 Entity relationships

See examples/10-annotations for detailed examples.

Advanced Analytics and Visualization

Analyze search results with built-in analytics and create visualizations:

from pyeuropepmc import (
    SearchClient,
    to_dataframe,
    citation_statistics,
    quality_metrics,
    remove_duplicates,
    plot_publication_years,
    create_summary_dashboard,
)

# Search and convert to DataFrame
with SearchClient() as client:
    response = client.search("machine learning", pageSize=100)
    papers = response.get("resultList", {}).get("result", [])

# Convert to pandas DataFrame for analysis
df = to_dataframe(papers)

# Remove duplicates
df = remove_duplicates(df, method="title", keep="most_cited")

# Get citation statistics
stats = citation_statistics(df)
print(f"Mean citations: {stats['mean_citations']:.2f}")
print(f"Highly cited (top 10%): {stats['citation_distribution']['90th_percentile']:.0f}")

# Assess quality metrics
metrics = quality_metrics(df)
print(f"Open access: {metrics['open_access_percentage']:.1f}%")
print(f"With PDF: {metrics['with_pdf_percentage']:.1f}%")

# Create visualizations
plot_publication_years(df, save_path="publications_by_year.png")
create_summary_dashboard(df, save_path="analysis_dashboard.png")

External API Enrichment

Enhance paper metadata with data from CrossRef, Unpaywall, Semantic Scholar, and OpenAlex:

from pyeuropepmc import PaperEnricher, EnrichmentConfig

# Configure enrichment with multiple APIs
config = EnrichmentConfig(
    enable_crossref=True,
    enable_semantic_scholar=True,
    enable_openalex=True,
    enable_unpaywall=True,
    unpaywall_email="your@email.com"  # Required for Unpaywall
)

# Enrich paper metadata
with PaperEnricher(config) as enricher:
    result = enricher.enrich_paper(doi="10.1371/journal.pone.0308090")

    # Access merged data from all sources
    merged = result["merged"]
    print(f"Title: {merged['title']}")
    print(f"Citations: {merged['citation_count']}")
    print(f"Open Access: {merged['is_oa']}")

    # Access individual source data
    if "crossref" in result["sources"]:
        print(f"Funders: {result['crossref']['funders']}")

    if "semantic_scholar" in result["sources"]:
        print(f"Influential Citations: {result['semantic_scholar']['influential_citation_count']}")

Features:

🔄 Automatic data merging from multiple sources
📊 Citation metrics from multiple databases
🔓 Open access status and full-text URLs
💰 Funding information
🏷️ Topic classifications and fields of study
⚡ Optional caching for performance

See examples/09-enrichment for more details.

Knowledge Graph Structure Options 🕸️

PyEuropePMC supports flexible knowledge graph structures for different use cases:

from pyeuropepmc.mappers import RDFMapper

mapper = RDFMapper()

# Metadata-only KG (for citation networks and bibliometrics)
metadata_graphs = mapper.save_metadata_rdf(
    entities_data,
    output_dir="rdf_output"
)  # Papers + authors + institutions

# Content-only KG (for text analysis and document processing)
content_graphs = mapper.save_content_rdf(
    entities_data,
    output_dir="rdf_output"
)  # Papers + sections + references + tables

# Complete KG (for comprehensive analysis)
complete_graphs = mapper.save_complete_rdf(
    entities_data,
    output_dir="rdf_output"
)  # All entities and relationships

# Use configured default from conf/rdf_map.yml
graphs = mapper.save_rdf(entities_data, output_dir="rdf_output")

Use Cases:

📊 Citation Networks: Use metadata-only KGs for bibliometric analysis
📝 Text Mining: Use content-only KGs for NLP and information extraction
🔬 Full Analysis: Use complete KGs for comprehensive research workflows

See examples/kg_structure_demo.py for a complete working example.

Unified Processing Pipeline 🏗️

The new unified pipeline dramatically simplifies the complex workflow of XML parsing → enrichment → RDF conversion:

from pyeuropepmc import PaperProcessingPipeline, PipelineConfig

# Simple configuration
config = PipelineConfig(
    enable_enrichment=True,      # Enable metadata enrichment
    enable_crossref=True,        # CrossRef API
    enable_semantic_scholar=True, # Semantic Scholar API
    enable_openalex=True,        # OpenAlex API
    enable_ror=True,             # ROR institution data
    crossref_email="your@email.com",  # Required for higher CrossRef rate limits
    output_format="turtle",      # RDF output format
    output_dir="output"          # Where to save RDF files
)

# Create unified pipeline
pipeline = PaperProcessingPipeline(config)

# Process single paper - replaces 8+ separate steps!
result = pipeline.process_paper(
    xml_content=xml_string,
    doi="10.1038/nature11476",
    save_rdf=True
)

print(f"Generated {result['triple_count']} RDF triples")
print(f"Output saved to: {result['output_file']}")

# Process multiple papers in batch
xml_contents = {
    "10.1038/nature11476": xml_content_1,
    "10.1038/nature11477": xml_content_2,
}

batch_results = pipeline.process_papers(xml_contents)
for doi, result in batch_results.items():
    print(f"{doi}: {result['triple_count']} triples")

What it does automatically:

✅ Parses XML and extracts entities (paper, authors, sections, tables, figures, references)
✅ Enriches metadata from external APIs (citations, fields of study, etc.)
✅ Converts everything to RDF with proper relationships
✅ Saves structured output files
✅ Handles errors gracefully

Before vs After:

# OLD: Complex multi-step workflow (8+ steps)
parser = FullTextXMLParser()
parser.parse(xml_content)
paper, authors, sections, tables, figures, references = build_paper_entities(parser)
enricher = PaperEnricher(config)
enrichment_data = enricher.enrich_paper(doi)
rdf_mapper = RDFMapper()
paper.to_rdf(graph, related_entities=...)
rdf_mapper.serialize_graph(graph, format='turtle')

# NEW: Single pipeline call (3 steps)
config = PipelineConfig(...)
pipeline = PaperProcessingPipeline(config)
result = pipeline.process_paper(xml_content, doi=doi)

See examples/pipeline_demo.py for a complete working example.

📚 Documentation

📖 Read the Full Documentation ← Start Here!

Quick Links:

🚀 Quick Start Guide - Get started in 5 minutes
� Query Builder - Advanced query building
�📚 API Reference - Complete API documentation
💡 Examples - Code examples and use cases
✨ Features - Explore all features
📊 XML Coverage Analysis - Parser coverage and benchmark results

Note: Enable GitHub Pages first! See Setup Guide for instructions.

📊 Performance

Benchmarks run weekly on Monday at 02:00 UTC. Last updated: Pending first run

Metric	Value
Total Requests	Pending
Average Response Time	Pending
Success Rate	Pending

Benchmark results will be automatically updated weekly by GitHub Actions.

🤝 Contributing

We welcome contributions! See our Contributing Guide for details.

📄 License

Distributed under the MIT License. See LICENSE for more information.

🌐 Links

📖 Documentation: GitHub Pages - Full documentation site
📦 PyPI Package: pyeuropepmc - Install with pip
💻 GitHub Repository: pyEuropePMC - Source code
🐛 Issue Tracker: GitHub Issues - Report bugs or request features

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.16.0

May 17, 2026

1.14.0

Dec 5, 2025

1.13.0

Nov 20, 2025

1.12.0

Nov 19, 2025

1.11.3

Nov 14, 2025

1.11.2

Nov 14, 2025

1.11.1

Nov 14, 2025

1.11.0

Nov 14, 2025

1.10.1

Nov 10, 2025

1.10.0

Nov 10, 2025

1.9.1

Nov 10, 2025

1.9.0

Nov 10, 2025

1.8.1

Nov 5, 2025

1.8.0

Nov 5, 2025

1.7.0

Nov 4, 2025

1.6.0

Oct 28, 2025

1.5.0

Oct 21, 2025

1.4.0

Oct 6, 2025

1.3.0

Sep 11, 2025

1.2.0

Jun 27, 2025

1.1.0

Jun 24, 2025

1.0.0

Jun 10, 2025

0.0.1

Jun 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyeuropepmc-1.16.0.tar.gz (265.2 kB view details)

Uploaded May 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyeuropepmc-1.16.0-py3-none-any.whl (314.9 kB view details)

Uploaded May 17, 2026 Python 3

File details

Details for the file pyeuropepmc-1.16.0.tar.gz.

File metadata

Download URL: pyeuropepmc-1.16.0.tar.gz
Upload date: May 17, 2026
Size: 265.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.4.1 CPython/3.10.20 Linux/6.17.0-1013-azure

File hashes

Hashes for pyeuropepmc-1.16.0.tar.gz
Algorithm	Hash digest
SHA256	`1e41d4212d4701691b5ed8a730a8c92a59d51ee674f99c251822e0635d46b6ee`
MD5	`2ffc2f1e8dea092c9e48bc2b0b42c238`
BLAKE2b-256	`092303f58f4a0832644cd7296cf13e98491768f0687f4a86066bb2d38f5eeaa6`

See more details on using hashes here.

File details

Details for the file pyeuropepmc-1.16.0-py3-none-any.whl.

File metadata

Download URL: pyeuropepmc-1.16.0-py3-none-any.whl
Upload date: May 17, 2026
Size: 314.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.4.1 CPython/3.10.20 Linux/6.17.0-1013-azure

File hashes

Hashes for pyeuropepmc-1.16.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b959d66339fc36a51e830cdc5461d83157822214e76a5671a5e312ba26fd95ee`
MD5	`54539a2123f156852f345fee069f2216`
BLAKE2b-256	`5e464b2982a447b5c22e1a038583a94e90525bead288cfc4cf729493ea000779`

See more details on using hashes here.

pyeuropepmc 1.16.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

PyEuropePMC

🔄 Build Status

✨ Key Features

📁 Project Structure

🚀 Quick Start

Installation

Basic Usage

Advanced Search with QueryBuilder

Advanced Search with Parsing

Full-Text Content Retrieval

Full-Text XML Parsing

Text-Mining Annotations

Advanced Analytics and Visualization

External API Enrichment

Knowledge Graph Structure Options 🕸️

Unified Processing Pipeline 🏗️

📚 Documentation

📊 Performance

🤝 Contributing

📄 License

🌐 Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes