Skip to main content

A Python package for working with the Europe PMC API to search and retrieve scientific literature.

Project description

PyEuropePMC

PyPI version PyPI - Downloads Python 3.10+ License: MIT Tests Coverage Documentation

๐Ÿ”„ Build Status

CI/CD Pipeline Python Compatibility Documentation CodeQL codecov

PyEuropePMC is a robust Python toolkit for automated search, extraction, and analysis of scientific literature from Europe PMC.

โœจ Key Features

  • ๐Ÿ” Comprehensive Search API - Query Europe PMC with advanced search options
  • ๏ฟฝ Advanced Query Builder - Fluent API for building complex search queries with type safety
  • ๏ฟฝ๐Ÿ“„ Full-Text Retrieval - Download PDFs, XML, and HTML content from open access articles
  • ๐Ÿ”ฌ XML Parsing & Conversion - Parse full text XML and convert to plaintext, markdown, extract tables and metadata
  • ๐Ÿ“Š Multiple Output Formats - JSON, XML, Dublin Core (DC)
  • ๐Ÿ“ฆ Bulk FTP Downloads - Efficient bulk PDF downloads from Europe PMC FTP servers
  • ๐Ÿ”„ Smart Pagination - Automatic handling of large result sets
  • ๐Ÿ›ก๏ธ Robust Error Handling - Built-in retry logic and connection management
  • ๐Ÿง‘โ€๐Ÿ’ป Type Safety - Extensive use of type annotations and validation
  • โšก Rate Limiting - Respectful API usage with configurable delays
  • ๐Ÿงช Extensively Tested - 200+ tests with 90%+ code coverage
  • ๐Ÿ“‹ Systematic Review Tracking - PRISMA-compliant search logging and audit trails
  • ๐Ÿ“ˆ Advanced Analytics - Publication trends, citation analysis, quality metrics, and duplicate detection
  • ๐Ÿ“‰ Rich Visualizations - Interactive plots and dashboards using matplotlib and seaborn
  • ๐Ÿ”— External API Enrichment - Enhance metadata with CrossRef, Unpaywall, Semantic Scholar, and OpenAlex

๐Ÿ“ Project Structure

The repository is organized as follows:

  • src/pyeuropepmc/ - Main package source code
  • tests/ - Unit and integration tests
  • docs/ - Documentation and guides
  • examples/ - Example scripts and usage demonstrations
  • benchmarks/ - Performance benchmarking scripts and results
  • data/ - Downloads, outputs, and generated data files
  • conf/ - Configuration files for RDF mapping and other settings

๐Ÿš€ Quick Start

Installation

pip install pyeuropepmc

Basic Usage

from pyeuropepmc.search import SearchClient

# Search for papers
with SearchClient() as client:
    results = client.search("CRISPR gene editing", pageSize=10)

    for paper in results["resultList"]["result"]:
        print(f"Title: {paper['title']}")
        print(f"Authors: {paper.get('authorString', 'N/A')}")
        print("---")

Advanced Search with QueryBuilder

from pyeuropepmc import QueryBuilder

# Build complex queries with fluent API
qb = QueryBuilder()
query = (qb
    .keyword("cancer", field="title")
    .and_()
    .keyword("immunotherapy")
    .and_()
    .date_range(start_year=2020, end_year=2023)
    .and_()
    .citation_count(min_count=10)
    .build())

print(f"Generated query: {query}")
# Output: (TITLE:cancer) AND immunotherapy AND (PUB_YEAR:[2020 TO 2023]) AND (CITED:[10 TO *])

Advanced Search with Parsing

# Search and automatically parse results
papers = client.search_and_parse(
    query="COVID-19 AND vaccine",
    pageSize=50,
    sort="CITED desc"
)

for paper in papers:
    print(f"Citations: {paper.get('citedByCount', 0)}")
    print(f"Title: {paper.get('title', 'N/A')}")

Full-Text Content Retrieval

from pyeuropepmc.fulltext import FullTextClient

# Initialize full-text client
fulltext_client = FullTextClient()

# Download PDF
pdf_path = fulltext_client.download_pdf_by_pmcid("PMC1234567", output_dir="./downloads")

# Download XML
xml_content = fulltext_client.download_xml_by_pmcid("PMC1234567")

# Bulk FTP downloads
from pyeuropepmc.ftp_downloader import FTPDownloader

ftp_downloader = FTPDownloader()
results = ftp_downloader.bulk_download_and_extract(
    pmcids=["1234567", "2345678"],
    output_dir="./bulk_downloads"
)

Full-Text XML Parsing

Parse full text XML files and extract structured information:

from pyeuropepmc import FullTextClient, FullTextXMLParser

# Download and parse XML
with FullTextClient() as client:
    xml_path = client.download_xml_by_pmcid("PMC3258128")

# Parse the XML
with open(xml_path, 'r') as f:
    parser = FullTextXMLParser(f.read())

# Extract metadata
metadata = parser.extract_metadata()
print(f"Title: {metadata['title']}")
print(f"Authors: {', '.join(metadata['authors'])}")

# Convert to different formats
plaintext = parser.to_plaintext()  # Plain text
markdown = parser.to_markdown()     # Markdown format

# Extract tables
tables = parser.extract_tables()
for table in tables:
    print(f"Table: {table['label']} - {len(table['rows'])} rows")

# Extract references
references = parser.extract_references()
print(f"Found {len(references)} references")

Advanced Analytics and Visualization

Analyze search results with built-in analytics and create visualizations:

from pyeuropepmc import (
    SearchClient,
    to_dataframe,
    citation_statistics,
    quality_metrics,
    remove_duplicates,
    plot_publication_years,
    create_summary_dashboard,
)

# Search and convert to DataFrame
with SearchClient() as client:
    response = client.search("machine learning", pageSize=100)
    papers = response.get("resultList", {}).get("result", [])

# Convert to pandas DataFrame for analysis
df = to_dataframe(papers)

# Remove duplicates
df = remove_duplicates(df, method="title", keep="most_cited")

# Get citation statistics
stats = citation_statistics(df)
print(f"Mean citations: {stats['mean_citations']:.2f}")
print(f"Highly cited (top 10%): {stats['citation_distribution']['90th_percentile']:.0f}")

# Assess quality metrics
metrics = quality_metrics(df)
print(f"Open access: {metrics['open_access_percentage']:.1f}%")
print(f"With PDF: {metrics['with_pdf_percentage']:.1f}%")

# Create visualizations
plot_publication_years(df, save_path="publications_by_year.png")
create_summary_dashboard(df, save_path="analysis_dashboard.png")

External API Enrichment

Enhance paper metadata with data from CrossRef, Unpaywall, Semantic Scholar, and OpenAlex:

from pyeuropepmc import PaperEnricher, EnrichmentConfig

# Configure enrichment with multiple APIs
config = EnrichmentConfig(
    enable_crossref=True,
    enable_semantic_scholar=True,
    enable_openalex=True,
    enable_unpaywall=True,
    unpaywall_email="your@email.com"  # Required for Unpaywall
)

# Enrich paper metadata
with PaperEnricher(config) as enricher:
    result = enricher.enrich_paper(doi="10.1371/journal.pone.0308090")

    # Access merged data from all sources
    merged = result["merged"]
    print(f"Title: {merged['title']}")
    print(f"Citations: {merged['citation_count']}")
    print(f"Open Access: {merged['is_oa']}")

    # Access individual source data
    if "crossref" in result["sources"]:
        print(f"Funders: {result['crossref']['funders']}")

    if "semantic_scholar" in result["sources"]:
        print(f"Influential Citations: {result['semantic_scholar']['influential_citation_count']}")

Features:

  • ๐Ÿ”„ Automatic data merging from multiple sources
  • ๐Ÿ“Š Citation metrics from multiple databases
  • ๐Ÿ”“ Open access status and full-text URLs
  • ๐Ÿ’ฐ Funding information
  • ๐Ÿท๏ธ Topic classifications and fields of study
  • โšก Optional caching for performance

See examples/09-enrichment for more details.

Knowledge Graph Structure Options ๐Ÿ•ธ๏ธ

PyEuropePMC supports flexible knowledge graph structures for different use cases:

from pyeuropepmc.mappers import RDFMapper

mapper = RDFMapper()

# Metadata-only KG (for citation networks and bibliometrics)
metadata_graphs = mapper.save_metadata_rdf(
    entities_data,
    output_dir="rdf_output"
)  # Papers + authors + institutions

# Content-only KG (for text analysis and document processing)
content_graphs = mapper.save_content_rdf(
    entities_data,
    output_dir="rdf_output"
)  # Papers + sections + references + tables

# Complete KG (for comprehensive analysis)
complete_graphs = mapper.save_complete_rdf(
    entities_data,
    output_dir="rdf_output"
)  # All entities and relationships

# Use configured default from conf/rdf_map.yml
graphs = mapper.save_rdf(entities_data, output_dir="rdf_output")

Use Cases:

  • ๐Ÿ“Š Citation Networks: Use metadata-only KGs for bibliometric analysis
  • ๐Ÿ“ Text Mining: Use content-only KGs for NLP and information extraction
  • ๐Ÿ”ฌ Full Analysis: Use complete KGs for comprehensive research workflows

See examples/kg_structure_demo.py for a complete working example.

Unified Processing Pipeline ๐Ÿ—๏ธ

The new unified pipeline dramatically simplifies the complex workflow of XML parsing โ†’ enrichment โ†’ RDF conversion:

from pyeuropepmc import PaperProcessingPipeline, PipelineConfig

# Simple configuration
config = PipelineConfig(
    enable_enrichment=True,      # Enable metadata enrichment
    enable_crossref=True,        # CrossRef API
    enable_semantic_scholar=True, # Semantic Scholar API
    enable_openalex=True,        # OpenAlex API
    enable_ror=True,             # ROR institution data
    crossref_email="your@email.com",  # Required for higher CrossRef rate limits
    output_format="turtle",      # RDF output format
    output_dir="output"          # Where to save RDF files
)

# Create unified pipeline
pipeline = PaperProcessingPipeline(config)

# Process single paper - replaces 8+ separate steps!
result = pipeline.process_paper(
    xml_content=xml_string,
    doi="10.1038/nature11476",
    save_rdf=True
)

print(f"Generated {result['triple_count']} RDF triples")
print(f"Output saved to: {result['output_file']}")

# Process multiple papers in batch
xml_contents = {
    "10.1038/nature11476": xml_content_1,
    "10.1038/nature11477": xml_content_2,
}

batch_results = pipeline.process_papers(xml_contents)
for doi, result in batch_results.items():
    print(f"{doi}: {result['triple_count']} triples")

What it does automatically:

  • โœ… Parses XML and extracts entities (paper, authors, sections, tables, figures, references)
  • โœ… Enriches metadata from external APIs (citations, fields of study, etc.)
  • โœ… Converts everything to RDF with proper relationships
  • โœ… Saves structured output files
  • โœ… Handles errors gracefully

Before vs After:

# OLD: Complex multi-step workflow (8+ steps)
parser = FullTextXMLParser()
parser.parse(xml_content)
paper, authors, sections, tables, figures, references = build_paper_entities(parser)
enricher = PaperEnricher(config)
enrichment_data = enricher.enrich_paper(doi)
rdf_mapper = RDFMapper()
paper.to_rdf(graph, related_entities=...)
rdf_mapper.serialize_graph(graph, format='turtle')

# NEW: Single pipeline call (3 steps)
config = PipelineConfig(...)
pipeline = PaperProcessingPipeline(config)
result = pipeline.process_paper(xml_content, doi=doi)

See examples/pipeline_demo.py for a complete working example.

๐Ÿ“š Documentation

๐Ÿ“– Read the Full Documentation โ† Start Here!

Quick Links:

Note: Enable GitHub Pages first! See Setup Guide for instructions.

๐Ÿ“Š Performance

Benchmarks run weekly on Monday at 02:00 UTC. Last updated: Pending first run

Metric Value
Total Requests Pending
Average Response Time Pending
Success Rate Pending

Benchmark results will be automatically updated weekly by GitHub Actions.

๐Ÿค Contributing

We welcome contributions! See our Contributing Guide for details.

๐Ÿ“„ License

Distributed under the MIT License. See LICENSE for more information.

๐ŸŒ Links

  • ๐Ÿ“– Documentation: GitHub Pages - Full documentation site
  • ๐Ÿ“ฆ PyPI Package: pyeuropepmc - Install with pip
  • ๐Ÿ’ป GitHub Repository: pyEuropePMC - Source code
  • ๐Ÿ› Issue Tracker: GitHub Issues - Report bugs or request features

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyeuropepmc-1.14.0.tar.gz (239.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyeuropepmc-1.14.0-py3-none-any.whl (285.7 kB view details)

Uploaded Python 3

File details

Details for the file pyeuropepmc-1.14.0.tar.gz.

File metadata

  • Download URL: pyeuropepmc-1.14.0.tar.gz
  • Upload date:
  • Size: 239.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.10.19 Linux/6.11.0-1018-azure

File hashes

Hashes for pyeuropepmc-1.14.0.tar.gz
Algorithm Hash digest
SHA256 a1c3ced9a63a85657b8b94c6d729d4d16975f51933acbbf0e8539596cacc81b6
MD5 8c67733c54db2800e89aeeff748b804f
BLAKE2b-256 10faae11500a46d3808c896448617865add4e9f26be6fdbc6e392436366e3603

See more details on using hashes here.

File details

Details for the file pyeuropepmc-1.14.0-py3-none-any.whl.

File metadata

  • Download URL: pyeuropepmc-1.14.0-py3-none-any.whl
  • Upload date:
  • Size: 285.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.2.1 CPython/3.10.19 Linux/6.11.0-1018-azure

File hashes

Hashes for pyeuropepmc-1.14.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c33887681fd9842cf13a89d8e5c2b5377e4742bccae1bcf8a2047420ad32e4e7
MD5 5fde6b196203b281014b6a72fb104323
BLAKE2b-256 551f938550d4b228566be1011ea6e0e8dec40cd6c86be3ffb1d8ad50c5cb4579

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page