A Python package for working with the Europe PMC API to search and retrieve scientific literature.
Project description
PyEuropePMC
๐ Build Status
PyEuropePMC is a robust Python toolkit for automated search, extraction, and analysis of scientific literature from Europe PMC.
โจ Key Features
- ๐ Comprehensive Search API - Query Europe PMC with advanced search options
- ๏ฟฝ Advanced Query Builder - Fluent API for building complex search queries with type safety
- ๏ฟฝ๐ Full-Text Retrieval - Download PDFs, XML, and HTML content from open access articles
- ๐ฌ XML Parsing & Conversion - Parse full text XML and convert to plaintext, markdown, extract tables and metadata
- ๐ Multiple Output Formats - JSON, XML, Dublin Core (DC)
- ๐ฆ Bulk FTP Downloads - Efficient bulk PDF downloads from Europe PMC FTP servers
- ๐ Smart Pagination - Automatic handling of large result sets
- ๐ก๏ธ Robust Error Handling - Built-in retry logic and connection management
- ๐งโ๐ป Type Safety - Extensive use of type annotations and validation
- โก Rate Limiting - Respectful API usage with configurable delays
- ๐งช Extensively Tested - 200+ tests with 90%+ code coverage
- ๐ Systematic Review Tracking - PRISMA-compliant search logging and audit trails
- ๐ Advanced Analytics - Publication trends, citation analysis, quality metrics, and duplicate detection
- ๐ Rich Visualizations - Interactive plots and dashboards using matplotlib and seaborn
- ๐ External API Enrichment - Enhance metadata with CrossRef, Unpaywall, Semantic Scholar, and OpenAlex
๐ Project Structure
The repository is organized as follows:
src/pyeuropepmc/- Main package source codetests/- Unit and integration testsdocs/- Documentation and guidesexamples/- Example scripts and usage demonstrationsbenchmarks/- Performance benchmarking scripts and resultsdata/- Downloads, outputs, and generated data filesconf/- Configuration files for RDF mapping and other settings
๐ Quick Start
Installation
pip install pyeuropepmc
Basic Usage
from pyeuropepmc.search import SearchClient
# Search for papers
with SearchClient() as client:
results = client.search("CRISPR gene editing", pageSize=10)
for paper in results["resultList"]["result"]:
print(f"Title: {paper['title']}")
print(f"Authors: {paper.get('authorString', 'N/A')}")
print("---")
Advanced Search with QueryBuilder
from pyeuropepmc import QueryBuilder
# Build complex queries with fluent API
qb = QueryBuilder()
query = (qb
.keyword("cancer", field="title")
.and_()
.keyword("immunotherapy")
.and_()
.date_range(start_year=2020, end_year=2023)
.and_()
.citation_count(min_count=10)
.build())
print(f"Generated query: {query}")
# Output: (TITLE:cancer) AND immunotherapy AND (PUB_YEAR:[2020 TO 2023]) AND (CITED:[10 TO *])
Advanced Search with Parsing
# Search and automatically parse results
papers = client.search_and_parse(
query="COVID-19 AND vaccine",
pageSize=50,
sort="CITED desc"
)
for paper in papers:
print(f"Citations: {paper.get('citedByCount', 0)}")
print(f"Title: {paper.get('title', 'N/A')}")
Full-Text Content Retrieval
from pyeuropepmc.fulltext import FullTextClient
# Initialize full-text client
fulltext_client = FullTextClient()
# Download PDF
pdf_path = fulltext_client.download_pdf_by_pmcid("PMC1234567", output_dir="./downloads")
# Download XML
xml_content = fulltext_client.download_xml_by_pmcid("PMC1234567")
# Bulk FTP downloads
from pyeuropepmc.ftp_downloader import FTPDownloader
ftp_downloader = FTPDownloader()
results = ftp_downloader.bulk_download_and_extract(
pmcids=["1234567", "2345678"],
output_dir="./bulk_downloads"
)
Full-Text XML Parsing
Parse full text XML files and extract structured information:
from pyeuropepmc import FullTextClient, FullTextXMLParser
# Download and parse XML
with FullTextClient() as client:
xml_path = client.download_xml_by_pmcid("PMC3258128")
# Parse the XML
with open(xml_path, 'r') as f:
parser = FullTextXMLParser(f.read())
# Extract metadata
metadata = parser.extract_metadata()
print(f"Title: {metadata['title']}")
print(f"Authors: {', '.join(metadata['authors'])}")
# Convert to different formats
plaintext = parser.to_plaintext() # Plain text
markdown = parser.to_markdown() # Markdown format
# Extract tables
tables = parser.extract_tables()
for table in tables:
print(f"Table: {table['label']} - {len(table['rows'])} rows")
# Extract references
references = parser.extract_references()
print(f"Found {len(references)} references")
Advanced Analytics and Visualization
Analyze search results with built-in analytics and create visualizations:
from pyeuropepmc import (
SearchClient,
to_dataframe,
citation_statistics,
quality_metrics,
remove_duplicates,
plot_publication_years,
create_summary_dashboard,
)
# Search and convert to DataFrame
with SearchClient() as client:
response = client.search("machine learning", pageSize=100)
papers = response.get("resultList", {}).get("result", [])
# Convert to pandas DataFrame for analysis
df = to_dataframe(papers)
# Remove duplicates
df = remove_duplicates(df, method="title", keep="most_cited")
# Get citation statistics
stats = citation_statistics(df)
print(f"Mean citations: {stats['mean_citations']:.2f}")
print(f"Highly cited (top 10%): {stats['citation_distribution']['90th_percentile']:.0f}")
# Assess quality metrics
metrics = quality_metrics(df)
print(f"Open access: {metrics['open_access_percentage']:.1f}%")
print(f"With PDF: {metrics['with_pdf_percentage']:.1f}%")
# Create visualizations
plot_publication_years(df, save_path="publications_by_year.png")
create_summary_dashboard(df, save_path="analysis_dashboard.png")
External API Enrichment
Enhance paper metadata with data from CrossRef, Unpaywall, Semantic Scholar, and OpenAlex:
from pyeuropepmc import PaperEnricher, EnrichmentConfig
# Configure enrichment with multiple APIs
config = EnrichmentConfig(
enable_crossref=True,
enable_semantic_scholar=True,
enable_openalex=True,
enable_unpaywall=True,
unpaywall_email="your@email.com" # Required for Unpaywall
)
# Enrich paper metadata
with PaperEnricher(config) as enricher:
result = enricher.enrich_paper(doi="10.1371/journal.pone.0308090")
# Access merged data from all sources
merged = result["merged"]
print(f"Title: {merged['title']}")
print(f"Citations: {merged['citation_count']}")
print(f"Open Access: {merged['is_oa']}")
# Access individual source data
if "crossref" in result["sources"]:
print(f"Funders: {result['crossref']['funders']}")
if "semantic_scholar" in result["sources"]:
print(f"Influential Citations: {result['semantic_scholar']['influential_citation_count']}")
Features:
- ๐ Automatic data merging from multiple sources
- ๐ Citation metrics from multiple databases
- ๐ Open access status and full-text URLs
- ๐ฐ Funding information
- ๐ท๏ธ Topic classifications and fields of study
- โก Optional caching for performance
See examples/09-enrichment for more details.
Knowledge Graph Structure Options ๐ธ๏ธ
PyEuropePMC supports flexible knowledge graph structures for different use cases:
from pyeuropepmc.mappers import RDFMapper
mapper = RDFMapper()
# Metadata-only KG (for citation networks and bibliometrics)
metadata_graphs = mapper.save_metadata_rdf(
entities_data,
output_dir="rdf_output"
) # Papers + authors + institutions
# Content-only KG (for text analysis and document processing)
content_graphs = mapper.save_content_rdf(
entities_data,
output_dir="rdf_output"
) # Papers + sections + references + tables
# Complete KG (for comprehensive analysis)
complete_graphs = mapper.save_complete_rdf(
entities_data,
output_dir="rdf_output"
) # All entities and relationships
# Use configured default from conf/rdf_map.yml
graphs = mapper.save_rdf(entities_data, output_dir="rdf_output")
Use Cases:
- ๐ Citation Networks: Use metadata-only KGs for bibliometric analysis
- ๐ Text Mining: Use content-only KGs for NLP and information extraction
- ๐ฌ Full Analysis: Use complete KGs for comprehensive research workflows
See examples/kg_structure_demo.py for a complete working example.
Unified Processing Pipeline ๐๏ธ
The new unified pipeline dramatically simplifies the complex workflow of XML parsing โ enrichment โ RDF conversion:
from pyeuropepmc import PaperProcessingPipeline, PipelineConfig
# Simple configuration
config = PipelineConfig(
enable_enrichment=True, # Enable metadata enrichment
enable_crossref=True, # CrossRef API
enable_semantic_scholar=True, # Semantic Scholar API
enable_openalex=True, # OpenAlex API
enable_ror=True, # ROR institution data
crossref_email="your@email.com", # Required for higher CrossRef rate limits
output_format="turtle", # RDF output format
output_dir="output" # Where to save RDF files
)
# Create unified pipeline
pipeline = PaperProcessingPipeline(config)
# Process single paper - replaces 8+ separate steps!
result = pipeline.process_paper(
xml_content=xml_string,
doi="10.1038/nature11476",
save_rdf=True
)
print(f"Generated {result['triple_count']} RDF triples")
print(f"Output saved to: {result['output_file']}")
# Process multiple papers in batch
xml_contents = {
"10.1038/nature11476": xml_content_1,
"10.1038/nature11477": xml_content_2,
}
batch_results = pipeline.process_papers(xml_contents)
for doi, result in batch_results.items():
print(f"{doi}: {result['triple_count']} triples")
What it does automatically:
- โ Parses XML and extracts entities (paper, authors, sections, tables, figures, references)
- โ Enriches metadata from external APIs (citations, fields of study, etc.)
- โ Converts everything to RDF with proper relationships
- โ Saves structured output files
- โ Handles errors gracefully
Before vs After:
# OLD: Complex multi-step workflow (8+ steps)
parser = FullTextXMLParser()
parser.parse(xml_content)
paper, authors, sections, tables, figures, references = build_paper_entities(parser)
enricher = PaperEnricher(config)
enrichment_data = enricher.enrich_paper(doi)
rdf_mapper = RDFMapper()
paper.to_rdf(graph, related_entities=...)
rdf_mapper.serialize_graph(graph, format='turtle')
# NEW: Single pipeline call (3 steps)
config = PipelineConfig(...)
pipeline = PaperProcessingPipeline(config)
result = pipeline.process_paper(xml_content, doi=doi)
See examples/pipeline_demo.py for a complete working example.
๐ Documentation
๐ Read the Full Documentation โ Start Here!
Quick Links:
- ๐ Quick Start Guide - Get started in 5 minutes
- ๏ฟฝ Query Builder - Advanced query building
- ๏ฟฝ๐ API Reference - Complete API documentation
- ๐ก Examples - Code examples and use cases
- โจ Features - Explore all features
- ๐ XML Coverage Analysis - Parser coverage and benchmark results
Note: Enable GitHub Pages first! See Setup Guide for instructions.
๐ Performance
Benchmarks run weekly on Monday at 02:00 UTC. Last updated: Pending first run
| Metric | Value |
|---|---|
| Total Requests | Pending |
| Average Response Time | Pending |
| Success Rate | Pending |
Benchmark results will be automatically updated weekly by GitHub Actions.
๐ค Contributing
We welcome contributions! See our Contributing Guide for details.
๐ License
Distributed under the MIT License. See LICENSE for more information.
๐ Links
- ๐ Documentation: GitHub Pages - Full documentation site
- ๐ฆ PyPI Package: pyeuropepmc - Install with pip
- ๐ป GitHub Repository: pyEuropePMC - Source code
- ๐ Issue Tracker: GitHub Issues - Report bugs or request features
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyeuropepmc-1.14.0.tar.gz.
File metadata
- Download URL: pyeuropepmc-1.14.0.tar.gz
- Upload date:
- Size: 239.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.1 CPython/3.10.19 Linux/6.11.0-1018-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a1c3ced9a63a85657b8b94c6d729d4d16975f51933acbbf0e8539596cacc81b6
|
|
| MD5 |
8c67733c54db2800e89aeeff748b804f
|
|
| BLAKE2b-256 |
10faae11500a46d3808c896448617865add4e9f26be6fdbc6e392436366e3603
|
File details
Details for the file pyeuropepmc-1.14.0-py3-none-any.whl.
File metadata
- Download URL: pyeuropepmc-1.14.0-py3-none-any.whl
- Upload date:
- Size: 285.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.2.1 CPython/3.10.19 Linux/6.11.0-1018-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c33887681fd9842cf13a89d8e5c2b5377e4742bccae1bcf8a2047420ad32e4e7
|
|
| MD5 |
5fde6b196203b281014b6a72fb104323
|
|
| BLAKE2b-256 |
551f938550d4b228566be1011ea6e0e8dec40cd6c86be3ffb1d8ad50c5cb4579
|