A powerful Python package for searching NCBI GEO and PubMed databases with advanced filtering
Project description
ncbi-geo-pubmed-search
A powerful Python package for searching NCBI GEO and PubMed databases with advanced filtering
Features
- 🔍 Unified Search Interface: Search both PubMed and GEO databases with a single command
- 🏥 PubMed Search: Find scientific articles with advanced filtering by date, field, and more
- 🧬 GEO Search: Discover gene expression datasets with organism and dataset type filtering
- 🛡️ Robust Error Handling: Built-in retry logic with exponential backoff for API rate limits
- 📊 Flexible Output: Export results to Excel or CSV formats
- 🔧 Highly Configurable: Customize API delays, retries, and search parameters
- 🌍 Environment Variable Support: Secure credential management
- 📈 Result Statistics: Get summary statistics for your search results
Installation
Install from PyPI:
pip install ncbi-geo-pubmed-search
Or install the latest development version:
pip install git+https://github.com/yourusername/ncbi-geo-pubmed-search.git
Quick Start
Basic Usage
from ncbi_geo_pubmed import NCBISearcher
# Initialize with your email (required by NCBI)
searcher = NCBISearcher(email="your.email@example.com")
# Search both PubMed and GEO
results = searcher.search(
search_terms=["cancer", "immunotherapy"],
start_year=2020,
end_year=2024
)
# Access results
pubmed_df = results["pubmed"]
geo_df = results["geo_all"]
print(f"Found {len(pubmed_df)} PubMed articles")
print(f"Found {len(geo_df)} GEO datasets")
Using Environment Variables
Set your credentials as environment variables for security:
export NCBI_EMAIL="your.email@example.com"
export NCBI_API_KEY="your_optional_api_key"
Then use without explicit credentials:
from ncbi_geo_pubmed import NCBISearcher
searcher = NCBISearcher()
results = searcher.search(["aging", "senescence"], 2022, 2024)
Search Only PubMed
# Search PubMed with specific field
pubmed_results = searcher.search_pubmed(
search_terms=["CRISPR", "gene editing"],
start_year=2023,
end_year=2024,
field="Title/Abstract", # Search in title and abstract
retmax=500
)
# Save results
pubmed_results = searcher.search_pubmed(
search_terms=["diabetes", "metabolism"],
start_year=2020,
end_year=2024,
output_folder="./results",
save_format="excel"
)
Search Only GEO
# Search GEO with organism filter
geo_results = searcher.search_geo(
search_terms=["RNA-seq", "single cell"],
organisms=["Homo sapiens", "Mus musculus"],
dataset_type="expression profiling by high throughput sequencing",
retmax=1000
)
# Access organism-specific results
human_datasets = geo_results["geo_homo_sapiens"]
mouse_datasets = geo_results["geo_mus_musculus"]
Advanced Usage
Custom Configuration
# Initialize with custom settings
searcher = NCBISearcher(
email="your.email@example.com",
api_key="your_api_key", # Optional, provides higher rate limits
request_delay=0.5, # Delay between requests (seconds)
max_retries=5, # Maximum retry attempts
backoff_factor=2 # Exponential backoff multiplier
)
Combined Search with All Options
results = searcher.search(
search_terms=["alzheimer", "neurodegeneration", "tau"],
start_year=2020,
end_year=2024,
databases=["pubmed", "geo"], # Which databases to search
organisms=["Homo sapiens"], # GEO organism filter
retmax=2000, # Max results per database
pubmed_field="Title/Abstract", # PubMed search field
geo_dataset_type="expression profiling by array",
output_folder="./alzheimer_results",
save_format="excel",
combine_results=True # Save all results in one file
)
# Get summary statistics
stats = searcher.get_stats(results)
print(stats)
Batch Processing
# Process multiple search topics
topics = [
{"terms": ["COVID-19", "long COVID"], "years": (2021, 2024)},
{"terms": ["cancer", "immunotherapy"], "years": (2022, 2024)},
{"terms": ["CRISPR", "base editing"], "years": (2020, 2024)}
]
all_results = {}
for topic in topics:
key = "_".join(topic["terms"])
results = searcher.search(
search_terms=topic["terms"],
start_year=topic["years"][0],
end_year=topic["years"][1],
retmax=100
)
all_results[key] = results
Output Examples
PubMed Results DataFrame
| PMID | Title | Authors | Journal | Year | DOI | Citation |
|---|---|---|---|---|---|---|
| 12345678 | Cancer immunotherapy... | Smith J, et al. | Nature | 2023 | 10.1038/... | Nature. 123(45):678-90 (2023) |
GEO Results DataFrame
| GEO_ID | Accession | Title | Organism | Platform | Samples | DatasetType |
|---|---|---|---|---|---|---|
| 200012345 | GSE12345 | Single-cell RNA-seq... | Homo sapiens | GPL20301 | 10000 | expression profiling... |
Error Handling
The package includes robust error handling:
from ncbi_geo_pubmed import NCBISearcher, NCBISearchError, RateLimitError
try:
searcher = NCBISearcher(email="your.email@example.com")
results = searcher.search(["cancer"], 2020, 2024)
except RateLimitError:
print("Rate limit exceeded. Try again later or use an API key.")
except NCBISearchError as e:
print(f"Search failed: {e}")
Command Line Interface (CLI)
The package also provides a command-line interface:
# Basic search
ncbi-search --email your.email@example.com --terms "cancer,immunotherapy" --start 2020 --end 2024
# Search with all options
ncbi-search \
--email your.email@example.com \
--api-key your_api_key \
--terms "aging,senescence" \
--start 2020 \
--end 2024 \
--databases pubmed geo \
--organisms "Homo sapiens" "Mus musculus" \
--output ./results \
--format excel
Requirements
- Python >= 3.7
- biopython >= 1.79
- pandas >= 1.3.0
- openpyxl >= 3.0.9
- requests >= 2.25.0
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Support
If you encounter any problems or have questions:
- Check the FAQ
- Look through existing issues
- Open a new issue with a detailed description
Citation
If you use this package in your research, please cite:
@software{ncbi-geo-pubmed-search,
author = {{ Your Name }},
title = {{ ncbi-geo-pubmed-search: A powerful Python package for searching NCBI GEO and PubMed databases with advanced filtering }},
year = {2024},
url = {https://github.com/yourusername/ncbi-geo-pubmed-search}
}
Acknowledgments
- NCBI for providing the E-utilities API
- The BioPython community for the excellent Bio.Entrez module
- All contributors and users of this package
Made with ❤️ for the bioinformatics community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ncbi_geo_pubmed_search-1.0.0.tar.gz.
File metadata
- Download URL: ncbi_geo_pubmed_search-1.0.0.tar.gz
- Upload date:
- Size: 20.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2ea3d63e1191eba4808467de6bb413312c7424af6bd3b1d5bc4ce85ac2ed8e82
|
|
| MD5 |
c89f650de1d64a13d724f3253428b99b
|
|
| BLAKE2b-256 |
6e8ea7919ec9ac3590b7bd71b83d4909d8cf706099b8d0f5420565cf757672ae
|
File details
Details for the file ncbi_geo_pubmed_search-1.0.0-py3-none-any.whl.
File metadata
- Download URL: ncbi_geo_pubmed_search-1.0.0-py3-none-any.whl
- Upload date:
- Size: 16.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f549992d9fff77547f22366f9efd410c4054f698d7076d5aa0595386cbf037cc
|
|
| MD5 |
656168dcd7b915b6b1708b676b4877c3
|
|
| BLAKE2b-256 |
ce6e7a20dd570b49c5c5898e2bec04eab12ed33e88083434d46f66d7487adc0e
|