A powerful Python package for searching NCBI GEO and PubMed databases with advanced filtering

These details have not been verified by PyPI

Project links

Project description

ncbi-geo-pubmed-search

A powerful Python package for searching NCBI GEO and PubMed databases with advanced filtering

Features

🔍 Unified Search Interface: Search both PubMed and GEO databases with a single command
🏥 PubMed Search: Find scientific articles with advanced filtering by date, field, and more
🧬 GEO Search: Discover gene expression datasets with organism and dataset type filtering
🛡️ Robust Error Handling: Built-in retry logic with exponential backoff for API rate limits
📊 Flexible Output: Export results to Excel or CSV formats
🔧 Highly Configurable: Customize API delays, retries, and search parameters
🌍 Environment Variable Support: Secure credential management
📈 Result Statistics: Get summary statistics for your search results

Installation

Install from PyPI:

pip install ncbi-geo-pubmed-search

Or install the latest development version:

pip install git+https://github.com/yourusername/ncbi-geo-pubmed-search.git

Quick Start

Basic Usage

from ncbi_geo_pubmed import NCBISearcher

# Initialize with your email (required by NCBI)
searcher = NCBISearcher(email="your.email@example.com")

# Search both PubMed and GEO
results = searcher.search(
    search_terms=["cancer", "immunotherapy"],
    start_year=2020,
    end_year=2024
)

# Access results
pubmed_df = results["pubmed"]
geo_df = results["geo_all"]

print(f"Found {len(pubmed_df)} PubMed articles")
print(f"Found {len(geo_df)} GEO datasets")

Using Environment Variables

Set your credentials as environment variables for security:

export NCBI_EMAIL="your.email@example.com"
export NCBI_API_KEY="your_optional_api_key"

Then use without explicit credentials:

from ncbi_geo_pubmed import NCBISearcher

searcher = NCBISearcher()
results = searcher.search(["aging", "senescence"], 2022, 2024)

Search Only PubMed

# Search PubMed with specific field
pubmed_results = searcher.search_pubmed(
    search_terms=["CRISPR", "gene editing"],
    start_year=2023,
    end_year=2024,
    field="Title/Abstract",  # Search in title and abstract
    retmax=500
)

# Save results
pubmed_results = searcher.search_pubmed(
    search_terms=["diabetes", "metabolism"],
    start_year=2020,
    end_year=2024,
    output_folder="./results",
    save_format="excel"
)

Search Only GEO

# Search GEO with organism filter
geo_results = searcher.search_geo(
    search_terms=["RNA-seq", "single cell"],
    organisms=["Homo sapiens", "Mus musculus"],
    dataset_type="expression profiling by high throughput sequencing",
    retmax=1000
)

# Access organism-specific results
human_datasets = geo_results["geo_homo_sapiens"]
mouse_datasets = geo_results["geo_mus_musculus"]

Advanced Usage

Custom Configuration

# Initialize with custom settings
searcher = NCBISearcher(
    email="your.email@example.com",
    api_key="your_api_key",  # Optional, provides higher rate limits
    request_delay=0.5,       # Delay between requests (seconds)
    max_retries=5,          # Maximum retry attempts
    backoff_factor=2        # Exponential backoff multiplier
)

Combined Search with All Options

results = searcher.search(
    search_terms=["alzheimer", "neurodegeneration", "tau"],
    start_year=2020,
    end_year=2024,
    databases=["pubmed", "geo"],  # Which databases to search
    organisms=["Homo sapiens"],    # GEO organism filter
    retmax=2000,                   # Max results per database
    pubmed_field="Title/Abstract", # PubMed search field
    geo_dataset_type="expression profiling by array",
    output_folder="./alzheimer_results",
    save_format="excel",
    combine_results=True  # Save all results in one file
)

# Get summary statistics
stats = searcher.get_stats(results)
print(stats)

Batch Processing

# Process multiple search topics
topics = [
    {"terms": ["COVID-19", "long COVID"], "years": (2021, 2024)},
    {"terms": ["cancer", "immunotherapy"], "years": (2022, 2024)},
    {"terms": ["CRISPR", "base editing"], "years": (2020, 2024)}
]

all_results = {}
for topic in topics:
    key = "_".join(topic["terms"])
    results = searcher.search(
        search_terms=topic["terms"],
        start_year=topic["years"][0],
        end_year=topic["years"][1],
        retmax=100
    )
    all_results[key] = results

Output Examples

PubMed Results DataFrame

PMID	Title	Authors	Journal	Year	DOI	Citation
12345678	Cancer immunotherapy...	Smith J, et al.	Nature	2023	10.1038/...	Nature. 123(45):678-90 (2023)

GEO Results DataFrame

GEO_ID	Accession	Title	Organism	Platform	Samples	DatasetType
200012345	GSE12345	Single-cell RNA-seq...	Homo sapiens	GPL20301	10000	expression profiling...

Error Handling

The package includes robust error handling:

from ncbi_geo_pubmed import NCBISearcher, NCBISearchError, RateLimitError

try:
    searcher = NCBISearcher(email="your.email@example.com")
    results = searcher.search(["cancer"], 2020, 2024)
except RateLimitError:
    print("Rate limit exceeded. Try again later or use an API key.")
except NCBISearchError as e:
    print(f"Search failed: {e}")

Command Line Interface (CLI)

The package also provides a command-line interface:

# Basic search
ncbi-search --email your.email@example.com --terms "cancer,immunotherapy" --start 2020 --end 2024

# Search with all options
ncbi-search \
    --email your.email@example.com \
    --api-key your_api_key \
    --terms "aging,senescence" \
    --start 2020 \
    --end 2024 \
    --databases pubmed geo \
    --organisms "Homo sapiens" "Mus musculus" \
    --output ./results \
    --format excel

Requirements

Python >= 3.7
biopython >= 1.79
pandas >= 1.3.0
openpyxl >= 3.0.9
requests >= 2.25.0

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Support

If you encounter any problems or have questions:

Check the FAQ
Look through existing issues
Open a new issue with a detailed description

Citation

If you use this package in your research, please cite:

@software{ncbi-geo-pubmed-search,
  author = {{ Your Name }},
  title = {{ ncbi-geo-pubmed-search: A powerful Python package for searching NCBI GEO and PubMed databases with advanced filtering }},
  year = {2024},
  url = {https://github.com/yourusername/ncbi-geo-pubmed-search}
}

Acknowledgments

NCBI for providing the E-utilities API
The BioPython community for the excellent Bio.Entrez module
All contributors and users of this package

Made with ❤️ for the bioinformatics community

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Jul 25, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ncbi_geo_pubmed_search-1.0.0.tar.gz (20.0 kB view details)

Uploaded Jul 25, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

ncbi_geo_pubmed_search-1.0.0-py3-none-any.whl (16.4 kB view details)

Uploaded Jul 25, 2025 Python 3

File details

Details for the file ncbi_geo_pubmed_search-1.0.0.tar.gz.

File metadata

Download URL: ncbi_geo_pubmed_search-1.0.0.tar.gz
Upload date: Jul 25, 2025
Size: 20.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for ncbi_geo_pubmed_search-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`2ea3d63e1191eba4808467de6bb413312c7424af6bd3b1d5bc4ce85ac2ed8e82`
MD5	`c89f650de1d64a13d724f3253428b99b`
BLAKE2b-256	`6e8ea7919ec9ac3590b7bd71b83d4909d8cf706099b8d0f5420565cf757672ae`

See more details on using hashes here.

File details

Details for the file ncbi_geo_pubmed_search-1.0.0-py3-none-any.whl.

File metadata

Download URL: ncbi_geo_pubmed_search-1.0.0-py3-none-any.whl
Upload date: Jul 25, 2025
Size: 16.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for ncbi_geo_pubmed_search-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`f549992d9fff77547f22366f9efd410c4054f698d7076d5aa0595386cbf037cc`
MD5	`656168dcd7b915b6b1708b676b4877c3`
BLAKE2b-256	`ce6e7a20dd570b49c5c5898e2bec04eab12ed33e88083434d46f66d7487adc0e`

See more details on using hashes here.

ncbi-geo-pubmed-search 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ncbi-geo-pubmed-search

Features

Installation

Quick Start

Basic Usage

Using Environment Variables

Search Only PubMed

Search Only GEO

Advanced Usage

Custom Configuration

Combined Search with All Options

Batch Processing

Output Examples

PubMed Results DataFrame

GEO Results DataFrame

Error Handling

Command Line Interface (CLI)

Requirements

License

Contributing

Support

Citation

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes