Skip to main content

A Python module to scrape PhD offers from academicpositions.com

Project description

PhD Scraper for Academic Positions

A Python module to scrape PhD offers from academicpositions.com.

Features

  • 🔍 Scrape PhD positions with filtering by country and field
  • 📋 Extract detailed information: title, university, requirements, deadlines, etc.
  • 💾 Export to JSON, CSV, or Markdown formats
  • 🔄 Iterator support for memory-efficient processing
  • ⚡ Concurrent fetching with rate limiting
  • 🖥️ Command-line interface included

Installation

# Clone or navigate to the project directory
cd PhDFinder

# Install dependencies
pip install -e .

# Or install dependencies directly
pip install requests beautifulsoup4

Quick Start

Python API

from phd_scraper import AcademicPositionsScraper, PhDPosition

# Create scraper instance
scraper = AcademicPositionsScraper()

# Get PhD positions (basic usage)
positions = scraper.get_phd_positions(max_pages=2)

# Print results
for pos in positions:
    print(f"{pos.title} at {pos.university}")
    print(f"  Location: {pos.location}")
    print(f"  Deadline: {pos.deadline}")
    print(f"  URL: {pos.url}")
    print()

Filter by Country and Field

# Get Computer Science PhDs in Germany
positions = scraper.get_phd_positions(
    max_pages=3,
    country="germany",
    field="computer-science"
)

# Get Physics PhDs in Switzerland
positions = scraper.get_phd_positions(
    country="switzerland",
    field="physics"
)

Search with Keywords

# Search for specific keywords
positions = scraper.search_positions(
    keywords=["machine learning", "deep learning", "AI"],
    country="germany",
    max_pages=5
)

Export Results

from phd_scraper.utils import export_to_json, export_to_csv, export_to_markdown

# Get positions
positions = scraper.get_phd_positions(max_pages=2)

# Export to different formats
export_to_json(positions, "phd_positions.json")
export_to_csv(positions, "phd_positions.csv")
export_to_markdown(positions, "phd_positions.md")

Memory-Efficient Iterator

# Process positions one at a time (good for large datasets)
for position in scraper.iter_positions(country="sweden"):
    print(position.summary())
    # Process each position without loading all into memory

Command-Line Interface

# Basic usage - get 2 pages of positions
python -m phd_scraper --pages 2

# Filter by country and field
python -m phd_scraper --country germany --field computer-science

# Export to JSON
python -m phd_scraper --output positions.json --format json --pages 3

# Export to CSV
python -m phd_scraper --output positions.csv --format csv

# Search with keywords
python -m phd_scraper --keywords "machine learning" "neural networks" --pages 5

# List available filters
python -m phd_scraper --list-filters

# Fast mode (skip detailed info)
python -m phd_scraper --no-details --pages 10

# Verbose output
python -m phd_scraper --verbose --pages 1

Available Filters

Countries

  • germany, sweden, belgium, switzerland, netherlands, finland
  • norway, austria, france, united-kingdom, united-states
  • italy, spain, denmark, luxembourg

Fields

  • computer-science, physics, chemistry, biology, mathematics
  • engineering, medicine, economics, social-science, geosciences
  • artificial-intelligence, machine-learning, psychology, law

Data Model

Each PhDPosition object contains:

Field Description
title Position title
university University/employer name
location Full location (city, country)
country Country name
city City name
deadline Application deadline
published_date When the position was published
job_type Type of position (PhD)
fields Research fields/disciplines
description Full job description
requirements Qualifications needed
benefits What the position offers
url Link to the job posting
apply_url Direct application link

Configuration

scraper = AcademicPositionsScraper(
    request_delay=1.5,      # Delay between requests (seconds)
    timeout=30,             # Request timeout (seconds)
    max_retries=3,          # Number of retry attempts
    user_agent="Custom UA"  # Custom user agent string
)

Utility Functions

from phd_scraper.utils import (
    filter_positions,
    deduplicate_positions,
    sort_positions
)

# Filter positions
filtered = filter_positions(
    positions,
    keywords=["AI", "robotics"],
    countries=["germany", "switzerland"],
    has_deadline=True
)

# Remove duplicates
unique = deduplicate_positions(positions)

# Sort by field
sorted_pos = sort_positions(positions, by="deadline")

Example Output

[1] PhD Position in AI and Strategy
    University: ETH Zürich
    Location: Zurich, Switzerland
    Deadline: Unspecified
    Fields: Business Administration, Management, Artificial Intelligence
    URL: https://academicpositions.com/ad/eth-zurich/2026/...

[2] Doctoral student in Radiofrequency ranging for Lunar orbits
    University: KTH Royal Institute of Technology
    Location: Stockholm, Sweden
    Deadline: 2026-01-31 (Europe/Stockholm)
    Fields: Physics, Space Science
    URL: https://academicpositions.com/ad/kth-royal-institute-of-technology/2025/...

Important Notes

  • Rate Limiting: The scraper includes built-in delays to be respectful of the server
  • Terms of Service: Please review academicpositions.com's terms before scraping
  • Data Accuracy: Always verify position details on the original website
  • Updates: Website structure may change; report issues if scraping fails

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

academic_phd_scraper-1.0.0.tar.gz (12.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

academic_phd_scraper-1.0.0-py3-none-any.whl (14.5 kB view details)

Uploaded Python 3

File details

Details for the file academic_phd_scraper-1.0.0.tar.gz.

File metadata

  • Download URL: academic_phd_scraper-1.0.0.tar.gz
  • Upload date:
  • Size: 12.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.11

File hashes

Hashes for academic_phd_scraper-1.0.0.tar.gz
Algorithm Hash digest
SHA256 7e50cf093b90b0daa32ad8eaec73f46246265d6decded638d2cda0e7518d0c68
MD5 6ea39a8c9306e7732e1e2b3d256ebc96
BLAKE2b-256 29879561a53df5d413e80a07850ece971d533512588c21adf413b6925d3795c3

See more details on using hashes here.

File details

Details for the file academic_phd_scraper-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for academic_phd_scraper-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 984678dca5b615e01f33e46739b42bdb046c18e40859f597de6d73469183eb20
MD5 661a61570e2b7a03f29ef3b61b74e5da
BLAKE2b-256 268dc2697f3218378b042bcc6f234eef0aa44565c0479fdc8b819cace1e7f887

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page