Skip to main content

A comprehensive toolkit for downloading and filtering Kepler DR25 FITS files from NASA's MAST archive

Project description

Kepler-Downloader-DR25

Research-Oriented Toolkit for Kepler Data Analysis

This toolkit is designed for researchers studying NASA Kepler Space Telescope data. It follows the data structure requirements of machine learning frameworks like ExoMiner and AstroNet while remaining flexible for diverse research applications beyond these specific frameworks.

Key Design Philosophy

  • Research-First Approach: Built to support various astronomical research workflows
  • ML-Ready: Compatible with ExoMiner, AstroNet, and other machine learning pipelines
  • Flexible Architecture: Not limited to specific frameworks - adaptable to custom research needs
  • Scientific Rigor: Maintains data integrity and validation standards required for publication-quality research

Overview

A comprehensive toolkit for downloading and filtering Kepler DR25 FITS files from NASA's MAST archive with intelligent mode detection, DVT validation, and universal filtering capabilities.

This project provides two main scripts and utility tools:

Main Scripts:

  1. get-kepler-dr25.py - Main downloader with DVT filtering for ExoMiner mode and retry capabilities.
  2. filter-get-kepler-dr25.py - Universal filter with mode detection, conversion, and download of missing files.

Utility Tools (in util/ folder):

  • util/rebuild_database.py - Rebuild SQLite database from existing filesystem (useful for recovery)
  • util/check_missing_kics.py - Compare CSV with downloaded KICs to find missing ones
  • util/generate_stats.py - Generate comprehensive statistics for completed jobs
  • util/test_health_report.py - Diagnostic tool to verify database and health report contents

Why This Toolkit?

The Challenge with Kepler Data:

  • Full TCE dataset: ~400+ GB disk space required
  • KOI dataset: ~200+ GB disk space required
  • Common problems without proper tooling:
    • Missing KICs due to network timeouts
    • Corrupted FITS files from incomplete downloads
    • Database inconsistencies from concurrent writes
    • No recovery mechanism for partial failures
    • Manual tracking of thousands of files

What This Toolkit Solves:

Researchers studying exoplanets, stellar variability, or other phenomena in Kepler data need efficient, reliable tools that:

  • Prevent data loss: Redis buffering ensures zero database corruption even with network failures
  • Handle scale: Successfully manages datasets with 17,000+ KICs (tested with full TCE catalog)
  • Ensure completeness: Automatic retry mechanism and missing KIC detection
  • Support modern ML workflows: Compatible with ExoMiner, AstroNet, custom models
  • Provide verification: Health reports confirm data integrity and completeness

Key Features & Performance Metrics

Proven Performance:

  • 5.5x faster than traditional bulk query approaches
  • 99.9% success rate on 17,000+ KIC downloads (full TCE dataset)
  • Zero database corruption with Redis write-ahead buffering
  • Automatic recovery from network failures and timeouts

Core Capabilities:

  • Research-Ready Formats - Supports both ExoMiner/AstroNet structure and MAST standard
  • Universal filtering - Extract KOI subset from 400GB TCE data without re-downloading
  • Mode detection - Automatically detects and converts between formats
  • DVT validation - Ensures ML model compatibility with DVT file checking
  • Job-based organization - Each run creates a timestamped job directory for reproducibility
  • Parallel processing - 4-8 workers handle concurrent downloads efficiently
  • Health reporting - Comprehensive analysis confirms data completeness
  • Smart recovery - Automatically retries failed downloads and detects missing KICs

The standard Kepler light curve products available on the MAST archive are from the final Data Release 25 (DR25) processing.

Data Source and Attribution

NASA Kepler Mission Data

This toolkit downloads data from NASA's Kepler Space Telescope mission, which operated from 2009-2018 and revolutionized exoplanet science by discovering thousands of exoplanets. The data is hosted and distributed by:

  • MAST (Mikulski Archive for Space Telescopes) - NASA's data archive hosted at the Space Telescope Science Institute (STScI)
  • Data Release: DR25 (Data Release 25) - The final and most complete processing of Kepler data
  • Archive URL: https://archive.stsci.edu/kepler/

Data Usage and Citation

When using Kepler data downloaded with this toolkit, please:

  1. Acknowledge NASA and the Kepler mission in your publications

  2. Cite the appropriate Kepler papers:

    • Kepler Mission: Borucki et al. (2010) Science, 327, 977
    • Kepler Data Characteristics: Thompson et al. (2016) Kepler Data Release Notes (KSCI-19044-005)
    • DR25 Release: Available at MAST Kepler archive
  3. Include the standard acknowledgment:

    "This research has made use of the NASA Exoplanet Archive, which is operated by the California Institute of Technology, under contract with the National Aeronautics and Space Administration under the Exoplanet Exploration Program."

Data Products

The toolkit downloads the following NASA Kepler data products:

  • Light Curve Files (*_llc.fits) - Time series photometry data
  • Data Validation Files (*_dvt.fits) - Transit model fits and diagnostics
  • Target Pixel Files (*_tpf.fits) - Pixel-level photometry (when requested)

Terms of Use

All Kepler data is in the public domain. There are no restrictions on the use of Kepler data. However, proper attribution and citations are expected as a matter of professional courtesy and scientific integrity.

For more information about Kepler data:

Directory Structure

Kepler-Downloader-DR25/
├── input/
│   └── your_targets.csv
├── input_samples/
│   ├── cumulative_koi_2025.09.06_13.27.56.csv  # Kepler Objects of Interest
│   └── q1_q17_dr25_tce_2025.09.06_13.29.19.csv # Threshold Crossing Events
├── kepler_downloads/                # Default output directory
│   └── job-YYYYMMDD_HHMMSS/        # Timestamped job directory
│       ├── download_records.db      # SQLite database with all records
│       ├── health_check_report.txt  # Post-download health analysis
│       ├── reports/                  # DVT filtering and other reports
│       └── [Data directories based on mode]
├── util/                            # Utility scripts
│   ├── rebuild_database.py          # Rebuild database from filesystem
│   ├── check_missing_kics.py        # Find missing KICs
│   ├── generate_stats.py            # Generate job statistics
│   └── test_health_report.py        # Diagnostic tool
├── get-kepler-dr25.py              # Main downloader with DVT filtering
├── filter-get-kepler-dr25.py       # Universal filter with mode detection
├── setup.py                         # Package installation script
├── requirements.txt                 # Python dependencies
├── README.md                        # This file
├── QUICKSTART.md                   # Quick start guide
├── CHANGELOG.md                     # Version history
└── LICENSE                          # Apache 2.0 license

Output Formats

ExoMiner Format (Default)

kepler_downloads/job-*/
└── Kepler/
    └── XXXX/                  # First 4 digits of 9-digit KIC
        └── XXXXXXXXX/         # Full 9-digit KIC
            ├── *_llc.fits     # Light curve files
            ├── *_dvt.fits     # Data validation files (required)

Standard MAST Format (--no-exominer flag)

kepler_downloads/job-*/
└── mastDownload/
    └── Kepler/
        └── kplrXXXXXXXXX_lc/  # Light curve files
        └── kplrXXXXXXXXX_dv/  # Data validation and report files

Real-World Use Cases

Example: Filtering KOI from TCE Dataset

Problem: You downloaded the full TCE dataset (400+ GB, 17,230 KICs) but only need KOI data (8,214 KICs)

Traditional Approach: Re-download 200+ GB of KOI data, taking hours and risking incomplete downloads

With This Toolkit:

# Filter existing TCE data to extract KOI subset
python filter-get-kepler-dr25.py \
  --input-csv input_samples/cumulative_koi_2025.09.06_13.27.56.csv \
  --source-job kepler_downloads/job-with-tce-data

# Result: 7,141 KICs copied, 1,073 missing KICs automatically downloaded
# Time saved: Hours of redundant downloads
# Storage saved: 200+ GB of duplicate data

Security & Trust

PyPI Verified Security Scan

This package implements comprehensive security measures:

  • Trusted Publishing: Cryptographically verified releases via GitHub OIDC
  • Attestations: PEP 740 compliant package attestations
  • Signed Packages: Sigstore keyless signing for supply chain security
  • SBOM: Software Bill of Materials for dependency transparency
  • Security Scanning: Automated vulnerability scanning in CI/CD

For detailed security information, see SECURITY.md.

Installation

Prerequisites

  1. Python 3.7+ with pip installed

  2. Redis Server (required for reliable database operations)

    Install Redis:

    # macOS
    brew install redis && brew services start redis
    
    # Ubuntu/Debian
    sudo apt install redis-server && sudo systemctl start redis
    
    # Docker
    docker run -d -p 6379:6379 --name redis-kepler redis:latest
    

Installation Options

Option 1: Install from PyPI (Recommended)

# Install the package
pip install kepler-downloader-dr25

# Use command-line tools
kepler-download input/your_targets.csv
kepler-filter --input-csv input/kics.csv --source-job kepler_downloads/job-XXX
kepler-stats kepler_downloads/job-XXX

Option 2: Install from GitHub

# Clone and install
git clone https://github.com/akira921x/Kepler-Downloader-DR25.git
cd Kepler-Downloader-DR25
pip install -r requirements.txt

# Use scripts directly
python get-kepler-dr25.py input/your_targets.csv

Python Dependencies

Required packages (automatically installed with pip):

  • pandas - Data processing
  • astroquery - MAST archive interface
  • redis - Redis client for buffering
  • requests - HTTP requests
  • numpy - Numerical operations
  • tqdm - Progress bars

Quick Start

# Install from PyPI
pip install kepler-downloader-dr25

# Quick test with 3 targets
echo "006922244,007799349,011446443" > test.csv
kepler-download test.csv

# Download real datasets
kepler-download cumulative_koi.csv    # ~8,200 KOIs, ~200GB
kepler-download q1_q17_dr25_tce.csv   # ~17,000 TCEs, ~400GB

# Filter existing data (save time & storage)
kepler-filter --input-csv koi.csv --source-job kepler_downloads/job-XXX

See QUICKSTART.md for detailed quick start guide.

Usage

1. Downloading Data

Basic Download (ExoMiner Format - Default)

# Download from CSV files in input/ folder (ExoMiner format by default)
python get-kepler-dr25.py input/your_targets.csv

# Verbose mode (detailed output)
python get-kepler-dr25.py input/your_targets.csv --verbose

Standard MAST Format

# Download with Standard MAST structure (no DVT requirement)
python get-kepler-dr25.py input/your_targets.csv --no-exominer

# Strict DVT mode - skip KICs without DVT immediately (ExoMiner is default)
python get-kepler-dr25.py input/your_targets.csv --strict-dvt

# Backup KICs without DVT instead of deleting (ExoMiner is default)
python get-kepler-dr25.py input/your_targets.csv --backup-no-dvt

Advanced Options

# Custom configuration
python get-kepler-dr25.py input/your_targets.csv \
  --workers 8 \
  --batch-size 50 \
  --output-dir custom_output

# Retry failed downloads from a previous run
python get-kepler-dr25.py input/your_targets.csv --retry-failed

2. Filtering Existing Data

The universal filter script can process any CSV with KIC IDs and intelligently handle mode conversions.

Basic Filtering

# Filter existing job with KOI data
python filter-get-kepler-dr25.py \
  --input-csv input_samples/cumulative_koi_2025.09.06_13.27.56.csv \
  --source-job kepler_downloads/job-20250906_020543

# Use Standard format instead of default ExoMiner
python filter-get-kepler-dr25.py \
  --input-csv input/custom_kics.csv \
  --source-job kepler_downloads/job-20250906_020543 \
  --no-exominer

Mode Conversion

# Convert from Standard to ExoMiner format (ExoMiner is default target)
python filter-get-kepler-dr25.py \
  --input-csv input/kics.csv \
  --source-job kepler_downloads/standard_job \
  --force-mode  # Required when modes don't match

# Disable DVT validation for ExoMiner
python filter-get-kepler-dr25.py \
  --input-csv input/kics.csv \
  --source-job kepler_downloads/job-20250906 \
  --no-validate-dvt

Download Missing KICs

# Filter and download missing KICs from MAST
python filter-get-kepler-dr25.py \
  --input-csv input/kics.csv \
  --source-job kepler_downloads/job-20250906

# Skip downloading missing KICs
python filter-get-kepler-dr25.py \
  --input-csv input/kics.csv \
  --source-job kepler_downloads/job-20250906 \
  --no-download-missing

Command-Line Options

get-kepler-dr25.py

Option Description Required Default
csv_file Input CSV file with KIC IDs Yes -
--output-dir Output directory No kepler_downloads
--workers Number of parallel workers No 4
--batch-size KICs per batch No 50
--no-exominer Use Standard MAST format instead of ExoMiner No False (ExoMiner enabled)
--strict-dvt Skip KICs without DVT immediately (ExoMiner mode) No False
--backup-no-dvt Backup instead of delete no-DVT KICs (ExoMiner mode) No False
--retry-failed Retry failed downloads from a previous run. No False
--verbose Enable verbose logging No False

filter-get-kepler-dr25.py

Option Description Required Default
--input-csv Input CSV with KIC IDs Yes -
--source-job Source job folder Yes -
--no-exominer Use Standard MAST format instead of ExoMiner No False (ExoMiner enabled)
--output-dir Output directory No Auto-generated timestamp
--force-mode Force mode conversion even if incompatible No False
--no-validate-dvt Disable DVT validation for ExoMiner mode No False
--no-download-missing Skip downloading missing KICs from MAST No False (download enabled)
--workers Number of parallel workers for downloads No 4
--batch-size Batch size for downloads No 50
--verbose Enable verbose logging No False

Mode Detection and Compatibility

The filter script automatically detects job modes:

ExoMiner/AstroNet Mode (Default)

  • Structure: Kepler/XXXX/XXXXXXXXX/
  • Requires DVT files for each KIC
  • Optimized for machine learning workflows
  • Compatible with ExoMiner, AstroNet, and similar ML frameworks
  • Supports custom research pipelines requiring structured data organization

Standard Mode

  • Structure: mastDownload/Kepler/kplr*_lc/
  • MAST's default organization
  • No DVT requirement
  • Compatible with traditional analysis tools

Mode Compatibility Rules

  1. Same mode: Direct copy, no conversion needed
  2. Different modes: Requires --force-mode flag
  3. ExoMiner target: DVT validation enabled by default
  4. Mode mismatch: Detailed report explains incompatibility

Health Reports

Both scripts generate comprehensive health reports:

Download Health Report

  • Download statistics (success/failure rates)
  • DVT coverage for ExoMiner mode
  • File inventory by type
  • Performance metrics
  • Failed KIC list with errors

Filter Health Report

  • Source job analysis (mode, structure, statistics)
  • Mode compatibility assessment
  • Processing statistics
  • DVT validation results (ExoMiner)
  • Recommendations for issues

DVT Filtering (ExoMiner Mode)

When using ExoMiner mode, the system handles DVT (Data Validation) files:

  1. During Download (get-kepler-dr25.py --exominer):

    • Downloads both LLC and DVT files
    • Tracks DVT availability
    • Post-download filtering removes KICs without DVT
    • Optional backup with --backup-no-dvt
  2. During Filtering (filter-get-kepler-dr25.py):

    • Validates DVT presence for ExoMiner target
    • Moves no-DVT KICs to backup
    • Reports DVT coverage statistics

Performance

Expected Performance

Download Performance (with 4 workers, default settings):

  • Processing rate: ~50-60 KICs/minute
  • Small datasets (< 100 KICs): 2-5 minutes
  • Medium datasets (1,000 KICs): 20-30 minutes
  • KOI dataset (~8,200 KICs): 2.5-3 hours (~200 GB)
  • Full TCE dataset (~17,000 KICs): 5-6 hours (~400+ GB)

Why Traditional Methods Fail at Scale:

  • No automatic retry for network timeouts
  • Database corruption from concurrent writes
  • No progress tracking or recovery mechanism
  • Missing KIC detection requires manual verification
  • Network interruptions cause incomplete FITS files

Optimization Tips

  1. Increase workers for faster downloads: --workers 8
  2. Use larger batches: --batch-size 100
  3. Ensure Redis is running for optimal performance
  4. Use --strict-dvt to skip no-DVT KICs early
  5. Run during off-peak hours for better MAST response

Database Features

Tables Created

download_records

  • KIC ID and success status
  • File counts (LLC and DVT)
  • DVT presence flag
  • Error messages
  • Removal reasons (ExoMiner mode)

file_inventory

  • Complete file catalog
  • File types and sizes
  • Download timestamps

removed_kics (ExoMiner mode)

  • KICs removed for lacking DVT
  • File statistics before removal
  • Removal timestamps

filter_operations (filter script)

  • Source and target modes
  • Operation type (copy/download)
  • Success status
  • DVT validation results

Utility Tools

Check Missing KICs

# Compare CSV with downloaded KICs
python util/check_missing_kics.py input/target.csv kepler_downloads/job-20250907_015817

# Output: Creates missing_kics_job-20250907_015817.csv

Generate Statistics

# Generate comprehensive statistics for a job
python util/generate_stats.py kepler_downloads/job-20250907_015817

# Export statistics to CSV
python util/generate_stats.py kepler_downloads/job-20250907_015817 --export stats.csv

Rebuild Database

# Rebuild database from filesystem (useful for recovery)
python util/rebuild_database.py kepler_downloads/job-20250907_015817

Test Health Report

# Verify database and health report contents
python util/test_health_report.py

Troubleshooting

Redis Connection Issues

# Check Redis status
redis-cli ping  # Should return PONG

# Start Redis
brew services start redis  # macOS
sudo systemctl start redis  # Linux

Database Shows All Zeros (Old Downloads)

This was a known bug (missing conn.commit()) that has been fixed. For existing downloads:

# Rebuild the database from filesystem
python util/rebuild_database.py kepler_downloads/job-YYYYMMDD_HHMMSS

# This will scan all FITS files and recreate the database

Mode Incompatibility

  • Check health report for details
  • Use --force-mode to override (use cautiously)
  • Consider target mode requirements

DVT Missing (ExoMiner)

  • Some KICs don't have DVT files in MAST
  • Use --backup-no-dvt to preserve data
  • Consider Standard mode for analysis

Download Failures

  • Check network connectivity
  • Verify KIC exists in MAST
  • Review error messages in health report
  • Retry with --retry-failed

Related Projects

Machine Learning Frameworks for Exoplanet Detection

  • ExoMiner - NASA's deep learning model for exoplanet detection
  • AstroNet - Google's neural network for identifying exoplanets

This toolkit provides data in formats compatible with these frameworks while maintaining flexibility for custom research applications.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Development Setup

# Clone the repository
git clone https://github.com/akira921x/Kepler-Downloader-DR25.git
cd Kepler-Downloader-DR25

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e .

# Run tests (if available)
python -m pytest tests/

Version History

See CHANGELOG.md for a detailed version history.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Copyright 2025 Kepler-Downloader-DR25 Project

Acknowledgments

  • NASA Kepler Mission Team
  • MAST Archive at STScI
  • ExoMiner and AstroNet teams for ML framework specifications
  • Open source community for invaluable tools and libraries

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kepler_downloader_dr25-1.2.2.tar.gz (3.7 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kepler_downloader_dr25-1.2.2-py3-none-any.whl (41.4 kB view details)

Uploaded Python 3

File details

Details for the file kepler_downloader_dr25-1.2.2.tar.gz.

File metadata

  • Download URL: kepler_downloader_dr25-1.2.2.tar.gz
  • Upload date:
  • Size: 3.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for kepler_downloader_dr25-1.2.2.tar.gz
Algorithm Hash digest
SHA256 791da09dbd51f137f59d85ffc26e33d287eb98eadbb3b83132b8addcf2c26fce
MD5 7b027e8c3039ea569ac00ada69e1e0de
BLAKE2b-256 a3985da5c8354ed549f438c72137980d2120309c1763223faf307714151a310f

See more details on using hashes here.

Provenance

The following attestation bundles were made for kepler_downloader_dr25-1.2.2.tar.gz:

Publisher: release.yml on akira921x/Kepler-Downloader-DR25

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file kepler_downloader_dr25-1.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for kepler_downloader_dr25-1.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ca53bd26f9772e000a704b9e93a45418541aa2a89fb8695fbd8ef02f16e584d8
MD5 7e943c4901a7f546373a4e8cf2d17e99
BLAKE2b-256 cbb71764d7baaf970e8b5264cea45ddcd97852ee810cb8820f7ab5a98abd232b

See more details on using hashes here.

Provenance

The following attestation bundles were made for kepler_downloader_dr25-1.2.2-py3-none-any.whl:

Publisher: release.yml on akira921x/Kepler-Downloader-DR25

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page