A comprehensive toolkit for downloading and filtering Kepler DR25 FITS files from NASA's MAST archive
Project description
Kepler-Downloader-DR25
Research-Oriented Toolkit for Kepler Data Analysis
This toolkit is designed for researchers studying NASA Kepler Space Telescope data. It follows the data structure requirements of machine learning frameworks like ExoMiner and AstroNet while remaining flexible for diverse research applications beyond these specific frameworks.
Key Design Philosophy
- Research-First Approach: Built to support various astronomical research workflows
- ML-Ready: Compatible with ExoMiner, AstroNet, and other machine learning pipelines
- Flexible Architecture: Not limited to specific frameworks - adaptable to custom research needs
- Scientific Rigor: Maintains data integrity and validation standards required for publication-quality research
Overview
A comprehensive toolkit for downloading and filtering Kepler DR25 FITS files from NASA's MAST archive with intelligent mode detection, DVT validation, and universal filtering capabilities.
This project provides two main scripts and utility tools:
Main Scripts:
get-kepler-dr25.py- Main downloader with DVT filtering for ExoMiner mode and retry capabilities.filter-get-kepler-dr25.py- Universal filter with mode detection, conversion, and download of missing files.
Utility Tools (in util/ folder):
util/rebuild_database.py- Rebuild SQLite database from existing filesystem (useful for recovery)util/check_missing_kics.py- Compare CSV with downloaded KICs to find missing onesutil/generate_stats.py- Generate comprehensive statistics for completed jobsutil/test_health_report.py- Diagnostic tool to verify database and health report contents
Why This Toolkit?
The Challenge with Kepler Data:
- Full TCE dataset: ~400+ GB disk space required
- KOI dataset: ~200+ GB disk space required
- Common problems without proper tooling:
- Missing KICs due to network timeouts
- Corrupted FITS files from incomplete downloads
- Database inconsistencies from concurrent writes
- No recovery mechanism for partial failures
- Manual tracking of thousands of files
What This Toolkit Solves:
Researchers studying exoplanets, stellar variability, or other phenomena in Kepler data need efficient, reliable tools that:
- Prevent data loss: Redis buffering ensures zero database corruption even with network failures
- Handle scale: Successfully manages datasets with 17,000+ KICs (tested with full TCE catalog)
- Ensure completeness: Automatic retry mechanism and missing KIC detection
- Support modern ML workflows: Compatible with ExoMiner, AstroNet, custom models
- Provide verification: Health reports confirm data integrity and completeness
Key Features & Performance Metrics
Proven Performance:
- 5.5x faster than traditional bulk query approaches
- 99.9% success rate on 17,000+ KIC downloads (full TCE dataset)
- Zero database corruption with Redis write-ahead buffering
- Automatic recovery from network failures and timeouts
Core Capabilities:
- Research-Ready Formats - Supports both ExoMiner/AstroNet structure and MAST standard
- Universal filtering - Extract KOI subset from 400GB TCE data without re-downloading
- Mode detection - Automatically detects and converts between formats
- DVT validation - Ensures ML model compatibility with DVT file checking
- Job-based organization - Each run creates a timestamped job directory for reproducibility
- Parallel processing - 4-8 workers handle concurrent downloads efficiently
- Health reporting - Comprehensive analysis confirms data completeness
- Smart recovery - Automatically retries failed downloads and detects missing KICs
The standard Kepler light curve products available on the MAST archive are from the final Data Release 25 (DR25) processing.
Data Source and Attribution
NASA Kepler Mission Data
This toolkit downloads data from NASA's Kepler Space Telescope mission, which operated from 2009-2018 and revolutionized exoplanet science by discovering thousands of exoplanets. The data is hosted and distributed by:
- MAST (Mikulski Archive for Space Telescopes) - NASA's data archive hosted at the Space Telescope Science Institute (STScI)
- Data Release: DR25 (Data Release 25) - The final and most complete processing of Kepler data
- Archive URL: https://archive.stsci.edu/kepler/
Data Usage and Citation
When using Kepler data downloaded with this toolkit, please:
-
Acknowledge NASA and the Kepler mission in your publications
-
Cite the appropriate Kepler papers:
- Kepler Mission: Borucki et al. (2010) Science, 327, 977
- Kepler Data Characteristics: Thompson et al. (2016) Kepler Data Release Notes (KSCI-19044-005)
- DR25 Release: Available at MAST Kepler archive
-
Include the standard acknowledgment:
"This research has made use of the NASA Exoplanet Archive, which is operated by the California Institute of Technology, under contract with the National Aeronautics and Space Administration under the Exoplanet Exploration Program."
Data Products
The toolkit downloads the following NASA Kepler data products:
- Light Curve Files (*_llc.fits) - Time series photometry data
- Data Validation Files (*_dvt.fits) - Transit model fits and diagnostics
- Target Pixel Files (*_tpf.fits) - Pixel-level photometry (when requested)
Terms of Use
All Kepler data is in the public domain. There are no restrictions on the use of Kepler data. However, proper attribution and citations are expected as a matter of professional courtesy and scientific integrity.
For more information about Kepler data:
- MAST Kepler Archive: https://archive.stsci.edu/kepler/
- NASA Exoplanet Archive: https://exoplanetarchive.ipac.caltech.edu/
- Kepler Mission Page: https://www.nasa.gov/mission_pages/kepler/main/index.html
Directory Structure
Kepler-Downloader-DR25/
├── input/
│ └── your_targets.csv
├── input_samples/
│ ├── cumulative_koi_2025.09.06_13.27.56.csv # Kepler Objects of Interest
│ └── q1_q17_dr25_tce_2025.09.06_13.29.19.csv # Threshold Crossing Events
├── kepler_downloads/ # Default output directory
│ └── job-YYYYMMDD_HHMMSS/ # Timestamped job directory
│ ├── download_records.db # SQLite database with all records
│ ├── health_check_report.txt # Post-download health analysis
│ ├── reports/ # DVT filtering and other reports
│ └── [Data directories based on mode]
├── util/ # Utility scripts
│ ├── rebuild_database.py # Rebuild database from filesystem
│ ├── check_missing_kics.py # Find missing KICs
│ ├── generate_stats.py # Generate job statistics
│ └── test_health_report.py # Diagnostic tool
├── get-kepler-dr25.py # Main downloader with DVT filtering
├── filter-get-kepler-dr25.py # Universal filter with mode detection
├── setup.py # Package installation script
├── requirements.txt # Python dependencies
├── README.md # This file
├── QUICKSTART.md # Quick start guide
├── CHANGELOG.md # Version history
└── LICENSE # Apache 2.0 license
Output Formats
ExoMiner Format (Default)
kepler_downloads/job-*/
└── Kepler/
└── XXXX/ # First 4 digits of 9-digit KIC
└── XXXXXXXXX/ # Full 9-digit KIC
├── *_llc.fits # Light curve files
├── *_dvt.fits # Data validation files (required)
Standard MAST Format (--no-exominer flag)
kepler_downloads/job-*/
└── mastDownload/
└── Kepler/
└── kplrXXXXXXXXX_lc/ # Light curve files
└── kplrXXXXXXXXX_dv/ # Data validation and report files
Real-World Use Cases
Example: Filtering KOI from TCE Dataset
Problem: You downloaded the full TCE dataset (400+ GB, 17,230 KICs) but only need KOI data (8,214 KICs)
Traditional Approach: Re-download 200+ GB of KOI data, taking hours and risking incomplete downloads
With This Toolkit:
# Filter existing TCE data to extract KOI subset
python filter-get-kepler-dr25.py \
--input-csv input_samples/cumulative_koi_2025.09.06_13.27.56.csv \
--source-job kepler_downloads/job-with-tce-data
# Result: 7,141 KICs copied, 1,073 missing KICs automatically downloaded
# Time saved: Hours of redundant downloads
# Storage saved: 200+ GB of duplicate data
Security & Trust
This package implements comprehensive security measures:
- Trusted Publishing: Cryptographically verified releases via GitHub OIDC
- Attestations: PEP 740 compliant package attestations
- Signed Packages: Sigstore keyless signing for supply chain security
- SBOM: Software Bill of Materials for dependency transparency
- Security Scanning: Automated vulnerability scanning in CI/CD
For detailed security information, see SECURITY.md.
Installation
Prerequisites
-
Python 3.8+ with pip installed
-
Redis Server (optional but recommended for reliability)
-
Operating System: Windows, macOS, or Linux
Install Redis:
# macOS brew install redis && brew services start redis # Ubuntu/Debian sudo apt install redis-server && sudo systemctl start redis # Windows (Option 1: WSL2) wsl --install # Then follow Ubuntu instructions above # Windows (Option 2: Docker) docker run -d -p 6379:6379 --name redis-kepler redis:latest # Windows (Option 3: Memurai - Redis for Windows) # Download from: https://www.memurai.com/get-memurai
Installation Options
Option 1: Install from PyPI (Recommended)
# Install the package
pip install kepler-downloader-dr25
# Use command-line tools
kepler-download input/your_targets.csv
kepler-filter --input-csv input/kics.csv --source-job kepler_downloads/job-XXX
kepler-stats kepler_downloads/job-XXX
Option 2: Install from GitHub
# Clone and install
git clone https://github.com/akira921x/Kepler-Downloader-DR25.git
cd Kepler-Downloader-DR25
pip install -r requirements.txt
# Use scripts directly
python get-kepler-dr25.py input/your_targets.csv
Python Dependencies
Required packages (automatically installed with pip):
pandas- Data processingastroquery- MAST archive interfaceredis- Redis client for bufferingrequests- HTTP requestsnumpy- Numerical operationstqdm- Progress bars
Quick Start
# Install from PyPI
pip install kepler-downloader-dr25
# Quick test with 3 targets
echo "006922244,007799349,011446443" > test.csv
kepler-download test.csv
# Download real datasets
kepler-download cumulative_koi.csv # ~8,200 KOIs, ~200GB
kepler-download q1_q17_dr25_tce.csv # ~17,000 TCEs, ~400GB
# Filter existing data (save time & storage)
kepler-filter --input-csv koi.csv --source-job kepler_downloads/job-XXX
See Quick Start Guide for detailed instructions.
Usage
1. Downloading Data
Basic Download (ExoMiner Format - Default)
# Download from CSV files in input/ folder (ExoMiner format by default)
python get-kepler-dr25.py input/your_targets.csv
# Verbose mode (detailed output)
python get-kepler-dr25.py input/your_targets.csv --verbose
Standard MAST Format
# Download with Standard MAST structure (no DVT requirement)
python get-kepler-dr25.py input/your_targets.csv --no-exominer
# Strict DVT mode - skip KICs without DVT immediately (ExoMiner is default)
python get-kepler-dr25.py input/your_targets.csv --strict-dvt
# Backup KICs without DVT instead of deleting (ExoMiner is default)
python get-kepler-dr25.py input/your_targets.csv --backup-no-dvt
Advanced Options
# Custom configuration
python get-kepler-dr25.py input/your_targets.csv \
--workers 8 \
--batch-size 50 \
--output-dir custom_output
# Retry failed downloads from a previous run
python get-kepler-dr25.py input/your_targets.csv --retry-failed
2. Filtering Existing Data
The universal filter script can process any CSV with KIC IDs and intelligently handle mode conversions.
Basic Filtering
# Filter existing job with KOI data
python filter-get-kepler-dr25.py \
--input-csv input_samples/cumulative_koi_2025.09.06_13.27.56.csv \
--source-job kepler_downloads/job-20250906_020543
# Use Standard format instead of default ExoMiner
python filter-get-kepler-dr25.py \
--input-csv input/custom_kics.csv \
--source-job kepler_downloads/job-20250906_020543 \
--no-exominer
Mode Conversion
# Convert from Standard to ExoMiner format (ExoMiner is default target)
python filter-get-kepler-dr25.py \
--input-csv input/kics.csv \
--source-job kepler_downloads/standard_job \
--force-mode # Required when modes don't match
# Disable DVT validation for ExoMiner
python filter-get-kepler-dr25.py \
--input-csv input/kics.csv \
--source-job kepler_downloads/job-20250906 \
--no-validate-dvt
Download Missing KICs
# Filter and download missing KICs from MAST
python filter-get-kepler-dr25.py \
--input-csv input/kics.csv \
--source-job kepler_downloads/job-20250906
# Skip downloading missing KICs
python filter-get-kepler-dr25.py \
--input-csv input/kics.csv \
--source-job kepler_downloads/job-20250906 \
--no-download-missing
Command-Line Options
get-kepler-dr25.py
| Option | Description | Required | Default |
|---|---|---|---|
csv_file |
Input CSV file with KIC IDs | Yes | - |
--output-dir |
Output directory | No | kepler_downloads |
--workers |
Number of parallel workers | No | 4 |
--batch-size |
KICs per batch | No | 50 |
--no-exominer |
Use Standard MAST format instead of ExoMiner | No | False (ExoMiner enabled) |
--strict-dvt |
Skip KICs without DVT immediately (ExoMiner mode) | No | False |
--backup-no-dvt |
Backup instead of delete no-DVT KICs (ExoMiner mode) | No | False |
--retry-failed |
Retry failed downloads from a previous run. | No | False |
--verbose |
Enable verbose logging | No | False |
filter-get-kepler-dr25.py
| Option | Description | Required | Default |
|---|---|---|---|
--input-csv |
Input CSV with KIC IDs | Yes | - |
--source-job |
Source job folder | Yes | - |
--no-exominer |
Use Standard MAST format instead of ExoMiner | No | False (ExoMiner enabled) |
--output-dir |
Output directory | No | Auto-generated timestamp |
--force-mode |
Force mode conversion even if incompatible | No | False |
--no-validate-dvt |
Disable DVT validation for ExoMiner mode | No | False |
--no-download-missing |
Skip downloading missing KICs from MAST | No | False (download enabled) |
--workers |
Number of parallel workers for downloads | No | 4 |
--batch-size |
Batch size for downloads | No | 50 |
--verbose |
Enable verbose logging | No | False |
Mode Detection and Compatibility
The filter script automatically detects job modes:
ExoMiner/AstroNet Mode (Default)
- Structure:
Kepler/XXXX/XXXXXXXXX/ - Requires DVT files for each KIC
- Optimized for machine learning workflows
- Compatible with ExoMiner, AstroNet, and similar ML frameworks
- Supports custom research pipelines requiring structured data organization
Standard Mode
- Structure:
mastDownload/Kepler/kplr*_lc/ - MAST's default organization
- No DVT requirement
- Compatible with traditional analysis tools
Mode Compatibility Rules
- Same mode: Direct copy, no conversion needed
- Different modes: Requires
--force-modeflag - ExoMiner target: DVT validation enabled by default
- Mode mismatch: Detailed report explains incompatibility
Health Reports
Both scripts generate comprehensive health reports:
Download Health Report
- Download statistics (success/failure rates)
- DVT coverage for ExoMiner mode
- File inventory by type
- Performance metrics
- Failed KIC list with errors
Filter Health Report
- Source job analysis (mode, structure, statistics)
- Mode compatibility assessment
- Processing statistics
- DVT validation results (ExoMiner)
- Recommendations for issues
DVT Filtering (ExoMiner Mode)
When using ExoMiner mode, the system handles DVT (Data Validation) files:
-
During Download (
get-kepler-dr25.py --exominer):- Downloads both LLC and DVT files
- Tracks DVT availability
- Post-download filtering removes KICs without DVT
- Optional backup with
--backup-no-dvt
-
During Filtering (
filter-get-kepler-dr25.py):- Validates DVT presence for ExoMiner target
- Moves no-DVT KICs to backup
- Reports DVT coverage statistics
Performance
Expected Performance
Download Performance (with 4 workers, default settings):
- Processing rate: ~50-60 KICs/minute
- Small datasets (< 100 KICs): 2-5 minutes
- Medium datasets (1,000 KICs): 20-30 minutes
- KOI dataset (~8,200 KICs): 2.5-3 hours (~200 GB)
- Full TCE dataset (~17,000 KICs): 5-6 hours (~400+ GB)
Why Traditional Methods Fail at Scale:
- No automatic retry for network timeouts
- Database corruption from concurrent writes
- No progress tracking or recovery mechanism
- Missing KIC detection requires manual verification
- Network interruptions cause incomplete FITS files
Optimization Tips
- Increase workers for faster downloads:
--workers 8 - Use larger batches:
--batch-size 100 - Ensure Redis is running for optimal performance
- Use
--strict-dvtto skip no-DVT KICs early - Run during off-peak hours for better MAST response
Database Features
Tables Created
download_records
- KIC ID and success status
- File counts (LLC and DVT)
- DVT presence flag
- Error messages
- Removal reasons (ExoMiner mode)
file_inventory
- Complete file catalog
- File types and sizes
- Download timestamps
removed_kics (ExoMiner mode)
- KICs removed for lacking DVT
- File statistics before removal
- Removal timestamps
filter_operations (filter script)
- Source and target modes
- Operation type (copy/download)
- Success status
- DVT validation results
Utility Tools
Check Missing KICs
# Compare CSV with downloaded KICs
python util/check_missing_kics.py input/target.csv kepler_downloads/job-20250907_015817
# Output: Creates missing_kics_job-20250907_015817.csv
Generate Statistics
# Generate comprehensive statistics for a job
python util/generate_stats.py kepler_downloads/job-20250907_015817
# Export statistics to CSV
python util/generate_stats.py kepler_downloads/job-20250907_015817 --export stats.csv
Rebuild Database
# Rebuild database from filesystem (useful for recovery)
python util/rebuild_database.py kepler_downloads/job-20250907_015817
Test Health Report
# Verify database and health report contents
python util/test_health_report.py
Troubleshooting
Redis Connection Issues
# Check Redis status
redis-cli ping # Should return PONG
# Start Redis
brew services start redis # macOS
sudo systemctl start redis # Linux
Database Shows All Zeros (Old Downloads)
This was a known bug (missing conn.commit()) that has been fixed. For existing downloads:
# Rebuild the database from filesystem
python util/rebuild_database.py kepler_downloads/job-YYYYMMDD_HHMMSS
# This will scan all FITS files and recreate the database
Mode Incompatibility
- Check health report for details
- Use
--force-modeto override (use cautiously) - Consider target mode requirements
DVT Missing (ExoMiner)
- Some KICs don't have DVT files in MAST
- Use
--backup-no-dvtto preserve data - Consider Standard mode for analysis
Download Failures
- Check network connectivity
- Verify KIC exists in MAST
- Review error messages in health report
- Retry with
--retry-failed
Related Projects
Machine Learning Frameworks for Exoplanet Detection
- ExoMiner - NASA's deep learning model for exoplanet detection
- AstroNet - Google's neural network for identifying exoplanets
This toolkit provides data in formats compatible with these frameworks while maintaining flexibility for custom research applications.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
Development Setup
# Clone the repository
git clone https://github.com/akira921x/Kepler-Downloader-DR25.git
cd Kepler-Downloader-DR25
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in development mode
pip install -e .
# Run tests (if available)
python -m pytest tests/
Version History
See CHANGELOG.md for a detailed version history.
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Copyright 2025 Kepler-Downloader-DR25 Project
Acknowledgments
- NASA Kepler Mission Team
- MAST Archive at STScI
- ExoMiner and AstroNet teams for ML framework specifications
- Open source community for invaluable tools and libraries
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file kepler_downloader_dr25-1.4.0.tar.gz.
File metadata
- Download URL: kepler_downloader_dr25-1.4.0.tar.gz
- Upload date:
- Size: 3.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
44399e46315c378a6ece7802210a93da17791297aa30fa5c829c8e4c61b59b4f
|
|
| MD5 |
d2829d78795665fcd196785806fd4d0d
|
|
| BLAKE2b-256 |
bed7e38c7964b4fc04ae18ba62a1a0b3d594784bf24c94a6dc605ceceb100c45
|
File details
Details for the file kepler_downloader_dr25-1.4.0-py3-none-any.whl.
File metadata
- Download URL: kepler_downloader_dr25-1.4.0-py3-none-any.whl
- Upload date:
- Size: 41.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ed850a1a102e05d1149f8b2afafd62e2ea6e2eea4f08f85e1fe9dbdca47ae34
|
|
| MD5 |
45afe04eb45dedff40ddb9b8b33f81da
|
|
| BLAKE2b-256 |
4a65b09f82192982dc7f759545df7367fc7a269eb7060e06c2ad1c0e58e9c074
|