Skip to main content

A professional-grade tool for extracting and analyzing discussions from Kaggle competitions

Project description

Kaggle Discussion Extractor

Python 3.8+ License: MIT Playwright

A professional-grade Python tool for extracting and analyzing discussions from Kaggle competitions. Features hierarchical reply extraction with proper parent-child relationships, pagination support, and clean markdown output.

🚀 Key Features

Hierarchical Discussion Extraction

  • Complete Thread Preservation: Maintains the full discussion structure with parent-child relationships
  • Smart Reply Numbering: Automatic hierarchical numbering (1, 1.1, 1.2, 2, 2.1, etc.)
  • No Content Duplication: Intelligently separates parent and nested reply content
  • Deep Nesting Support: Handles multiple levels of nested replies

Rich Metadata Extraction

  • Author Information: Names, usernames, profile URLs
  • Competition Rankings: Extracts "Nth in this Competition" rankings
  • User Badges: Competition Host, Expert, Master, Grandmaster badges
  • Engagement Metrics: Upvotes/downvotes for all posts and replies
  • Timestamps: Full timestamp extraction for temporal analysis

Advanced Capabilities

  • Pagination Support: Automatically handles multi-page discussion lists
  • Batch Processing: Extract all discussions from a competition at once
  • Rate Limiting: Built-in delays to respect server resources
  • Error Recovery: Robust error handling with detailed logging
  • Multiple Output Formats: Clean Markdown export with proper formatting

📦 Installation

Method 1: Install from PyPI (Recommended)

pip install kaggle-discussion-extractor
playwright install chromium

Method 2: Install from Source

# Clone the repository
git clone https://github.com/yourusername/kaggle-discussion-extractor.git
cd kaggle-discussion-extractor

# Install in development mode
pip install -e .
playwright install chromium

🎯 Quick Start

Command Line Usage

# Extract all discussions from a competition
kaggle-discussion-extractor https://www.kaggle.com/competitions/neurips-2025

# Extract only 10 discussions
kaggle-discussion-extractor https://www.kaggle.com/competitions/neurips-2025 --limit 10

# Enable development mode for detailed logging
kaggle-discussion-extractor https://www.kaggle.com/competitions/neurips-2025 --dev-mode

# Run with visible browser (useful for debugging)
kaggle-discussion-extractor https://www.kaggle.com/competitions/neurips-2025 --no-headless

Python API Usage

import asyncio
from kaggle_discussion_extractor import KaggleDiscussionExtractor

async def extract_discussions():
    # Initialize extractor
    extractor = KaggleDiscussionExtractor(dev_mode=True)
    
    # Extract discussions
    success = await extractor.extract_competition_discussions(
        competition_url="https://www.kaggle.com/competitions/neurips-2025",
        limit=5  # Optional: limit number of discussions
    )
    
    if success:
        print("Extraction completed successfully!")
    else:
        print("Extraction failed!")

# Run the extraction
asyncio.run(extract_discussions())

📋 CLI Options

Option Description Default
competition_url URL of the Kaggle competition (required) -
--limit, -l Number of discussions to extract All
--dev-mode, -d Enable detailed logging False
--no-headless Run browser in visible mode False (headless)
--version, -v Show version information -

📁 Output Structure

The extractor creates a kaggle_discussions_extracted directory with:

kaggle_discussions_extracted/
├── 01_Discussion_Title.md
├── 02_Another_Discussion.md
├── 03_Third_Discussion.md
└── ...

Sample Output Format

# Discussion Title

**URL**: https://www.kaggle.com/competitions/neurips-2025/discussion/123456
**Total Comments**: 15
**Extracted**: 2025-01-15T10:30:00

---

## Main Post

**Author**: username (@username)
**Rank**: 27th in this Competition
**Badges**: Competition Host
**Upvotes**: 36

Main discussion content goes here...

---

## Replies

### Reply 1

- **Author**: user1 (@user1)
- **Rank**: 154th in this Competition
- **Upvotes**: 11
- **Timestamp**: Tue Jun 17 2025 11:54:57 GMT+0300

Content of reply 1...

  #### Reply 1.1

  - **Author**: user2 (@user2)
  - **Upvotes**: 6
  - **Timestamp**: Sun Jun 29 2025 04:20:43 GMT+0300

  Nested reply content...

  #### Reply 1.2

  - **Author**: user3 (@user3)
  - **Upvotes**: 2
  - **Timestamp**: Wed Jul 16 2025 12:50:34 GMT+0300

  Another nested reply...

---

### Reply 2

- **Author**: user4 (@user4)
- **Upvotes**: -3

Content of reply 2...

---

⚙️ Configuration

Development Mode

Enable development mode to see detailed logs and debugging information:

extractor = KaggleDiscussionExtractor(dev_mode=True)

What dev_mode does:

  • Enables DEBUG level logging
  • Shows detailed progress information
  • Displays browser automation steps
  • Provides error stack traces
  • Logs DOM element detection details

Browser Mode

Run with visible browser for debugging:

extractor = KaggleDiscussionExtractor(headless=False)

🧪 Examples

Basic Example

from kaggle_discussion_extractor import KaggleDiscussionExtractor
import asyncio

async def main():
    extractor = KaggleDiscussionExtractor()
    
    await extractor.extract_competition_discussions(
        "https://www.kaggle.com/competitions/neurips-2025"
    )

asyncio.run(main())

Advanced Example with Logging

import asyncio
import logging
from kaggle_discussion_extractor import KaggleDiscussionExtractor

# Setup custom logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

async def extract_with_monitoring():
    extractor = KaggleDiscussionExtractor(
        dev_mode=True,  # Enable detailed logging
        headless=True   # Run in background
    )
    
    logger.info("Starting extraction...")
    
    success = await extractor.extract_competition_discussions(
        competition_url="https://www.kaggle.com/competitions/neurips-2025",
        limit=20  # Extract first 20 discussions
    )
    
    if success:
        logger.info("✅ Extraction completed successfully!")
        logger.info("Check 'kaggle_discussions_extracted' directory for results")
    else:
        logger.error("❌ Extraction failed!")

if __name__ == "__main__":
    asyncio.run(extract_with_monitoring())

🔧 Development

Setup Development Environment

# Clone repository
git clone https://github.com/yourusername/kaggle-discussion-extractor.git
cd kaggle-discussion-extractor

# Install development dependencies
pip install -e ".[dev]"
playwright install chromium

# Run tests
pytest tests/

Project Structure

kaggle_discussion_extractor/
├── __init__.py          # Package initialization
├── core.py             # Main extraction logic
└── cli.py              # Command-line interface

🤝 Contributing

Contributions are welcome! Please read our Contributing Guidelines for details on how to submit pull requests, report issues, and contribute to the project.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Built with Playwright for reliable browser automation
  • Inspired by the need for better Kaggle competition analysis tools
  • Thanks to the open-source community for continuous support

📊 Features Comparison

Feature This Tool Other Tools
Hierarchical Replies ✅ Perfect (1, 1.1, 1.2) ❌ Flat structure
No Content Duplication ✅ Smart separation ❌ Duplicated content
Pagination Support ✅ All pages ❌ Single page only
Author Rankings ✅ Full metadata ❌ Basic info only
Rate Limiting ✅ Respectful delays ❌ Aggressive scraping
Error Recovery ✅ Robust handling ❌ Fails on errors

Made with ❤️ for the Kaggle community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kaggle_discussion_extractor-1.0.3.tar.gz (19.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

kaggle_discussion_extractor-1.0.3-py3-none-any.whl (14.6 kB view details)

Uploaded Python 3

File details

Details for the file kaggle_discussion_extractor-1.0.3.tar.gz.

File metadata

File hashes

Hashes for kaggle_discussion_extractor-1.0.3.tar.gz
Algorithm Hash digest
SHA256 3c9469bf054cc2e98488ceed1e410baa0dfff07dae94e02b07e05978dfaf0a88
MD5 4fd471886dc119b18c8bf19fd084b025
BLAKE2b-256 243e0b3a9ee803c6d1e0aa10c5da1ce6f2d13620e937d1079ed9fc45bb521001

See more details on using hashes here.

File details

Details for the file kaggle_discussion_extractor-1.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for kaggle_discussion_extractor-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 98f912388605d4dff297722a31bd5cf7fb8a4b123a3479b2499b04dca8702b6c
MD5 d7dad1d51aa2593bef8ac9ecaec45956
BLAKE2b-256 324aaddbf91394ce72f50826e14d9addd656d9d2b93524594f08de0f875b847c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page