Skip to main content

Production web scraper for Idealista real estate listings

Project description

Idealista Scraper

Production-ready web scraper for Idealista real estate listings with async support, resumable sessions, and MongoDB-compatible output.

Features

  • Async scraping with configurable concurrency
  • Multi-key Scrapfly rotation for high-volume scraping
  • Resumable sessions with progress checkpoints
  • S3 image upload with automatic retries
  • MongoDB-compatible JSONL output
  • Rich CLI with progress indicators

Installation

# Clone the repository
git clone https://github.com/antonyngigge/idealistaScraper.git
cd idealistaScraper

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in editable mode
pip install -e .

# Verify installation
idealista-scraper --help

Configuration

Create a .env.local file in the project root:

# Scrapfly API keys (supports multiple for rotation)
SCRAPFLY_KEY_1=your_key_here
SCRAPFLY_KEY_2=optional_second_key
# ... up to SCRAPFLY_KEY_15

# Or single key
SCRAPFLY_KEY=your_key_here

# AWS S3 (optional, for image upload)
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=eu-north-1
S3_BUCKET_NAME=your-bucket-name

# MongoDB (optional, for direct import)
MONGODB_URI=mongodb://localhost:27017

Quick Start

# Scrape rental listings from Madrid (10 pages)
idealista-scraper scrape listings --location madrid --type rental --pages 10

# Scrape from Portugal
idealista-scraper scrape listings --country portugal --location lisboa --type sale

# Check scraping progress
idealista-scraper status

# Clean data for MongoDB import
idealista-scraper clean-mongo output/rental_properties.jsonl --output cleaned.jsonl

# Upload images to S3
idealista-scraper upload-images --bucket my-bucket

Multi-Country Support

The scraper supports multiple Idealista country domains:

Country Domain
Spain https://www.idealista.com
Portugal https://www.idealista.pt

Discovering Available Regions

# List supported countries
idealista-scraper regions --list-countries

# List regions in Spain (uses API)
idealista-scraper regions --country spain

# List regions in Portugal for sale properties
idealista-scraper regions --country portugal --type sale

# List common regions without API call
idealista-scraper regions --country spain --common

Scraping Different Countries

# Spain (default)
idealista-scraper scrape listings --location madrid --type rental

# Portugal
idealista-scraper scrape listings --country portugal --location lisboa --type sale

# Set default country via environment variable
export IDEALISTA_DEFAULT_COUNTRY=portugal
idealista-scraper scrape listings --location porto --type rental

CLI Reference

Scraping Commands

# Scrape listings
idealista-scraper scrape listings --location <city> --type <rental|sale> --pages <n>
idealista-scraper scrape listings --location madrid --type rental --pages 10

# Scrape property details from URL list
idealista-scraper scrape properties --input urls.txt --output properties.jsonl

# Scrape agent data
idealista-scraper scrape agents --limit 100

Data Processing

# Transform HTML to JSONL
idealista-scraper transform raw.html --output mongo.jsonl --agents agents.jsonl

# Clean for MongoDB (UUID to ObjectID, BSON fixes)
idealista-scraper clean-mongo properties.jsonl --output cleaned.jsonl

Pipeline Automation

# Interactive mode
idealista-scraper pipeline

# Full pipeline: Scrape -> Transform -> Clean -> Upload
idealista-scraper pipeline --preset full --location barcelona --pages 20

# Quick pipeline: Scrape -> Transform only
idealista-scraper pipeline --preset quick

# Export pipeline: Clean -> Upload (existing data)
idealista-scraper pipeline --preset export --bucket my-bucket

Utilities

# Check progress and statistics
idealista-scraper status

# Resume interrupted session
idealista-scraper resume

# Estimate credit usage
idealista-scraper estimate --pages 100 --asp    # With ASP (25x credits)
idealista-scraper estimate --pages 100 --no-asp # Without ASP (1x credits)

# Test if ASP is required
idealista-scraper test-asp

# Show configuration info
idealista-scraper info

# Clean output files
idealista-scraper clean --cache     # Cache only
idealista-scraper clean --progress  # Progress files only
idealista-scraper clean --all       # Everything

# Upload images to S3
idealista-scraper upload-images --input image_urls.jsonl --bucket my-bucket
idealista-scraper upload-images --bucket my-bucket --resume  # Resume upload

Python Library Usage

from idealista_scraper import (
    PropertyDetailsScraper,
    AgentDetailsScraper,
    HTMLParser,
    S3ImageUploader,
    MongoCleaner,
    clean_html_content,
)

# Parse HTML content
parser = HTMLParser()
data = parser.parse(html_content)

# Clean data for MongoDB
cleaner = MongoCleaner()
cleaned = cleaner.clean_record(data)

# Upload images
uploader = S3ImageUploader(bucket_name="my-bucket")
await uploader.upload_image(url, s3_key)

Output Files

All output is saved to the output/ directory:

File Description
rental_properties.jsonl Scraped property data
raw_listings.jsonl Raw listing data
image_urls.jsonl Property image URLs for S3 upload
agent_properties.jsonl Agent data
properties_cleaned.jsonl MongoDB-ready cleaned data

Project Structure

idealista_scraper/
├── cli/              # Typer CLI interface
├── scraping/         # Web scraping modules
├── parsing/          # HTML parsing
├── transform/        # Data transformation and MongoDB cleaning
├── client/           # Scrapfly client management
├── cache/            # Caching layers
├── session/          # Session management
├── output/           # File output writers
├── upload/           # S3 upload functionality
└── utils/            # Utilities (paths, config)

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/unit/

# Run with integration tests (requires API keys)
pytest tests/ --integration

# Lint code
ruff check idealista_scraper/

# Type check
mypy idealista_scraper/

Alternative Entry Points

The package can be run in multiple ways:

# Console script (installed via pip)
idealista-scraper --help

# Short alias
idealista --help

# Module execution
python -m idealista_scraper --help

License

MIT License - see LICENSE file for details.

Author

Antony Ngigge

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

idealista_scraper-1.0.0.tar.gz (93.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

idealista_scraper-1.0.0-py3-none-any.whl (110.0 kB view details)

Uploaded Python 3

File details

Details for the file idealista_scraper-1.0.0.tar.gz.

File metadata

  • Download URL: idealista_scraper-1.0.0.tar.gz
  • Upload date:
  • Size: 93.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for idealista_scraper-1.0.0.tar.gz
Algorithm Hash digest
SHA256 fd25c9dc4dee3e0c06a504cbee72da09b9bdae699eede7ce7862eb2b61f635cc
MD5 e0434e29fdfc702a3b5967b9081799f7
BLAKE2b-256 25d398c53520a9d034175835189334ddd1bf396a634d0482574a2400d73b3845

See more details on using hashes here.

File details

Details for the file idealista_scraper-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for idealista_scraper-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 888e6d07e5d1e467b15ae48f45e07fc1724c90487ffa1e4af8277038e5b26395
MD5 e29c7e0fdc0d5168b31e8a5dda46fc68
BLAKE2b-256 552fdaddc4cadb992b6cc4c9bd0190e44f7e5155ccfda089879f118e506b0658

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page