Production web scraper for Idealista real estate listings
Project description
Idealista Scraper
Production-ready web scraper for Idealista real estate listings with async support, resumable sessions, and MongoDB-compatible output.
Features
- Async scraping with configurable concurrency
- Multi-key Scrapfly rotation for high-volume scraping
- Resumable sessions with progress checkpoints
- S3 image upload with automatic retries
- MongoDB-compatible JSONL output
- Rich CLI with progress indicators
Installation
# Clone the repository
git clone https://github.com/antonyngigge/idealistaScraper.git
cd idealistaScraper
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install in editable mode
pip install -e .
# Verify installation
idealista-scraper --help
Configuration
Create a .env.local file in the project root:
# Scrapfly API keys (supports multiple for rotation)
SCRAPFLY_KEY_1=your_key_here
SCRAPFLY_KEY_2=optional_second_key
# ... up to SCRAPFLY_KEY_15
# Or single key
SCRAPFLY_KEY=your_key_here
# AWS S3 (optional, for image upload)
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=eu-north-1
S3_BUCKET_NAME=your-bucket-name
# MongoDB (optional, for direct import)
MONGODB_URI=mongodb://localhost:27017
Quick Start
# Scrape rental listings from Madrid (10 pages)
idealista-scraper scrape listings --location madrid --type rental --pages 10
# Scrape from Portugal
idealista-scraper scrape listings --country portugal --location lisboa --type sale
# Check scraping progress
idealista-scraper status
# Clean data for MongoDB import
idealista-scraper clean-mongo output/rental_properties.jsonl --output cleaned.jsonl
# Upload images to S3
idealista-scraper upload-images --bucket my-bucket
Multi-Country Support
The scraper supports multiple Idealista country domains:
| Country | Domain |
|---|---|
| Spain | https://www.idealista.com |
| Portugal | https://www.idealista.pt |
Discovering Available Regions
# List supported countries
idealista-scraper regions --list-countries
# List regions in Spain (uses API)
idealista-scraper regions --country spain
# List regions in Portugal for sale properties
idealista-scraper regions --country portugal --type sale
# List common regions without API call
idealista-scraper regions --country spain --common
Scraping Different Countries
# Spain (default)
idealista-scraper scrape listings --location madrid --type rental
# Portugal
idealista-scraper scrape listings --country portugal --location lisboa --type sale
# Set default country via environment variable
export IDEALISTA_DEFAULT_COUNTRY=portugal
idealista-scraper scrape listings --location porto --type rental
CLI Reference
Scraping Commands
# Scrape listings
idealista-scraper scrape listings --location <city> --type <rental|sale> --pages <n>
idealista-scraper scrape listings --location madrid --type rental --pages 10
# Scrape property details from URL list
idealista-scraper scrape properties --input urls.txt --output properties.jsonl
# Scrape agent data
idealista-scraper scrape agents --limit 100
Data Processing
# Transform HTML to JSONL
idealista-scraper transform raw.html --output mongo.jsonl --agents agents.jsonl
# Clean for MongoDB (UUID to ObjectID, BSON fixes)
idealista-scraper clean-mongo properties.jsonl --output cleaned.jsonl
Pipeline Automation
# Interactive mode
idealista-scraper pipeline
# Full pipeline: Scrape -> Transform -> Clean -> Upload
idealista-scraper pipeline --preset full --location barcelona --pages 20
# Quick pipeline: Scrape -> Transform only
idealista-scraper pipeline --preset quick
# Export pipeline: Clean -> Upload (existing data)
idealista-scraper pipeline --preset export --bucket my-bucket
Utilities
# Check progress and statistics
idealista-scraper status
# Resume interrupted session
idealista-scraper resume
# Estimate credit usage
idealista-scraper estimate --pages 100 --asp # With ASP (25x credits)
idealista-scraper estimate --pages 100 --no-asp # Without ASP (1x credits)
# Test if ASP is required
idealista-scraper test-asp
# Show configuration info
idealista-scraper info
# Clean output files
idealista-scraper clean --cache # Cache only
idealista-scraper clean --progress # Progress files only
idealista-scraper clean --all # Everything
# Upload images to S3
idealista-scraper upload-images --input image_urls.jsonl --bucket my-bucket
idealista-scraper upload-images --bucket my-bucket --resume # Resume upload
Python Library Usage
from idealista_scraper import (
PropertyDetailsScraper,
AgentDetailsScraper,
HTMLParser,
S3ImageUploader,
MongoCleaner,
clean_html_content,
)
# Parse HTML content
parser = HTMLParser()
data = parser.parse(html_content)
# Clean data for MongoDB
cleaner = MongoCleaner()
cleaned = cleaner.clean_record(data)
# Upload images
uploader = S3ImageUploader(bucket_name="my-bucket")
await uploader.upload_image(url, s3_key)
Output Files
All output is saved to the output/ directory:
| File | Description |
|---|---|
rental_properties.jsonl |
Scraped property data |
raw_listings.jsonl |
Raw listing data |
image_urls.jsonl |
Property image URLs for S3 upload |
agent_properties.jsonl |
Agent data |
properties_cleaned.jsonl |
MongoDB-ready cleaned data |
Project Structure
idealista_scraper/
├── cli/ # Typer CLI interface
├── scraping/ # Web scraping modules
├── parsing/ # HTML parsing
├── transform/ # Data transformation and MongoDB cleaning
├── client/ # Scrapfly client management
├── cache/ # Caching layers
├── session/ # Session management
├── output/ # File output writers
├── upload/ # S3 upload functionality
└── utils/ # Utilities (paths, config)
Development
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/unit/
# Run with integration tests (requires API keys)
pytest tests/ --integration
# Lint code
ruff check idealista_scraper/
# Type check
mypy idealista_scraper/
Alternative Entry Points
The package can be run in multiple ways:
# Console script (installed via pip)
idealista-scraper --help
# Short alias
idealista --help
# Module execution
python -m idealista_scraper --help
License
MIT License - see LICENSE file for details.
Author
Antony Ngigge
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file idealista_scraper-1.0.0.tar.gz.
File metadata
- Download URL: idealista_scraper-1.0.0.tar.gz
- Upload date:
- Size: 93.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fd25c9dc4dee3e0c06a504cbee72da09b9bdae699eede7ce7862eb2b61f635cc
|
|
| MD5 |
e0434e29fdfc702a3b5967b9081799f7
|
|
| BLAKE2b-256 |
25d398c53520a9d034175835189334ddd1bf396a634d0482574a2400d73b3845
|
File details
Details for the file idealista_scraper-1.0.0-py3-none-any.whl.
File metadata
- Download URL: idealista_scraper-1.0.0-py3-none-any.whl
- Upload date:
- Size: 110.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
888e6d07e5d1e467b15ae48f45e07fc1724c90487ffa1e4af8277038e5b26395
|
|
| MD5 |
e29c7e0fdc0d5168b31e8a5dda46fc68
|
|
| BLAKE2b-256 |
552fdaddc4cadb992b6cc4c9bd0190e44f7e5155ccfda089879f118e506b0658
|