Production web scraper for Idealista real estate listings

These details have not been verified by PyPI

Project links

Project description

Idealista Scraper

Production-ready web scraper for Idealista real estate listings with async support, resumable sessions, and MongoDB-compatible output.

Features

Async scraping with configurable concurrency
Multi-key Scrapfly rotation for high-volume scraping
Resumable sessions with progress checkpoints
S3 image upload with automatic retries
MongoDB-compatible JSONL output
Rich CLI with progress indicators

Installation

# Clone the repository
git clone https://github.com/antonyngigge/idealistaScraper.git
cd idealistaScraper

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in editable mode
pip install -e .

# Verify installation
idealista-scraper --help

Configuration

Create a .env.local file in the project root:

# Scrapfly API keys (supports multiple for rotation)
SCRAPFLY_KEY_1=your_key_here
SCRAPFLY_KEY_2=optional_second_key
# ... up to SCRAPFLY_KEY_15

# Or single key
SCRAPFLY_KEY=your_key_here

# AWS S3 (optional, for image upload)
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=eu-north-1
S3_BUCKET_NAME=your-bucket-name

# MongoDB (optional, for direct import)
MONGODB_URI=mongodb://localhost:27017

Quick Start

# Scrape rental listings from Madrid (10 pages)
idealista-scraper scrape listings --location madrid --type rental --pages 10

# Scrape from Portugal
idealista-scraper scrape listings --country portugal --location lisboa --type sale

# Check scraping progress
idealista-scraper status

# Clean data for MongoDB import
idealista-scraper clean-mongo output/rental_properties.jsonl --output cleaned.jsonl

# Upload images to S3
idealista-scraper upload-images --bucket my-bucket

Multi-Country Support

The scraper supports multiple Idealista country domains:

Country	Domain
Spain	https://www.idealista.com
Portugal	https://www.idealista.pt

Discovering Available Regions

# List supported countries
idealista-scraper regions --list-countries

# List regions in Spain (uses API)
idealista-scraper regions --country spain

# List regions in Portugal for sale properties
idealista-scraper regions --country portugal --type sale

# List common regions without API call
idealista-scraper regions --country spain --common

Scraping Different Countries

# Spain (default)
idealista-scraper scrape listings --location madrid --type rental

# Portugal
idealista-scraper scrape listings --country portugal --location lisboa --type sale

# Set default country via environment variable
export IDEALISTA_DEFAULT_COUNTRY=portugal
idealista-scraper scrape listings --location porto --type rental

CLI Reference

Scraping Commands

# Scrape listings
idealista-scraper scrape listings --location <city> --type <rental|sale> --pages <n>
idealista-scraper scrape listings --location madrid --type rental --pages 10

# Scrape property details from URL list
idealista-scraper scrape properties --input urls.txt --output properties.jsonl

# Scrape agent data
idealista-scraper scrape agents --limit 100

Data Processing

# Transform HTML to JSONL
idealista-scraper transform raw.html --output mongo.jsonl --agents agents.jsonl

# Clean for MongoDB (UUID to ObjectID, BSON fixes)
idealista-scraper clean-mongo properties.jsonl --output cleaned.jsonl

Pipeline Automation

# Interactive mode
idealista-scraper pipeline

# Full pipeline: Scrape -> Transform -> Clean -> Upload
idealista-scraper pipeline --preset full --location barcelona --pages 20

# Quick pipeline: Scrape -> Transform only
idealista-scraper pipeline --preset quick

# Export pipeline: Clean -> Upload (existing data)
idealista-scraper pipeline --preset export --bucket my-bucket

Utilities

# Check progress and statistics
idealista-scraper status

# Resume interrupted session
idealista-scraper resume

# Estimate credit usage
idealista-scraper estimate --pages 100 --asp    # With ASP (25x credits)
idealista-scraper estimate --pages 100 --no-asp # Without ASP (1x credits)

# Test if ASP is required
idealista-scraper test-asp

# Show configuration info
idealista-scraper info

# Clean output files
idealista-scraper clean --cache     # Cache only
idealista-scraper clean --progress  # Progress files only
idealista-scraper clean --all       # Everything

# Upload images to S3
idealista-scraper upload-images --input image_urls.jsonl --bucket my-bucket
idealista-scraper upload-images --bucket my-bucket --resume  # Resume upload

Python Library Usage

from idealista_scraper import (
    PropertyDetailsScraper,
    AgentDetailsScraper,
    HTMLParser,
    S3ImageUploader,
    MongoCleaner,
    clean_html_content,
)

# Parse HTML content
parser = HTMLParser()
data = parser.parse(html_content)

# Clean data for MongoDB
cleaner = MongoCleaner()
cleaned = cleaner.clean_record(data)

# Upload images
uploader = S3ImageUploader(bucket_name="my-bucket")
await uploader.upload_image(url, s3_key)

Output Files

All output is saved to the output/ directory:

File	Description
`rental_properties.jsonl`	Scraped property data
`raw_listings.jsonl`	Raw listing data
`image_urls.jsonl`	Property image URLs for S3 upload
`agent_properties.jsonl`	Agent data
`properties_cleaned.jsonl`	MongoDB-ready cleaned data

Project Structure

idealista_scraper/
├── cli/              # Typer CLI interface
├── scraping/         # Web scraping modules
├── parsing/          # HTML parsing
├── transform/        # Data transformation and MongoDB cleaning
├── client/           # Scrapfly client management
├── cache/            # Caching layers
├── session/          # Session management
├── output/           # File output writers
├── upload/           # S3 upload functionality
└── utils/            # Utilities (paths, config)

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/unit/

# Run with integration tests (requires API keys)
pytest tests/ --integration

# Lint code
ruff check idealista_scraper/

# Type check
mypy idealista_scraper/

Alternative Entry Points

The package can be run in multiple ways:

# Console script (installed via pip)
idealista-scraper --help

# Short alias
idealista --help

# Module execution
python -m idealista_scraper --help

License

MIT License - see LICENSE file for details.

Author

Antony Ngigge

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Nov 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

idealista_scraper-1.0.0.tar.gz (93.6 kB view details)

Uploaded Nov 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

idealista_scraper-1.0.0-py3-none-any.whl (110.0 kB view details)

Uploaded Nov 29, 2025 Python 3

File details

Details for the file idealista_scraper-1.0.0.tar.gz.

File metadata

Download URL: idealista_scraper-1.0.0.tar.gz
Upload date: Nov 29, 2025
Size: 93.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for idealista_scraper-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`fd25c9dc4dee3e0c06a504cbee72da09b9bdae699eede7ce7862eb2b61f635cc`
MD5	`e0434e29fdfc702a3b5967b9081799f7`
BLAKE2b-256	`25d398c53520a9d034175835189334ddd1bf396a634d0482574a2400d73b3845`

See more details on using hashes here.

File details

Details for the file idealista_scraper-1.0.0-py3-none-any.whl.

File metadata

Download URL: idealista_scraper-1.0.0-py3-none-any.whl
Upload date: Nov 29, 2025
Size: 110.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.0

File hashes

Hashes for idealista_scraper-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`888e6d07e5d1e467b15ae48f45e07fc1724c90487ffa1e4af8277038e5b26395`
MD5	`e29c7e0fdc0d5168b31e8a5dda46fc68`
BLAKE2b-256	`552fdaddc4cadb992b6cc4c9bd0190e44f7e5155ccfda089879f118e506b0658`

See more details on using hashes here.

idealista-scraper 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Idealista Scraper

Features

Installation

Configuration

Quick Start

Multi-Country Support

Discovering Available Regions

Scraping Different Countries

CLI Reference

Scraping Commands

Data Processing

Pipeline Automation

Utilities

Python Library Usage

Output Files

Project Structure

Development

Alternative Entry Points

License

Author

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes