AI-powered Cursor assistant for web scraping using MCP protocol

Project description

AI-Cursor-Scraping-Assistant

A powerful tool that leverages Cursor AI and MCP (Model Context Protocol) to easily generate web scrapers for various types of websites. This project helps you quickly analyze websites and generate proper Scrapy or Camoufox scrapers with minimal effort.

Project Overview

This project contains two main components:

Cursor Rules - A set of rules that teach Cursor AI how to analyze websites and create different types of Scrapy spiders
MCP Tools - A collection of Model Context Protocol tools that enhance Cursor's capabilities for web scraping tasks

Prerequisites

Cursor AI installed
Python 3.10+ installed
Basic knowledge of web scraping concepts

Installation

Clone this repository to your local machine:

git clone https://github.com/TheWebScrapingClub/AI-Cursor-Scraping-Assistant.git
cd AI-Cursor-Scraping-Assistant

Install the required dependencies:

pip install mcp camoufox scrapy

If you plan to use Camoufox, you'll need to fetch its browser binary:

python -m camoufox fetch

Setup

Setting Up MCP Server

The MCP server provides tools that help Cursor AI analyze web pages and generate XPath selectors. To start the MCP server:

Navigate to the MCPfiles directory:
```
cd MCPfiles
```
Update the CAMOUFOX_FILE_PATH in xpath_server.py to point to your local Camoufox_template.py file.
Start the MCP server:
```
python xpath_server.py
```
In Cursor, connect to the MCP server by configuring it in the settings or using the MCP panel.

Cursor Rules

The cursor-rules directory contains rules that teach Cursor AI how to analyze websites and create different types of scrapers. These rules are automatically loaded when you open the project in Cursor.

Detailed Cursor Rules Explanation

The cursor-rules directory contains a set of MDC (Markdown Configuration) files that guide Cursor's behavior when creating web scrapers:

`prerequisites.mdc`

This rule handles initial setup tasks before creating any scrapers:

Gets the full path of the current project using pwd
Stores the path in context for later use by other rules
Confirms the execution of preliminary actions before proceeding

`website-analysis.mdc`

This comprehensive rule guides Cursor through website analysis:

Identifies the type of Scrapy spider to build (PLP, PDP, etc.)
Fetches and stores homepage HTML and cookies
Strips CSS using the MCP tool to simplify HTML analysis
Checks cookies for anti-bot protection (Akamai, Datadome, PerimeterX, etc.)
For PLP scrapers: fetches category pages, analyzes structure, looks for JSON data
For PDP scrapers: fetches product pages, analyzes structure, looks for JSON data
Detects schema.org markup and modern frameworks like Next.js

`scrapy-step-by-step-process.mdc`

This rule provides the execution flow for creating scrapers:

Outlines the sequence of steps to follow
References other rule files in the correct order
Ensures prerequisite actions are completed before scraper creation
Guides Cursor to analyze the website before generating code

`scrapy.mdc`

This extensive rule contains Scrapy best practices:

Defines recommended code organization and directory structure
Details file naming conventions and module organization
Provides component architecture guidelines
Offers strategies for code splitting and reuse
Includes performance optimization recommendations
Covers security practices, error handling, and logging
Provides specific syntax examples and code snippets

`scraper-models.mdc`

This rule defines the different types of scrapers that can be created:

E-commerce PLP: Details the data structure, field definitions, and implementation steps
E-commerce PDP: Details the data structure, field definitions, and implementation steps
Field mapping guidelines for all scraper types
Step-by-step instructions for creating each type of scraper
Default settings recommendations
Anti-bot countermeasures for different protection systems

Usage

Here's how to use the AI-Cursor-Scraping-Assistant:

Open the project in Cursor AI
Make sure the MCP server is running

Ask Cursor to create a scraper with a prompt like:

Write an e-commerce PLP scraper for the website gucci.com

Cursor will then:

Analyze the website structure
Check for anti-bot protection
Extract the relevant HTML elements
Generate a complete Scrapy spider based on the website type

Available Scraper Types

You can request different types of scrapers:

E-commerce PLP (Product Listing Page) - Scrapes product catalogs/category pages
E-commerce PDP (Product Detail Page) - Scrapes detailed product information

For example:

Write an e-commerce PDP scraper for nike.com

Advanced Usage

Camoufox Integration

The project includes a Camoufox template for creating stealth scrapers that can bypass certain anti-bot measures. The MCP tools help you:

Fetch page content using Camoufox
Generate XPath selectors for the desired elements
Create a complete Camoufox scraper based on the template

Custom Scrapers

You can extend the functionality by adding new scraper types to the cursor-rules files. The modular design allows for easy customization.

Project Structure

AI-Cursor-Scraping-Assistant/
├── MCPfiles/
│   ├── xpath_server.py     # MCP server with web scraping tools
│   └── Camoufox_template.py # Template for Camoufox scrapers
├── cursor-rules/
│   ├── website-analysis.mdc    # Rules for analyzing websites
│   ├── scrapy.mdc              # Best practices for Scrapy
│   ├── scrapy-step-by-step-process.mdc # Guide for creating scrapers
│   ├── scraper-models.mdc      # Templates for different scraper types
│   └── prerequisites.mdc       # Setup requirements
└── README.md

TODO: Future Enhancements

The following features are planned for future development:

Proxy Integration

Add proxy support when requested by the operator
Implement proxy rotation strategies
Support for different proxy providers
Handle proxy authentication
Integrate with popular proxy services

Improved XPath Generation and Validation

Add validation mechanisms for generated XPath selectors
Implement feedback loop for selector refinement
Control flow management for reworking selectors
Auto-correction of problematic selectors
Handle edge cases like dynamic content and AJAX loading

Other Planned Features

Support for more scraper types (news sites, social media, etc.)
Integration with additional anti-bot bypass techniques
Enhanced JSON extraction capabilities
Support for more complex navigation patterns
Multi-page scraping optimizations

References

This project is based on articles from The Web Scraping Club:

For more information on web scraping techniques and best practices, visit The Web Scraping Club.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Feb 8, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iflow_mcp_thewebscrapingclub_ai_cursor_scraping_assistant-0.1.0.tar.gz (11.5 kB view details)

Uploaded Feb 8, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

iflow_mcp_thewebscrapingclub_ai_cursor_scraping_assistant-0.1.0-py3-none-any.whl (15.2 kB view details)

Uploaded Feb 8, 2026 Python 3

File details

Details for the file iflow_mcp_thewebscrapingclub_ai_cursor_scraping_assistant-0.1.0.tar.gz.

File metadata

Download URL: iflow_mcp_thewebscrapingclub_ai_cursor_scraping_assistant-0.1.0.tar.gz
Upload date: Feb 8, 2026
Size: 11.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for iflow_mcp_thewebscrapingclub_ai_cursor_scraping_assistant-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d796d8017966d9e8bfab5c60649ca5186f57cfb11cca87b9904190d99071fa0c`
MD5	`2844d078c7ad720370dd734590edb8aa`
BLAKE2b-256	`e414cfd9e7992b70bb40defe97ef465811540b657a926c57972b374935398280`

See more details on using hashes here.

File details

Details for the file iflow_mcp_thewebscrapingclub_ai_cursor_scraping_assistant-0.1.0-py3-none-any.whl.

File metadata

Download URL: iflow_mcp_thewebscrapingclub_ai_cursor_scraping_assistant-0.1.0-py3-none-any.whl
Upload date: Feb 8, 2026
Size: 15.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.10.0 {"installer":{"name":"uv","version":"0.10.0","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Debian GNU/Linux","version":"13","id":"trixie","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for iflow_mcp_thewebscrapingclub_ai_cursor_scraping_assistant-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`059a51e48561b76b7288343c334c62e3dd0d4b0a67f4476ad9cc6fd66b14c063`
MD5	`7ab448bc549534de4b17862411b1dcac`
BLAKE2b-256	`78ba4c1e0726f693e7f3f6b55057da6b55ecc592ae0cf5e1fbcf53d9e9e6ebbd`

See more details on using hashes here.

iflow-mcp_thewebscrapingclub-ai-cursor-scraping-assistant 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

AI-Cursor-Scraping-Assistant

Project Overview

Prerequisites

Installation

Setup

Setting Up MCP Server

Cursor Rules

Detailed Cursor Rules Explanation

prerequisites.mdc

website-analysis.mdc

scrapy-step-by-step-process.mdc

scrapy.mdc

scraper-models.mdc

Usage

Available Scraper Types

Advanced Usage

Camoufox Integration

Custom Scrapers

Project Structure

TODO: Future Enhancements

Proxy Integration

Improved XPath Generation and Validation

Other Planned Features

References

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`prerequisites.mdc`

`website-analysis.mdc`

`scrapy-step-by-step-process.mdc`

`scrapy.mdc`

`scraper-models.mdc`