Skip to main content

Independent URL and Product scraping pipelines for Amazon

Project description

Amazon Scraper Pipelines

Python Version FastAPI License

Production-ready, Selenium-based pipelines for scraping Amazon search results and product details with a powerful FastAPI web interface.

๐Ÿš€ Features

  • โœ… Modular Design: URL and product scraping pipelines can be run independently or together
  • โœ… FastAPI REST API: Full-featured API for programmatic access to all scraping pipelines
  • โœ… Beautiful Web UI: Browser-based interface for running scrapers without writing code
  • โœ… Configurable Scraping: Control search terms, number of URLs per term, timeouts, and headless mode
  • โœ… Timestamped Artifacts: All outputs stored under timestamped folders for easy versioning
  • โœ… YAML-based Locators: Page locators externalized into YAML for easier maintenance
  • โœ… Detailed Logging: Structured logs for each stage and overall pipeline execution
  • โœ… In-memory Data Access: Optionally return scraped data as Python dicts in addition to JSON files
  • โœ… Download API: Download scraped data via REST endpoints

๐Ÿ“ฆ Installation

pip install amazon-scraper-pipelines

Requirements

  • Python 3.8+
  • Chrome/Chromium browser (for Selenium)

Dependencies:

pip install fastapi uvicorn selenium webdriver-manager pydantic jinja2 python-multipart pyyaml

๐ŸŽฏ Quick Start

๐Ÿš€ Running the FastAPI Server

You can start the server in several ways depending on your workflow.

Option 1: Run main.py directly (if uvicorn.run is inside)

import uvicorn

if __name__ == "__main__":
    uvicorn.run(
        "scrapper.router.api:app",
        host="127.0.0.1",
        port=8080,
        reload=True
    )
python main.py

Option 2: Use uvicorn from command line

uvicorn scrapper.router.api:app --host 127.0.0.1 --port 8080

Option 3: Development mode with auto-reload (recommended during development)

uvicorn scrapper.router.api:app --host 127.0.0.1 --port 8080 --reload

Option 4: Run on all network interfaces

uvicorn scrapper.router.api:app --host 0.0.0.0 --port 8080

Understanding the uvicorn command:

  • scrapper.router.api:app โ†’ The app object inside scrapper/router/api.py file (app = FastAPI())
  • --host 127.0.0.1 โ†’ Binds to localhost only (most secure for local development)
  • --port 8080 โ†’ Server listens on port 8080
  • --reload โ†’ Auto-reloads server when code changes (development only, NOT for production)
  • --host 0.0.0.0 โ†’ Makes server accessible from other machines on your network

Server will be available at:


Using the Web Interface

  1. Open http://127.0.0.1:8080/ in your browser
  2. Choose a scraper tab (Main Scraper / URL Scraper / Product Scraper)
  3. Configure your scraping options
  4. Click "Start Scraping"
  5. Download results when complete

Using Python Directly

from scrapper.pipeline.main_pipeline import AmazonScrapingPipeline


# Run full pipeline: Search โ†’ URLs โ†’ Products
pipeline = AmazonScrapingPipeline(
    search_terms=['laptop', 'wireless mouse'],
    target_links=5,
    headless=True,
    return_url_data=True,
    return_prod_data=True
)


# Returns in this fixed order
url_artifact, url_data, product_artifact, product_data = pipeline.run_pipeline()


print(f"โœ… URLs saved to: {url_artifact.url_file_path}")
print(f"โœ… Products saved to: {product_artifact.product_file_path}")
print(f"๐Ÿ“Š Total URLs: {url_data['total_urls']}")
print(f"๐Ÿ“Š Scraped products: {product_data['total_scraped']}")

โš™๏ธ Antiโ€‘bot & network tips

For best results when running the scrapers:

  1. Use a fast and stable internet connection
    High latency or frequent disconnects can cause timeouts, incomplete loads, and more frequent bot challenges.

  2. Set headless=False while debugging
    Run the browser in visible mode during development to see what the scraper is doing, inspect page behavior, and understand where it fails.

  3. Use a VPN or proxy if you frequently see CAPTCHAs
    Switch to a different region or IP (respecting all legal and platform terms) when Amazon starts showing CAPTCHAs too often.

  4. Extend the code to handle bot detection for your use case
    The project is open for customization: adjust delays, headers, proxies, and Selenium behavior, and add your own strategies to better handle bot detection and anti-scraping defenses.


๐ŸŒ FastAPI Web Interface

The web interface provides three scraping modes accessible via tabs:

1. Main Scraper (Full Pipeline)

Runs both URL and Product scraping in sequence.

  • Search Terms: Enter one search term per line
  • Target Links: Number of product URLs to scrape per search term
  • Headless Mode: Run browser without visible window
  • Return URL Data: Include scraped URLs in API response
  • Return Product Data: Include scraped product details in API response

2. URL Scraper

Collects only product URLs from Amazon search results.

  • Outputs a JSON file with URLs organized by search term
  • Useful when you want to review URLs before scraping product details

3. Product Scraper

Scrapes detailed product information from a previously generated URL file.

  • Upload a urls.json file from a previous URL scrape
  • Extracts price, specifications, reviews, and more

๐Ÿ”Œ REST API Endpoints

Health Check

GET /api

Response:

{
  "message": "Amazon Scraper Router API is running.",
  "version": "1.0.0"
}

Main Scraper (Full Pipeline)

POST /api/mainscrape
Content-Type: application/json

Request Body:

{
  "search_terms": ["laptop", "wireless mouse"],
  "target_links": 5,
  "headless": true,
  "return_url_data": true,
  "return_prod_data": true
}

Parameters:

Parameter Type Default Description
search_terms list[str] required List of Amazon search terms
target_links int list[int] required
headless bool true Run browser in headless mode
return_url_data bool false Include URL data in response
return_prod_data bool false Include product data in response

Response (with both return flags true):

{
  "status": "success",
  "url_artifact": {
    "url_file_path": "Artifacts/12_04_2025_14_58_45/UrlData/urls.json",
    "download_url": "/api/download/url-data/12_04_2025_14_58_45"
  },
  "product_artifact": {
    "product_file_path": "Artifacts/12_04_2025_14_58_45/ProductData/products.json",
    "download_url": "/api/download/product-data/12_04_2025_14_58_45"
  },
  "url_data": {
    "total_products": 2,
    "total_urls": 10,
    "products": {
      "laptop": {
        "count": 5,
        "urls": ["https://www.amazon.in/..."]
      }
    }
  },
  "product_data": {
    "total_scraped": 10,
    "total_failed": 0,
    "products": {
      "laptop": [
        {
          "Product Name": "...",
          "Product Price": "โ‚น49,999",
          "Ratings": "4.5",
          "Technical Details": {},
          "Customer Reviews": []
        }
      ]
    }
  }
}

URL Scraper

POST /api/urlscrape
Content-Type: application/json

Request Body:

{
  "search_terms": ["laptop"],
  "target_links": 10,
  "headless": true,
  "return_url_data": true
}

Response:

{
  "status": "success",
  "url_artifact": {
    "url_file_path": "Artifacts/12_04_2025_14_58_45/UrlData/urls.json",
    "download_url": "/api/download/url-data/12_04_2025_14_58_45"
  },
  "url_data": {
    "total_products": 1,
    "total_urls": 10,
    "products": {
      "laptop": {
        "count": 10,
        "urls": [
          "https://www.amazon.in/...",
          "https://www.amazon.in/..."
        ]
      }
    }
  }
}

Product Scraper

POST /api/productscrape
Content-Type: multipart/form-data

Form Data:

Field Type Description
file File JSON file containing URLs
headless bool Run browser in headless mode
return_prod_data bool Include product data in response

Example using cURL:

curl -X POST "http://127.0.0.1:8080/api/productscrape" \
  -F "file=@urls.json" \
  -F "headless=true" \
  -F "return_prod_data=true"

Response:

{
  "status": "success",
  "url_file_path": "Artifacts/12_04_2025_14_58_45/UrlData/urls.json",
  "url_artifact": {
    "url_file_path": "...",
    "download_url": "/api/download/file?path=..."
  },
  "product_artifact": {
    "product_file_path": "Artifacts/12_04_2025_14_58_45/ProductData/products.json",
    "download_url": "/api/download/product-data/12_04_2025_14_58_45"
  },
  "product_data": {
    "total_scraped": 10,
    "total_failed": 0,
    "products": {}
  }
}

Download Endpoints

Download URL data by timestamp:

GET /api/download/url-data/{timestamp}
# Example: GET /api/download/url-data/12_04_2025_14_58_45

Download product data by timestamp:

GET /api/download/product-data/{timestamp}
# Example: GET /api/download/product-data/12_04_2025_14_58_45

Download by file path:

GET /api/download/file?path=Artifacts/12_04_2025_14_58_45/UrlData/urls.json

Results Endpoints

Get results by timestamp:

GET /api/results/{timestamp}

List all available results:

GET /api/results

Response:

{
  "results": [
    {
      "timestamp": "12_04_2025_14_58_45",
      "files": {
        "url_file": "Artifacts/12_04_2025_14_58_45/UrlData/urls.json",
        "product_file": "Artifacts/12_04_2025_14_58_45/ProductData/products.json"
      },
      "download_urls": {
        "url_data": "/api/download/url-data/12_04_2025_14_58_45",
        "product_data": "/api/download/product-data/12_04_2025_14_58_45"
      }
    }
  ]
}

๐Ÿ Python API

1. URL Scraping Pipeline

Collects product URLs from Amazon search results and saves them to a JSON file.

from scrapper.pipeline.url_pipeline import AmazonUrlScrapingPipeline


pipeline = AmazonUrlScrapingPipeline(
    search_terms=['laptop pc', 'wireless mouse'],
    target_links=[5, 3],  # 5 laptops, 3 mice
    headless=True,
    return_url_data=True
)


url_artifact, url_data = pipeline.run()


print(f"URLs saved to: {url_artifact.url_file_path}")
print(f"Total URLs: {url_data['total_urls']}")

Parameters:

  • search_terms: list[str] | str - Amazon search terms
  • target_links: int | list[int] - URLs to scrape per term (default: 5)
  • headless: bool - Run browser in headless mode (default: False)
  • wait_timeout: int - Element wait timeout in seconds (default: 5)
  • page_load_timeout: int - Page load timeout in seconds (default: 15)
  • return_url_data: bool - Return URL data in memory (default: False)

Returns:

  • When return_url_data=False: (UrlDataArtifact,)
  • When return_url_data=True: (UrlDataArtifact, dict)

2. Product Scraping Pipeline

Reads a URL JSON file and scrapes detailed information for each product URL.

from scrapper.pipeline.prodcut_pipeline import AmazonProductScrapingPipeline


pipeline = AmazonProductScrapingPipeline(
    url_file_path="Artifacts/12_04_2025_14_58_45/UrlData/urls.json",
    headless=True,
    return_prod_data=True
)


product_artifact, product_data = pipeline.run()


print(f"Products saved to: {product_artifact.product_file_path}")
print(f"Success: {product_artifact.scraped_count}")
print(f"Failed: {product_artifact.failed_count}")

Parameters:

  • url_file_path: str | Path - Path to URL JSON file (required)
  • headless: bool - Run browser in headless mode (default: False)
  • wait_timeout: int - Element wait timeout in seconds (default: 10)
  • page_load_timeout: int - Page load timeout in seconds (default: 20)
  • return_prod_data: bool - Return product data in memory (default: False)

Returns:

  • When return_prod_data=False: (ProductDataArtifact,)
  • When return_prod_data=True: (ProductDataArtifact, dict)

3. End-to-End Pipeline (Main)

Runs both URL and product scraping in sequence: Search โ†’ URLs โ†’ Products

from scrapper.pipeline.main_pipeline import AmazonScrapingPipeline


pipeline = AmazonScrapingPipeline(
    search_terms=['laptop', 'wireless mouse'],
    target_links=[5, 3],
    headless=True,
    return_url_data=True,
    return_prod_data=True
)


# ALWAYS returns in this fixed order
url_artifact, url_data, product_artifact, product_data = pipeline.run_pipeline()


print(f"โœ… URLs: {url_data['total_urls']}")
print(f"โœ… Products: {product_data['total_scraped']}")

Parameters:

Parameter Type Default Description
search_terms list[str] str required
target_links int list[int] 5
headless bool False Run in headless mode
wait_timeout int 5 Wait timeout (seconds)
page_load_timeout int 15 Page load timeout (seconds)
return_url_data bool False Return URL data in memory
return_prod_data bool False Return product data in memory

Return Value (Fixed Order):

url_artifact, url_data, product_artifact, product_data = pipeline.run_pipeline()
Variable Type Description
url_artifact UrlDataArtifact Contains url_file_path
url_data dict None
product_artifact ProductDataArtifact Contains product_file_path, scraped_count, failed_count
product_data `dict None`

๐Ÿ“ Output Structure

All artifacts are saved under timestamped directories:

Artifacts/
โ””โ”€โ”€ 12_04_2025_14_58_45/           # Timestamp: MM_DD_YYYY_HH_MM_SS
    โ”œโ”€โ”€ UrlData/
    โ”‚   โ””โ”€โ”€ urls.json              # Collected product URLs
    โ””โ”€โ”€ ProductData/
        โ””โ”€โ”€ products.json          # Detailed product data

URL JSON Format

{
  "total_products": 2,
  "total_urls": 3,
  "products": {
    "laptop": {
      "count": 1,
      "urls": ["https://www.amazon.in/..."]
    },
    "wireless mouse": {
      "count": 2,
      "urls": [
        "https://www.amazon.in/...",
        "https://www.amazon.in/..."
      ]
    }
  }
}

Product JSON Format

{
  "total_scraped": 3,
  "total_failed": 0,
  "products": {
    "laptop": [
      {
        "Product Name": "Apple MacBook Air M2",
        "Product Price": "โ‚น99,999",
        "Ratings": "4.5",
        "Total Reviews": "1,234 ratings",
        "Category": "Computers & Accessories",
        "Product URL": "https://www.amazon.in/...",
        "Technical Details": {
          "Brand": "Apple",
          "Processor": "M2",
          "RAM": "8GB"
        },
        "Customer Reviews": [
          {
            "reviewer": "John Doe",
            "rating": "5.0",
            "title": "Excellent laptop",
            "content": "Fast and reliable..."
          }
        ]
      }
    ]
  }
}

๐Ÿ”ง Advanced Examples

Example: Different Link Counts per Search Term

pipeline = AmazonScrapingPipeline(
    search_terms=['laptop', 'wireless mouse', 'keyboard'],
    target_links=[10, 5, 3],  # 10 laptops, 5 mice, 3 keyboards
    headless=True,
    return_url_data=True,
    return_prod_data=True
)


url_artifact, url_data, product_artifact, product_data = pipeline.run_pipeline()

Example: URL Scraping Only

from scrapper.pipeline.url_pipeline import AmazonUrlScrapingPipeline


pipeline = AmazonUrlScrapingPipeline(
    search_terms=['gaming laptop'],
    target_links=20,
    headless=True,
    return_url_data=True
)


url_artifact, url_data = pipeline.run()


# Review URLs before product scraping
for term, data in url_data['products'].items():
    print(f"{term}: {data['count']} URLs")

Example: Product Scraping from Existing URLs

from scrapper.pipeline.prodcut_pipeline import AmazonProductScrapingPipeline


pipeline = AmazonProductScrapingPipeline(
    url_file_path="Artifacts/12_04_2025_14_58_45/UrlData/urls.json",
    headless=True,
    return_prod_data=True
)


product_artifact, product_data = pipeline.run()


print(f"Success: {product_artifact.scraped_count}")
print(f"Failed: {product_artifact.failed_count}")

๐ŸŒ Using the REST API

Example: Python Requests

import requests


# Main scraper
response = requests.post(
    'http://127.0.0.1:8080/api/mainscrape',
    json={
        'search_terms': ['laptop'],
        'target_links': 5,
        'headless': True,
        'return_url_data': True,
        'return_prod_data': True
    }
)


data = response.json()
print(f"Status: {data['status']}")
print(f"URLs: {data['url_data']['total_urls']}")
print(f"Products: {data['product_data']['total_scraped']}")


# Download files
timestamp = "12_04_2025_14_58_45"
url_file = requests.get(f'http://127.0.0.1:8080/api/download/url-data/{timestamp}')
with open('urls.json', 'wb') as f:
    f.write(url_file.content)

Example: JavaScript/Node.js

// Main scraper
const response = await fetch('http://127.0.0.1:8080/api/mainscrape', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    search_terms: ['laptop'],
    target_links: 5,
    headless: true,
    return_url_data: true,
    return_prod_data: true
  })
});


const data = await response.json();
console.log(`URLs: ${data.url_data.total_urls}`);
console.log(`Products: ${data.product_data.total_scraped}`);

๐Ÿ› ๏ธ Project Layout

project/
โ”œโ”€โ”€ Artifacts/
โ”‚   โ””โ”€โ”€ <timestamp_folder>/
โ”‚       โ”œโ”€โ”€ UrlData/
โ”‚       โ”‚   โ””โ”€โ”€ urls.json
โ”‚       โ””โ”€โ”€ ProductData/
โ”‚           โ””โ”€โ”€ products.json
โ”œโ”€โ”€ logs/
โ”‚   โ”œโ”€โ”€ *.log
โ”‚   โ””โ”€โ”€ ...
โ”œโ”€โ”€ static/
โ”‚   โ”œโ”€โ”€ css/
โ”‚   โ”‚   โ””โ”€โ”€ style.css
โ”‚   โ””โ”€โ”€ js/
โ”‚       โ””โ”€โ”€ app.js
โ”œโ”€โ”€ templates/
โ”‚   โ”œโ”€โ”€ index.html
โ”‚   โ”œโ”€โ”€ base.html
โ”‚   โ””โ”€โ”€ about.html
โ””โ”€โ”€ scrapper/
    โ”œโ”€โ”€ config/
    โ”‚   โ”œโ”€โ”€ urls_locators.yaml
    โ”‚   โ””โ”€โ”€ product_locators.yaml
    โ”œโ”€โ”€ constant/
    โ”‚   โ””โ”€โ”€ configuration.py
    โ”œโ”€โ”€ entity/
    โ”‚   โ”œโ”€โ”€ artifact_entity.py
    โ”‚   โ”œโ”€โ”€ config_entity.py
    โ”‚   โ”œโ”€โ”€ product_locator_entity.py
    โ”‚   โ””โ”€โ”€ url_locator_entity.py
    โ”œโ”€โ”€ exception/
    โ”‚   โ””โ”€โ”€ custom_exception.py
    โ”œโ”€โ”€ logger/
    โ”‚   โ””โ”€โ”€ logging.py
    โ”œโ”€โ”€ pipeline/
    โ”‚   โ”œโ”€โ”€ main_pipeline.py
    โ”‚   โ”œโ”€โ”€ url_pipeline.py
    โ”‚   โ””โ”€โ”€ prodcut_pipeline.py
    โ”œโ”€โ”€ router/
    โ”‚   โ””โ”€โ”€ api.py              # FastAPI application
    โ”œโ”€โ”€ src/
    โ”‚   โ”œโ”€โ”€ multi_product_scrapper.py
    โ”‚   โ”œโ”€โ”€ multi_url_scrapper.py
    โ”‚   โ””โ”€โ”€ url_scrapper.py
    โ””โ”€โ”€ util/
        โ””โ”€โ”€ main_utils.py

๐Ÿ“Š Logging

Logs are stored in the logs/ directory with timestamps:

logs/
โ”œโ”€โ”€ 12_04_2025_14_58_45.log
โ”œโ”€โ”€ 12_04_2025_15_30_12.log
โ””โ”€โ”€ ...

Log Levels:

  • INFO: Normal operations
  • WARNING: Potential issues
  • ERROR: Errors during scraping
  • DEBUG: Detailed debugging information

๐Ÿšจ Important Notes

Legal & Ethical Considerations

  • โš ๏ธ Educational purposes only - Use responsibly
  • โš ๏ธ Respect Amazon's Terms of Service and robots.txt
  • โš ๏ธ Use reasonable delays between requests
  • โš ๏ธ Do not overload Amazon's servers
  • โš ๏ธ Check local laws regarding web scraping
  • โš ๏ธ This tool should not be used for commercial scraping without proper authorization

Technical Considerations

  • Amazon's DOM structure may change; locators may need updates
  • Anti-bot mechanisms may block excessive requests
  • Headless mode is recommended for production use
  • Use proxies for large-scale scraping
  • The FastAPI server runs on port 8080 by default (configurable)
  • For production deployment, use a proper ASGI server like Gunicorn with Uvicorn workers

๐Ÿ”„ Typical Workflows

Option 1: Use Web UI

  1. Start the server: uvicorn scrapper.router.api:app --host 127.0.0.1 --port 8080
  2. Open http://127.0.0.1:8080/
  3. Select a scraper tab
  4. Configure options and click "Start Scraping"
  5. Download results when complete

Option 2: Use REST API

# Full pipeline
curl -X POST "http://127.0.0.1:8080/api/mainscrape" \
  -H "Content-Type: application/json" \
  -d '{
    "search_terms": ["laptop"],
    "target_links": 5,
    "headless": true,
    "return_url_data": true,
    "return_prod_data": true
  }'


# Download results
curl -O "http://127.0.0.1:8080/api/download/url-data/12_04_2025_14_58_45"

Option 3: Use Python Directly

from scrapper.pipeline.main_pipeline import AmazonScrapingPipeline


pipeline = AmazonScrapingPipeline(
    search_terms=['laptop pc', 'wireless mouse'],
    target_links=[1, 2],
    headless=True,
    return_url_data=True,
    return_prod_data=True
)


url_artifact, url_data, product_artifact, product_data = pipeline.run_pipeline()

Option 4: Run Stages Independently

# 1) Collect URLs
python -m scrapper.pipeline.url_pipeline


# 2) Scrape products (update url_file_path first)
python -m scrapper.pipeline.prodcut_pipeline

๐Ÿ“„ License

Proprietary License - All rights reserved.

This software is proprietary. No part of this code may be used, copied, modified, or distributed without explicit written permission from the copyright holder.


๐Ÿ‘จโ€๐Ÿ’ป Support

For support, bug reports, or feature requests:


๐Ÿ”„ Version History

1.0.0 (Current)

  • โœ… Initial release
  • โœ… URL scraping pipeline
  • โœ… Product scraping pipeline
  • โœ… End-to-end pipeline
  • โœ… FastAPI REST API
  • โœ… Web UI interface
  • โœ… Download endpoints
  • โœ… Comprehensive logging
  • โœ… YAML-based locators

๐ŸŽ“ More Information

For interactive API documentation with live testing capabilities, visit:

(Available when the FastAPI server is running)


Made with โค๏ธ for Amazon scraping workflows by Dhruv Saxena

Also Visit: dhruvsaxena25.com for more details.


Disclaimer: This project is proprietary. No one is allowed to use, copy, modify, or distribute any part of this code without explicit permission from the owner.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

amazon_scrapper_pipeline-0.2.6.tar.gz (34.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

amazon_scrapper_pipeline-0.2.6-py3-none-any.whl (43.5 kB view details)

Uploaded Python 3

File details

Details for the file amazon_scrapper_pipeline-0.2.6.tar.gz.

File metadata

File hashes

Hashes for amazon_scrapper_pipeline-0.2.6.tar.gz
Algorithm Hash digest
SHA256 7002d4fb821dc10b3665f8dc0bb2236d6034a99fc00ac8bd1bb9268d41fafb25
MD5 98f25bd37450decff45707ab05f5d081
BLAKE2b-256 d47adf7181a713792fbde3c95896ca396308ab834d62df08d92d58cf9083572b

See more details on using hashes here.

File details

Details for the file amazon_scrapper_pipeline-0.2.6-py3-none-any.whl.

File metadata

File hashes

Hashes for amazon_scrapper_pipeline-0.2.6-py3-none-any.whl
Algorithm Hash digest
SHA256 281e11d0fc03c08c57d59c4374b4442121fb75c0b0968951ba0fbbaeb9f6416a
MD5 668d0259a6f6e5e4744a5ecb70f425ec
BLAKE2b-256 95d675488a3b114b6210daed4be1ae2f3e1a61a2710092608d5ec83185bd2ac5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page