Independent URL and Product scraping pipelines for Amazon
Project description
Amazon Scraper Pipelines
Production-ready, Selenium-based pipelines for scraping Amazon search results and product details with a powerful FastAPI web interface.
๐ Features
- โ Modular Design: URL and product scraping pipelines can be run independently or together
- โ FastAPI REST API: Full-featured API for programmatic access to all scraping pipelines
- โ Beautiful Web UI: Browser-based interface for running scrapers without writing code
- โ Configurable Scraping: Control search terms, number of URLs per term, timeouts, and headless mode
- โ Timestamped Artifacts: All outputs stored under timestamped folders for easy versioning
- โ YAML-based Locators: Page locators externalized into YAML for easier maintenance
- โ Detailed Logging: Structured logs for each stage and overall pipeline execution
- โ In-memory Data Access: Optionally return scraped data as Python dicts in addition to JSON files
- โ Download API: Download scraped data via REST endpoints
๐ฆ Installation
pip install amazon-scraper-pipelines
Requirements
- Python 3.8+
- Chrome/Chromium browser (for Selenium)
Dependencies:
pip install fastapi uvicorn selenium webdriver-manager pydantic jinja2 python-multipart pyyaml
๐ฏ Quick Start
๐ Running the FastAPI Server
You can start the server in several ways depending on your workflow.
Option 1: Run main.py directly (if uvicorn.run is inside)
import uvicorn
if __name__ == "__main__":
uvicorn.run(
"scrapper.router.api:app",
host="127.0.0.1",
port=8080,
reload=True
)
python main.py
Option 2: Use uvicorn from command line
uvicorn scrapper.router.api:app --host 127.0.0.1 --port 8080
Option 3: Development mode with auto-reload (recommended during development)
uvicorn scrapper.router.api:app --host 127.0.0.1 --port 8080 --reload
Option 4: Run on all network interfaces
uvicorn scrapper.router.api:app --host 0.0.0.0 --port 8080
Understanding the uvicorn command:
scrapper.router.api:appโ Theappobject insidescrapper/router/api.pyfile (app = FastAPI())--host 127.0.0.1โ Binds to localhost only (most secure for local development)--port 8080โ Server listens on port 8080--reloadโ Auto-reloads server when code changes (development only, NOT for production)--host 0.0.0.0โ Makes server accessible from other machines on your network
Server will be available at:
- ๐ Web UI: http://127.0.0.1:8080/
- ๐ API Docs (Swagger): http://127.0.0.1:8080/docs
- ๐ API Docs (ReDoc): http://127.0.0.1:8080/redoc
Using the Web Interface
- Open http://127.0.0.1:8080/ in your browser
- Choose a scraper tab (Main Scraper / URL Scraper / Product Scraper)
- Configure your scraping options
- Click "Start Scraping"
- Download results when complete
Using Python Directly
from scrapper.pipeline.main_pipeline import AmazonScrapingPipeline
# Run full pipeline: Search โ URLs โ Products
pipeline = AmazonScrapingPipeline(
search_terms=['laptop', 'wireless mouse'],
target_links=5,
headless=True,
return_url_data=True,
return_prod_data=True
)
# Returns in this fixed order
url_artifact, url_data, product_artifact, product_data = pipeline.run_pipeline()
print(f"โ
URLs saved to: {url_artifact.url_file_path}")
print(f"โ
Products saved to: {product_artifact.product_file_path}")
print(f"๐ Total URLs: {url_data['total_urls']}")
print(f"๐ Scraped products: {product_data['total_scraped']}")
โ๏ธ Antiโbot & network tips
For best results when running the scrapers:
-
Use a fast and stable internet connection
High latency or frequent disconnects can cause timeouts, incomplete loads, and more frequent bot challenges. -
Set
headless=Falsewhile debugging
Run the browser in visible mode during development to see what the scraper is doing, inspect page behavior, and understand where it fails. -
Use a VPN or proxy if you frequently see CAPTCHAs
Switch to a different region or IP (respecting all legal and platform terms) when Amazon starts showing CAPTCHAs too often. -
Extend the code to handle bot detection for your use case
The project is open for customization: adjust delays, headers, proxies, and Selenium behavior, and add your own strategies to better handle bot detection and anti-scraping defenses.
๐ FastAPI Web Interface
The web interface provides three scraping modes accessible via tabs:
1. Main Scraper (Full Pipeline)
Runs both URL and Product scraping in sequence.
- Search Terms: Enter one search term per line
- Target Links: Number of product URLs to scrape per search term
- Headless Mode: Run browser without visible window
- Return URL Data: Include scraped URLs in API response
- Return Product Data: Include scraped product details in API response
2. URL Scraper
Collects only product URLs from Amazon search results.
- Outputs a JSON file with URLs organized by search term
- Useful when you want to review URLs before scraping product details
3. Product Scraper
Scrapes detailed product information from a previously generated URL file.
- Upload a
urls.jsonfile from a previous URL scrape - Extracts price, specifications, reviews, and more
๐ REST API Endpoints
Health Check
GET /api
Response:
{
"message": "Amazon Scraper Router API is running.",
"version": "1.0.0"
}
Main Scraper (Full Pipeline)
POST /api/mainscrape
Content-Type: application/json
Request Body:
{
"search_terms": ["laptop", "wireless mouse"],
"target_links": 5,
"headless": true,
"return_url_data": true,
"return_prod_data": true
}
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
search_terms |
list[str] |
required | List of Amazon search terms |
target_links |
int |
list[int] |
required |
headless |
bool |
true |
Run browser in headless mode |
return_url_data |
bool |
false |
Include URL data in response |
return_prod_data |
bool |
false |
Include product data in response |
Response (with both return flags true):
{
"status": "success",
"url_artifact": {
"url_file_path": "Artifacts/12_04_2025_14_58_45/UrlData/urls.json",
"download_url": "/api/download/url-data/12_04_2025_14_58_45"
},
"product_artifact": {
"product_file_path": "Artifacts/12_04_2025_14_58_45/ProductData/products.json",
"download_url": "/api/download/product-data/12_04_2025_14_58_45"
},
"url_data": {
"total_products": 2,
"total_urls": 10,
"products": {
"laptop": {
"count": 5,
"urls": ["https://www.amazon.in/..."]
}
}
},
"product_data": {
"total_scraped": 10,
"total_failed": 0,
"products": {
"laptop": [
{
"Product Name": "...",
"Product Price": "โน49,999",
"Ratings": "4.5",
"Technical Details": {},
"Customer Reviews": []
}
]
}
}
}
URL Scraper
POST /api/urlscrape
Content-Type: application/json
Request Body:
{
"search_terms": ["laptop"],
"target_links": 10,
"headless": true,
"return_url_data": true
}
Response:
{
"status": "success",
"url_artifact": {
"url_file_path": "Artifacts/12_04_2025_14_58_45/UrlData/urls.json",
"download_url": "/api/download/url-data/12_04_2025_14_58_45"
},
"url_data": {
"total_products": 1,
"total_urls": 10,
"products": {
"laptop": {
"count": 10,
"urls": [
"https://www.amazon.in/...",
"https://www.amazon.in/..."
]
}
}
}
}
Product Scraper
POST /api/productscrape
Content-Type: multipart/form-data
Form Data:
| Field | Type | Description |
|---|---|---|
file |
File |
JSON file containing URLs |
headless |
bool |
Run browser in headless mode |
return_prod_data |
bool |
Include product data in response |
Example using cURL:
curl -X POST "http://127.0.0.1:8080/api/productscrape" \
-F "file=@urls.json" \
-F "headless=true" \
-F "return_prod_data=true"
Response:
{
"status": "success",
"url_file_path": "Artifacts/12_04_2025_14_58_45/UrlData/urls.json",
"url_artifact": {
"url_file_path": "...",
"download_url": "/api/download/file?path=..."
},
"product_artifact": {
"product_file_path": "Artifacts/12_04_2025_14_58_45/ProductData/products.json",
"download_url": "/api/download/product-data/12_04_2025_14_58_45"
},
"product_data": {
"total_scraped": 10,
"total_failed": 0,
"products": {}
}
}
Download Endpoints
Download URL data by timestamp:
GET /api/download/url-data/{timestamp}
# Example: GET /api/download/url-data/12_04_2025_14_58_45
Download product data by timestamp:
GET /api/download/product-data/{timestamp}
# Example: GET /api/download/product-data/12_04_2025_14_58_45
Download by file path:
GET /api/download/file?path=Artifacts/12_04_2025_14_58_45/UrlData/urls.json
Results Endpoints
Get results by timestamp:
GET /api/results/{timestamp}
List all available results:
GET /api/results
Response:
{
"results": [
{
"timestamp": "12_04_2025_14_58_45",
"files": {
"url_file": "Artifacts/12_04_2025_14_58_45/UrlData/urls.json",
"product_file": "Artifacts/12_04_2025_14_58_45/ProductData/products.json"
},
"download_urls": {
"url_data": "/api/download/url-data/12_04_2025_14_58_45",
"product_data": "/api/download/product-data/12_04_2025_14_58_45"
}
}
]
}
๐ Python API
1. URL Scraping Pipeline
Collects product URLs from Amazon search results and saves them to a JSON file.
from scrapper.pipeline.url_pipeline import AmazonUrlScrapingPipeline
pipeline = AmazonUrlScrapingPipeline(
search_terms=['laptop pc', 'wireless mouse'],
target_links=[5, 3], # 5 laptops, 3 mice
headless=True,
return_url_data=True
)
url_artifact, url_data = pipeline.run()
print(f"URLs saved to: {url_artifact.url_file_path}")
print(f"Total URLs: {url_data['total_urls']}")
Parameters:
search_terms:list[str] | str- Amazon search termstarget_links:int | list[int]- URLs to scrape per term (default: 5)headless:bool- Run browser in headless mode (default: False)wait_timeout:int- Element wait timeout in seconds (default: 5)page_load_timeout:int- Page load timeout in seconds (default: 15)return_url_data:bool- Return URL data in memory (default: False)
Returns:
- When
return_url_data=False:(UrlDataArtifact,) - When
return_url_data=True:(UrlDataArtifact, dict)
2. Product Scraping Pipeline
Reads a URL JSON file and scrapes detailed information for each product URL.
from scrapper.pipeline.prodcut_pipeline import AmazonProductScrapingPipeline
pipeline = AmazonProductScrapingPipeline(
url_file_path="Artifacts/12_04_2025_14_58_45/UrlData/urls.json",
headless=True,
return_prod_data=True
)
product_artifact, product_data = pipeline.run()
print(f"Products saved to: {product_artifact.product_file_path}")
print(f"Success: {product_artifact.scraped_count}")
print(f"Failed: {product_artifact.failed_count}")
Parameters:
url_file_path:str | Path- Path to URL JSON file (required)headless:bool- Run browser in headless mode (default: False)wait_timeout:int- Element wait timeout in seconds (default: 10)page_load_timeout:int- Page load timeout in seconds (default: 20)return_prod_data:bool- Return product data in memory (default: False)
Returns:
- When
return_prod_data=False:(ProductDataArtifact,) - When
return_prod_data=True:(ProductDataArtifact, dict)
3. End-to-End Pipeline (Main)
Runs both URL and product scraping in sequence: Search โ URLs โ Products
from scrapper.pipeline.main_pipeline import AmazonScrapingPipeline
pipeline = AmazonScrapingPipeline(
search_terms=['laptop', 'wireless mouse'],
target_links=[5, 3],
headless=True,
return_url_data=True,
return_prod_data=True
)
# ALWAYS returns in this fixed order
url_artifact, url_data, product_artifact, product_data = pipeline.run_pipeline()
print(f"โ
URLs: {url_data['total_urls']}")
print(f"โ
Products: {product_data['total_scraped']}")
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
search_terms |
list[str] |
str | required |
target_links |
int |
list[int] | 5 |
headless |
bool |
False | Run in headless mode |
wait_timeout |
int |
5 | Wait timeout (seconds) |
page_load_timeout |
int |
15 | Page load timeout (seconds) |
return_url_data |
bool |
False | Return URL data in memory |
return_prod_data |
bool |
False | Return product data in memory |
Return Value (Fixed Order):
url_artifact, url_data, product_artifact, product_data = pipeline.run_pipeline()
| Variable | Type | Description |
|---|---|---|
url_artifact |
UrlDataArtifact |
Contains url_file_path |
url_data |
dict |
None |
product_artifact |
ProductDataArtifact |
Contains product_file_path, scraped_count, failed_count |
product_data |
`dict | None` |
๐ Output Structure
All artifacts are saved under timestamped directories:
Artifacts/
โโโ 12_04_2025_14_58_45/ # Timestamp: MM_DD_YYYY_HH_MM_SS
โโโ UrlData/
โ โโโ urls.json # Collected product URLs
โโโ ProductData/
โโโ products.json # Detailed product data
URL JSON Format
{
"total_products": 2,
"total_urls": 3,
"products": {
"laptop": {
"count": 1,
"urls": ["https://www.amazon.in/..."]
},
"wireless mouse": {
"count": 2,
"urls": [
"https://www.amazon.in/...",
"https://www.amazon.in/..."
]
}
}
}
Product JSON Format
{
"total_scraped": 3,
"total_failed": 0,
"products": {
"laptop": [
{
"Product Name": "Apple MacBook Air M2",
"Product Price": "โน99,999",
"Ratings": "4.5",
"Total Reviews": "1,234 ratings",
"Category": "Computers & Accessories",
"Product URL": "https://www.amazon.in/...",
"Technical Details": {
"Brand": "Apple",
"Processor": "M2",
"RAM": "8GB"
},
"Customer Reviews": [
{
"reviewer": "John Doe",
"rating": "5.0",
"title": "Excellent laptop",
"content": "Fast and reliable..."
}
]
}
]
}
}
๐ง Advanced Examples
Example: Different Link Counts per Search Term
pipeline = AmazonScrapingPipeline(
search_terms=['laptop', 'wireless mouse', 'keyboard'],
target_links=[10, 5, 3], # 10 laptops, 5 mice, 3 keyboards
headless=True,
return_url_data=True,
return_prod_data=True
)
url_artifact, url_data, product_artifact, product_data = pipeline.run_pipeline()
Example: URL Scraping Only
from scrapper.pipeline.url_pipeline import AmazonUrlScrapingPipeline
pipeline = AmazonUrlScrapingPipeline(
search_terms=['gaming laptop'],
target_links=20,
headless=True,
return_url_data=True
)
url_artifact, url_data = pipeline.run()
# Review URLs before product scraping
for term, data in url_data['products'].items():
print(f"{term}: {data['count']} URLs")
Example: Product Scraping from Existing URLs
from scrapper.pipeline.prodcut_pipeline import AmazonProductScrapingPipeline
pipeline = AmazonProductScrapingPipeline(
url_file_path="Artifacts/12_04_2025_14_58_45/UrlData/urls.json",
headless=True,
return_prod_data=True
)
product_artifact, product_data = pipeline.run()
print(f"Success: {product_artifact.scraped_count}")
print(f"Failed: {product_artifact.failed_count}")
๐ Using the REST API
Example: Python Requests
import requests
# Main scraper
response = requests.post(
'http://127.0.0.1:8080/api/mainscrape',
json={
'search_terms': ['laptop'],
'target_links': 5,
'headless': True,
'return_url_data': True,
'return_prod_data': True
}
)
data = response.json()
print(f"Status: {data['status']}")
print(f"URLs: {data['url_data']['total_urls']}")
print(f"Products: {data['product_data']['total_scraped']}")
# Download files
timestamp = "12_04_2025_14_58_45"
url_file = requests.get(f'http://127.0.0.1:8080/api/download/url-data/{timestamp}')
with open('urls.json', 'wb') as f:
f.write(url_file.content)
Example: JavaScript/Node.js
// Main scraper
const response = await fetch('http://127.0.0.1:8080/api/mainscrape', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
search_terms: ['laptop'],
target_links: 5,
headless: true,
return_url_data: true,
return_prod_data: true
})
});
const data = await response.json();
console.log(`URLs: ${data.url_data.total_urls}`);
console.log(`Products: ${data.product_data.total_scraped}`);
๐ ๏ธ Project Layout
project/
โโโ Artifacts/
โ โโโ <timestamp_folder>/
โ โโโ UrlData/
โ โ โโโ urls.json
โ โโโ ProductData/
โ โโโ products.json
โโโ logs/
โ โโโ *.log
โ โโโ ...
โโโ static/
โ โโโ css/
โ โ โโโ style.css
โ โโโ js/
โ โโโ app.js
โโโ templates/
โ โโโ index.html
โ โโโ base.html
โ โโโ about.html
โโโ scrapper/
โโโ config/
โ โโโ urls_locators.yaml
โ โโโ product_locators.yaml
โโโ constant/
โ โโโ configuration.py
โโโ entity/
โ โโโ artifact_entity.py
โ โโโ config_entity.py
โ โโโ product_locator_entity.py
โ โโโ url_locator_entity.py
โโโ exception/
โ โโโ custom_exception.py
โโโ logger/
โ โโโ logging.py
โโโ pipeline/
โ โโโ main_pipeline.py
โ โโโ url_pipeline.py
โ โโโ prodcut_pipeline.py
โโโ router/
โ โโโ api.py # FastAPI application
โโโ src/
โ โโโ multi_product_scrapper.py
โ โโโ multi_url_scrapper.py
โ โโโ url_scrapper.py
โโโ util/
โโโ main_utils.py
๐ Logging
Logs are stored in the logs/ directory with timestamps:
logs/
โโโ 12_04_2025_14_58_45.log
โโโ 12_04_2025_15_30_12.log
โโโ ...
Log Levels:
- INFO: Normal operations
- WARNING: Potential issues
- ERROR: Errors during scraping
- DEBUG: Detailed debugging information
๐จ Important Notes
Legal & Ethical Considerations
- โ ๏ธ Educational purposes only - Use responsibly
- โ ๏ธ Respect Amazon's Terms of Service and robots.txt
- โ ๏ธ Use reasonable delays between requests
- โ ๏ธ Do not overload Amazon's servers
- โ ๏ธ Check local laws regarding web scraping
- โ ๏ธ This tool should not be used for commercial scraping without proper authorization
Technical Considerations
- Amazon's DOM structure may change; locators may need updates
- Anti-bot mechanisms may block excessive requests
- Headless mode is recommended for production use
- Use proxies for large-scale scraping
- The FastAPI server runs on port 8080 by default (configurable)
- For production deployment, use a proper ASGI server like Gunicorn with Uvicorn workers
๐ Typical Workflows
Option 1: Use Web UI
- Start the server:
uvicorn scrapper.router.api:app --host 127.0.0.1 --port 8080 - Open http://127.0.0.1:8080/
- Select a scraper tab
- Configure options and click "Start Scraping"
- Download results when complete
Option 2: Use REST API
# Full pipeline
curl -X POST "http://127.0.0.1:8080/api/mainscrape" \
-H "Content-Type: application/json" \
-d '{
"search_terms": ["laptop"],
"target_links": 5,
"headless": true,
"return_url_data": true,
"return_prod_data": true
}'
# Download results
curl -O "http://127.0.0.1:8080/api/download/url-data/12_04_2025_14_58_45"
Option 3: Use Python Directly
from scrapper.pipeline.main_pipeline import AmazonScrapingPipeline
pipeline = AmazonScrapingPipeline(
search_terms=['laptop pc', 'wireless mouse'],
target_links=[1, 2],
headless=True,
return_url_data=True,
return_prod_data=True
)
url_artifact, url_data, product_artifact, product_data = pipeline.run_pipeline()
Option 4: Run Stages Independently
# 1) Collect URLs
python -m scrapper.pipeline.url_pipeline
# 2) Scrape products (update url_file_path first)
python -m scrapper.pipeline.prodcut_pipeline
๐ License
Proprietary License - All rights reserved.
This software is proprietary. No part of this code may be used, copied, modified, or distributed without explicit written permission from the copyright holder.
๐จโ๐ป Support
For support, bug reports, or feature requests:
- ๐ง Email: support.dhruv@dhruvsaxena25.com
- ๐ Issues: Create an issue on the repository
- ๐ Documentation: http://127.0.0.1:8080/docs (when server is running)
๐ Version History
1.0.0 (Current)
- โ Initial release
- โ URL scraping pipeline
- โ Product scraping pipeline
- โ End-to-end pipeline
- โ FastAPI REST API
- โ Web UI interface
- โ Download endpoints
- โ Comprehensive logging
- โ YAML-based locators
๐ More Information
For interactive API documentation with live testing capabilities, visit:
- Swagger UI: http://127.0.0.1:8080/docs
- ReDoc: http://127.0.0.1:8080/redoc
(Available when the FastAPI server is running)
Made with โค๏ธ for Amazon scraping workflows by Dhruv Saxena
Also Visit: dhruvsaxena25.com for more details.
Disclaimer: This project is proprietary. No one is allowed to use, copy, modify, or distribute any part of this code without explicit permission from the owner.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file amazon_scrapper_pipeline-0.2.6.tar.gz.
File metadata
- Download URL: amazon_scrapper_pipeline-0.2.6.tar.gz
- Upload date:
- Size: 34.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7002d4fb821dc10b3665f8dc0bb2236d6034a99fc00ac8bd1bb9268d41fafb25
|
|
| MD5 |
98f25bd37450decff45707ab05f5d081
|
|
| BLAKE2b-256 |
d47adf7181a713792fbde3c95896ca396308ab834d62df08d92d58cf9083572b
|
File details
Details for the file amazon_scrapper_pipeline-0.2.6-py3-none-any.whl.
File metadata
- Download URL: amazon_scrapper_pipeline-0.2.6-py3-none-any.whl
- Upload date:
- Size: 43.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
281e11d0fc03c08c57d59c4374b4442121fb75c0b0968951ba0fbbaeb9f6416a
|
|
| MD5 |
668d0259a6f6e5e4744a5ecb70f425ec
|
|
| BLAKE2b-256 |
95d675488a3b114b6210daed4be1ae2f3e1a61a2710092608d5ec83185bd2ac5
|