A scraper for Amazon product details and reviews using ASIN

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

AmazonScraper Python Package Documentation

AmazonScraper is a Python package for scraping product details from Amazon.in using a product's ASIN. It retrieves various pieces of information including the product title, pricing, categories, technical details, additional specifications, ratings, and feature highlights.

Downloads

Overview
Features
Requirements
Installation
Usage
- Basic Usage
- Saving the HTML
API Reference
- AmazonScraper Class
- Data Models
Error Handling
License
Contributing
Disclaimer

Overview

AmazonScraper is designed to extract detailed information from an Amazon.in product page by parsing its HTML. With this package, users can retrieve:

Product Title
Pricing Details (MRP and Selling Price)
Category Tags (Breadcrumbs)
Technical and Additional Specifications
Detailed Product Information (Bullet Points)
Ratings and Review Counts
Feature Highlights (About Section)

Features

Comprehensive Scraping: Extracts all relevant product data in one go.
Robust Error Handling: Gracefully handles missing data or page fetch errors.
Structured Data Models: Returns data in easy-to-use Python dataclasses.
Customizable HTML Saving: Option to save the prettified HTML for debugging.

Requirements

Python 3.7 or higher
httpx for HTTP requests
BeautifulSoup for HTML parsing

Install the dependencies via pip:

pip install httpx beautifulsoup4 pydantic

Installation

Clone or download the repository and ensure that the package structure (including the models and utils modules) is maintained. Then, include the package in your project as needed.

Usage

Basic Usage

Below is a sample code snippet to demonstrate how to use the AmazonScraper:

from dibkb_scraper import AmazonScraper
# Initialize the scraper with a valid Amazon ASIN
asin = "B00935MGKK"
scraper = AmazonScraper(asin)

# Retrieve all product details
product_details = scraper.get_all_details()

# Access and print various product attributes
print("Title:", product_details.product.title)
print("MRP:", product_details.product.pricing.mrp)
print("Selling Price:", product_details.product.pricing.selling_price)
print("Categories:", product_details.product.categories)
print("Highlights:", product_details.product.description.highlights)
print("Technical Specs:", product_details.product.specifications.technical)
print("Additional Specs:", product_details.product.specifications.additional)
print("Detail Bullets:", product_details.product.specifications.details)
print("Ratings:", product_details.product.ratings.rating)
print("Review Count:", product_details.product.ratings.review_count)

Saving the HTML

To save the prettified HTML of the product page (useful for debugging), use the page_html_to_text method:

# Saves the HTML content to 'B00935MGKK.txt' (or a custom file name)
scraper.page_html_to_text("B00935MGKK_page")

API Reference

AmazonScraper Class

`init(self, asin: str)`

Parameters:
- asin (str): The Amazon Standard Identification Number of the product.
Description: Initializes the scraper, constructs the product URL, sets HTTP headers, and retrieves the HTML content.

`page_html_to_text(self, name: Optional[str] = None)`

Parameters:
- name (Optional[str]): Optional filename for the output text file. Defaults to the ASIN if not provided.
Description: Saves the prettified HTML of the product page into a text file.

`get_product_title(self) -> Optional[str]`

Returns: The product title as a string, or None if not found.
Description: Extracts and returns the product title from the page.

`get_mrp(self) -> Optional[float]`

Returns: The Maximum Retail Price (MRP) as a float, or None if not found.
Description: Extracts the MRP from the designated HTML element.

`get_selling_price(self) -> Optional[float]`

Returns: The selling price as a float, or None if not found.
Description: Retrieves the selling price from the page.

`get_tags(self) -> List[str]`

Returns: A list of category tags (breadcrumbs) as strings.
Description: Extracts breadcrumb links that indicate product categories.

`get_technical_info(self) -> Dict[str, str]`

Returns: A dictionary of technical specifications in key-value pairs.
Description: Parses the technical details table from the product page.

`get_additional_info(self) -> Dict[str, str]`

Returns: A dictionary containing additional product details.
Description: Extracts further details from the secondary details table.

`get_product_details(self) -> Dict[str, str]`

Returns: A dictionary of detailed product information (e.g., bullet points).
Description: Retrieves information from the "detail bullets" section of the product page.

`get_ratings(self) -> Ratings`

Returns: A Ratings object containing the product's average rating and the total review count.
Description: Extracts rating and review count, with a fallback method if the primary extraction fails.

`get_about(self) -> Union[List[str], Dict[str, str]]`

Returns: A list of product description highlights, or an error dictionary if extraction fails.
Description: Retrieves the feature bullets from the "feature-bullets" section.

`get_all_details(self) -> AmazonProductResponse`

Returns: An AmazonProductResponse object that consolidates all scraped product details. If the page fails to load, the response includes an error message.
Description: Aggregates all product data into a structured response.

Data Models

The package uses several dataclasses to organize the scraped data:

Ratings

Attributes:
- rating (Optional[float]): The average product rating.
- review_count (Optional[int]): The total number of reviews.

Pricing

Attributes:
- mrp (Optional[float]): The Maximum Retail Price.
- selling_price (Optional[float]): The current selling price.

Description

Attributes:
- highlights (List[str]): A list of product highlight points.

Specifications

Attributes:
- technical (Dict[str, str]): Technical specifications from the product page.
- additional (Dict[str, str]): Additional product details.
- details (Dict[str, str]): Detailed information extracted from the bullet points.

Product

Attributes:
- title (Optional[str]): The product title.
- pricing (Pricing): Pricing details of the product.
- categories (List[str]): Category tags (breadcrumbs).
- description (Description): Feature highlights.
- specifications (Specifications): Detailed specifications.
- ratings (Ratings): Rating information.

AmazonProductResponse

Attributes:
- product (Product): An object containing all the scraped product details.
- error (Optional[str]): An error message if the scraping process fails.

Error Handling

Page Fetch Errors: If the scraper fails to retrieve the page (e.g., due to network issues or an invalid ASIN), the AmazonProductResponse will include an error field.
Parsing Exceptions: Individual methods include exception handling to ensure that missing elements do not break the entire scraping process.

License

This project is licensed under the MIT License. See the LICENSE file for full details.

Contributing

Contributions are welcome! Please feel free to open an issue or submit a pull request with your suggestions or improvements.

Disclaimer

This package is provided for educational and research purposes only. Users must comply with Amazon's terms of service and applicable laws when scraping websites. Use the package responsibly.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

0.3.4

Apr 18, 2025

0.3.3

Apr 18, 2025

0.3.2

Apr 18, 2025

0.3.1

Apr 17, 2025

0.3.0

Apr 17, 2025

0.2.9

Apr 17, 2025

0.2.8

Apr 17, 2025

0.2.7

Feb 27, 2025

This version

0.2.6

Feb 27, 2025

0.2.5

Feb 25, 2025

0.2.4

Feb 24, 2025

0.2.3

Feb 24, 2025

0.2.2

Feb 24, 2025

0.2.1

Feb 21, 2025

0.2.0

Feb 21, 2025

0.1.8

Feb 21, 2025

0.1.7

Feb 21, 2025

0.1.6

Feb 19, 2025

0.1.3

Feb 18, 2025

0.1.2

Feb 18, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dibkb_scraper-0.2.6.tar.gz (11.8 kB view details)

Uploaded Feb 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dibkb_scraper-0.2.6-py3-none-any.whl (9.9 kB view details)

Uploaded Feb 27, 2025 Python 3

File details

Details for the file dibkb_scraper-0.2.6.tar.gz.

File metadata

Download URL: dibkb_scraper-0.2.6.tar.gz
Upload date: Feb 27, 2025
Size: 11.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for dibkb_scraper-0.2.6.tar.gz
Algorithm	Hash digest
SHA256	`a593dd9bb9539be0033ef039719893928ee6c0b9f311c482e13f8f77851a2f40`
MD5	`dc7d68f54a919f8658b8bef96c75dba3`
BLAKE2b-256	`43fb604ca127604b5d386c7aacc5b5c8cb2813db6e81bfe2aa9a7628a9ec990e`

See more details on using hashes here.

File details

Details for the file dibkb_scraper-0.2.6-py3-none-any.whl.

File metadata

Download URL: dibkb_scraper-0.2.6-py3-none-any.whl
Upload date: Feb 27, 2025
Size: 9.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.1

File hashes

Hashes for dibkb_scraper-0.2.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dc27a9b9bd2818b36111f0b5b1e0cf9b7528f9cd8c55511dbfbf03c58cbc9db9`
MD5	`bdf25424adf4cb12180bff1aebf37a71`
BLAKE2b-256	`f51b34e0d2ebb03f8cdb9c2975f73679089dc021f127c9eda1fb443c5d7fa6bb`

See more details on using hashes here.

dibkb-scraper 0.2.6

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

AmazonScraper Python Package Documentation

Downloads

Table of Contents

Overview

Features

Requirements

Installation

Usage

Basic Usage

Saving the HTML

API Reference

AmazonScraper Class

__init__(self, asin: str)

page_html_to_text(self, name: Optional[str] = None)

get_product_title(self) -> Optional[str]

get_mrp(self) -> Optional[float]

get_selling_price(self) -> Optional[float]

get_tags(self) -> List[str]

get_technical_info(self) -> Dict[str, str]

get_additional_info(self) -> Dict[str, str]

get_product_details(self) -> Dict[str, str]

get_ratings(self) -> Ratings

get_about(self) -> Union[List[str], Dict[str, str]]

get_all_details(self) -> AmazonProductResponse

Data Models

Ratings

Pricing

Description

Specifications

Product

AmazonProductResponse

Error Handling

License

Contributing

Disclaimer

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`init(self, asin: str)`

`page_html_to_text(self, name: Optional[str] = None)`

`get_product_title(self) -> Optional[str]`

`get_mrp(self) -> Optional[float]`

`get_selling_price(self) -> Optional[float]`

`get_tags(self) -> List[str]`

`get_technical_info(self) -> Dict[str, str]`

`get_additional_info(self) -> Dict[str, str]`

`get_product_details(self) -> Dict[str, str]`

`get_ratings(self) -> Ratings`

`get_about(self) -> Union[List[str], Dict[str, str]]`

`get_all_details(self) -> AmazonProductResponse`