Extract structured product data from any web page using Schema.org JSON-LD — no AI required.

These details have not been verified by PyPI

Project links

Project description

product-harvest

Extract structured product data from any web page using Schema.org — no AI required.

Features

Zero config — reads the structured data sites already embed for Google
Three extraction methods — JSON-LD (preferred), Open Graph, and meta tag fallback
Rich product data — name, price, currency, brand, SKU, GTIN, availability, rating, reviews, images, categories
Works with any ecommerce site — Shopify, WooCommerce, BigCommerce, Amazon, and millions of sites using Schema.org
Batch processing — extract from multiple URLs in one command
Multiple output formats — plain text, CSV, JSON, JSONL
URL file input — read URLs from a file or stdin pipe
No AI/LLM required — pure parsing of existing structured data
Minimal dependencies — only requests + beautifulsoup4

How It Works

Most ecommerce sites embed Schema.org structured data in their HTML to help Google understand their products. product-harvest reads this existing data — no scraping rules, no CSS selectors, no AI needed.

<!-- This is what sites already have in their HTML for Google: -->
<script type="application/ld+json">
{
  "@type": "Product",
  "name": "Wireless Headphones",
  "offers": { "price": "149.99", "priceCurrency": "USD" }
}
</script>

product-harvest just finds and parses it.

Installation

pip install product-harvest

Requires Python 3.8+.

Quick Start

Extract product data from a URL:

productx https://shop.example.com/product

Output:

ProMax Wireless Headphones    USD 149.99

With full details:

productx https://shop.example.com/product --detail --format json

[
  {
    "name": "ProMax Wireless Headphones",
    "price": "149.99",
    "currency": "USD",
    "brand": "ProMax",
    "availability": "InStock",
    "rating": "4.5",
    "review_count": "328",
    "sku": "WH-PM-2024",
    "category": "Electronics > Audio > Headphones",
    "description": "Premium noise-cancelling wireless headphones...",
    "image": "https://shop.example.com/images/headphones.jpg",
    "url": "https://shop.example.com/headphones",
    "source_url": "https://shop.example.com/headphones",
    "extraction_method": "json-ld"
  }
]

Batch processing from a URL file:

productx --file product-urls.txt --format csv -o products.csv

Pipe URLs from another tool:

cat urls.txt | productx --stdin --format jsonl

CLI Reference

productx [OPTIONS] URL...

Input

Flag	Description
`URL...`	Product page URLs (positional)
`--file, -F FILE`	Read URLs from file (one per line)
`--stdin`	Read URLs from stdin

Output

Flag	Default	Description
`--format, -f`	`plain`	Output: `plain`, `csv`, `json`, `jsonl`
`--output, -o FILE`	stdout	Write to file
`--detail, -d`	off	Include all fields

Fields Extracted

Field	Source	Description
`name`	All methods	Product name
`price`	All methods	Price value
`currency`	All methods	Currency code (USD, EUR, etc.)
`brand`	JSON-LD, OG	Brand name
`availability`	JSON-LD, OG	InStock, OutOfStock, etc.
`rating`	JSON-LD	Average rating
`review_count`	JSON-LD	Number of reviews
`sku`	JSON-LD	Stock Keeping Unit
`gtin`	JSON-LD	Global Trade Item Number
`category`	JSON-LD, OG	Product category
`description`	All methods	Product description
`image`	JSON-LD, OG	Product image URL
`url`	All methods	Canonical product URL

Python API

from product_extractor import ProductExtractor

ex = ProductExtractor()

# From URL
products = ex.from_url("https://shop.example.com/product")
for p in products:
    print(f"{p.name}: {p.currency} {p.price} ({p.brand})")

# From HTML string
products = ex.from_html(html_content, source_url="https://example.com")

# Batch URLs
products = ex.from_urls([
    "https://shop1.com/product-a",
    "https://shop2.com/product-b",
])

# Access all fields
p = products[0]
print(p.name, p.price, p.currency, p.brand, p.sku)
print(p.rating, p.review_count, p.availability)
print(p.to_dict())  # as dictionary

Extraction Priority

product-harvest tries three methods in order and returns the first successful result:

JSON-LD (most reliable) — <script type="application/ld+json"> with @type: Product
Open Graph — <meta property="og:*"> and <meta property="product:*">
Meta tags + price regex (fallback) — page title + price pattern matching

Comparison

Feature	product-harvest	Crawl4AI	Scrapling	oxylabs/amazon-scraper
No AI/LLM needed	Yes	No	Yes	Yes
No API key needed	Yes	Depends	Yes	No
Works on any site	Yes*	Yes	Yes	Amazon only
Zero config	Yes	No	No	No
Structured output	Yes	Yes	No	Yes
pip install	Yes	Yes	Yes	No
Dependencies	2	Many	Many	N/A
Price	Free	Free	Free	Paid proxy

*Requires the site to have Schema.org or Open Graph product data.

Also by Thunderbit

email-harvest — Extract email addresses from text, files, and web pages
phone-harvest — Extract and identify phone numbers from text, files, and web pages
image-harvest — Discover and batch download images from web pages

License

MIT

Built by Thunderbit — AI-powered web scraper and data extraction tools.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Mar 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extract_product-0.1.0.tar.gz (11.4 kB view details)

Uploaded Mar 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

extract_product-0.1.0-py3-none-any.whl (11.1 kB view details)

Uploaded Mar 27, 2026 Python 3

File details

Details for the file extract_product-0.1.0.tar.gz.

File metadata

Download URL: extract_product-0.1.0.tar.gz
Upload date: Mar 27, 2026
Size: 11.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for extract_product-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`41f88f93d352830d72e53eca8f9992cbdee8a431ab2c833aa5b7d87b73586a6d`
MD5	`75e382cfa076d5f16017c9dafc74b8e7`
BLAKE2b-256	`f6732a3b534a62fe813240e8c9d6db950dcd613997815b0702030cfca93cf3d5`

See more details on using hashes here.

File details

Details for the file extract_product-0.1.0-py3-none-any.whl.

File metadata

Download URL: extract_product-0.1.0-py3-none-any.whl
Upload date: Mar 27, 2026
Size: 11.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for extract_product-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bb71dd092099f0afe66748315a5555211dc11f4061b6d499e9d335da835bebcf`
MD5	`0f2fcd3169fb8fb1df4f331d1929f26c`
BLAKE2b-256	`31910f51119836c8c0a9f236e5eef692d93303e633a34da5ffc15dac1f5a7e70`

See more details on using hashes here.

extract-product 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

product-harvest

Features

How It Works

Installation

Quick Start

CLI Reference

Input

Output

Fields Extracted

Python API

Extraction Priority

Comparison

Also by Thunderbit

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes