Extract structured product data from any web page using Schema.org JSON-LD — no AI required.
Project description
product-harvest
Extract structured product data from any web page using Schema.org — no AI required.
Features
- Zero config — reads the structured data sites already embed for Google
- Three extraction methods — JSON-LD (preferred), Open Graph, and meta tag fallback
- Rich product data — name, price, currency, brand, SKU, GTIN, availability, rating, reviews, images, categories
- Works with any ecommerce site — Shopify, WooCommerce, BigCommerce, Amazon, and millions of sites using Schema.org
- Batch processing — extract from multiple URLs in one command
- Multiple output formats — plain text, CSV, JSON, JSONL
- URL file input — read URLs from a file or stdin pipe
- No AI/LLM required — pure parsing of existing structured data
- Minimal dependencies — only
requests+beautifulsoup4
How It Works
Most ecommerce sites embed Schema.org structured data in their HTML to help Google understand their products. product-harvest reads this existing data — no scraping rules, no CSS selectors, no AI needed.
<!-- This is what sites already have in their HTML for Google: -->
<script type="application/ld+json">
{
"@type": "Product",
"name": "Wireless Headphones",
"offers": { "price": "149.99", "priceCurrency": "USD" }
}
</script>
product-harvest just finds and parses it.
Installation
pip install product-harvest
Requires Python 3.8+.
Quick Start
Extract product data from a URL:
productx https://shop.example.com/product
Output:
ProMax Wireless Headphones USD 149.99
With full details:
productx https://shop.example.com/product --detail --format json
[
{
"name": "ProMax Wireless Headphones",
"price": "149.99",
"currency": "USD",
"brand": "ProMax",
"availability": "InStock",
"rating": "4.5",
"review_count": "328",
"sku": "WH-PM-2024",
"category": "Electronics > Audio > Headphones",
"description": "Premium noise-cancelling wireless headphones...",
"image": "https://shop.example.com/images/headphones.jpg",
"url": "https://shop.example.com/headphones",
"source_url": "https://shop.example.com/headphones",
"extraction_method": "json-ld"
}
]
Batch processing from a URL file:
productx --file product-urls.txt --format csv -o products.csv
Pipe URLs from another tool:
cat urls.txt | productx --stdin --format jsonl
CLI Reference
productx [OPTIONS] URL...
Input
| Flag | Description |
|---|---|
URL... |
Product page URLs (positional) |
--file, -F FILE |
Read URLs from file (one per line) |
--stdin |
Read URLs from stdin |
Output
| Flag | Default | Description |
|---|---|---|
--format, -f |
plain |
Output: plain, csv, json, jsonl |
--output, -o FILE |
stdout | Write to file |
--detail, -d |
off | Include all fields |
Fields Extracted
| Field | Source | Description |
|---|---|---|
name |
All methods | Product name |
price |
All methods | Price value |
currency |
All methods | Currency code (USD, EUR, etc.) |
brand |
JSON-LD, OG | Brand name |
availability |
JSON-LD, OG | InStock, OutOfStock, etc. |
rating |
JSON-LD | Average rating |
review_count |
JSON-LD | Number of reviews |
sku |
JSON-LD | Stock Keeping Unit |
gtin |
JSON-LD | Global Trade Item Number |
category |
JSON-LD, OG | Product category |
description |
All methods | Product description |
image |
JSON-LD, OG | Product image URL |
url |
All methods | Canonical product URL |
Python API
from product_extractor import ProductExtractor
ex = ProductExtractor()
# From URL
products = ex.from_url("https://shop.example.com/product")
for p in products:
print(f"{p.name}: {p.currency} {p.price} ({p.brand})")
# From HTML string
products = ex.from_html(html_content, source_url="https://example.com")
# Batch URLs
products = ex.from_urls([
"https://shop1.com/product-a",
"https://shop2.com/product-b",
])
# Access all fields
p = products[0]
print(p.name, p.price, p.currency, p.brand, p.sku)
print(p.rating, p.review_count, p.availability)
print(p.to_dict()) # as dictionary
Extraction Priority
product-harvest tries three methods in order and returns the first successful result:
- JSON-LD (most reliable) —
<script type="application/ld+json">with@type: Product - Open Graph —
<meta property="og:*">and<meta property="product:*"> - Meta tags + price regex (fallback) — page title + price pattern matching
Comparison
| Feature | product-harvest | Crawl4AI | Scrapling | oxylabs/amazon-scraper |
|---|---|---|---|---|
| No AI/LLM needed | Yes | No | Yes | Yes |
| No API key needed | Yes | Depends | Yes | No |
| Works on any site | Yes* | Yes | Yes | Amazon only |
| Zero config | Yes | No | No | No |
| Structured output | Yes | Yes | No | Yes |
| pip install | Yes | Yes | Yes | No |
| Dependencies | 2 | Many | Many | N/A |
| Price | Free | Free | Free | Paid proxy |
*Requires the site to have Schema.org or Open Graph product data.
Also by Thunderbit
- email-harvest — Extract email addresses from text, files, and web pages
- phone-harvest — Extract and identify phone numbers from text, files, and web pages
- image-harvest — Discover and batch download images from web pages
License
Built by Thunderbit — AI-powered web scraper and data extraction tools.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file product_harvest-0.1.0.tar.gz.
File metadata
- Download URL: product_harvest-0.1.0.tar.gz
- Upload date:
- Size: 13.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f74308f3a22c409737dd6d0af97a297f86b5d0a998d932e71c2cb2041c8eabee
|
|
| MD5 |
858371afe3653f371837b2787083f6a9
|
|
| BLAKE2b-256 |
8f773c278b1bd6c51ef0fb653f52d7b81c272b03da2c85bf9af34fe8e242aa3f
|
File details
Details for the file product_harvest-0.1.0-py3-none-any.whl.
File metadata
- Download URL: product_harvest-0.1.0-py3-none-any.whl
- Upload date:
- Size: 11.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6285ebffcd2efc5dde03f81fc2136801eed152fe420bbda04eb04d8441c72857
|
|
| MD5 |
ce14bc1f24cac9142c6ed6e0336b9647
|
|
| BLAKE2b-256 |
27e081636cf5fcd550217c01c39013d8d5cc2627192209566d51fa4387c6d451
|