Skip to main content

Extract structured product data from any web page using Schema.org JSON-LD — no AI required.

Project description

product-harvest

Extract structured product data from any web page using Schema.org — no AI required.

Features

  • Zero config — reads the structured data sites already embed for Google
  • Three extraction methods — JSON-LD (preferred), Open Graph, and meta tag fallback
  • Rich product data — name, price, currency, brand, SKU, GTIN, availability, rating, reviews, images, categories
  • Works with any ecommerce site — Shopify, WooCommerce, BigCommerce, Amazon, and millions of sites using Schema.org
  • Batch processing — extract from multiple URLs in one command
  • Multiple output formats — plain text, CSV, JSON, JSONL
  • URL file input — read URLs from a file or stdin pipe
  • No AI/LLM required — pure parsing of existing structured data
  • Minimal dependencies — only requests + beautifulsoup4

How It Works

Most ecommerce sites embed Schema.org structured data in their HTML to help Google understand their products. product-harvest reads this existing data — no scraping rules, no CSS selectors, no AI needed.

<!-- This is what sites already have in their HTML for Google: -->
<script type="application/ld+json">
{
  "@type": "Product",
  "name": "Wireless Headphones",
  "offers": { "price": "149.99", "priceCurrency": "USD" }
}
</script>

product-harvest just finds and parses it.

Installation

pip install product-harvest

Requires Python 3.8+.

Quick Start

Extract product data from a URL:

productx https://shop.example.com/product

Output:

ProMax Wireless Headphones    USD 149.99

With full details:

productx https://shop.example.com/product --detail --format json
[
  {
    "name": "ProMax Wireless Headphones",
    "price": "149.99",
    "currency": "USD",
    "brand": "ProMax",
    "availability": "InStock",
    "rating": "4.5",
    "review_count": "328",
    "sku": "WH-PM-2024",
    "category": "Electronics > Audio > Headphones",
    "description": "Premium noise-cancelling wireless headphones...",
    "image": "https://shop.example.com/images/headphones.jpg",
    "url": "https://shop.example.com/headphones",
    "source_url": "https://shop.example.com/headphones",
    "extraction_method": "json-ld"
  }
]

Batch processing from a URL file:

productx --file product-urls.txt --format csv -o products.csv

Pipe URLs from another tool:

cat urls.txt | productx --stdin --format jsonl

CLI Reference

productx [OPTIONS] URL...

Input

Flag Description
URL... Product page URLs (positional)
--file, -F FILE Read URLs from file (one per line)
--stdin Read URLs from stdin

Output

Flag Default Description
--format, -f plain Output: plain, csv, json, jsonl
--output, -o FILE stdout Write to file
--detail, -d off Include all fields

Fields Extracted

Field Source Description
name All methods Product name
price All methods Price value
currency All methods Currency code (USD, EUR, etc.)
brand JSON-LD, OG Brand name
availability JSON-LD, OG InStock, OutOfStock, etc.
rating JSON-LD Average rating
review_count JSON-LD Number of reviews
sku JSON-LD Stock Keeping Unit
gtin JSON-LD Global Trade Item Number
category JSON-LD, OG Product category
description All methods Product description
image JSON-LD, OG Product image URL
url All methods Canonical product URL

Python API

from product_extractor import ProductExtractor

ex = ProductExtractor()

# From URL
products = ex.from_url("https://shop.example.com/product")
for p in products:
    print(f"{p.name}: {p.currency} {p.price} ({p.brand})")

# From HTML string
products = ex.from_html(html_content, source_url="https://example.com")

# Batch URLs
products = ex.from_urls([
    "https://shop1.com/product-a",
    "https://shop2.com/product-b",
])

# Access all fields
p = products[0]
print(p.name, p.price, p.currency, p.brand, p.sku)
print(p.rating, p.review_count, p.availability)
print(p.to_dict())  # as dictionary

Extraction Priority

product-harvest tries three methods in order and returns the first successful result:

  1. JSON-LD (most reliable) — <script type="application/ld+json"> with @type: Product
  2. Open Graph<meta property="og:*"> and <meta property="product:*">
  3. Meta tags + price regex (fallback) — page title + price pattern matching

Comparison

Feature product-harvest Crawl4AI Scrapling oxylabs/amazon-scraper
No AI/LLM needed Yes No Yes Yes
No API key needed Yes Depends Yes No
Works on any site Yes* Yes Yes Amazon only
Zero config Yes No No No
Structured output Yes Yes No Yes
pip install Yes Yes Yes No
Dependencies 2 Many Many N/A
Price Free Free Free Paid proxy

*Requires the site to have Schema.org or Open Graph product data.

Also by Thunderbit

  • email-harvest — Extract email addresses from text, files, and web pages
  • phone-harvest — Extract and identify phone numbers from text, files, and web pages
  • image-harvest — Discover and batch download images from web pages

License

MIT


Built by Thunderbit — AI-powered web scraper and data extraction tools.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

product_harvest-0.1.0.tar.gz (13.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

product_harvest-0.1.0-py3-none-any.whl (11.1 kB view details)

Uploaded Python 3

File details

Details for the file product_harvest-0.1.0.tar.gz.

File metadata

  • Download URL: product_harvest-0.1.0.tar.gz
  • Upload date:
  • Size: 13.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for product_harvest-0.1.0.tar.gz
Algorithm Hash digest
SHA256 f74308f3a22c409737dd6d0af97a297f86b5d0a998d932e71c2cb2041c8eabee
MD5 858371afe3653f371837b2787083f6a9
BLAKE2b-256 8f773c278b1bd6c51ef0fb653f52d7b81c272b03da2c85bf9af34fe8e242aa3f

See more details on using hashes here.

File details

Details for the file product_harvest-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for product_harvest-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6285ebffcd2efc5dde03f81fc2136801eed152fe420bbda04eb04d8441c72857
MD5 ce14bc1f24cac9142c6ed6e0336b9647
BLAKE2b-256 27e081636cf5fcd550217c01c39013d8d5cc2627192209566d51fa4387c6d451

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page