Skip to main content

Simple CLI to scrape product data, images, and collections from Shopify stores

Project description

Shopify Spy

CI PyPI version Python 3.10+ License: MIT

Shopify Spy is a command-line tool for scraping product and collection data from any Shopify store. Built on Scrapy, it extracts detailed data including high-value information like vendor names and inventory levels.

To find Shopify stores to scrape, try searching Google with site:myshopify.com.

Installation

pipx and uv tool install CLI tools in isolated environments, so they won't conflict with other Python projects:

# pipx
pipx install shopify-spy

# uv
uv tool install shopify-spy

Or install with pip if you want it in a specific virtual environment:

pip install shopify-spy

Requires Python 3.10+.

Quick Start

# Scrape a single store
shopify-spy scrape https://www.example.com

# Scrape multiple stores
shopify-spy scrape https://store1.com https://store2.com https://store3.com

# Download product images
shopify-spy scrape https://www.example.com --images

# Include collections
shopify-spy scrape https://www.example.com --collections

# Scrape multiple stores from a file
shopify-spy scrape --url-file stores.txt

# Specify output directory
shopify-spy scrape https://www.example.com --output ./my-data

Results are saved as JSONL in the output directory (default: ./output). Use --format to choose JSON, CSV, or XML instead.

Commands

scrape

Scrape products and collections from Shopify stores.

shopify-spy scrape [URL] [OPTIONS]

Arguments:

  • URL... - One or more Shopify store URLs (optional if using --url-file)

Options:

  • --url-file, -f FILE - File containing URLs (one per line)
  • --products / --no-products - Scrape products (default: yes)
  • --collections / --no-collections - Scrape collections (default: no)
  • --images / --no-images - Download images (default: no)
  • --output, -o PATH - Output directory (default: ./output)
  • --format, -F FORMAT - Output format: json, jsonl, csv, xml (default: jsonl)
  • --config, -c FILE - Path to YAML config file
  • --concurrent INT - Concurrent requests per domain (default: 16)
  • --throttle / --no-throttle - Auto-throttle requests (default: yes)
  • --user-agent, -A TEXT - Custom User-Agent header
  • --verbose, -v - Show debug output
  • --quiet, -q - Show only warnings and errors

init

Create a default configuration file.

shopify-spy init [PATH]

Arguments:

  • PATH - Where to create the config file (default: ./shopify-spy.yaml)

Options:

  • --force, -f - Overwrite existing file

Configuration

Shopify Spy can be configured via YAML file. Create one with shopify-spy init:

# shopify-spy.yaml
scrape:
  products: true      # Scrape product data
  collections: false  # Scrape collection data
  images: false       # Download product images

output:
  dir: ./output       # Output directory for results
  format: jsonl       # Output format: json, jsonl, csv, xml
  images_subdir: images  # Subdirectory for downloaded images

network:
  concurrent_requests: 16  # Concurrent requests per domain
  timeout: 180             # Download timeout (seconds)
  retries: 2               # Retry failed requests
  # user_agent: MyBot/1.0 (+https://example.com)  # Custom user agent
  respect_robots_txt: true

throttle:
  enabled: true            # Auto-throttle based on server response
  start_delay: 1           # Initial download delay (seconds)
  max_delay: 60            # Maximum download delay (seconds)
  target_concurrency: 1.0  # Target concurrent requests (higher = faster)

Config file search order:

  1. Path specified with --config
  2. ./shopify-spy.yaml
  3. ~/.config/shopify-spy/config.yaml

CLI options override config file settings.

Output

Results are saved in the output directory (JSONL by default, configurable via --format):

output/
  shopify_spider_2024-01-15T10-30-00.jsonl
  images/
    full/
      <image files>

Each line in the JSON file contains a product or collection with full metadata from Shopify's JSON API.

Image Metadata

When using --images, each item includes a scraped_images field with download info:

{
  "image_urls": ["https://cdn.shopify.com/.../product.jpg"],
  "scraped_images": [
    {
      "url": "https://cdn.shopify.com/.../product.jpg",
      "path": "full/abc123def.jpg",
      "checksum": "d41d8cd98f00b204e9800998ecf8427e",
      "status": "downloaded"
    }
  ],
  "product": { ... }
}

The path is relative to the images directory (output/images/ by default).

Parsing Output

With jq:

# Extract product titles
cat output/*.jsonl | jq '.product.title'

# Get prices
cat output/*.jsonl | jq '{title: .product.title, price: .product.variants[0].price}'

With Python:

import json

with open("output/shopify_spider_2024-01-15.jsonl") as f:
    for line in f:
        item = json.loads(line)
        print(item["product"]["title"])

With pandas:

import pandas as pd

df = pd.read_json("output/shopify_spider_2024-01-15.jsonl", lines=True)
products = pd.json_normalize(df["product"])

With polars:

import polars as pl

df = pl.read_ndjson("output/shopify_spider_2024-01-15.jsonl")

Limitations

Standard Shopify stores only. This tool works with standard Shopify stores using Liquid themes, which represent nearly all Shopify sites. The small number of headless stores built on Hydrogen or other custom storefronts are not supported, as they use the Storefront GraphQL API instead of the JSON endpoints this tool relies on.

Rate limiting. Scraping very large stores may still result in temporary bans. Auto-throttling is enabled by default, but you can adjust the settings or disable it for faster scraping:

# Disable throttling (faster but riskier)
shopify-spy scrape https://example.com --no-throttle

Advanced Usage

For advanced Scrapy configuration or custom pipelines, you can use Shopify Spy as a library:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from shopify_spy.spiders.shopify import ShopifySpider

process = CrawlerProcess(get_project_settings())
process.crawl(ShopifySpider, url="https://example.com", products=True)
process.start()

Feedback

Found a bug or have a suggestion? Open an issue.

License

MIT

Credits

Icon by Bartama Graphic.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shopify_spy-0.1.1.tar.gz (117.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

shopify_spy-0.1.1-py3-none-any.whl (13.6 kB view details)

Uploaded Python 3

File details

Details for the file shopify_spy-0.1.1.tar.gz.

File metadata

  • Download URL: shopify_spy-0.1.1.tar.gz
  • Upload date:
  • Size: 117.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.17 {"installer":{"name":"uv","version":"0.9.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for shopify_spy-0.1.1.tar.gz
Algorithm Hash digest
SHA256 f981d6718559b25734e005a8bd3045b36f88013f9e0eb990adbcb4295dac99c9
MD5 212c4ba43aaadd3147ce9abcbe31d422
BLAKE2b-256 de37353a61de608365b6591dea640b408cdf103f4f2b8cd378946c8eed1497d1

See more details on using hashes here.

File details

Details for the file shopify_spy-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: shopify_spy-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 13.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.17 {"installer":{"name":"uv","version":"0.9.17","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for shopify_spy-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 3212cf0a205379f53703b5758e3784f9716a780a67b6f77a8bcc0c1348125f76
MD5 605c67c2b7dbece224d174ac356faa4f
BLAKE2b-256 1037e0095f41f99f199f28fb9f2ac32b39ec33d8616edadc4ec4b1fc8ee8ca17

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page