Skip to main content

Scrape, cluster, and analyze product feedback from public channels

Project description

Sift

Scrape user feedback from public product channels, cluster complaints with ML, and generate AI-powered product insights — all from your terminal.

What It Does

  • Scrapes G2, app stores, YouTube, Hacker News, GitHub issues, Product Hunt, forums, changelogs, and public exports for real user feedback about any product
  • Anonymizes reviews at ingestion — no usernames stored, only a clickable link to the source
  • Deduplicates feedback across sources using hash-based IDs so you never count the same review twice
  • Clusters complaints and pain points using sentence embeddings + UMAP + HDBSCAN
  • Analyzes each cluster with an LLM to name themes, summarize issues, and rate severity
  • Compares multiple products to surface shared vs. unique pain points

How It Works

G2 / App Stores / YouTube / HN / GitHub / Forums / Exports
          │
          └──> Feedback Items ──> Dedup Filter ──> Sentence Embeddings ──> UMAP + HDBSCAN
                       (anonymized)                                      (all-MiniLM-L12-v2)

                                          ┌── Clustered Themes ──> LLM Analysis ──> Report (MD + JSON)
              Multi-Product Comparison <──┘

Install

Prerequisites: Python 3.11+ and an OpenAI-compatible LLM endpoint.

pip install getsift

Quick Start

# 1. Install
pip install getsift

# 2. Set up (creates config.yaml and .env with your API keys)
sift init

# 3. Run — launches the interactive Rich frontend
sift

That's it. sift opens an interactive terminal UI where you pick products and sources. No CLI arguments needed.

CLI Commands

# Interactive mode (default — just run sift)
sift

# First-run setup wizard (creates config.yaml + .env)
sift init

# Scripted/automation use:
sift analyze "Notion" "Obsidian" --source g2
sift scrape "Slack" --source g2 --source app_store

# Debug logging
sift analyze "Notion" --verbose

Configuration

Edit config.yaml to tune the pipeline:

Section Key Options
sources default_sources, disabled_sources
reddit subreddits, max_posts, max_comments_per_post
g2 request_delay, max_pages, user_agent_rotation
app_store / play_store product-to-app/package mappings, locale, item limits
youtube video_ids, max_comments_per_video
github_issues product-to-repo mappings, item limits
support_forums / changelogs URL templates or product URL mappings
discord_exports / linkedin_comments public/export JSON paths or URLs
clustering embedding_model, umap_n_neighbors, hdbscan_min_cluster_size
llm model, temperature, max_tokens
logging level (INFO or DEBUG), format

LLM endpoint and API keys are set via .env:

LLM_API_KEY=your-key
LLM_BASE_URL=https://api.openai.com/v1
LLM_MODEL=gpt-4o-mini
YOUTUBE_API_KEY=optional-youtube-key
GITHUB_TOKEN=optional-github-token

Any OpenAI-compatible API works — OpenAI, Anthropic (via proxy), Ollama, OpenCode, etc.

Data Sources

Source Method Requirements
G2 Web scraping (BeautifulSoup) None — includes User-Agent rotation and polite request delays
App Store Apple customer reviews RSS Product app IDs in config.yaml
Play Store Public app details/reviews page Product package names in config.yaml
YouTube comments YouTube Data API YOUTUBE_API_KEY and product video IDs
Hacker News Algolia HN Search API None
GitHub issues GitHub Search API Product repos; optional GITHUB_TOKEN
Product Hunt comments Public product pages Optional product slugs
Support forums Configured public search URLs Forum URL templates
Changelogs Configured public changelog URLs Product URL mappings
Discord exports Public/exported JSON JSON file paths or URLs
LinkedIn comments Public/exported JSON JSON file paths or URLs
Reddit PRAW (official API) Currently disabled in sources.disabled_sources until API approval

To reactivate Reddit later, remove reddit from sources.disabled_sources and add it to sources.default_sources if you want it in default runs.

Privacy: Usernames and PII are stripped at ingestion. Only the review text and a clickable source link are retained.

Output

Reports are saved to output/ in two formats:

  • Markdown — human-readable with severity badges, representative quotes, and comparison tables
  • JSON — machine-readable structured data for dashboards or downstream tools

Each report includes:

  • Overall product insights (LLM-generated)
  • Top pain points ranked by severity
  • Per-cluster summaries with representative user quotes
  • For multi-product runs: shared vs. unique pain points + competitive insights

Architecture

sift/
├── scrapers/          # Source adapters for public feedback channels
├── pipeline/          # Embeddings, clustering, LLM analysis, comparison, rate limiting, deduplication
├── models/            # Data classes (FeedbackItem, ClusterResult, ProductReport)
├── ui/                # Rich terminal frontend, setup wizard, interactive menus
├── config.py          # YAML + env var configuration loader
└── cli.py             # Click CLI (analyze, scrape, init commands)
tests/                 # Tests covering all modules

Running Tests

python -m pytest tests/ -v

Roadmap

  • Reactivate Reddit source after API approval
  • Web app with dashboard UI
  • Continuous monitoring mode (track sentiment over time)
  • Additional review sites (Trustpilot, Capterra)
  • Slack/email alerting for new complaint spikes

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

getsift-0.1.0.tar.gz (61.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

getsift-0.1.0-py3-none-any.whl (63.3 kB view details)

Uploaded Python 3

File details

Details for the file getsift-0.1.0.tar.gz.

File metadata

  • Download URL: getsift-0.1.0.tar.gz
  • Upload date:
  • Size: 61.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for getsift-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0925cd211bca2b88dbb14e614a7522cf6121d2e146692506aa161e1a51473087
MD5 e54d177d9c3b9c27e4970721dd959013
BLAKE2b-256 4031de5e43a8891f207177c2ec35ee6acc755a9cfa56d115061c8c96c9073404

See more details on using hashes here.

Provenance

The following attestation bundles were made for getsift-0.1.0.tar.gz:

Publisher: package.yml on Gitter09/sift

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file getsift-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: getsift-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 63.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for getsift-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 41b25b0b4ff0639df43e97cdc75cbb6d9e00f97fc2d5c4d5f96f04bcf70b2652
MD5 cb936ce9d947ad38705e381bee252928
BLAKE2b-256 7e7f6a5f12baa85bbe20fdecf2137be2c8879380b3faa3e3dba3113e9fd5110a

See more details on using hashes here.

Provenance

The following attestation bundles were made for getsift-0.1.0-py3-none-any.whl:

Publisher: package.yml on Gitter09/sift

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page