Scrape, cluster, and analyze product feedback from public channels
Project description
Sift
Scrape user feedback from public product channels, cluster complaints with ML, and generate AI-powered product insights — all from your terminal.
What It Does
- Scrapes G2, app stores, YouTube, Hacker News, GitHub issues, Product Hunt, forums, changelogs, and public exports for real user feedback about any product
- Anonymizes reviews at ingestion — no usernames stored, only a clickable link to the source
- Deduplicates feedback across sources using hash-based IDs so you never count the same review twice
- Clusters complaints and pain points using sentence embeddings + UMAP + HDBSCAN
- Analyzes each cluster with an LLM to name themes, summarize issues, and rate severity
- Compares multiple products to surface shared vs. unique pain points
How It Works
G2 / App Stores / YouTube / HN / GitHub / Forums / Exports
│
└──> Feedback Items ──> Dedup Filter ──> Sentence Embeddings ──> UMAP + HDBSCAN
(anonymized) (all-MiniLM-L12-v2)
┌── Clustered Themes ──> LLM Analysis ──> Report (MD + JSON)
Multi-Product Comparison <──┘
Install
Prerequisites: Python 3.11+ and an OpenAI-compatible LLM endpoint.
pip install getsift
Quick Start
# 1. Install
pip install getsift
# 2. Set up (creates config.yaml and .env with your API keys)
sift init
# 3. Run — launches the interactive Rich frontend
sift
That's it. sift opens an interactive terminal UI where you pick products and sources. No CLI arguments needed.
CLI Commands
# Interactive mode (default — just run sift)
sift
# First-run setup wizard (creates config.yaml + .env)
sift init
# Scripted/automation use:
sift analyze "Notion" "Obsidian" --source g2
sift scrape "Slack" --source g2 --source app_store
# Debug logging
sift analyze "Notion" --verbose
Configuration
Edit config.yaml to tune the pipeline:
| Section | Key Options |
|---|---|
sources |
default_sources, disabled_sources |
reddit |
subreddits, max_posts, max_comments_per_post |
g2 |
request_delay, max_pages, user_agent_rotation |
app_store / play_store |
product-to-app/package mappings, locale, item limits |
youtube |
video_ids, max_comments_per_video |
github_issues |
product-to-repo mappings, item limits |
support_forums / changelogs |
URL templates or product URL mappings |
discord_exports / linkedin_comments |
public/export JSON paths or URLs |
clustering |
embedding_model, umap_n_neighbors, hdbscan_min_cluster_size |
llm |
model, temperature, max_tokens |
logging |
level (INFO or DEBUG), format |
LLM endpoint and API keys are set via .env:
LLM_API_KEY=your-key
LLM_BASE_URL=https://api.openai.com/v1
LLM_MODEL=gpt-4o-mini
YOUTUBE_API_KEY=optional-youtube-key
GITHUB_TOKEN=optional-github-token
Any OpenAI-compatible API works — OpenAI, Anthropic (via proxy), Ollama, OpenCode, etc.
Data Sources
| Source | Method | Requirements |
|---|---|---|
| G2 | Web scraping (BeautifulSoup) | None — includes User-Agent rotation and polite request delays |
| App Store | Apple customer reviews RSS | Product app IDs in config.yaml |
| Play Store | Public app details/reviews page | Product package names in config.yaml |
| YouTube comments | YouTube Data API | YOUTUBE_API_KEY and product video IDs |
| Hacker News | Algolia HN Search API | None |
| GitHub issues | GitHub Search API | Product repos; optional GITHUB_TOKEN |
| Product Hunt comments | Public product pages | Optional product slugs |
| Support forums | Configured public search URLs | Forum URL templates |
| Changelogs | Configured public changelog URLs | Product URL mappings |
| Discord exports | Public/exported JSON | JSON file paths or URLs |
| LinkedIn comments | Public/exported JSON | JSON file paths or URLs |
| PRAW (official API) | Currently disabled in sources.disabled_sources until API approval |
To reactivate Reddit later, remove
sources.disabled_sourcesand add it tosources.default_sourcesif you want it in default runs.Privacy: Usernames and PII are stripped at ingestion. Only the review text and a clickable source link are retained.
Output
Reports are saved to output/ in two formats:
- Markdown — human-readable with severity badges, representative quotes, and comparison tables
- JSON — machine-readable structured data for dashboards or downstream tools
Each report includes:
- Overall product insights (LLM-generated)
- Top pain points ranked by severity
- Per-cluster summaries with representative user quotes
- For multi-product runs: shared vs. unique pain points + competitive insights
Architecture
sift/
├── scrapers/ # Source adapters for public feedback channels
├── pipeline/ # Embeddings, clustering, LLM analysis, comparison, rate limiting, deduplication
├── models/ # Data classes (FeedbackItem, ClusterResult, ProductReport)
├── ui/ # Rich terminal frontend, setup wizard, interactive menus
├── config.py # YAML + env var configuration loader
└── cli.py # Click CLI (analyze, scrape, init commands)
tests/ # Tests covering all modules
Running Tests
python -m pytest tests/ -v
Roadmap
- Reactivate Reddit source after API approval
- Web app with dashboard UI
- Continuous monitoring mode (track sentiment over time)
- Additional review sites (Trustpilot, Capterra)
- Slack/email alerting for new complaint spikes
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file getsift-0.1.0.tar.gz.
File metadata
- Download URL: getsift-0.1.0.tar.gz
- Upload date:
- Size: 61.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0925cd211bca2b88dbb14e614a7522cf6121d2e146692506aa161e1a51473087
|
|
| MD5 |
e54d177d9c3b9c27e4970721dd959013
|
|
| BLAKE2b-256 |
4031de5e43a8891f207177c2ec35ee6acc755a9cfa56d115061c8c96c9073404
|
Provenance
The following attestation bundles were made for getsift-0.1.0.tar.gz:
Publisher:
package.yml on Gitter09/sift
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
getsift-0.1.0.tar.gz -
Subject digest:
0925cd211bca2b88dbb14e614a7522cf6121d2e146692506aa161e1a51473087 - Sigstore transparency entry: 1646143514
- Sigstore integration time:
-
Permalink:
Gitter09/sift@7fe6d7d5e0aee517c168446567db74af06ce6aca -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Gitter09
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
package.yml@7fe6d7d5e0aee517c168446567db74af06ce6aca -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file getsift-0.1.0-py3-none-any.whl.
File metadata
- Download URL: getsift-0.1.0-py3-none-any.whl
- Upload date:
- Size: 63.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
41b25b0b4ff0639df43e97cdc75cbb6d9e00f97fc2d5c4d5f96f04bcf70b2652
|
|
| MD5 |
cb936ce9d947ad38705e381bee252928
|
|
| BLAKE2b-256 |
7e7f6a5f12baa85bbe20fdecf2137be2c8879380b3faa3e3dba3113e9fd5110a
|
Provenance
The following attestation bundles were made for getsift-0.1.0-py3-none-any.whl:
Publisher:
package.yml on Gitter09/sift
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
getsift-0.1.0-py3-none-any.whl -
Subject digest:
41b25b0b4ff0639df43e97cdc75cbb6d9e00f97fc2d5c4d5f96f04bcf70b2652 - Sigstore transparency entry: 1646143608
- Sigstore integration time:
-
Permalink:
Gitter09/sift@7fe6d7d5e0aee517c168446567db74af06ce6aca -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Gitter09
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
package.yml@7fe6d7d5e0aee517c168446567db74af06ce6aca -
Trigger Event:
workflow_dispatch
-
Statement type: