Scrape, cluster, and analyze product feedback from public channels

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

Sift

Scrape user feedback from public product channels, cluster complaints with ML, and generate AI-powered product insights — all from your terminal.

What It Does

Scrapes G2, app stores, YouTube, Hacker News, GitHub issues, Product Hunt, forums, changelogs, and public exports for real user feedback about any product
Anonymizes reviews at ingestion — no usernames stored, only a clickable link to the source
Deduplicates feedback across sources using hash-based IDs so you never count the same review twice
Clusters complaints and pain points using sentence embeddings + UMAP + HDBSCAN
Analyzes each cluster with an LLM to name themes, summarize issues, and rate severity
Compares multiple products to surface shared vs. unique pain points

How It Works

G2 / App Stores / YouTube / HN / GitHub / Forums / Exports
          │
          └──> Feedback Items ──> Dedup Filter ──> Sentence Embeddings ──> UMAP + HDBSCAN
                       (anonymized)                                      (all-MiniLM-L12-v2)

                                          ┌── Clustered Themes ──> LLM Analysis ──> Report (MD + JSON)
              Multi-Product Comparison <──┘

Install

Prerequisites: Python 3.11+ and an OpenAI-compatible LLM endpoint.

pip install getsift

Quick Start

# 1. Install
pip install getsift

# 2. Set up (creates config.yaml and .env with your API keys)
sift init

# 3. Run — launches the interactive Rich frontend
sift

That's it. sift opens an interactive terminal UI where you pick products and sources. No CLI arguments needed.

CLI Commands

# Interactive mode (default — just run sift)
sift

# First-run setup wizard (creates config.yaml + .env)
sift init

# Scripted/automation use:
sift analyze "Notion" "Obsidian" --source g2
sift scrape "Slack" --source g2 --source app_store

# Debug logging
sift analyze "Notion" --verbose

Configuration

Edit config.yaml to tune the pipeline:

Section	Key Options
`sources`	`default_sources`, `disabled_sources`
`reddit`	`subreddits`, `max_posts`, `max_comments_per_post`
`g2`	`request_delay`, `max_pages`, `user_agent_rotation`
`app_store` / `play_store`	product-to-app/package mappings, locale, item limits
`youtube`	`video_ids`, `max_comments_per_video`
`github_issues`	product-to-repo mappings, item limits
`support_forums` / `changelogs`	URL templates or product URL mappings
`discord_exports` / `linkedin_comments`	public/export JSON paths or URLs
`clustering`	`embedding_model`, `umap_n_neighbors`, `hdbscan_min_cluster_size`
`llm`	`model`, `temperature`, `max_tokens`
`logging`	`level` (`INFO` or `DEBUG`), `format`

LLM endpoint and API keys are set via .env:

LLM_API_KEY=your-key
LLM_BASE_URL=https://api.openai.com/v1
LLM_MODEL=gpt-4o-mini
YOUTUBE_API_KEY=optional-youtube-key
GITHUB_TOKEN=optional-github-token

Any OpenAI-compatible API works — OpenAI, Anthropic (via proxy), Ollama, OpenCode, etc.

Data Sources

Source	Method	Requirements
G2	Web scraping (BeautifulSoup)	None — includes User-Agent rotation and polite request delays
App Store	Apple customer reviews RSS	Product app IDs in `config.yaml`
Play Store	Public app details/reviews page	Product package names in `config.yaml`
YouTube comments	YouTube Data API	`YOUTUBE_API_KEY` and product video IDs
Hacker News	Algolia HN Search API	None
GitHub issues	GitHub Search API	Product repos; optional `GITHUB_TOKEN`
Product Hunt comments	Public product pages	Optional product slugs
Support forums	Configured public search URLs	Forum URL templates
Changelogs	Configured public changelog URLs	Product URL mappings
Discord exports	Public/exported JSON	JSON file paths or URLs
LinkedIn comments	Public/exported JSON	JSON file paths or URLs
Reddit	PRAW (official API)	Currently disabled in `sources.disabled_sources` until API approval

To reactivate Reddit later, remove reddit from sources.disabled_sources and add it to sources.default_sources if you want it in default runs.

Privacy: Usernames and PII are stripped at ingestion. Only the review text and a clickable source link are retained.

Output

Reports are saved to output/ in two formats:

Markdown — human-readable with severity badges, representative quotes, and comparison tables
JSON — machine-readable structured data for dashboards or downstream tools

Each report includes:

Overall product insights (LLM-generated)
Top pain points ranked by severity
Per-cluster summaries with representative user quotes
For multi-product runs: shared vs. unique pain points + competitive insights

Architecture

sift/
├── scrapers/          # Source adapters for public feedback channels
├── pipeline/          # Embeddings, clustering, LLM analysis, comparison, rate limiting, deduplication
├── models/            # Data classes (FeedbackItem, ClusterResult, ProductReport)
├── ui/                # Rich terminal frontend, setup wizard, interactive menus
├── config.py          # YAML + env var configuration loader
└── cli.py             # Click CLI (analyze, scrape, init commands)
tests/                 # Tests covering all modules

Running Tests

python -m pytest tests/ -v

Roadmap

Reactivate Reddit source after API approval
Web app with dashboard UI
Continuous monitoring mode (track sentiment over time)
Additional review sites (Trustpilot, Capterra)
Slack/email alerting for new complaint spikes

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

hxrshyt

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

May 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

getsift-0.1.0.tar.gz (61.7 kB view details)

Uploaded May 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

getsift-0.1.0-py3-none-any.whl (63.3 kB view details)

Uploaded May 27, 2026 Python 3

File details

Details for the file getsift-0.1.0.tar.gz.

File metadata

Download URL: getsift-0.1.0.tar.gz
Upload date: May 27, 2026
Size: 61.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for getsift-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`0925cd211bca2b88dbb14e614a7522cf6121d2e146692506aa161e1a51473087`
MD5	`e54d177d9c3b9c27e4970721dd959013`
BLAKE2b-256	`4031de5e43a8891f207177c2ec35ee6acc755a9cfa56d115061c8c96c9073404`

See more details on using hashes here.

Provenance

The following attestation bundles were made for getsift-0.1.0.tar.gz:

Publisher: package.yml on Gitter09/sift

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: getsift-0.1.0.tar.gz
- Subject digest: 0925cd211bca2b88dbb14e614a7522cf6121d2e146692506aa161e1a51473087
- Sigstore transparency entry: 1646143514
- Sigstore integration time: May 27, 2026
Source repository:
- Permalink: Gitter09/sift@7fe6d7d5e0aee517c168446567db74af06ce6aca
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/Gitter09
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: package.yml@7fe6d7d5e0aee517c168446567db74af06ce6aca
- Trigger Event: workflow_dispatch

File details

Details for the file getsift-0.1.0-py3-none-any.whl.

File metadata

Download URL: getsift-0.1.0-py3-none-any.whl
Upload date: May 27, 2026
Size: 63.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for getsift-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`41b25b0b4ff0639df43e97cdc75cbb6d9e00f97fc2d5c4d5f96f04bcf70b2652`
MD5	`cb936ce9d947ad38705e381bee252928`
BLAKE2b-256	`7e7f6a5f12baa85bbe20fdecf2137be2c8879380b3faa3e3dba3113e9fd5110a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for getsift-0.1.0-py3-none-any.whl:

Publisher: package.yml on Gitter09/sift

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: getsift-0.1.0-py3-none-any.whl
- Subject digest: 41b25b0b4ff0639df43e97cdc75cbb6d9e00f97fc2d5c4d5f96f04bcf70b2652
- Sigstore transparency entry: 1646143608
- Sigstore integration time: May 27, 2026
Source repository:
- Permalink: Gitter09/sift@7fe6d7d5e0aee517c168446567db74af06ce6aca
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/Gitter09
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: package.yml@7fe6d7d5e0aee517c168446567db74af06ce6aca
- Trigger Event: workflow_dispatch

getsift 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Sift

What It Does

How It Works

Install

Quick Start

CLI Commands

Configuration

Data Sources

Output

Architecture

Running Tests

Roadmap

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance