Skip to main content

Scrape comments and replies from Threads by keyword

Project description

threadscraper

A keyword-based CLI scraper for Threads comments and replies — no account required.

Python 3.10+ License: MIT PyPI


Features

  • Keyword search — search Threads by any keyword and collect all matching comments and replies
  • No account required — session tokens are fetched automatically via headless browser on first run
  • Auto token refresh — detects expired tokens after 3 consecutive 403s and silently refreshes via headless Chromium
  • Text cleaning — removes URLs, @mentions, #hashtags, emoji, and normalizes whitespace
  • Deduplication — skips posts already scraped, tracked across restarts via checkpoint file
  • Resume support — interrupted scrapes continue from where they left off
  • Configurable via CLI — limit, output file, delay range, minimum comment length, checkpoint toggle
  • CSV output with columns: post_code, post_id, post_text, comment_id, comment_text, username, like_count, reply_count, timestamp, keyword, type

Installation

pip install threads-scraper
playwright install chromium

Usage

# Scrape by inline keywords
threads-scraper --keywords "politik indonesia,pilkada"

# Use a keywords file
threads-scraper --keywords-file keywords.txt

# With all options
threads-scraper --keywords-file keywords.txt \
  --output data.csv \
  --limit 5000 \
  --delay-min 2 \
  --delay-max 5 \
  --min-length 15

keywords.txt format

Lines starting with # are treated as comments and ignored.

# Politik
politik indonesia
pilkada

# Ekonomi
ekonomi indonesia
bbm naik

CLI reference

Argument Default Description
--keywords Comma-separated keyword string
--keywords-file Path to .txt file, one keyword per line
--limit unlimited Maximum total comments to collect
--output output.csv Output CSV file path
--min-length 10 Minimum character count per comment
--delay-min 2.0 Minimum seconds between requests
--delay-max 5.0 Maximum seconds between requests
--no-checkpoint off Disable resume behavior (start fresh)

Output CSV columns

Column Description
post_code Original post shortcode from the URL (e.g. DYeZUeiElWy)
post_id Numeric media ID used by the GraphQL API
post_text Text of the top-level post being replied to
comment_id Numeric ID of the comment or reply
comment_text Cleaned comment/reply text
username Poster's Threads username
like_count Number of likes on the comment
reply_count Number of direct replies to the comment
timestamp Unix timestamp of the comment
keyword The search keyword that found this post
type comment (top-level) or reply

Notes

  • For educational and research purposes only
  • Respect Threads' Terms of Service
  • Use reasonable delays (--delay-min, --delay-max) to avoid overloading servers
  • The first run launches a headless browser to capture fresh session tokens — this is normal and takes ~10 seconds

Credit

Made by @galihkjaya

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

threads_comment_scraper-0.1.0.tar.gz (11.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

threads_comment_scraper-0.1.0-py3-none-any.whl (14.0 kB view details)

Uploaded Python 3

File details

Details for the file threads_comment_scraper-0.1.0.tar.gz.

File metadata

  • Download URL: threads_comment_scraper-0.1.0.tar.gz
  • Upload date:
  • Size: 11.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.1

File hashes

Hashes for threads_comment_scraper-0.1.0.tar.gz
Algorithm Hash digest
SHA256 774fadff986d5527df147c283ac6093192a4e532c5e0a05f2b2cd7aab0b4f185
MD5 d462e1058bcb103105ef996c81045972
BLAKE2b-256 c07ced86b74dc2be4f127e6c40973c05a7b916dfd21a2162d546a6baf7271804

See more details on using hashes here.

File details

Details for the file threads_comment_scraper-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for threads_comment_scraper-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8f572a659eea7e0d055bc307de90429b2dad58c4c1ac9eda4bee734dab94098f
MD5 df5cfef95e3018819f6448627a24b6b6
BLAKE2b-256 d8ea097cf0a48f7753810ef8c684b9a40cefa840f66ca1961d8fa9fefe334b7d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page