Scrape comments and replies from Threads by keyword
Project description
Threads Comment Scraper
A keyword-based CLI scraper for Threads comments and replies — no account required.
Features
- Keyword search: search Threads by any keyword and collect all matching comments and replies
- No account required: session tokens are fetched automatically via headless browser on first run
- Auto token refresh: detects expired tokens after 3 consecutive 403s and silently refreshes via headless Chromium
- Text cleaning: removes URLs, @mentions, #hashtags, emoji, and normalizes whitespace
- Deduplication: skips posts already scraped, tracked across restarts via checkpoint file
- Resume support: interrupted scrapes continue from where they left off
- Configurable via CLI: limit, output file, delay range, minimum comment length, checkpoint toggle
- CSV output with columns:
post_code,post_id,post_text,comment_id,comment_text,username,like_count,reply_count,timestamp,keyword,type
Installation
pip install threads-comment-scraper
playwright install chromium
Usage
# Scrape by inline keywords
threads-scraper --keywords "politik indonesia,pilkada"
# Use a keywords file
threads-scraper --keywords-file keywords.txt
# With all options
threads-scraper --keywords-file keywords.txt \
--output data.csv \
--limit 5000 \
--delay-min 2 \
--delay-max 5 \
--min-length 15
keywords.txt format
Lines starting with # are treated as comments and ignored.
# Politik
politik indonesia
pilkada
# Ekonomi
ekonomi indonesia
bbm naik
CLI reference
| Argument | Default | Description |
|---|---|---|
--keywords |
— | Comma-separated keyword string |
--keywords-file |
— | Path to .txt file, one keyword per line |
--limit |
unlimited | Maximum total comments to collect |
--output |
output.csv |
Output CSV file path |
--min-length |
10 |
Minimum character count per comment |
--delay-min |
2.0 |
Minimum seconds between requests |
--delay-max |
5.0 |
Maximum seconds between requests |
--no-checkpoint |
off | Disable resume behavior (start fresh) |
Output CSV columns
| Column | Description |
|---|---|
post_code |
Original post shortcode from the URL (e.g. DYeZUeiElWy) |
post_id |
Numeric media ID used by the GraphQL API |
post_text |
Text of the top-level post being replied to |
comment_id |
Numeric ID of the comment or reply |
comment_text |
Cleaned comment/reply text |
username |
Poster's Threads username |
like_count |
Number of likes on the comment |
reply_count |
Number of direct replies to the comment |
timestamp |
Unix timestamp of the comment |
keyword |
The search keyword that found this post |
type |
comment (top-level) or reply |
Notes
- For educational and research purposes only
- Respect Threads' Terms of Service
- Use reasonable delays (
--delay-min,--delay-max) to avoid overloading servers - The first run launches a headless browser to capture fresh session tokens, this is normal and takes ~10 seconds
Credit
Made by @galihkjaya @Nathaniel7
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file threads_comment_scraper-0.1.1.tar.gz.
File metadata
- Download URL: threads_comment_scraper-0.1.1.tar.gz
- Upload date:
- Size: 11.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c9795a023d3853f3a0d9c28402e400c47dfc3981158ebe1a4b67295fc4ccff5b
|
|
| MD5 |
f2f59754c67a519bdd22b8f091a0b11e
|
|
| BLAKE2b-256 |
2d043e441fa0946fdc72e403cd6aad74d9fa6f176066ec20b1011bac80d8e8af
|
File details
Details for the file threads_comment_scraper-0.1.1-py3-none-any.whl.
File metadata
- Download URL: threads_comment_scraper-0.1.1-py3-none-any.whl
- Upload date:
- Size: 14.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7ce327ff78c43655b57c91f683ff8209674aa97622cfa4d934e791597fd36ab2
|
|
| MD5 |
77d838f61817c3eed580fcf828634008
|
|
| BLAKE2b-256 |
faae6d5601ecc72d192116f06388f0000a2e02b098ed4a16abfe8cd9f25b214a
|