API-only tweet search scraping (Twitter/X web GraphQL)

These details have not been verified by PyPI

Project links

Project description

Scweet v4 Documentation

Scweet v4 preserves the familiar v3 surface (Scweet(...), scrape/ascrape, output filenames, resume flags) while moving tweet search scraping to an API-only core and introducing DB-first account provisioning (SQLite).

What’s In v4 (and What Isn’t)

Supported:

Tweet search scraping (GraphQL SearchTimeline), API-only.
Multiple provisioning sources: .env, accounts.txt, cookies.json, Netscape cookies.txt, direct cookies= payload.
Local SQLite state: accounts leasing/cooldowns, resume checkpoints, run stats, manifest cache.
Optional internal cookie bootstrap with nodriver (credentials -> cookies). No scraping via browser.

Not implemented (v4.x):

Profile scraping and follower/following scraping are stubbed (methods exist for signature compatibility, but engines return 501 Not implemented).
API Login/provisioning using creds
API profile timeline scraping
Richer scraping query input

Installation

pip install Scweet

Notes:

Python >= 3.9
API HTTP stack uses curl_cffi (sync or async sessions depending on config).
nodriver is only needed if you want credentials-based cookie bootstrap.

Quickstart

The simplest “I already have cookies” flow:

from Scweet import Scweet

scweet = Scweet.from_sources(
    db_path="scweet_state.db",
    cookies={"auth_token": "...", "ct0": "..."}, # ct0 is optional here.
)

tweets = scweet.scrape(
    since="2026-02-01",
    until="2026-02-07",
    words=["bitcoin"],
    limit=200,
    resume=True,
    save_dir="outputs",
    custom_csv_name="bitcoin.csv",
)

print(len(tweets))  # list[dict] of raw GraphQL tweet objects

Notebook Usage (Async)

In notebooks/Jupyter, you almost always have an active event loop. Use await ascrape(...) instead of scrape(...).

from Scweet import Scweet

scweet = Scweet.from_sources(
    db_path="scweet_state.db",
    cookies_file="cookies.json",
    provision_on_init=True,
)

tweets = await scweet.ascrape(
    since="2026-02-01",
    until="2026-02-07",
    words=["bitcoin"],
    limit=50,
    resume=True,
    save_dir="outputs",
    custom_csv_name="nb_bitcoin.csv",
)

Configuration

There are three ways to configure Scweet v4:

Scweet.from_sources(...) (convenience)
ScweetConfig.from_sources(...) (recommended for typed discoverability)
Pass a full config dict/model to Scweet(config=...) for advanced tuning

Recommended: `ScweetConfig.from_sources(...)`

from Scweet import Scweet, ScweetConfig

cfg = ScweetConfig.from_sources(
    db_path="scweet_state.db",
    accounts_file="accounts.txt",
    cookies_file="cookies.json",    # or cookies.txt (Netscape)
    # env_path=".env",                # works only for 1 account
    bootstrap_strategy="auto",      # auto|token_only|nodriver_only|none
    resume_mode="hybrid_safe",      # legacy_csv|db_cursor|hybrid_safe
    output_format="both",           # csv|json|both|none
    strict=False,
    proxy={"host": "127.0.0.1", "port": 8080},
    api_http_impersonate="chrome124",
    overrides={
        "operations": {"account_lease_ttl_s": 600},
        "output": {"dedupe_on_resume_by_tweet_id": True},
    },
)

scweet = Scweet(config=cfg)

Advanced: `overrides={...}`

All “power knobs” live in nested config sections. overrides is a deep merge patch.

Common advanced fields:

operations.account_lease_ttl_s
operations.account_requests_per_min
operations.api_page_size
pool.n_splits, pool.concurrency
output.dedupe_on_resume_by_tweet_id
manifest.manifest_url, manifest.update_on_init, manifest.ttl_s

Full Configuration Reference (`ScweetConfig`)

ScweetConfig is the full v4 configuration model (Pydantic) and is the source of truth for all runtime behavior.

You can provide configuration as:

A ScweetConfig instance (recommended for IDE discoverability).
A dict (same shape as below).

To print the current defaults as JSON:

from Scweet import ScweetConfig
import json

print(json.dumps(ScweetConfig().model_dump(mode="json"), indent=2))

Note on defaults:

ScweetConfig() contains baseline defaults for all fields.
ScweetConfig.from_sources(...) applies v4-friendly defaults (notably engine.kind="api" since v4 tweet search scraping is API-only).

`engine`

Controls API HTTP behavior and keeps legacy compatibility fields.

Key	Type	Default	Description
`engine.kind`	`"api" \| "browser" \| "auto"`	`"browser"`	Legacy/compat field. v4 tweet search scraping is API-only, so you should treat this as `"api"`.
`engine.api_http_mode`	`"auto" \| "sync" \| "async"`	`"auto"`	How HTTP calls are executed under the hood. `auto` prefers async sessions when available and falls back to sync. Use `ascrape()` in async environments; use `scrape()` in sync scripts.
`engine.api_http_impersonate`	`str \| None`	`None`	`curl_cffi` impersonation string (example: `"chrome124"`). Affects API sessions and transaction-id bootstrap. If unset, `curl_cffi` defaults are used (or `SCWEET_HTTP_IMPERSONATE`).

`storage`

SQLite state DB configuration.

Key	Type	Default	Description
`storage.db_path`	`str`	`"scweet_state.db"`	SQLite path for Scweet state (accounts, leases, resume, manifest cache).
`storage.enable_wal`	`bool`	`True`	Enable WAL mode (recommended) for better concurrency/perf on SQLite.
`storage.busy_timeout_ms`	`int`	`5000`	SQLite busy timeout in milliseconds.

`accounts`

Account provisioning sources and bootstrap policy.

Key	Type	Default	Description
`accounts.accounts_file`	`str \| None`	`None`	Path to `accounts.txt` (colon-separated) provisioning source.
`accounts.cookies_file`	`str \| None`	`None`	Path to `cookies.json` or Netscape `cookies.txt`.
`accounts.cookies_path`	`str \| None`	`None`	Legacy alias for a cookies path (kept for backward compatibility). Prefer `cookies_file`.
`accounts.env_path`	`str \| None`	`None`	Path to a dotenv-style `.env` (legacy single-account provisioning).
`accounts.cookies`	`Any`	`None`	The legacy `cookies=` payload. Accepted forms include cookie dict/list, Cookie header string, auth_token string, JSON string, or a file path string. See the cookies section.
`accounts.provision_on_init`	`bool`	`True`	If `True`, `Scweet(...)` will import any provided sources into the DB during initialization.
`accounts.bootstrap_strategy`	`"auto" \| "token_only" \| "nodriver_only" \| "none"`	`"auto"`	Controls whether Scweet may bootstrap missing auth material (auth_token -> cookies, and/or credentials -> cookies via nodriver).

`pool`

Work splitting and concurrency.

Key	Type	Default	Description
`pool.n_splits`	`int`	`5`	Split the date window into N intervals (tasks). More splits can increase parallelism but also increases overhead.
`pool.concurrency`	`int`	`5`	Max concurrent workers. Effective concurrency is limited by the number of eligible accounts in the DB.

`runtime`

Runtime behavior and nodriver (credentials bootstrap) controls.

Key	Type	Default	Description
`runtime.proxy`	`str \| dict \| None`	`None`	Default proxy used for API HTTP and nodriver bootstrap. Can be a URL (`"http://user:pass@host:port"`) or a dict (`{"host": "...", "port": 8080, "username": "...", "password": "..."}`), or a requests-style proxies dict. Per-account proxy overrides can be stored in the DB.
`runtime.user_agent`	`str \| None`	`None`	User-Agent override for nodriver bootstrap only.
`runtime.api_user_agent`	`str \| None`	`None`	User-Agent override for API HTTP requests. By default, Scweet does not set a UA for `curl_cffi` sessions (to avoid impersonation fingerprint mismatches).
`runtime.headless`	`bool`	`True`	nodriver option: headless mode for bootstrap/login.
`runtime.scroll_ratio`	`int`	`30`	Legacy field (browser-scraping era). Currently unused in v4 API-only scraping.
`runtime.code_callback`	`callable \| None`	`None`	Optional callback used by nodriver bootstrap to request user-provided login codes (email/2FA).
`runtime.strict`	`bool`	`False`	If `True`, some failures become exceptions (for example: manifest refresh when `update_on_init=True`, or "no usable accounts" instead of returning empty outputs).

`operations`

Account leasing, rate limiting, retries, and cooldown policy.

Key	Type	Default	Description
`operations.account_lease_ttl_s`	`int`	`120`	How long a leased account stays reserved before expiring (crash safety).
`operations.account_lease_heartbeat_s`	`float`	`30.0`	How often workers extend the lease while running. Set `0` to disable heartbeats.
`operations.account_daily_requests_limit`	`int`	`5000`	Per-account daily cap (UTC) on requests/pages; accounts above this cap become ineligible for leasing until reset.
`operations.account_daily_tweets_limit`	`int`	`50000`	Per-account daily cap (UTC) on tweets returned; accounts above this cap become ineligible for leasing until reset.
`operations.cooldown_default_s`	`float`	`120.0`	Cooldown used for rate limits when no reset header is available.
`operations.transient_cooldown_s`	`float`	`120.0`	Cooldown used for transient/network/5xx failures.
`operations.auth_cooldown_s`	`float`	`2592000.0`	Cooldown used for auth failures (401/403/404). Default is 30 days.
`operations.cooldown_jitter_s`	`float`	`10.0`	Adds random jitter to cooldowns to avoid synchronized retries.
`operations.account_requests_per_min`	`int`	`30`	Per-account request rate limit (token bucket).
`operations.account_min_delay_s`	`float`	`0.0`	Minimum delay between requests (per account worker).
`operations.api_page_size`	`int`	`20`	GraphQL page size (`count`). Larger values reduce requests but can increase per-request payload. Max 100.
`operations.task_retry_base_s`	`int`	`1`	Base delay (seconds) used for task retries.
`operations.task_retry_max_s`	`int`	`30`	Max delay (seconds) for exponential backoff on transient errors.
`operations.max_task_attempts`	`int`	`3`	Max retries per task before failing.
`operations.max_fallback_attempts`	`int`	`3`	Max fallback retries per task before failing (used for continuation/edge cases).
`operations.max_account_switches`	`int`	`2`	Max times a task can switch accounts after auth errors before failing.
`operations.scheduler_min_interval_s`	`int`	`300`	Minimum interval size used when splitting `[since, until]` into `pool.n_splits` tasks. Limits how many splits are allowed.

`resume`

Resume policy for resume=True.

Key	Type	Default	Description
`resume.mode`	`"legacy_csv" \| "db_cursor" \| "hybrid_safe"`	`"hybrid_safe"`	How Scweet decides where to continue when resuming (CSV timestamp, DB cursor checkpoint, or hybrid).

`output`

File outputs (return value is always list[dict]).

Key	Type	Default	Description
`output.save_dir`	`str`	`"outputs"`	Default directory for output files (can be overridden per call via `save_dir=`).
`output.format`	`"csv" \| "json" \| "both" \| "none"`	`"csv"`	Which files Scweet writes: CSV only, JSON only, both, or none.
`output.dedupe_on_resume_by_tweet_id`	`bool`	`False`	If `True` and `resume=True`, Scweet avoids appending duplicates (by tweet id) to CSV and JSON outputs.

`manifest`

GraphQL request manifest controls (query ids + features).

Key	Type	Default	Description
`manifest.manifest_url`	`str \| None`	`None`	Remote manifest URL (your hosted `scweet_manifest.json`). If unset, Scweet uses the packaged manifest (with a built-in fallback).
`manifest.ttl_s`	`int`	`3600`	Cache TTL (seconds) for the remote manifest (stored in SQLite).
`manifest.update_on_init`	`bool`	`False`	If `True`, Scweet will attempt a best-effort remote refresh during init (raises if `runtime.strict=True`).

Account Provisioning (DB-First)

Scweet imports account sources into SQLite. Scraping workers lease accounts from the DB.

Provisioning sources:

env_path=".env": legacy single-account provisioning
accounts_file="accounts.txt": multiple accounts, credentials and/or tokens
cookies_file="cookies.json" or Netscape cookies.txt
cookies=... payload: dict/list/header/raw token/path string/JSON string

Provision on init vs manual provisioning

If accounts.provision_on_init=True (default), Scweet imports accounts during Scweet(...) init when you provide any sources.

If you want a two-step flow:

from Scweet import Scweet

scweet = Scweet.from_sources(db_path="scweet_state.db", provision_on_init=False)
print(scweet.provision_accounts(accounts_file="accounts.txt"))

Bootstrap strategy (`accounts.bootstrap_strategy`)

Controls how Scweet can create missing auth material:

auto (default): allow auth_token bootstrap and credentials (nodriver) bootstrap
token_only: allow auth_token bootstrap only
nodriver_only: allow credentials bootstrap only # not recommended. API login coming soon.
none: do not bootstrap; accounts missing auth are imported but marked unusable

Credentials bootstrap requires nodriver and a record containing a login identifier + password (email or username + password).

Scraping API

`scrape(...)` and `ascrape(...)`

scrape(...) is sync (good for normal scripts)
ascrape(...) is async (required in notebooks/async apps)

Important parameters:

since, until: date bounds (YYYY-MM-DD); v4 normalizes internally.
words: list of keywords (OR query) or "a//b" string (legacy split).
from_account, to_account, mention_account, hashtag, lang
display_type: "Top" or "Latest" (legacy value "Recent" is treated as "Latest").
limit: best-effort per-run cap. Due to concurrency/page size, you may overshoot slightly.
resume: append to existing outputs + attempt continuation using resume mode policy.
More scraping params coming soon.

Accepted-but-ignored (v3 compatibility):

filter_replies, proximity, geocode, minreplies, minlikes, minretweets

These parameters remain in the public signature for backward compatibility, but are not currently applied to the v4 API search query.

How many tweets will I get?

limit is a per-run target, not “total across runs”. If you run the same query multiple times with resume=True, outputs will append unless you use dedupe (below).

Output (Return + Files)

Return value:

scrape/ascrape returns list[dict] of raw GraphQL tweet objects (tweet_results.result).

File outputs are controlled by config.output.format:

csv (default): curated CSV schema with important fields
json: raw tweet objects saved as JSON array
both: write CSV + JSON
none: don’t write files

Dedupe on resume (CSV and JSON)

To avoid writing duplicates across runs when appending:

Scweet(config={"output": {"dedupe_on_resume_by_tweet_id": True}})

When enabled and resume=True, Scweet reads existing tweet ids from the current output files and skips any new tweet whose id already exists.

Limitations:

It prevents new duplicates while appending; it does not retroactively clean old files.
JSON dedupe loads the whole existing JSON array to build the id set (can be slow for huge files).

Resume Modes

config.resume.mode controls how resume=True chooses where to start:

legacy_csv: v3 behavior (override since based on max CSV timestamp)
db_cursor: use DB checkpoint (since + cursor)
hybrid_safe: prefer DB checkpoint, fallback to CSV timestamp

Compatibility rule:

If you import the legacy facade (from Scweet.scweet import Scweet), resume is forced to legacy_csv.

Logging

Scweet uses the standard Python logging module and installs no handlers.

Notebook-friendly setup:

import logging, sys

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(name)s:%(lineno)d | %(message)s",
    handlers=[logging.StreamHandler(sys.stdout)],
    force=True,
)

Common messages:

Import account reuse ...: DB already had usable auth for that token/username.
Import account creds bootstrap ...: account needs nodriver login bootstrap.
No usable accounts available ...: scraping can’t lease an eligible account (set runtime.strict=True to raise).

Local DB Maintenance (ScweetDB)

Scweet stores state in SQLite. You can inspect and maintain it via ScweetDB:

from Scweet import ScweetDB

db = ScweetDB("scweet_state.db")
print(db.accounts_summary())
print(db.list_accounts(limit=10, eligible_only=True, include_cookies=True))
print(db.clear_leases(expired_only=True))
print(db.reset_account_cooldowns(clear_leases=True, include_unusable=True))
print(db.collapse_duplicates_by_auth_token(dry_run=True))

Secrets are redacted by default (fingerprints + cookie keys). Use reveal_secrets=True only when necessary.

Proxy, User-Agent, HTTP Mode, Impersonation

Proxy

Set runtime.proxy as:

{"host": "...", "port": 8080, "username": "...", "password": "..."} (dict)
"http://user:pass@host:port" (string)
{"http": "...", "https": "..."} (requests/curl-style dict)

This proxy is used for:

API calls (curl_cffi sessions)
Transaction-id bootstrap (if enabled)
Token bootstrap (best-effort)
nodriver login bootstrap

User-Agent policy

runtime.user_agent: used by nodriver only
runtime.api_user_agent: overrides API HTTP User-Agent (by default, curl_cffi uses its own UA consistent with impersonation)

curl_cffi impersonation

Control via engine.api_http_impersonate (e.g. "chrome124").

API HTTP mode

engine.api_http_mode controls how sessions are created:

auto (default): prefer async if available
async: force async
sync: force sync

Manifest (Query IDs + Features)

Twitter’s web GraphQL layer changes frequently. Scweet externalizes the most common drift points into a small manifest:

GraphQL query_id for SearchTimeline
endpoint template
features dict passed to GraphQL

Defaults:

Local packaged manifest: Scweet/v4/default_manifest.json (plus a built-in fallback)

Override via URL:

from Scweet import Scweet

scweet = Scweet.from_sources(
    db_path="scweet_state.db",
    cookies_file="cookies.json",
    manifest_url="https://gist.githubusercontent.com/<user>/<gist>/raw/scweet_manifest.json",
    update_manifest=True,  # force refresh at init (best-effort unless strict)
)

Caching:

Remote manifests are cached in SQLite (manifest_cache) for manifest.ttl_s seconds.

Important limitation:

The manifest does not protect against all future breaking changes (auth flows, variable schema, response shape).

Exceptions and Strict Mode

Key exceptions:

Scweet.v4.exceptions.AccountPoolExhausted: no eligible account could be leased.

Strict mode:

runtime.strict=True turns some “best effort” behavior into exceptions (recommended for production workflows where silent empty output is not acceptable).

Future Work (Planned)

Implement profile/follows APIs (API-only) to replace legacy browser behavior.
Improve resume semantics for “total tweets across runs” and stronger cross-run dedupe.
Expand manifest coverage (variable schema, optional toggles) where possible.
Add first-class docs website (recommendation: MkDocs Material).

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

4.0

Feb 10, 2026

3.2

Apr 16, 2025

3.1

Apr 15, 2025

3.0

Apr 14, 2025

1.8

Jan 3, 2022

1.7

Jan 3, 2022

1.6

Sep 11, 2021

1.5

Aug 16, 2021

1.4

Aug 9, 2021

1.2

Aug 9, 2021

1.1

Jul 24, 2021

1.0

Apr 21, 2021

0.3.3

Jan 20, 2021

0.3.1

Apr 21, 2021

0.2.1

Dec 29, 2020

0.2

Dec 25, 2020

0.1.2

Dec 25, 2020

0.1.1

Dec 25, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scweet-4.0.tar.gz (89.8 kB view details)

Uploaded Feb 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scweet-4.0-py3-none-any.whl (89.8 kB view details)

Uploaded Feb 10, 2026 Python 3

File details

Details for the file scweet-4.0.tar.gz.

File metadata

Download URL: scweet-4.0.tar.gz
Upload date: Feb 10, 2026
Size: 89.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for scweet-4.0.tar.gz
Algorithm	Hash digest
SHA256	`14a4474c761f1eeeba962e828cfb1c8e8070ee60ade7621a36569c8dcb960d26`
MD5	`ad1309641456310f3f678d28d68867c6`
BLAKE2b-256	`327940ac761fec3118ca04d48498848ac69c1c5fcc00cebf7d610a288547e693`

See more details on using hashes here.

File details

Details for the file scweet-4.0-py3-none-any.whl.

File metadata

Download URL: scweet-4.0-py3-none-any.whl
Upload date: Feb 10, 2026
Size: 89.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for scweet-4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`772724ba534b15255fe87276acabe252587db2e86ceed78a9d91402fd71565ef`
MD5	`cea374fadff3aba0a97073fb4e049b39`
BLAKE2b-256	`b3e1254fdfc010cfd01ca23b8a557a9e28ce68431e5252e5a488837de46e4aa3`

See more details on using hashes here.

Scweet 4.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

Scweet v4 Documentation

What’s In v4 (and What Isn’t)

Installation

Quickstart

Notebook Usage (Async)

Configuration

Recommended: ScweetConfig.from_sources(...)

Advanced: overrides={...}

Full Configuration Reference (ScweetConfig)

engine

storage

accounts

pool

runtime

operations

resume

output

manifest

Account Provisioning (DB-First)

Provision on init vs manual provisioning

Bootstrap strategy (accounts.bootstrap_strategy)

Scraping API

scrape(...) and ascrape(...)

How many tweets will I get?

Output (Return + Files)

Dedupe on resume (CSV and JSON)

Resume Modes

Logging

Local DB Maintenance (ScweetDB)

Proxy, User-Agent, HTTP Mode, Impersonation

Proxy

User-Agent policy

curl_cffi impersonation

API HTTP mode

Manifest (Query IDs + Features)

Exceptions and Strict Mode

Future Work (Planned)

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Recommended: `ScweetConfig.from_sources(...)`

Advanced: `overrides={...}`

Full Configuration Reference (`ScweetConfig`)

`engine`

`storage`

`accounts`

`pool`

`runtime`

`operations`

`resume`

`output`

`manifest`

Bootstrap strategy (`accounts.bootstrap_strategy`)

`scrape(...)` and `ascrape(...)`