Archive X bookmarks, likes, authored tweets, and official X exports locally.
Project description
tweetxvault
A Python CLI tool for archiving your Twitter/X bookmarks, likes, and authored tweets into a local LanceDB database, with support for importing official X archive exports into the same store. Runs unattended via cron, supports incremental sync with crash-safe resume, and preserves raw API responses so you never lose data.
Features
- Incremental sync — fetches only new items by default; resumes interrupted backfills automatically
- Official X archive import — imports authored tweets, deleted tweets, likes, and exported media from official X archive ZIPs/directories into the same local archive
- Raw capture preservation — every API response page is stored verbatim alongside parsed tweet records
- Secondary object extraction — archives canonical tweet objects, attached-tweet relations, media metadata, URL refs, and article payloads alongside collection memberships
- Crash-safe checkpoints — sync state advances atomically with data writes; safe to kill mid-run
- Full-text and semantic search — built-in FTS (tantivy) and optional ONNX-based vector embeddings for hybrid search
- Automatic query ID discovery — scrapes Twitter's JS bundles to stay current with GraphQL endpoint changes
- Browser cookie extraction — reads session cookies from Firefox plus Chromium-family browsers like Chrome, Chromium, Brave, Edge, Opera, Opera GX, Vivaldi, and Arc
- Rate limit handling — exponential backoff, cooldown periods, and configurable retry limits
- Export — export your archive to JSON or a self-contained HTML viewer
Requirements
- Python 3.12+
- uv (recommended) or pip
- Unix-like runtime only today (Linux/macOS). Windows is not supported yet because the CLI currently depends on
fcntl,resource, andstrftime("%-d"). - A Twitter/X account logged in via Firefox or a supported Chromium-family browser, or session cookies obtained manually
Installation
Install from PyPI:
pip install tweetxvault
Install globally with uv:
uv tool install tweetxvault
Install globally with pipx:
pipx install tweetxvault
To enable semantic search (vector embeddings):
pip install "tweetxvault[embed]"
Or install the embedding extra as a global tool:
uv tool install "tweetxvault[embed]"
pipx install "tweetxvault[embed]"
Install from source:
git clone https://github.com/lhl/tweetxvault.git
cd tweetxvault
uv sync
Install your local checkout as a global editable tool while developing on HEAD:
uv tool install -e .
Re-run that command with --force after dependency or metadata changes in
pyproject.toml.
To enable semantic search from source:
uv sync --extra embed
Run once without installing:
uvx tweetxvault --help
This installs onnxruntime, tokenizers, and huggingface-hub. GPU acceleration (CUDA/ROCm) is used automatically when available; CPU works fine too (~150-200 tweets/s).
Authentication
tweetxvault needs your auth_token and ct0 session cookies from Twitter/X. There are three ways to provide them (checked in this order):
1. Environment variables (simplest)
export TWEETXVAULT_AUTH_TOKEN="your_auth_token"
export TWEETXVAULT_CT0="your_ct0_token"
export TWEETXVAULT_USER_ID="your_numeric_user_id" # required for likes and own-tweet sync
2. Config file
Create ~/.config/tweetxvault/config.toml:
[auth]
auth_token = "your_auth_token"
ct0 = "your_ct0_token"
user_id = "your_numeric_user_id"
3. Browser auto-extraction
If you're logged into x.com in Firefox, Chrome, Chromium, Brave, Edge, Opera, Opera GX, Vivaldi, or Arc, tweetxvault will try them in that order and stop after the first browser profile that yields valid X cookies.
Firefox is read from its profile database directly. Chromium-family browsers use browser-cookie3 for cookie decryption and OS keyring access.
To force a specific browser or profile for one command:
uv run tweetxvault auth check --browser chrome
uv run tweetxvault sync all --browser brave --profile "Profile 2"
uv run tweetxvault sync all --browser firefox --profile-path /path/to/profile
How --browser behaves:
--browser,--profile, and--profile-pathforce tweetxvault to takeauth_tokenandct0from that browser/profile.user_idstill uses the normal precedence order:TWEETXVAULT_USER_ID->auth.user_id-> browsertwid.- That means you can still pin
user_idexplicitly for likes or authored-tweet sync if needed, even while forcing cookies from a specific browser profile. - If you do not set
user_idexplicitly, tweetxvault will use the browser profile'stwidcookie when available.
For example, this uses Firefox cookies from the selected profile, but still pins user_id from the environment:
export TWEETXVAULT_USER_ID="123456789"
uv run tweetxvault sync likes --browser firefox --profile my-profile
This matters most if you use multiple X accounts. Make sure the selected browser profile and resolved user_id belong to the same account. If you mix cookies from one account with a user_id from another, likes/authored-tweet sync may fail, and tweetxvault's archive-owner guardrail will refuse writes if the local archive already belongs to a different user.
Run uv run tweetxvault auth check --browser ... first if you want to verify which sources are being used before a sync.
To persist a browser preference in the environment or config:
export TWEETXVAULT_BROWSER="chrome"
export TWEETXVAULT_BROWSER_PROFILE="Profile 2"
export TWEETXVAULT_BROWSER_PROFILE_PATH="/path/to/profile"
Legacy Firefox-only override is still supported:
export TWEETXVAULT_FIREFOX_PROFILE_PATH="/path/to/your/firefox/profile"
Verify your setup
uv run tweetxvault auth check
uv run tweetxvault auth check --interactive
This probes the API without writing any data and reports credential status and endpoint readiness. --interactive opens a picker over discovered browser profiles with valid X cookies.
Usage
Syncing
# Sync everything (incremental by default)
uv run tweetxvault sync all
# Sync just bookmarks, likes, or your own authored tweets
uv run tweetxvault sync bookmarks
uv run tweetxvault sync likes
uv run tweetxvault sync tweets
# Force a specific browser profile for this run
uv run tweetxvault sync all --browser chrome --profile "Profile 2"
# Full re-sync from scratch (resets sync state, does not delete existing data)
uv run tweetxvault sync all --full
# Continue past duplicates without resetting state
uv run tweetxvault sync all --backfill
# Clear a saved historical backfill cursor and run only the head pass
uv run tweetxvault sync likes --head-only
# Rewalk existing pages to refresh article-bearing tweets after article fields change
uv run tweetxvault sync bookmarks --article-backfill
# Limit to N pages per collection
uv run tweetxvault sync all --limit 5
If the [embed] extra is installed, new tweets are automatically embedded after each sync on a best-effort basis; if embedding fails, sync still succeeds and you can retry later with tweetxvault embed.
--article-backfill updates stored raw_json and normalized secondary rows inline, so it does not require a follow-up tweetxvault rehydrate.
tweetxvault sync all still covers bookmarks + likes only; authored tweets stay opt-in via tweetxvault sync tweets.
--head-only is the escape hatch when an old saved backfill cursor is no longer useful: it clears that cursor for the targeted collection and runs only the normal head pass. It cannot be combined with --full, --backfill, or --article-backfill.
Importing an X archive
# Import an official X archive ZIP or extracted directory
uv run tweetxvault import x-archive ~/Downloads/twitter-archive.zip
# Clear previously imported archive-owned rows/media and reimport from scratch
uv run tweetxvault import x-archive ~/Downloads/twitter-archive.zip --regen
# Fetch TweetDetail for every remaining sparse archive tweet after the automatic bulk tweets/likes reconciliation
uv run tweetxvault import x-archive ~/Downloads/twitter-archive.zip --enrich
# Run a bounded TweetDetail follow-up after the automatic bulk tweets/likes reconciliation
uv run tweetxvault import x-archive ~/Downloads/twitter-archive --detail-lookups 100
# Sample/debug a large archive without touching the normal follow-up path
uv run tweetxvault import x-archive ~/Downloads/twitter-archive.zip --regen --debug --limit 1000
# Continue pending TweetDetail follow-up later without re-reading the archive ZIP
uv run tweetxvault import enrich
# Or run the follow-up in bounded batches
uv run tweetxvault import enrich --limit 500
The importer maps authored tweets, deleted authored tweets, likes, and exported tweets_media/ files into the same LanceDB archive used by live sync. It applies the same archive-owner guardrail as sync, runs bulk live tweets / likes reconciliation automatically when auth is available, and keeps sparse archive-only rows in a tracked pending state until you choose how much per-tweet follow-up to run. If the archive only includes a video poster/thumbnail and not the main media file, that media row stays pending so tweetxvault media download can fill the gap later. Official X archives currently do not include bookmarks, so the importer warns about the missing bookmark dataset; that warning is expected today.
TweetDetail is X's per-tweet detail API: tweetxvault uses it to fill in metadata the archive often lacks for liked tweets, such as author fields, timestamps, full media metadata, and thread context.
Import follow-up options:
- Default import does no per-tweet TweetDetail pass. It only imports the archive and runs the bulk live collection reconciliation.
--detail-lookups Nruns a bounded TweetDetail pass for at mostNpending sparse tweets after the bulk live syncs.--enrichruns the TweetDetail pass for all currently pending sparse tweets after the bulk live syncs.--regenclears archive-import-owned rows, import manifests, and copied archive media files before reimporting. It leaves live-synced rows intact.- Archive import itself works without live auth, but the automatic reconciliation and any TweetDetail follow-up only run when auth is available.
- Archive import/enrich uses a head-only live reconciliation pass; it does not resume an old saved likes/tweets backfill from the normal sync state machine.
- If the same archive digest was already imported, a plain re-run still short-circuits, but
--enrichreuses the existing import and runs only the follow-up enrichment instead of re-importing the ZIP contents. - If an import is interrupted during the archive-write phase, rerunning the same
import x-archive ...command is the normal recovery path. If the archive write already completed and only the follow-up was interrupted, usetweetxvault import enrichor rerun with--enrich. tweetxvault import enrichreruns that same archive-specific follow-up later against already imported archive data, without needing the original ZIP or directory path again.tweetxvault import enrich --limit Nlimits the TweetDetail phase only; it still reruns the archive-specific bulk live reconciliation first.tweetxvault threads expandis the broader TweetDetail-based context/thread capture command; use it when you want parents, replies, and linked status URLs beyond the archive-placeholder follow-up.- Interactive TTY runs show tqdm progress bars for hashing, tweet/like import, media copy, and detail enrichment by default.
- Non-interactive runs stay quiet by default aside from warnings/errors, so cron/piped runs do not get interactive progress output.
--debugadds per-phase timing diagnostics on top of that interactive progress output.--limit Nrequires--debugand is a sampled diagnostic import: tweetxvault still hashes and parses the full archive files, but only imports the firstNauthored tweets, deleted tweets, likes, and media files after load. Sampled runs are stored assampled, notcompleted, and skip the automatic live follow-up unless you explicitly ask for--enrich/--detail-lookups.
Importing old "Grailbird" archives (pre-2018)
Twitter archives exported before ~2018 use an older format called "Grailbird" (CSV-based, with tweets.csv in the root and monthly JS files under data/js/tweets/). These cannot be imported directly — convert them first with the shipped tweetxvault import grailbird command:
# Convert the old archive to modern format
tweetxvault import grailbird ~/TwitterArchive-2015 ~/TwitterArchive-2015-converted
# Then import normally
tweetxvault import x-archive ~/TwitterArchive-2015-converted
The converter reads tweets.csv and data/js/user_details.js (if present) and produces a modern archive directory with data/tweets.js, data/account.js, and data/manifest.js. If user_details.js is missing, tweetxvault still imports the converted archive, but it leaves the local archive owner unset so the first later authenticated sync can establish the real owner metadata instead of locking the archive to a fake placeholder id. For checkout-based use, the repo also keeps a compatibility wrapper at python convert_grailbird.py .... See docs/GRAILBIRD.md for details on what gets converted and known limitations.
Viewing your archive
# View recent bookmarks in a terminal table
uv run tweetxvault view bookmarks
# View likes, oldest first
uv run tweetxvault view likes --sort oldest
# View your authored tweets
uv run tweetxvault view tweets
# View all archived tweets
uv run tweetxvault view all --limit 50
Terminal views render tweet timestamps in your local timezone. Sort order uses tweet created_at, not collection position. For likes, that means uv run tweetxvault view likes --sort oldest shows the oldest liked tweet by tweet timestamp when one is known; X does not expose a reliable liked_at timestamp for reconstructing the exact order in which you liked posts.
Searching
# Search posts and articles together
uv run tweetxvault search "machine learning"
# Limit search to result types and/or collections
uv run tweetxvault search "machine learning" --type article
uv run tweetxvault search "machine learning" --type post --collection bookmark,like
# Sort search results chronologically instead of by relevance
uv run tweetxvault search "machine learning" --sort newest
uv run tweetxvault search "machine learning" --sort oldest
# Force a specific search mode
uv run tweetxvault search "llama" --mode fts
uv run tweetxvault search "llama" --mode vector
uv run tweetxvault search "llama" --mode hybrid
# Adjust result count
uv run tweetxvault search "transformer architecture" --limit 50
Search modes:
- fts — keyword matching via full-text search (always available)
- vector — semantic similarity via embeddings (requires
uv sync --extra embed+tweetxvault embed) - hybrid — combines FTS and vector results with reranking (best quality)
- auto (default) — uses hybrid for post-only searches when embeddings exist; otherwise falls back to FTS so article results stay included
Chronological search sorting reorders the returned match set by created_at after retrieval, so it preserves the selected search mode and still defaults to relevance-first candidate selection.
Filters:
--type— comma-delimited result types:post,article--collection— comma-delimited archive collections:bookmark,like,tweet--sort—relevance(default),newest, oroldest
Embeddings
# Generate embeddings for all archived tweets (resumes if interrupted)
uv run tweetxvault embed
# Regenerate all embeddings from scratch
uv run tweetxvault embed --regen
Uses the all-MiniLM-L6-v2 model (384 dimensions) via ONNX Runtime. The model is downloaded automatically from Hugging Face Hub on first run.
If you already generated embeddings with an older tweetxvault version, run uv run tweetxvault embed --regen once after upgrading so stored vectors are rebuilt with the current normalized cosine-search setup.
Exporting
# Export all archived tweets to JSON
uv run tweetxvault export json
# Export a specific collection
uv run tweetxvault export json --collection bookmarks
uv run tweetxvault export json --collection tweets
# Export to a specific path
uv run tweetxvault export json --out ~/exports/my-bookmarks.json
# Export as a self-contained HTML viewer
uv run tweetxvault export html
uv run tweetxvault export html --collection likes --out ~/exports/likes.html
JSON exports now include normalized media, urls, and article sections alongside each exported tweet row.
HTML exports now render tweet media, URL metadata, and full article bodies when those rows exist in the archive.
Media + URL Enrichment
# Download all pending archived media files into the local data dir
uv run tweetxvault media download
# Only download photos
uv run tweetxvault media download --photos-only
# Fetch final URL, canonical URL, title, and description metadata
uv run tweetxvault unfurl
# Retry previously failed URL unfurls
uv run tweetxvault unfurl --retry-failed
Thread Expansion
# Expand archived tweets through TweetDetail to capture parents/context rows
uv run tweetxvault threads expand
# Expand a specific thread target by URL or ID
uv run tweetxvault threads expand https://x.com/dimitrispapail/status/2026531440414925307
uv run tweetxvault threads expand 2026531440414925307
# Re-fetch an explicit target even if it was already expanded before
uv run tweetxvault threads expand --refresh 2026531440414925307
Explicit thread targets are idempotent by default: previously expanded targets are skipped unless you pass --refresh. Linked status-URL targets are attempted at most once per run, even if the same target appears in multiple archived URL refs.
tweetxvault threads expand is incremental by default, so it is safe to rerun after archive import if you want more TweetDetail-based context capture later.
Article Refresh
# Refresh preview-only archived article rows via TweetDetail
uv run tweetxvault articles refresh
# Refresh every archived article row, not just preview-only ones
uv run tweetxvault articles refresh --all
# Refresh a specific article-bearing tweet by URL or ID
uv run tweetxvault articles refresh https://x.com/dimitrispapail/status/2026531440414925307
uv run tweetxvault articles refresh 2026531440414925307
Maintenance
# Show archive totals, per-collection coverage, sync recency, and storage health
uv run tweetxvault stats
# Compact the LanceDB archive (reduces file count after many syncs)
uv run tweetxvault optimize
# Rebuild normalized tweet fields and secondary objects from stored raw JSON,
# including any previously captured TweetDetail/thread-expansion payloads
uv run tweetxvault rehydrate
# Force-refresh query IDs from Twitter's JS bundles
uv run tweetxvault auth refresh-ids
tweetxvault stats reports overall post/article totals, per-collection counts plus first/last tweet timestamps, latest sync/capture times, storage health details such as DB/media size and version count with an actionable optimize hint, and follow-up queues for archive enrichment, rehydrate gaps, and pending thread expansion. The command now ends with a short legend that explains the backfill states plus the difference between archive enrich, local rehydrate gaps, membership thread targets, and linked-status thread targets.
Unattended sync via cron
# Sync bookmarks and likes every 6 hours
0 */6 * * * cd /path/to/tweetxvault && uv run tweetxvault sync all 2>> /tmp/tweetxvault.log
A process lock prevents overlapping runs.
Configuration
All configuration is optional. Defaults work out of the box with browser cookie extraction.
Sync tuning (config.toml or env vars)
| Setting | Default | Env var |
|---|---|---|
sync.page_delay |
2.0 s |
TWEETXVAULT_PAGE_DELAY |
sync.max_retries |
3 |
TWEETXVAULT_MAX_RETRIES |
sync.backoff_base |
2.0 s |
TWEETXVAULT_BACKOFF_BASE |
sync.detail_max_retries |
2 |
TWEETXVAULT_DETAIL_MAX_RETRIES |
sync.detail_backoff_base |
30.0 s |
TWEETXVAULT_DETAIL_BACKOFF_BASE |
sync.cooldown_threshold |
3 consecutive 429s |
TWEETXVAULT_COOLDOWN_THRESHOLD |
sync.cooldown_duration |
300.0 s |
TWEETXVAULT_COOLDOWN_DURATION |
sync.timeout |
30.0 s |
TWEETXVAULT_TIMEOUT |
TweetDetail-heavy jobs such as tweetxvault import enrich, tweetxvault threads expand,
and tweetxvault articles refresh pace themselves from X's live rate-limit headers when
those headers are present. If X does not expose reset/remaining headers for a request, the
client falls back to the shared sync.detail_* retry settings and cooldown controls.
Data storage
Data paths are resolved by platformdirs so they follow OS conventions, but the current runtime target is Unix-like systems only. On Linux the defaults are:
| Purpose | Default path |
|---|---|
| Config | ~/.config/tweetxvault/ |
| Archive (LanceDB) | ~/.local/share/tweetxvault/archive.lancedb/ |
| Cache (query IDs) | ~/.cache/tweetxvault/ |
Override with XDG_CONFIG_HOME, XDG_DATA_HOME, XDG_CACHE_HOME.
How it works
tweetxvault calls Twitter's internal GraphQL API — the same endpoints the web app uses. It:
- Resolves session cookies (env/config/browser extraction)
- Discovers current GraphQL query IDs by parsing Twitter's JS bundles (with a 24h TTL cache and static fallbacks)
- Fetches timeline pages with the proper headers, feature flags, and cursor pagination
- Stores raw API responses + collection tweet rows + normalized secondary objects in a single LanceDB table
- Tracks sync state per collection so the next run picks up where it left off
Development
uv sync --extra embed
# Run tests
uv run pytest
# Lint and format
uv run ruff check
uv run ruff format --check
See docs/ for architecture docs, the implementation plan, and research notes.
Similar projects
- twitter-web-exporter — Browser extension (Tampermonkey/Violentmonkey) that intercepts Twitter's GraphQL responses in-page; exports bookmarks, likes, tweets, followers, and DMs to JSON/CSV/HTML with bulk media download
- tweethoarder — Python CLI archiver for likes, bookmarks, tweets, reposts, and home feed into SQLite with JSON/Markdown/CSV/HTML export
- Siftly — Self-hosted AI bookmark manager (Next.js + SQLite + Anthropic API) with entity extraction, vision analysis, and mindmap visualization
- TweetVault (helioLJ) — Self-hosted bookmark archive (Go + Next.js + PostgreSQL) with tag management; imports via twitter-web-exporter ZIP
- twitter-likes-export — Minimal Python scripts to export likes via Twitter's GraphQL API with optional media download
- download_twitter_likes — Playwright-based media downloader that scrolls your likes page and saves images/GIFs/videos
License
Apache 2.0
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tweetxvault-0.2.3.tar.gz.
File metadata
- Download URL: tweetxvault-0.2.3.tar.gz
- Upload date:
- Size: 100.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
390957a88f1b56039a44393ec062ab5f3af42baaac2613874ef080d02b05d38e
|
|
| MD5 |
7131fcb31bb1b9d25959a208d6111a70
|
|
| BLAKE2b-256 |
1fb8295f8395a854d86b48fa6cd0e59c779e46824d5b706dffc932d7cb222ce4
|
File details
Details for the file tweetxvault-0.2.3-py3-none-any.whl.
File metadata
- Download URL: tweetxvault-0.2.3-py3-none-any.whl
- Upload date:
- Size: 111.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4eb03f1f289719669fcc4edbd3eb80bb867e7aaad39808052069524f6dc66132
|
|
| MD5 |
ff230cb36a3d5d7e531444190bad2d56
|
|
| BLAKE2b-256 |
f1eb50de0042db9c64f21f4c508322b51cf8b1d4618842755f6eca8b8905cbcc
|