Skip to main content

Implementing X (Twitter) data crawling based on Camoufox, for personal use.

Project description

X-CrawlFox 🦊

License Python Version PyPI Version PyPI - Downloads GitHub stars

A free, highly anonymous, human-like scraping CLI tool for X (Twitter) and search engines.

🌐 English | 中文


🚀 Key Features

Features: Free, highly customizable, incremental crawling, and built-in human-like behavior for anti-bot protection.

  • Human-like Interaction: Integrates Camoufox fingerprint obfuscation to simulate real human scrolling, random delays, and typing interactions, significantly reducing the risk of detection.
  • Timeline Scraping: Supports crawling "Following" and "For you" feeds with configurable item limits.
  • Deep News Scraping: Automatically scrapes the "Today's News" sidebar, with support for clicking into details to extract Grok summaries and related popular posts.
  • Incremental Account Monitoring: Supports multi-account monitoring with automatic tracking of the last crawled tweet ID to only fetch new content.
  • One-click Composite Tasks: Launch composite tasks (Timeline, News, Monitoring, Search) via a unified JSON configuration file.
  • Automatic State Management: Automatically saves login sessions (Cookie) and crawling progress (Crawler State).
  • Multi-Search Engine Support: Supports 18 different search engines, including Google, Bing, Baidu, Brave, DuckDuckGo, and more.

📦 Quick Start

Installation

  1. Install from PyPI:

    pip install x-crawlfox
    
  2. Build from source: This project uses uv for package management.

    git clone https://github.com/Jiutwo/x-crawlfox.git
    cd x-crawlfox
    uv sync
    

How to Use

1. Initialize Config Directory

Before first use, run the following command to generate the .x-crawlfox configuration folder and default settings in the current directory:

x-crawlfox init

# To save the configuration to the user home directory (Global Mode):
x-crawlfox init --global

2. Account Login or Cookie Export (Required)

You must have a logged-in session (Cookie) before scraping.

Note: Scraping immediately with a newly registered account is risky; it is recommended to use the account normally for a while first.

Method 1: Export via Cookie Editor Extension (Recommended)

Use the browser extension Cookie Editor to export your current session cookies as JSON and save them to .x-crawlfox/x_cookies.json.

The .x-crawlfox folder can be located in the current directory or the user home directory. X-CrawlFox will automatically recognize and convert the Cookie Editor format to the required internal format upon loading.

Cookie Editor

Method 2: Command Line Login

x-crawlfox x login

Complete the login in the popup browser window, then return to the terminal and press Enter to save the state. The login state will be automatically saved to .x-crawlfox/x_cookies.json.

If X blocks the login as a "suspicious attempt," please switch to Method 1.

3. Scrape Personal Timeline

# Scrape the first 20 items from the Following feed
# Add --no-headless to visualize the process
x-crawlfox x timeline --type Following --max-items 20

# Scrape the For You feed
x-crawlfox x timeline --type "For you" --max-items 50

4. Scrape Today's News

# Scrape sidebar list only
x-crawlfox x news

# Deep scraping: Enter details to get summaries and related posts
x-crawlfox x news --detail --max-items 3

5. Scrape/Monitor Specific User

# Fetch the latest 20 tweets from a specific user
x-crawlfox x user elonmusk --max-tweets 20

# Incremental fetch: Only get new content since the last run
x-crawlfox x user elonmusk --only-new

Run multi-account monitoring independently (reads x.monitor from crawl_config.json):

x-crawlfox x monitor

You can also specify a custom config file (flat list format):

x-crawlfox x monitor --config my_accounts.json

6. Search Engine Scraping

X-CrawlFox supports scraping search results from 18 search engines (8 CN + 10 Global) via the se subcommand. No login is required.

Single engine search

# Fast mode: navigate directly to the search URL (default)
x-crawlfox se search "LangGraph" --engine google --max-results 10

# Simulate mode: open homepage and type like a human (better anti-detection)
x-crawlfox se search "LangGraph" --engine google --mode simulate

# Time filter: hour | day | week | month | year
x-crawlfox se search "AI news" --engine bing --time-range day

# Domain restriction
x-crawlfox se search "python async" --engine google --site github.com

# File type filter
x-crawlfox se search "machine learning" --engine baidu --filetype pdf

# Exact phrase match
x-crawlfox se search "anything" --engine duckduckgo --exact-phrase "large language model"

# Disable headless mode (useful when bot detection is triggered)
x-crawlfox se search "隐私工具" --engine qwant --no-headless

Multi-engine search — query multiple engines in one run and merge results into a single .jsonl file:

x-crawlfox se multi "AI Agent" --engines google,bing,duckduckgo --max-results 10
x-crawlfox se multi "量化投资" --engines baidu,sogou,jisilu,wechat
x-crawlfox se multi "rust async" --engines google,bing --time-range month

Available engines

Region Engines
CN baidu bing-cn bing-int 360 sogou wechat toutiao jisilu
Global google google-hk bing duckduckgo yahoo startpage brave ecosia qwant wolframalpha

Results are saved as .jsonl to the output/ directory (e.g. output/se_google_LangGraph_20260419_120000.jsonl).


7. One-click Composite Tasks

Edit .x-crawlfox/crawl_config.json, then run:

x-crawlfox x all

You can also specify a different config file path via --config:

x-crawlfox x all --config /path/to/crawl_config.json

Example crawl_config.json format:

{
    "global": {
        "output_dir": "output",
        "headless": true
    },
    "x": {
        "timeline": [
            { "type": "For you",   "max_scrolls": 2, "max_items": 10 },
            { "type": "Following", "max_scrolls": 3, "max_items": 10 }
        ],
        "news": {
            "enabled": true,
            "detail": true,
            "max_items": 5
        },
        "monitor": [
            { "username": "elonmusk", "only_new": true, "max_tweets": 10 },
            { "username": "OpenAI",   "only_new": true, "max_tweets": 10 }
        ]
    }
}

📂 Storage & Configuration (.x-crawlfox)

To protect privacy and support persistence, X-CrawlFox uses the .x-crawlfox folder to store sensitive data:

  1. Storage Location:

    • Local Mode: The program first checks if .x-crawlfox exists in the current working directory. If found, all data is stored here (ideal for account isolation).
    • Global Mode: If the local directory does not exist, it defaults to ~/.x-crawlfox in the user home directory (Windows: %USERPROFILE%\.x-crawlfox).
  2. Stored Content:

    • x_cookies.json: Stores X login cookies and auth tokens. Do not share this file.
    • crawl_config.json: Unified configuration file for the all and monitor commands.
    • x_crawl_state.json: Stores the last tweet ID fetched for each monitored account to enable incremental fetching.
  3. Output Location: All scraping results are saved in .jsonl format in the output/ directory for easy analysis or database import.


🙏 Acknowledgments

This project is deeply inspired by the open-source community and integrates excellent open-source projects such as Camoufox. Sincere thanks to all the open-source libraries and developers who provide foundational support for this project.


⚠️ Disclaimer

This tool is for educational and research purposes only. Please comply with the X (Twitter) Terms of Service. The developers are not responsible for any account restrictions or legal issues resulting from the use of this tool.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

x_crawlfox-0.1.2.tar.gz (53.9 kB view details)

Uploaded Source

File details

Details for the file x_crawlfox-0.1.2.tar.gz.

File metadata

  • Download URL: x_crawlfox-0.1.2.tar.gz
  • Upload date:
  • Size: 53.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.22 {"installer":{"name":"uv","version":"0.9.22","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"22.04","id":"jammy","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for x_crawlfox-0.1.2.tar.gz
Algorithm Hash digest
SHA256 c7219aaf21264f831f34084a27717f4bb93f6b9082f72735cd081afe0765f277
MD5 2ace7f71cf5e01d3753f9d9d5f098422
BLAKE2b-256 50eebbc93fa09f0c0ce761781a3526c4944779fff03d0127a7176c584182fa33

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page