Skip to main content

Cloudflare scraper and cralwer written in Async

Project description

cfcrawler

Release Build status codecov Commit activity License

Cloudflare scraper and cralwer written in Async, In-place library for HTTPX. Crawl website that has cloudflare enabled, easier than ever!

This library is a HTTP client designed to crawl websites protected by Cloudflare, even when their bot detection system is active. If you're already using httpx, you can switch to this library easily since it's a drop-in replacement. Just change the import, and you're good to go.

Installation

To install base package, you can do:

pip install cfcrawler

To install fake-useragent backend support to rotate user agents, you can do:

pip install cfcrawler[ua]

Getting started

To use library, simply replace your aiohttp client with ours! It's completely in-place :)

from cfcrawler import AsyncClient

async def get(url):
    client = AsyncClient()
    await client.get(url)

By default, we're using one random user agent, which is still undetected in tests. To rotate user agent, you need to explicitly call rotate_useragent method. By default, we have a pool of few user agents, But you can also install extra optional dependencies to have a big pool of user agents to rotate between them.

from cfcrawler import AsyncClient

client = AsyncClient(use_fake_useragent_library=True) # one random user agent is selected
# do something
client.rotate_useragent() # user agent is rotated
# do something
client.rotate_useragent() # user agent is rotated again
# do something

You can also specify which browser you want to use

from cfcrawler.types import Browser
from cfcrawler import AsyncClient

AsyncClient(browser=Browser.CHROME)

If you wish to have your own user agent pool, you can pass the factory callable to the client

from cfcrawler import AsyncClient

def my_useragent_factory():
    return "My User Agent" # your implementation

client = AsyncClient(user_agent_factory=my_useragent_factory)

You can also use asyncer to syncify the implementation

from cfcrawler import AsyncClient
from asyncer import syncify

def get(url):
    client = AsyncClient()
    syncify(client.get)(url)

How this library works

The Problem

Websites using Cloudflare often rely on a bot detection mechanism that works by checking your TLS Fingerprint. So, what's a TLS Fingerprint? When you connect to a website that uses HTTPS, the first thing that happens is the exchange of a "Client Hello" message. This message tells the server some basic info about your client, like the TLS version you support and a list of cipher suites, and etc.

What's a Cipher Suite?

A cipher suite is a set of cryptographic algorithms that the client and server use to establish a secure connection. Each browser or client has its own specific list of cipher suites, and their order is unique. For instance, Chrome has its own list, Firefox has another, and Python's requests library has a completely different one.

The Detection

Cloudflare figures out if you're not a real browser by comparing your TLS Fingerprint—which is a combination of the TLS version and cipher suite order—with your user-agent. If there's a mismatch, like if your user-agent says you're Chrome but your cipher suites suggest you're a Python script, Cloudflare knows you're not a browser and blocks the request.

How My Library Helps

This library handles that problem by aligning your TLS Fingerprint with your user-agent, making it harder for Cloudflare to detect that you're not a real browser. The best part? It's just 10 lines of code! In source code, checkout cfcrawler/tls.py to see how it works.

Coming Next

  1. CF JS Challenge solver
  2. Captcha solver integration (2Captcha and etc)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cfcrawler-0.1.0.tar.gz (9.4 kB view details)

Uploaded Source

Built Distribution

cfcrawler-0.1.0-py3-none-any.whl (8.3 kB view details)

Uploaded Python 3

File details

Details for the file cfcrawler-0.1.0.tar.gz.

File metadata

  • Download URL: cfcrawler-0.1.0.tar.gz
  • Upload date:
  • Size: 9.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for cfcrawler-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e063b4cee5acda3c13d44f2d34cdae9490e7a6c47195d6e87b029dd841a7d638
MD5 7a6e5852408c414d384255f605727d3e
BLAKE2b-256 abfb19c11ecb30a68fd774f927a1e9b1fad8449a33710fd3c0cc0628fdc7883d

See more details on using hashes here.

File details

Details for the file cfcrawler-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: cfcrawler-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 8.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for cfcrawler-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d816838bbffb5ec94ef4918fc1ddadf89fcbb6de435f4c0794f0c362e269113f
MD5 4b23f2090739f36ae618050487e6bb52
BLAKE2b-256 393ae230de878e17d19318646e966488c6b24984d43ae07e4c700375841c5ca5

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page