Skip to main content

Undetected web-scraping & seamless HTML parsing in Python!

Project description

The Easiest Way to Scrape the Web

Python 3.10+ PyPI PyPI installs

Features

  • Realistic HTTP Requests:
    • Mimics Chrome browser for undetected scraping using curl_cffi
    • Automatically rotates User Agents between requests
    • Tracks and updates the Referer header to simulate realistic request chains
    • Built-in retry logic for failed requests (e.g. 429, 503, 522)
  • Faster and Easier Parsing:
    • Extract emails, phone numbers, images, and links from responses
    • Automatically extract metadata (title, description, author, etc.) from HTML-based responses
    • Seamlessly convert responses into Lxml and BeautifulSoup objects for more parsing
    • Easily convert full or specific sections of HTML to Markdown

Install

$ pip install stealth_requests

Table of Contents

Sending Requests

Stealth-Requests mimics the API of the requests package, allowing you to use it in nearly the same way.

You can send one-off requests like this:

import stealth_requests as requests

resp = requests.get('https://link-here.com')

Or you can use a StealthSession object which will keep track of certain headers for you between requests such as the Referer header.

from stealth_requests import StealthSession

with StealthSession() as session:
    resp = session.get('https://link-here.com')

Stealth-Requests has a built-in retry feature that automatically waits 2 seconds and retries the request if it fails due to certain status codes (like 429, 503, etc.).

To enable retries, just pass the number of retry attempts using the retry argument:

import stealth_requests as requests

resp = requests.get('https://link-here.com', retry=3)

Sending Requests With Asyncio

Stealth-Requests supports Asyncio in the same way as the requests package:

from stealth_requests import AsyncStealthSession

async with AsyncStealthSession() as session:
    resp = await session.get('https://link-here.com')

Accessing Page Metadata

The response returned from this package is a StealthResponse, which has all of the same methods and attributes as a standard requests response object, with a few added features. One of these extra features is automatic parsing of header metadata from HTML-based responses. The metadata can be accessed from the meta property, which gives you access to the following metadata:

  • title: str | None
  • author: str | None
  • description: str | None
  • thumbnail: str | None
  • canonical: str | None
  • twitter_handle: str | None
  • keywords: tuple[str] | None
  • robots: tuple[str] | None

Here's an example of how to get the title of a page:

import stealth_requests as requests

resp = requests.get('https://link-here.com')
print(resp.meta.title)

Extracting Emails, Phone Numbers, Images, and Links

The StealthResponse object includes some helpful properties for extracting common data:

import stealth_requests as requests

resp = requests.get('https://link-here.com')

print(resp.emails)
# Output: ('info@example.com', 'support@example.com')

print(resp.phone_numbers)
# Output: ('+1 (800) 123-4567', '212-555-7890')

print(resp.images)
# Output: ('https://example.com/logo.png', 'https://cdn.example.com/banner.jpg')

print(resp.links)
# Output: ('https://example.com/about', 'https://example.com/contact')

Extracting HTML Tables

The StealthResponse object can parse HTML tables into dictionaries, where each key is a column header and the value is a list of that column's cell values.

For example, given a page with this table:

Name Age
Jacob 30
Jake 25

You can extract it like this:

import stealth_requests as requests

resp = requests.get('https://link-here.com')

# Each table becomes a dict: {column_name: [values]}
for table in resp.tables:
    print(table)
# Output: {'Name': ['Jacob', 'Jake'], 'Age': ['30', '25']}

Tables without recognizable headers are automatically skipped.

More Parsing Options

To make parsing HTML faster, I've also added two popular parsing packages to Stealth-Requests: Lxml and BeautifulSoup4. To use these add-ons, you need to install the parsers extra:

$ pip install 'stealth_requests[parsers]'

To easily get an Lxml tree, you can use resp.tree() and to get a BeautifulSoup object, use the resp.soup() method.

For simple parsing, I've also added the following convenience methods, from the Lxml package, right into the StealthResponse object:

  • text_content(): Get all text content in a response
  • xpath(): Go right to using XPath expressions instead of getting your own Lxml tree.

Converting Responses to Markdown

In some cases, it’s easier to work with a webpage in Markdown format rather than HTML. After making a GET request that returns HTML, you can use the resp.markdown() method to convert the response into a Markdown string, providing a simplified and readable version of the page content!

markdown() has two optional parameters:

  1. content_xpath An XPath expression, in the form of a string, which can be used to narrow down what text is converted to Markdown. This can be useful if you don't want the header and footer of a webpage to be turned into Markdown.
  2. ignore_links A boolean value that tells Html2Text whether to include links in the Markdown output.

Using Proxies

Stealth-Requests supports proxy usage through a proxies dictionary argument, similar to the standard requests package.

You can pass both HTTP and HTTPS proxy URLs when making a request:

import stealth_requests as requests

proxies = {
    "http": "http://username:password@proxyhost:port",
    "https": "http://username:password@proxyhost:port",
}

resp = requests.get('https://link-here.com', proxies=proxies)

Contributing

Contributions are welcome! Feel free to open issues or submit pull requests.

Before submitting a pull request, please format your code with Ruff: uvx ruff format stealth_requests/

↑ Back to top

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stealth_requests-2.0.5.tar.gz (15.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stealth_requests-2.0.5-py3-none-any.whl (9.7 kB view details)

Uploaded Python 3

File details

Details for the file stealth_requests-2.0.5.tar.gz.

File metadata

  • Download URL: stealth_requests-2.0.5.tar.gz
  • Upload date:
  • Size: 15.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.3

File hashes

Hashes for stealth_requests-2.0.5.tar.gz
Algorithm Hash digest
SHA256 e1393cd888be154dca2927e12ea82d23cab5f74fee2362c1a7a9a80e61075e87
MD5 ec67a916ceb43e65f47aafb7f68bf472
BLAKE2b-256 e8d2bb0244ba90a9ba931434e49f8f70f7cd626113ee301468589829d1285e83

See more details on using hashes here.

File details

Details for the file stealth_requests-2.0.5-py3-none-any.whl.

File metadata

File hashes

Hashes for stealth_requests-2.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 24a27a21105cb1d01e52995165912b3ed2ed44af6cf5ce599950e538a309fd18
MD5 ddf3375eef8a19617b8cd934aa0fe18f
BLAKE2b-256 578e2253e387c56c197cbb47f88e8617b1c4bd9b08181bfb938bd6515eb23982

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page