simple-header

Seamlessly get authentic and ready-to-use browser headers with ease

These details have not been verified by PyPI

Project links

GitHub Statistics

Project description

Generate ready-to-use headers for a web request with the least chance of being blocked by anti-scraping techniques applied by websites. The headers are based on the most common headers sent by browsers and operating systems, and are ordered in the right way (servers check for that, even if the web standards say it should not be considered). If you just need real-world useragents, check out my simple-useragent package.

Features
Installation
Usage
Development
Contributors
Credits
License

Features

Authentic: All header values, combinations and their ordering are verified to work with most web servers.
Complete: Generates all Sec-Ch-Ua, Sec-Ch-Ua-Mobile, Sec-Fetch-Dest, ... headers for Chrome-based browsers.
Powerful: Pass your own user agent in or use the convenience functions to get common, real-world user agents.
Wide Support: Almost all user agents supported: Windows, macOS, Linux, Android and iOS: Google Chrome, Firefox, Safari, Edge, Opera, Whale and QQ.
Lightweight: Designed to consume minimal system resources and optimized for performance.
Simple: Easy to use and understand with a clean and simple API.
Compatible: Supports Python 3.8 and above. Runs on Windows, macOS and Linux.
Tested: Has 99% test coverage and is continuously tested.
Open Source: Provides transparency and allows community contributions for continuous development.

Installation

Just install the package from PyPi using pip:

 pip install simple-header

Usage

Quickstart

Just import the package and use the convenience function.

 import simple_header as sh

 sh.get_dict(url="https://www.example.com/cat/pics.html")
 # {'User-Agent': 'Mozilla/5.0 ...', 'Host': 'www.example.com', 'Sec-Ch-Ua': '"Not A(Brand";v="99", ...', ...}

Advanced Usage

Import the package and use the full-fledged get() function. For detailed explanation of function parameters, please see Settings and Parameters.

 import simple_header as sh

 # Get a Header instance with a random mobile user agent to scrape the desired url.
 header = sh.get(url="https://www.example.com/cat/pics.html", mobile=True)
 header.dict
 # {'User-Agent': 'Mozilla/5.0 ...', 'Host': 'www.example.org', 'Connection': 'keep-alive', ...}
 
 # Access more attributes of the Header instance (just a few examples).
 header.connection  # 'keep-alive'
 header.referer  # 'https://www.example.com'  <- url without path
 header.user_agent.string  # 'Mozilla/5.0 ...'  <- randomly chosen user agent
 header.user_agent.os  # 'Windows'
 header.sec_ch_ua  # '"Not A(Brand";v="99", "Microsoft Edge";v="108", "Chromium";v="108"'
 header.sec_fetch_mode  # ['navigate', 'same-origin', 'cors']  <- multiple values possible (list of strings)
 
 # Overwrite auto language detection (.com = 'en-US' -> 'de-DE') and set custom seed.
 header = sh.get(url="https://www.example.com/cat/pics.html", language="de-DE",seed=3)
 header.referer  # 'https://www.web.de/'  <- referer from pool of common german websites
 header.accept_language  # 'de-DE,de;q=0.5'  <- language set to German
 
 sh.get(url="https...com", user_agent="Mozilla/5.0 ...")  # Header instance with given user agent string.
 # Header('Mozilla/5.0 ...', 'https...com', 'keep-alive', ...)
 
 ua = sh.sua.get(num=2, mobile=True)  # List of 2 the two most common mobile user agent as UserAgent instance.
 sh.get(url="https...com", user_agent=ua[0])  # Header instance with the previously fetched UserAgent instance passed.
 # Header('Mozilla/5.0 ...', 'https...com', 'keep-alive', ...)

You can also use get more than one Header instance at once with the get_list() function. The get_dict() function returns a dictionary with the headers directly usable in a request.

 # Get a list of 10 Header instances, each with the passed user agent string.
 sh.get_list(url="https...com", user_agent="Mozilla/5.0 ...", num=10)
 # [Header(...), Header(...), ...]
 
 sh.get_dict(url="https://www.example.com/cat/pics.html") # Dictionary with just the headers.
 # {'User-Agent': 'Mozilla/5.0 ...', 'Host': 'www.example.com', 'Connection': 'keep-alive', ...}

Fetching User Agents. For full explanation check the simple-useragent package.

 # Fetch a specified number of random mobile user agent instances.
 sh.sua.get(num=2, shuffle=True, mobile=True)
 # [UserAgent('Mozilla/5.0 (iPhone ...'), UserAgent('Mozilla/5.0 (iPhone; ...')]

 sh.sua.get_list(force_cached=True)  # List of all available desktop user agents as strings.
 # ['Mozilla/5.0 ...', 'Mozilla/5.0 (iPhone ...', 'Mozilla/5.0 (iPhone ...', ...]
  
 sh.sua.get_dict()  # Dictionary with all desktop and mobile user agents.
 # {'desktop': ['Mozilla/5.0 ...', ...] 'mobile': ['Mozilla/5.0 (iPhone ...', ...]}

The UserAgent instance offers attributes for the user agent properties. You can also access the properties with dictionary syntax.

 # Parse a custom string directly to the UserAgent class and access its attributes.
 obj = sh.sua.parse('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36')
 obj.string  # 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit ...'
 obj.browser  # 'Chrome', 'Firefox', 'Safari', 'Edge', 'IE', 'Opera', 'Whale', 'QQ Browser', 'Samsung Browser', 'Other'
 obj.browser_version  # '110', '109', '537', ...
 obj.browser_version_minor  # '0', '1', '36', ...
 obj['os']  # 'Windows', 'macOS', 'Linux', 'Android', 'iOS', 'Other'
 obj['os_version']  # '10', '7', '11', '14', ...
 obj['os_version_minor']  # '0', '1', '2', ...
 obj['mobile']  # True / False

Settings and Parameters

The functions can take the following parameters:

url: The url of the website you want to scrape.
language: The language of the website you want to scrape or where the request is made from (default: None = auto-detect).
user_agent: A custom user agent string or a UserAgent instance to use for header generation (default: None = random user agent).
mobile: If no user_agent is passed: Generate a mobile or desktop user agent (default: False = desktop).
seed: The random seed for referer selection and header value combinations (default: None = most plausible values chosen, max: 720).
num: The number of Header instances to fetch only for get_list method (default: 10, max: 720).

Notes:

The src/simple_header/inspect_headers.py file contains a commented-out Flask app to validate which headers your browser or scraper sends.

The language auto-detection is based on the top-level domain of the url. You can overwrite it with the language parameter, by giving it a language (e.g. 'de-DE') or a country code (e.g. 'de'). Fallback for unknown or non-country domains (.org, .dev, ...) is 'en-US'.

For each language there is a pool of common websites, which are used to get a plausible referer. Also, we use the url to scrape without the path as referer (e.g. 'https://www.example.com/cat/pics.html' -> 'https://www.example.com'). The referer is used to make the request look more realistic, as it seems like the user is browsing between different pages of the website.

The seed parameter is used to set the random seed for referer selection and header values (if multiple are available). This is useful if your request got blocked by the server, so you try again with another seed. There are around 720 different combinations/seeds possible.

The order of the headers is important, as most servers and bot-detectors check for that, even if the web standards say it should not be considered. I manually tested for every browser and OS which headers are sent and in which order.

Development

As an open-source project, I strive for transparency and collaboration in my development process. I greatly appreciate any contributions members of our community can provide. Whether you are fixing bugs, proposing features, improving documentation, or spreading awareness - your involvement strengthens the project. Please review the code of conduct to understand how we work together respectfully.

Bug Report: If you are experiencing an issue while using the package, please create an issue.
Feature Request: Make this project better by submitting a feature request.
Documentation: Improve our documentation by adding a wiki page.
Community Support: Help others on GitHub Discussions.
Security Report: Report critical security issues via our template.

Contributors

Thank you so much for giving feedback, implementing features and improving the code and project!

Credits

Full credits are in the ACKNOWLEDGMENTS file.

License

Provided under the terms of the GNU GPL3 License © Lennart Haack 2024.

See LICENSE file for details. For the licenses of used third party libraries and software, please refer to the ACKNOWLEDGMENTS file.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

Release history Release notifications | RSS feed

This version

0.1.1

Feb 14, 2024

0.1.0

Feb 8, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simple-header-0.1.1.tar.gz (37.4 kB view hashes)

Uploaded Feb 14, 2024 Source

Built Distribution

simple_header-0.1.1-py3-none-any.whl (33.0 kB view hashes)

Uploaded Feb 14, 2024 Python 3

Hashes for simple-header-0.1.1.tar.gz

Hashes for simple-header-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`696d92b08fc8e3f59ccca8c113111d2c69e76f15b326ea991db59965a199c34c`
MD5	`827cd01b098ba1cad71bfec3dd48e4ef`
BLAKE2b-256	`9a57d1131b98008211d556b754349625b3c776b1a4e7e94b9a5334930f399e54`

Hashes for simple_header-0.1.1-py3-none-any.whl

Hashes for simple_header-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`90f92c1f6217ccd3db1eadbe7b025669a1ed2513ac033358d265b4a2979a6c5f`
MD5	`390cc474d4869128b5900671b710a970`
BLAKE2b-256	`b6af4e7d1e396cc516ab29c91c6a5edee465a8f002e72ef3fc9c69da2697d4a3`