Crawling the web made easy.

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

BitCrawler

What is it?

Bitcrawler is a Python package that provides functionality for crawling & scraping the web. The library brings simplicity, speed, and extensibility to any crawling project. The library can be exteded to easily add on additional crawling behavior and functionality for specific use cases.

Installation

pip install bitcrawler

Documentation

See the documentation at https://bitcrawler.readthedocs.io/en/latest/bitcrawler.html#bitcrawler for more details on usage.

Example Crawler

Crawling webpages will begin by fetching the original URL supplied. The crawler will traverse links discoverd on the pages until it reaches the specified crawl depth or runs out of links.

A bitcrawler.webpage.Webpage class instance will be returned for each page fetched. To see more details on the Webpage class see the documetation on the class (https://bitcrawler.readthedocs.io/en/latest/bitcrawler.html#module-bitcrawler.webpage).

Simple Usage

from bitcrawler.crawler import Crawler

crawler = Crawler()
# Returns a list of bitcrawler.webpage.Webpage objects.
# See the Webpage class for more details on its members.
crawled_pages = crawler.crawl('http://test.com')

Advanced Usage

The below example extends the crawler object and overrides the parse function. The parse function is always called at the end of crawling. It is passed all the pages fetched. In the below example the pages are parsed using beautifulsoup and the title is printed with the URL.

from bs4 import BeautifulSoup
from bitcrawler.crawler import Crawler
from bitcrawler import webpage

class MyCrawler(Crawler):
    # Parse is always called py the `crawl` method and is provided
    # a webpage.Webpage class instance for each URL.
    # See the webpage.Webpage class for details about the object.
    def parse(self, webpages):
        for page in webpages:
            # If page response is not none, response code is in 200s, and document is html.
            if page.response and \
               page.response.ok and \
               page.response.headers.get('content-type').startswith('text/html'):
                soup = BeautifulSoup(page.response.text, "html.parser")
                print(page.url, "- ", soup.title) 
        return webpages

# Initializes the crawler with the configuration specified by parameters.
crawler = MyCrawler(
    user_agent='python-requests', # The User Agent to use for all requests.
    crawl_delay=0, # Number of seconds to wait between web requests.
    crawl_depth=2, # The max depth from following links (Default is 5).
    cross_site=False, # If true, domains other than the original domain can be crawled.
    respect_robots=True, # If true, the robots.txt standard will be followed.
    respect_robots_crawl_delay=True, # If true, the robots.txt crawl delay will be followed.
    multithreading=True, # If true, parallelizes requests for faster crawling.
    max_threads=100, # If multithreading is true, this determines the number of threads.
    webpage_builder=webpage.WebpageBuilder, # Advanced Usage - Allows the WebpageBuilder class to be overridden to allow modificaion.
    request_kwargs={'timeout': 10}, # Additional keyword arguments that you would like to pass into any request made.
    reppy_cache_capacity=100, # The number of robots.txt objects to cache. Eliminates the need to fetch robots.txt file many times.
    reppy_cache_policy=None, # Advanced Usage - See docs for details.
    reppy_ttl_policy=None, # Advanced Usage - See docs for details.
    reppy_args=tuple()) # Advanced Usage - See docs for details.

# Crawls pages starting from "http://test.com"
# Returns a list of bitcrawler.webpage.Webpage objects.
# See the Webpage class for more details on its members.
crawled_pages = crawler.crawl(
    url="http://test.com", # The start URL to crawl from.
    allowed_domains=[], # A list of allowed domains. `cross_site` must be True. Ex. ['python.org',...]
    disallowed_domains=[], # A list of disallowed domains. `cross_site` must be True and `allowed_domains` empty.
    page_timeout=10) # The ammount of time before a page retrieval/build times out.

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.0

Mar 9, 2021

0.0.4

Feb 27, 2021

0.0.3

Feb 21, 2021

0.0.2

Feb 21, 2021

0.0.1

Feb 21, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bitcrawler-0.1.0.tar.gz (12.6 kB view details)

Uploaded Mar 9, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

bitcrawler-0.1.0-py3-none-any.whl (18.3 kB view details)

Uploaded Mar 9, 2021 Python 3

File details

Details for the file bitcrawler-0.1.0.tar.gz.

File metadata

Download URL: bitcrawler-0.1.0.tar.gz
Upload date: Mar 9, 2021
Size: 12.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.2

File hashes

Hashes for bitcrawler-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`facbce097a6b2789bd8365a876cc5d0ead83bdef56f85a82ea2ef0402650709c`
MD5	`ae124499ffbe294493de8b1f8b4965b5`
BLAKE2b-256	`f84cafe26d55ff4946a0423db3c095338af0571ffdad5c3600708f78253d84a0`

See more details on using hashes here.

File details

Details for the file bitcrawler-0.1.0-py3-none-any.whl.

File metadata

Download URL: bitcrawler-0.1.0-py3-none-any.whl
Upload date: Mar 9, 2021
Size: 18.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.25.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.59.0 CPython/3.8.2

File hashes

Hashes for bitcrawler-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d7cfb319ce8774e6d00fcb399a3db03a540e9e737ca50e7add0a88fcafc00ad7`
MD5	`a5fa31b03b597e51b7fb7c4e66774b95`
BLAKE2b-256	`603c879562428435a3a53c9131a53e139cc6ccc8e852eeae9869ef7df313e314`

See more details on using hashes here.

bitcrawler 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

BitCrawler

What is it?

Installation

Documentation

Example Crawler

Simple Usage

Advanced Usage

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes