Skip to main content

Small customizable multiprocessing multi-proxy crawler.

Project description

travis sonar_quality sonar_maintainability sonar_coverage Maintainability pip

An highly customizable crawler that uses multiprocessing and proxies to download one or more websites following a given filter, search and save functions.

REMEMBER THAT DDOS IS ILLEGAL. DO NOT USE THIS SOFTWARE FOR ILLEGAL PURPOSE.

Installing TinyCrawler

pip install tinycrawler

Preview (Test case)

This is the preview of the console when running the test_base.py.

preview

Usage example

from tinycrawler import TinyCrawler
from bs4 import BeautifulSoup


def url_validator(url:str)->bool:
    """Return if page at given url is to be downloaded."""
    if "http://www.example.com/my/path" not in url:
        return False

    return True

def file_parser(request_url: str, text: str, logger: 'Log')->str:
    """Return parsed downloaded page as a text document to be saved.
        request_url: str, the url of given downloaded page
        text: str, the content of the page
        logger: 'Log', a logger to log eventual errors or infos

        Return None if the page should not be saved.
    """

    soup = BeautifulSoup(text, 'lxml')

    example = soup.find("div", {"class": "example"})
    if example is None:
        return None

    return example.get_text()


my_crawler = TinyCrawler(
    use_cli=True, # True to use the command line interface, False otherwise
    directory="my_path_for_website" # Path for where to save website
)

my_crawler.load_proxies("path/to/my/proxies.json")
my_crawler.set_url_validator(url_validator)
my_crawler.set_file_parser(file_parser)

my_crawler.run("http://www.example.com/my/path/index.html")

Proxies are expected to be in the following format:

[
  {
    "ip": "89.236.17.108",
    "port": 3128,
    "type": [
      "https",
      "http"
    ]
  },
  {
    "ip": "128.199.141.151",
    "port": 3128,
    "type": [
      "https",
      "http"
    ]
  }
]

License

The software is released under the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tinycrawler-1.2.0.tar.gz (13.0 kB view hashes)

Uploaded source

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page