Skip to main content

Small customizable multiprocessing multi-proxy crawler.

Project description

travis sonar_quality sonar_maintainability sonar_coverage Maintainability pip

A small crawler that uses multiprocessing and arbitrarily many proxies to download one or more websites following a given filter, search and save functions.

REMEMBER THAT DDOS IS ILLEGAL. DO NOT THIS SOFTWARE FOR ILLEGAL PORPOSES.

Installing TinyCrawler

pip install tinycrawler

Usage example

from tinycrawler import TinyCrawler
from bs4 import BeautifulSoup


def url_validator(url):
    if "http://www.example.com/my/path" not in url:
        return False

    return True


def file_parser(url, text, logger):
    return None
    soup = BeautifulSoup(text, 'lxml')

    example = soup.find("div", {"class": "example"})
    if example is None:
        return None

    return example.get_text()


my_crawler = TinyCrawler(
    seed="http://www.example.com/my/path/index.html"
)

my_crawler.load_proxies("path/to/my/proxies.json")
my_crawler.set_url_validator(url_validator)
my_crawler.set_file_parser(file_parser)

my_crawler.run("http://www.example.com/my/path/index.html")

Proxies are expected to be in the following format:

[
  {
    "ip": "89.236.17.108",
    "port": 3128,
    "type": [
      "https",
      "http"
    ]
  },
  {
    "ip": "128.199.141.151",
    "port": 3128,
    "type": [
      "https",
      "http"
    ]
  }
]

License

The software is released under the MIT license.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tinycrawler-1.0.0.tar.gz (11.1 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page