Skip to main content

A multi-threaded, open source web crawler

Project description

A multi-threaded, open source web crawler


  • Use multiple threads to visit web pages

  • Extract web page data using XPath expressions or CSS selectors

  • Extract urls from a web page and visit extracted urls

  • Write extracted data to an output file

  • Set HTTP session parameters such as: cookies, SSL certificates, proxies

  • Set HTTP request parameters such as: header, body, authentication

  • Download files from the urls

  • Supports Python 2 and Python 3


pip install xcrawler

When installing lxml library on Windows you may encounter Microsoft Visual C++ is required errors.
To install lxml library on Windows:
  1. Download and install Microsoft Windows SDK:

  2. Click the Start Menu, search for and open the command prompt:

    • For Python 2.6, 2.7, 3.0, 3.1, 3.2: CMD Shell

    • For Python 3.3, 3.4: Windows SDK 7.1 Command Prompt

  3. Install lxml

setenv /x86 /release && SET DISTUTILS_USE_SDK=1 && set STATICBUILD=true && pip install lxml


Data and urls are extracted from a web page by a page scraper.
To extract data and urls from a web page use the following methods:
extract returns data extracted from a web page
visit returns next Pages to be visited

A crawler can be configured before crawling web pages. A user can configure such settings of the crawler as:
* the number of threads used to visit web pages
* the name of an output file
* the request timeout
To run the crawler call:

Examples how to use xcrawler can be found at:

XPath Example

from xcrawler import XCrawler, Page, PageScraper

class Scraper(PageScraper):
    def extract(self, page):
        topics = page.xpath("//a[@class='question-hyperlink']/text()")
        return topics

start_pages = [ Page("", Scraper()) ]
crawler = XCrawler(start_pages)
crawler.config.output_file_name = "stackoverflow_example_crawler_output.csv"

CSS Example

from xcrawler import XCrawler, Page, PageScraper

class StackOverflowItem:
    def __init__(self):
        self.title = None
        self.votes = None
        self.tags = None
        self.url = None

class UrlsScraper(PageScraper):
    def visit(self, page):
        hrefs = page.css_attr(".question-summary h3 a", "href")
        urls = page.to_urls(hrefs)
        return [Page(url, QuestionScraper()) for url in urls]

class QuestionScraper(PageScraper):
    def extract(self, page):
        item = StackOverflowItem()
        item.title = page.css_text("h1 a")[0]
        item.votes = page.css_text(".question .vote-count-post")[0].strip()
        item.tags = page.css_text(".question .post-tag")[0]
        item.url = page.url
        return item

start_pages = [ Page("", UrlsScraper()) ]
crawler = XCrawler(start_pages)
crawler.config.output_file_name = "stackoverflow_css_crawler_output.csv"
crawler.config.number_of_threads = 3

File Example

from xcrawler import XCrawler, Page, PageScraper

class WikimediaItem:
    def __init__(self): = None
        self.base64 = None

class EncodedScraper(PageScraper):
    def extract(self, page):
        url = page.xpath("//div[@class='fullImageLink']/a/@href")[0]
        item = WikimediaItem() = url.split("/")[-1]
        item.base64 = page.file(url)
        return item

start_pages = [ Page("", EncodedScraper()) ]
crawler = XCrawler(start_pages)
crawler.config.output_file_name = "wikimedia_file_example_output.csv"

Session Example

from xcrawler import XCrawler, Page, PageScraper
from requests.auth import HTTPBasicAuth

class Scraper(PageScraper):
    def extract(self, page):
        return page.__str__()

start_pages = [ Page("", Scraper()) ]
crawler = XCrawler(start_pages)
crawler.config.output_file_name = "router_session_example_output.csv"
crawler.config.session.headers = {"User-Agent": "Custom User Agent",
                                  "Accept-Language": "fr"}
crawler.config.session.auth = HTTPBasicAuth('admin', 'admin')

Request Example

from xcrawler import XCrawler, Page, PageScraper

class Scraper(PageScraper):
    def extract(self, page):
        return page.__str__()

start_page = Page("", Scraper())
start_page.request.cookies = {"theme": "classic"}
crawler = XCrawler([start_page])
crawler.config.request_timeout = (5, 5)
crawler.config.output_file_name = "router_request_example_output.csv"


For more information about xcrawler see the source code and Python Docstrings: source code
The documentation can also be accessed at runtime with Python’s built-in help function:
>>> import xcrawler
>>> help(xcrawler.Config)
    # Information about the Config class
>>> help(xcrawler.PageScraper.extract)
    # Information about the extract method of the PageScraper class


GNU GPL v2.0

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xcrawler-1.3.0.tar.gz (30.0 kB view details)

Uploaded Source

File details

Details for the file xcrawler-1.3.0.tar.gz.

File metadata

  • Download URL: xcrawler-1.3.0.tar.gz
  • Upload date:
  • Size: 30.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for xcrawler-1.3.0.tar.gz
Algorithm Hash digest
SHA256 2536c3a903384fc727f35b4a73ab6a1b9e153a9e5586eabfa0b13bee55a7e2c4
MD5 e0585a2c4d97eb13d70e631a706d2719
BLAKE2b-256 e622c42c3907f45bc36c3e8c04f8ec95eed62659e7c720e4db0933ad5e120a4d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page