Skip to main content

Web Scraping Library Based on Selenium

Project description

RACCY

OVERVIEW

Raccy is a multithreaded web scraping library based on selenium. It can be used for web automation, web scraping, and data mining.

REQUIREMENTS

  • Python 3.7+
  • Works on Linux, Windows, and Mac

ARCHITECTURE OVERVIEW

  • UrlDownloaderWorker: resonsible for downloading item(s) to be scraped urls and enqueue(s) them in ItemUrlQueue

  • ItemUrlQueue: receives item urls from UrlDownloaderWorker and enqueues them for feeding them to CrawlerWorker

  • CrawlerWorker: fetches item web pages and scrapes or extract data from them and enqueues the data in DatabaseQueue

  • DatabaseQueue: receives scraped item data from CrawlerWorker(s) and enques them for feeding them to DatabaseWorker.

  • DatabaseWorker: receives scraped data from DatabaseQueue and stores it in a persistent database.

INSTALL

pip install raccy

TUTORIAL

from raccy import (
    UrlDownloaderWorker, CrawlerWorker, DatabaseWorker, WorkersManager
)
import ro as model
from selenium import webdriver
from shutil import which

config = model.Config()
config.DATABASE = model.SQLiteDatabase('quotes.sqlite3')


class Quote(model.Model):
    quote_id = model.PrimaryKeyField()
    quote = model.TextField()
    author = model.CharField(max_length=100)


class UrlDownloader(UrlDownloaderWorker):
    start_url = 'https://quotes.toscrape.com/page/1/'
    max_url_download = 10

    def job(self):
        url = self.driver.current_url
        self.url_queue.put(url)
        self.follow(xpath="//a[contains(text(), 'Next')]", callback=self.job)


class Crawler(CrawlerWorker):

    def parse(self, url):
        self.driver.get(url)
        quotes = self.driver.find_elements_by_xpath("//div[@class='quote']")
        for q in quotes:
            quote = q.find_element_by_xpath(".//span[@class='text']").text
            author = q.find_element_by_xpath(".//span/small").text

            data = {
                'quote': quote,
                'author': author
            }
            self.log.info(data)
            self.db_queue.put(data)


class Db(DatabaseWorker):

    def save(self, data):
        Quote.objects.create(**data)


def get_driver():
    driver_path = which('.\\chromedriver.exe')
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    options.add_argument("--start-maximized")
    driver = webdriver.Chrome(executable_path=driver_path, options=options)
    return driver


if __name__ == '__main__':
    manager = WorkersManager()
    manager.add_driver(get_driver)
    manager.start()
    print('Done scraping...........')

Author

  • Afriyie Daniel

Hope You Enjoy Using It !!!!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

raccy-2.0.0-py3-none-any.whl (18.1 kB view details)

Uploaded Python 3

File details

Details for the file raccy-2.0.0-py3-none-any.whl.

File metadata

  • Download URL: raccy-2.0.0-py3-none-any.whl
  • Upload date:
  • Size: 18.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.7.8

File hashes

Hashes for raccy-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7b6d94ae059630da8eee4c28737cc8ebfa3817a87bff73fb62be7b4ea0b03f23
MD5 3a059082dfc95bff172cabdc2f3b3925
BLAKE2b-256 2488bb399f066a4e4cfcee591718d32cfe8a5b8205a90ddc727aa331db8942f3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page