Project description

RACCY

OVERVIEW

Raccy is a multithreaded web scraping library based on selenium with built in ORM feature. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. Currently the ORM feature supports only SQLite Database.

REQUIREMENTS

Python 3.7+
Works on Linux, Windows

ARCHITECTURE OVERVIEW

UrlDownloaderWorker: resonsible for downloading item(s) to be scraped urls and enqueue(s) them in ItemUrlScheduler
ItemUrlScheduler: receives item urls from UrlDownloaderWorker and enqueues the for feeding them to CrawlerWorker
CrawlerWorker: fetches item web pages and scrapes or extract data from them and enqueues the data in DatabaseScheduler
DatabaseScheduler: receives scraped item data from CrawlerWorker(s) and enques them for feeding them to DatabaseWorker.
DatabaseWorker: receives scraped data from DatabaseScheduler and stores it in a persistent database.

INSTALL

pip install raccy

TUTORIAL

from raccy import (
    model, UrlDownloaderWorker, CrawlerWorker, DatabaseWorker
)
from raccy.utils.driver import next_btn_handler, close_driver
from selenium import webdriver
from shutil import which

config = model.Config()
config.DATABASE = model.SQLiteDatabase('quotes.sqlite3')


class Quote(model.Model):
    quote_id = model.PrimaryKeyField()
    quote = model.TextField()
    author = model.CharField(max_length=100)

    class Meta:
        db_name = 'quote_table'


class UrlDownloader(UrlDownloaderWorker):
    start_url = 'https://quotes.toscrape.com/page/1/'
    urls_scraped = 0

    def job(self):
        while True:
            url = self.driver.current_url
            self.scheduler.put(url)
            next_btn_handler(self.driver, "//a[contains(text(), 'Next')]")

            if self.urls_scraped > 10:
                self.log.info('Closing................')
                break

            with self.mutex:
                self.urls_scraped += 1
        close_driver(self.driver, self.log)


class Crawler(CrawlerWorker):

    def parse(self, url):
        self.driver.get(url)
        quotes = self.driver.find_elements_by_xpath("//div[@class='quote']")
        for q in quotes:
            quote = q.find_element_by_xpath(".//span[@class='text']").text
            author = q.find_element_by_xpath(".//span/small").text

            data = {
                'quote': quote,
                'author': author
            }
            self.log.info(data)
            self.db_scheduler.put(data)


class Db(DatabaseWorker):

    def save(self, data):
        Quote.objects.create(**data)


def get_driver():
    driver_path = which('.\\chromedriver.exe')
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    options.add_argument("--start-maximized")
    driver = webdriver.Chrome(executable_path=driver_path, options=options)
    return driver


if __name__ == '__main__':
    workers = []
    urldownloader = UrlDownloader(get_driver())
    urldownloader.start()
    workers.append(urldownloader)

    for _ in range(5):
        crawler = Crawler(get_driver())
        crawler.start()
        workers.append(crawler)

    db = Db()
    db.start()
    workers.append(db)

    for worker in workers:
        worker.join()

    print('Done scraping...........')

Author

Afriyie Daniel

Hope You Enjoy Using It !!!!

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

2.0.0

Dec 14, 2021

1.3.1

Dec 3, 2021

1.3.0

Oct 30, 2021

1.2.6

Oct 14, 2021

1.2.5

Sep 28, 2021

1.2.4

Sep 19, 2021

1.2.3

Sep 16, 2021

1.1.2

Sep 13, 2021

This version

1.1.1

Sep 12, 2021

1.0.1

Sep 11, 2021

1.0.0

Sep 11, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

raccy-1.1.1-py3-none-any.whl (25.7 kB view hashes)

Uploaded Sep 12, 2021 Python 3

Hashes for raccy-1.1.1-py3-none-any.whl

Hashes for raccy-1.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b0dc2e6b57e504249eae4bb244e2ec8dc7a220f2dac859769d460721cecc1f59`
MD5	`cc9545e23b2b949b2f45803fb664013c`
BLAKE2b-256	`e66612acf1eeb774256c6f5c436e84328f7d1defc422728591cca51332f85220`