Skip to main content

Corgi Browser

Project description

CorgiBrowser: Scalable Web Crawling Framework

CorgiBrowser is an open-source Python framework focused at simplifying the process of web crawling and scraping. Built with scalability, efficiency, and ethical data collection in mind, it is designed for researchers, developers, and analysts who require robust data acquisition capabilities.

Documentation

readthedocs.org/projects/corgibrowser/

Table of Contents

Introduction

CorgiBrowser started from the need for a scalable solution that addresses the challenges of modern web crawling and scraping. With the internet's exponential data growth, existing frameworks often fall short in scalability and customizability. CorgiBrowser, is an all tools included framework that focus on ethical data practices, presents a pioneering approach to distributed crawling and data management.

Key Features

  • Scalability: Supports large-scale data collection with a microservices architecture, enabling horizontal scaling on cloud platforms.
  • Distributed Crawling: Offers configurable crawlers with priority settings for tailored crawling strategies.
  • Use of Custom Scraping Templates: Facilitates the integration of custom templates for precise data extraction.
  • Ethical Crawling: Complies with robots.txt standards and employs throttling to minimize the impact on web resources.
  • Cloud Integration: Works with cloud storage solutions for efficient data management and scalability.

Depencencies

Getting Started

To install CorgiBrowser, run the following command:

pip install corgibrowser

To initialize a Crawler instance:

import os
from dotenv import load_dotenv
from corgibrowser.corgi_cloud_integration.cloud_integration import CloudIntegration
from corgibrowser.corgi_datasets.DataSetsManager import DataSetsManager
from corgibrowser.corgi_settings.SettingsManager import SettingsManager
from corgibrowser.corgi_crawler.crawler import *

# Load Settings Manager
settings_manager = SettingsManager()
load_dotenv()
settings_manager.CLOUD["AZURE_STORAGE_ACCOUNT_NAME"] = os.getenv("AZURE_STORAGE_ACCOUNT_NAME")
settings_manager.CLOUD["AZURE_STORAGE_ACCOUNT_KEY"] = os.getenv("AZURE_STORAGE_ACCOUNT_KEY")

# Set Up cloud
CloudIntegration(settings_manager = settings_manager)
cloud_integration = CloudIntegration( settings_manager = settings_manager )
cloud_integration.initialize()

# Add Initial URLs
for url in DataSetsManager.load_usa_newspaper_urls():
    cloud_integration.add_url_to_queue(url)

# Crawl
crawler = WebCrawler(cloud_integration = cloud_integration, settings_manager=settings_manager )
crawler.initialize()
crawler.start()

To initialize a Scraper instance:

import os
from dotenv import load_dotenv
from corgibrowser.corgi_cloud_integration.cloud_integration import CloudIntegration
from corgibrowser.corgi_settings.SettingsManager import SettingsManager
from corgibrowser.corgi_webscraping.scraper import Scraper

# Load Settings Manager
settings_manager = SettingsManager()
load_dotenv()
settings_manager.CLOUD["AZURE_STORAGE_ACCOUNT_NAME"] = os.getenv("AZURE_STORAGE_ACCOUNT_NAME")
settings_manager.CLOUD["AZURE_STORAGE_ACCOUNT_KEY"] = os.getenv("AZURE_STORAGE_ACCOUNT_KEY")

# Set Up cloud
CloudIntegration(settings_manager = settings_manager)
cloud_integration = CloudIntegration( settings_manager = settings_manager )
cloud_integration.initialize()

# Scrape
scraper = Scraper(cloud_integration = cloud_integration, settings_manager=settings_manager )
scraper.initialize()
scraper.start()

Demos

Link to demo applications and tutorials.

Background

Developed for Jose Enriquez's Master's Thesis in Computer Engineering, CorgiBrowser aims to democratize access to web data through ethical and efficient crawling. CorgiBrowser objective is to represent a significant step in merging web crawling, cloud technologies, and data analysis. This integration enhances scalability, efficiency, and the ability to perform comprehensive data processing, establishing a new benchmark in data collection technologies.

Contributing to CorgiBrowser

Contributors are welcome! Check out the Open Issues on GitHub for starting points.

License

CorgiBrowser is released under the MIT License, promoting open and unrestricted use and contribution.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corgibrowser-0.1.0.tar.gz (55.0 kB view details)

Uploaded Source

Built Distribution

corgibrowser-0.1.0-py3-none-any.whl (72.6 kB view details)

Uploaded Python 3

File details

Details for the file corgibrowser-0.1.0.tar.gz.

File metadata

  • Download URL: corgibrowser-0.1.0.tar.gz
  • Upload date:
  • Size: 55.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.10.4

File hashes

Hashes for corgibrowser-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8aa8e8c28e80bb283109d49bc4fc92f92a64c5ea34fc12356dbd99d8d8b7d51e
MD5 6b0a8d3b08c2a78c11f2ceb99182319d
BLAKE2b-256 d6ab775541ec51487401b640b650a9304fb2e5b9ec92827d54d5c516106680af

See more details on using hashes here.

File details

Details for the file corgibrowser-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for corgibrowser-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f9de21af2afd1593d751e8a0732566ad3ac59dc62635e4aadac9a5221f61578e
MD5 c250228cae82437124be7f0d2d8d5aa7
BLAKE2b-256 9e75f843720ff10a1d23380a8b22848de090e555a87964ddb9d2605f318aa12d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page