Corgi Browser
Project description
CorgiBrowser: Scalable Web Crawling Framework
CorgiBrowser is an open-source Python framework focused at simplifying the process of web crawling and scraping. Built with scalability, efficiency, and ethical data collection in mind, it is designed for researchers, developers, and analysts who require robust data acquisition capabilities.
Documentation
readthedocs.org/projects/corgibrowser/
Table of Contents
- Introduction
- Key Features
- Depencencies
- Getting Started
- Demos
- Background
- Contributing to CorgiBrowser
- License
Introduction
CorgiBrowser started from the need for a scalable solution that addresses the challenges of modern web crawling and scraping. With the internet's exponential data growth, existing frameworks often fall short in scalability and customizability. CorgiBrowser, is an all tools included framework that focus on ethical data practices, presents a pioneering approach to distributed crawling and data management.
Key Features
- Scalability: Supports large-scale data collection with a microservices architecture, enabling horizontal scaling on cloud platforms.
- Distributed Crawling: Offers configurable crawlers with priority settings for tailored crawling strategies.
- Use of Custom Scraping Templates: Facilitates the integration of custom templates for precise data extraction.
- Ethical Crawling: Complies with robots.txt standards and employs throttling to minimize the impact on web resources.
- Cloud Integration: Works with cloud storage solutions for efficient data management and scalability.
Depencencies
- Python 3.9+
- Works on Linux, Windows
- Azure Storage Account, (with future support for local storage)
Getting Started
To install CorgiBrowser, run the following command:
pip install corgibrowser
To initialize a Crawler instance:
import os
from dotenv import load_dotenv
from corgibrowser.corgi_cloud_integration.cloud_integration import CloudIntegration
from corgibrowser.corgi_datasets.DataSetsManager import DataSetsManager
from corgibrowser.corgi_settings.SettingsManager import SettingsManager
from corgibrowser.corgi_crawler.crawler import *
# Load Settings Manager
settings_manager = SettingsManager()
load_dotenv()
settings_manager.CLOUD["AZURE_STORAGE_ACCOUNT_NAME"] = os.getenv("AZURE_STORAGE_ACCOUNT_NAME")
settings_manager.CLOUD["AZURE_STORAGE_ACCOUNT_KEY"] = os.getenv("AZURE_STORAGE_ACCOUNT_KEY")
# Set Up cloud
CloudIntegration(settings_manager = settings_manager)
cloud_integration = CloudIntegration( settings_manager = settings_manager )
cloud_integration.initialize()
# Add Initial URLs
for url in DataSetsManager.load_usa_newspaper_urls():
cloud_integration.add_url_to_queue(url)
# Crawl
crawler = WebCrawler(cloud_integration = cloud_integration, settings_manager=settings_manager )
crawler.initialize()
crawler.start()
To initialize a Scraper instance:
import os
from dotenv import load_dotenv
from corgibrowser.corgi_cloud_integration.cloud_integration import CloudIntegration
from corgibrowser.corgi_settings.SettingsManager import SettingsManager
from corgibrowser.corgi_webscraping.scraper import Scraper
# Load Settings Manager
settings_manager = SettingsManager()
load_dotenv()
settings_manager.CLOUD["AZURE_STORAGE_ACCOUNT_NAME"] = os.getenv("AZURE_STORAGE_ACCOUNT_NAME")
settings_manager.CLOUD["AZURE_STORAGE_ACCOUNT_KEY"] = os.getenv("AZURE_STORAGE_ACCOUNT_KEY")
# Set Up cloud
CloudIntegration(settings_manager = settings_manager)
cloud_integration = CloudIntegration( settings_manager = settings_manager )
cloud_integration.initialize()
# Scrape
scraper = Scraper(cloud_integration = cloud_integration, settings_manager=settings_manager )
scraper.initialize()
scraper.start()
Demos
Link to demo applications and tutorials.
Background
Developed for Jose Enriquez's Master's Thesis in Computer Engineering, CorgiBrowser aims to democratize access to web data through ethical and efficient crawling. CorgiBrowser objective is to represent a significant step in merging web crawling, cloud technologies, and data analysis. This integration enhances scalability, efficiency, and the ability to perform comprehensive data processing, establishing a new benchmark in data collection technologies.
Contributing to CorgiBrowser
Contributors are welcome! Check out the Open Issues on GitHub for starting points.
License
CorgiBrowser is released under the MIT License, promoting open and unrestricted use and contribution.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file corgibrowser-0.1.0.tar.gz
.
File metadata
- Download URL: corgibrowser-0.1.0.tar.gz
- Upload date:
- Size: 55.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8aa8e8c28e80bb283109d49bc4fc92f92a64c5ea34fc12356dbd99d8d8b7d51e |
|
MD5 | 6b0a8d3b08c2a78c11f2ceb99182319d |
|
BLAKE2b-256 | d6ab775541ec51487401b640b650a9304fb2e5b9ec92827d54d5c516106680af |
File details
Details for the file corgibrowser-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: corgibrowser-0.1.0-py3-none-any.whl
- Upload date:
- Size: 72.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.0.0 CPython/3.10.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f9de21af2afd1593d751e8a0732566ad3ac59dc62635e4aadac9a5221f61578e |
|
MD5 | c250228cae82437124be7f0d2d8d5aa7 |
|
BLAKE2b-256 | 9e75f843720ff10a1d23380a8b22848de090e555a87964ddb9d2605f318aa12d |