Skip to main content

A sophisticated Python module and command-line tool for web crawling

Project description

YiraBot

Yira Logo

dependency - YiraBot License: MIT GitHub stars GitHub forks GitHub pull-requests GitHub release Custom Badge

Overview

YiraBot is a sophisticated tool designed for efficient web data collection. Primarily a powerful Python-based command-line tool, it also doubles as an integrable module for Python projects. Ideal for developers, data enthusiasts, and researchers, YiraBot streamlines web crawling with an intuitive interface and robust capabilities.

Key Features

Command-Line Focus

  • Intuitive Command-Line Interface: Execute various tasks through simple yet powerful commands, making web crawling accessible and efficient.
  • Versatile Usage: Ideal for quick tasks or complex data extraction processes, all manageable through the command line.

Module Integration

  • Python Library Flexibility: In addition to its command-line prowess, YiraBot can be imported and used as a Python module, offering extended functionality in Python scripts.

Ethical and Efficient Crawling

  • Respect for Robots.txt: Adheres to ethical scraping standards by complying with website's robots.txt policies.
  • Rich Data Extraction: Capable of extracting meta tags, images, links, and parsing sitemaps for comprehensive web analysis.

User Experience

  • Data Export Capabilities: Features include the extraction of data to files for easy analysis and record-keeping.

Cross-Platform Compatibility

  • Universal Application: Works seamlessly across various operating systems.

Ideal Use Cases

  • Academic Research: Gathering data from diverse web sources for scholarly studies.
  • SEO and Website Audits: Reviewing meta tags, links, and content for SEO analysis.
  • Website Monitoring: Tracking updates or changes across web pages.
  • Data Gathering for Machine Learning and Analysis: Collecting web data for machine learning models and data projects.

Installation

Ensure Python and Pip is installed on your system before installing YiraBot. Follow these steps for installation:

pip install YiraBot

Command-Line Usage

yirabot <command> [arguments]

Examples

Displatying the help menu

yirabot

Crawling a webpage:

yirabot crawl example.com

Crawling a webpage and extracting the data to a file.

yirabot crawl example.com -file

Crawling a webpage to get the content:

yirabot crawl-content example.com

Use YiraBot On Your Own Projects.

Usage:

Import and use Yirabot in your python script as follows.

from yirabot import Yirabot

# Create an instance of YiraBot
bot = Yirabot()

# Example usage
html_content = bot.get_html('https://example.com')
print(html_content)

Methods:

  • get_html(url): Retrieves the HTML content of a webpage.
  • is_allowed_by_robots_txt(url): Checks if crawling a URL is allowed by robots.txt.
  • parse_sitemap(url): Parses the sitemap of a website to find URLs.
  • crawl(url): Crawls a URL and extracts various information.
  • crawl_content(url): Extracts detailed content like paragraphs, headings, and lists.

Examples

Crawling a Webpage

data = bot.crawl('https://example.com')
print(data)

Extracting Content

content = bot.crawl_content('https://example.com')
print(content)

Checking if a WebPage is crawlable

crawlable = bot.is_allowed_by_robots_txt('https://example.com')
print(crawlable)

Parse the sitemap of a Website to find URL's

urls = bot.parse_sitemap("https://example.com")
print(urls)

Contributing

Contributions to the YiraBot project are welcomed. Feel free to fork the repository, make your changes, and submit pull requests.

License

YiraBot is open-sourced software licensed under the MIT LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

YiraBot-1.0.6.3.tar.gz (8.2 kB view hashes)

Uploaded Source

Built Distribution

YiraBot-1.0.6.3-py3-none-any.whl (10.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page