Skip to main content

A sophisticated Python module and command-line tool for web crawling

Project description

YiraBot

Buy Me A Coffee dependency - YiraBot GitHub stars

Overview

YiraBot isn't just another web scraping tool; it's about making web crawling simple and accessible for everyone. Whether you're a seasoned developer, a data enthusiast, or just dabbling in Python, YiraBot is designed to make your life easier. With its user-friendly command-line interface and Python module flexibility, YiraBot streamlines the process of extracting data from the web, making it a straightforward and enjoyable experience.

Key Features

Command-Line Simplicity

  • Easy-to-Use Commands: Experience the ease of web crawling with intuitive and powerful commands.
  • Versatility for All Tasks: Whether it's a quick data extraction or a more complex scraping job, YiraBot is up to the task, all from the command line.

Module Integration

  • Enhanced Scripting Flexibility: Not just a command-line tool, YiraBot also integrates seamlessly into your Python scripts, expanding your data scraping capabilities.

Ethical and Efficient Crawling

  • Adherence to Web Standards: YiraBot respects the rules of the web by complying with robots.txt policies.
  • Comprehensive Data Extraction: From meta tags to images and links, YiraBot is thorough, ensuring you get all the data you need.

User Friendly Experience

  • Hassle-Free Data Export: Exporting your data is a breeze with YiraBot's straightforward options.
  • Cross-Platform Compatibility: YiraBot works smoothly whether you're on Linux, Windows, or macOS.

Ideal Uses

  • Academic Research: Effortlessly gather data from various web sources.
  • SEO and Website Analysis: Conduct comprehensive reviews of website content and SEO elements.
  • Website Monitoring: Stay updated with changes and updates on web pages.
  • Machine Learning Data Collection: Easily collect data for machine learning models and analysis.

Getting Started

Ensure Python and Pip are on your system, then simply run:

pip install YiraBot

Command-Line Usage

Display the help menu:

yirabot

Explore Yirabot's Capabilities:

  • Basic crawl: yirabot crawl example.com
  • Save crawl to a file: yirabot crawl example.com -file
  • Extract content: yirabot crawl-content example.com
  • Content to JSON: yirabot crawl-content example.com -json
  • Check website issues: yirabot check example.com
  • Clone a webpage: yirabot get-html example.com

Use YiraBot On Your Own Projects.

Usage:

Import and use Yirabot in your python script as follows.

from yirabot import Yirabot

# Create an instance of YiraBot
bot = Yirabot()

# Example usage
html_content = bot.get_html('https://example.com')
print(html_content)

Methods:

  • get_html(url): Retrieves the HTML content of a webpage.
  • is_allowed_by_robots_txt(url): Checks if crawling a URL is allowed by robots.txt.
  • parse_sitemap(url): Parses the sitemap of a website to find URLs.
  • crawl(url): Crawls a URL and extracts various information.
  • crawl_content(url): Extracts detailed content like paragraphs, headings, and lists.

Examples

Crawling a Webpage

data = bot.crawl('https://example.com')
print(data)

Extracting Content

content = bot.crawl_content('https://example.com')
print(content)

Checking if a WebPage is crawlable

crawlable = bot.is_allowed_by_robots_txt('https://example.com')
print(crawlable)

Parse the sitemap of a Website to find URL's

urls = bot.parse_sitemap("https://example.com")
print(urls)

Contributing

Contributions to the YiraBot project are welcomed. Feel free to fork the repository, make your changes, and submit pull requests.

License

YiraBot is open-sourced software licensed under the MIT LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

YiraBot-1.0.7.2.tar.gz (9.4 kB view hashes)

Uploaded Source

Built Distribution

YiraBot-1.0.7.2-py3-none-any.whl (11.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page