Skip to main content

A sophisticated Python module and command-line tool for web crawling

Project description

YiraBot

Buy Me A Coffee dependency - YiraBot GitHub stars

Overview

Meet YiraBot – your new web crawling and SEO analysis companion! Designed for simplicity and ease of use, YiraBot makes web scraping accessible to everyone. Whether you're a seasoned developer, a data enthusiast, or just exploring Python, YiraBot streamlines web data extraction, turning it into an effortless and satisfying task.

Key Features

Command-Line Simplicity

  • User-Friendly Commands: Jump right into web crawling with straightforward and powerful commands.
  • Ready for Any Task: From quick data grabs to intricate scraping jobs, YiraBot handles it all through the command line.

Module Integration

  • Scripting Made Easy: More than a command-line tool – YiraBot integrates flawlessly with your Python scripts for enhanced scraping capabilities.

Ethical and Efficient Crawling

  • Respecting Web Standards: YiraBot adheres to robots.txt policies, ensuring responsible web scraping.
  • Thorough Data Extraction: Extract everything from meta tags to images and links – YiraBot doesn't miss a beat.

User-Friendly Experience

  • Simple Data Export: Exporting your data is straightforward with YiraBot's easy options.
  • Cross-Platform Performance: Enjoy seamless operation across Linux, Windows, and macOS.

Ideal Uses

  • Academic Research: Gather web data effortlessly for your research projects.
  • SEO and Website Analysis: Dive deep into website content and SEO elements for comprehensive insights.
  • Website Monitoring: Keep tabs on changes and updates across web pages.
  • Machine Learning Data Gathering: Conveniently collect data sets for machine learning purposes.

Getting Started

First things first – make sure Python and Pip are installed on your system. Then, you're just one command away:

pip install YiraBot

Command-Line Usage

Kick things off with the help menu:

yirabot

Dive into YiraBot's Capabilities:

  • Basic Crawl: 'yirabot crawl example.com'
  • Save Crawl to a File: 'yirabot crawl example.com -file' (or -json)
  • Content Crawl: 'yirabot crawl-content example.com'
  • Check Website for Issues: 'yirabot check example.com'
  • Clone a Webpage: 'yirabot get-html example.com'
  • Crawl Authentication Protected Pages: 'yirabot session'

Using Yirabot in Your Projects

Easily integrate YiraBot in your scripts like so:

from yirabot import Yirabot

# Create a YiraBot instance
bot = Yirabot()

# Example usage
html_content = bot.get_html('https://example.com')
print(html_content)

Methods:

  • get_html(url): Retrieves the HTML content of a webpage.
  • is_allowed_by_robots_txt(url): Checks if a URL is permitted for crawling by robots.txt.
  • parse_sitemap(url): Finds URLs by parsing a website's sitemap.
  • crawl(url): Performs a comprehensive crawl of a URL.
  • crawl_content(url): Extracts detailed content like text, headings, and lists.

Examples

Crawl a Webpage:

data = bot.crawl('https://example.com')
print(data)

Extract Web Content:

content = bot.crawl_content('https://example.com')
print(content)

Check Crawlability of a WebPage:

crawlable = bot.is_allowed_by_robots_txt('https://example.com')
print(crawlable)

Discover URLs from a Website's Sitemap:

urls = bot.parse_sitemap("https://example.com")
print(urls)

Contributing

Your contributions are what make YiraBot even better. Fork the repository, make your changes, and create a pull request to join in!

License

iraBot is open-source and proudly bears the MIT LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

YiraBot-1.0.7.3.1.tar.gz (11.1 kB view hashes)

Uploaded Source

Built Distribution

YiraBot-1.0.7.3.1-py3-none-any.whl (12.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page