A sophisticated Python module and command-line tool for web crawling
Project description
Overview
YiraBot isn't just another web scraping tool; it's about making web crawling simple and accessible for everyone. Whether you're a seasoned developer, a data enthusiast, or just dabbling in Python, YiraBot is designed to make your life easier. With its user-friendly command-line interface and Python module flexibility, YiraBot streamlines the process of extracting data from the web, making it a straightforward and enjoyable experience.
Key Features
Command-Line Simplicity
- Easy-to-Use Commands: Experience the ease of web crawling with intuitive and powerful commands.
- Versatility for All Tasks: Whether it's a quick data extraction or a more complex scraping job, YiraBot is up to the task, all from the command line.
Module Integration
- Enhanced Scripting Flexibility: Not just a command-line tool, YiraBot also integrates seamlessly into your Python scripts, expanding your data scraping capabilities.
Ethical and Efficient Crawling
- Adherence to Web Standards: YiraBot respects the rules of the web by complying with robots.txt policies.
- Comprehensive Data Extraction: From meta tags to images and links, YiraBot is thorough, ensuring you get all the data you need.
User Friendly Experience
- Hassle-Free Data Export: Exporting your data is a breeze with YiraBot's straightforward options.
- Cross-Platform Compatibility: YiraBot works smoothly whether you're on Linux, Windows, or macOS.
Ideal Uses
- Academic Research: Effortlessly gather data from various web sources.
- SEO and Website Analysis: Conduct comprehensive reviews of website content and SEO elements.
- Website Monitoring: Stay updated with changes and updates on web pages.
- Machine Learning Data Collection: Easily collect data for machine learning models and analysis.
Getting Started
Ensure Python and Pip are on your system, then simply run:
pip install YiraBot
Command-Line Usage
Display the help menu:
yirabot
Explore Yirabot's Capabilities:
- Basic crawl: yirabot crawl example.com
- Save crawl to a file: yirabot crawl example.com -file
- Extract content: yirabot crawl-content example.com
- Content to JSON: yirabot crawl-content example.com -json
- Check website issues: yirabot check example.com
- Clone a webpage: yirabot get-html example.com
Use YiraBot On Your Own Projects.
Usage:
Import and use Yirabot in your python script as follows.
from yirabot import Yirabot
# Create an instance of YiraBot
bot = Yirabot()
# Example usage
html_content = bot.get_html('https://example.com')
print(html_content)
Methods:
- get_html(url): Retrieves the HTML content of a webpage.
- is_allowed_by_robots_txt(url): Checks if crawling a URL is allowed by robots.txt.
- parse_sitemap(url): Parses the sitemap of a website to find URLs.
- crawl(url): Crawls a URL and extracts various information.
- crawl_content(url): Extracts detailed content like paragraphs, headings, and lists.
Examples
Crawling a Webpage
data = bot.crawl('https://example.com')
print(data)
Extracting Content
content = bot.crawl_content('https://example.com')
print(content)
Checking if a WebPage is crawlable
crawlable = bot.is_allowed_by_robots_txt('https://example.com')
print(crawlable)
Parse the sitemap of a Website to find URL's
urls = bot.parse_sitemap("https://example.com")
print(urls)
Contributing
Contributions to the YiraBot project are welcomed. Feel free to fork the repository, make your changes, and submit pull requests.
License
YiraBot is open-sourced software licensed under the MIT LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for YiraBot-1.0.7.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 78875610313d0ea8db14d971844d1c98e37c75f43dcae571184ee657740f2de1 |
|
MD5 | b05ac9644ad71f2cbf974cd009fabfbe |
|
BLAKE2b-256 | a6ff69a022aca3395b20daf57acb85c30b7890716483c870bd595acc27c3d7d4 |