A sophisticated Python module and command-line tool for web crawling
Project description
Overview
Meet YiraBot – your new web crawling and SEO analysis companion! Designed for simplicity and ease of use, YiraBot makes web scraping accessible to everyone. Whether you're a seasoned developer, a data enthusiast, or just exploring Python, YiraBot streamlines web data extraction, turning it into an effortless and satisfying task.
Key Features
Command-Line Simplicity
- User-Friendly Commands: Jump right into web crawling with straightforward and powerful commands.
- Ready for Any Task: From quick data grabs to intricate scraping jobs, YiraBot handles it all through the command line.
Module Integration
- Scripting Made Easy: More than a command-line tool – YiraBot integrates flawlessly with your Python scripts for enhanced scraping capabilities.
Ethical and Efficient Crawling
- Respecting Web Standards: YiraBot adheres to robots.txt policies, ensuring responsible web scraping.
- Thorough Data Extraction: Extract everything from meta tags to images and links – YiraBot doesn't miss a beat.
User-Friendly Experience
- Simple Data Export: Exporting your data is straightforward with YiraBot's easy options.
- Cross-Platform Performance: Enjoy seamless operation across Linux, Windows, and macOS.
Ideal Uses
- Academic Research: Gather web data effortlessly for your research projects.
- SEO and Website Analysis: Dive deep into website content and SEO elements for comprehensive insights.
- Website Monitoring: Keep tabs on changes and updates across web pages.
- Machine Learning Data Gathering: Conveniently collect data sets for machine learning purposes.
Getting Started
First things first – make sure Python and Pip are installed on your system. Then, you're just one command away:
pip install YiraBot
Command-Line Usage
Kick things off with the help menu:
yirabot
Dive into YiraBot's Capabilities:
- Basic Crawl: 'yirabot crawl example.com'
- Save Crawl to a File: 'yirabot crawl example.com -file' (or -json)
- Content Crawl: 'yirabot crawl-content example.com'
- Check Website for Issues: 'yirabot check example.com'
- Clone a Webpage: 'yirabot get-html example.com'
- Crawl Authentication Protected Pages: 'yirabot session'
Using Yirabot in Your Projects
Easily integrate YiraBot in your scripts like so:
from yirabot import Yirabot
# Create a YiraBot instance
bot = Yirabot()
# Example usage
html_content = bot.get_html('https://example.com')
print(html_content)
Methods:
- get_html(url): Retrieves the HTML content of a webpage.
- is_allowed_by_robots_txt(url): Checks if a URL is permitted for crawling by robots.txt.
- parse_sitemap(url): Finds URLs by parsing a website's sitemap.
- crawl(url): Performs a comprehensive crawl of a URL.
- crawl_content(url): Extracts detailed content like text, headings, and lists.
Examples
Crawl a Webpage:
data = bot.crawl('https://example.com')
print(data)
Extract Web Content:
content = bot.crawl_content('https://example.com')
print(content)
Check Crawlability of a WebPage:
crawlable = bot.is_allowed_by_robots_txt('https://example.com')
print(crawlable)
Discover URLs from a Website's Sitemap:
urls = bot.parse_sitemap("https://example.com")
print(urls)
Contributing
Your contributions are what make YiraBot even better. Fork the repository, make your changes, and create a pull request to join in!
License
iraBot is open-source and proudly bears the MIT LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for YiraBot-1.0.7.3.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3e12e44a03817c881c37c3992b030de820f33eaa68230ea09b3025fa7a7d024b |
|
MD5 | 54b99854e84d2a2e0151138f88690019 |
|
BLAKE2b-256 | 54323aeb3dd0c586f49789ea5b2aeede76de0ed08671793824afc47faf15ccb1 |