A sophisticated Python module and command-line tool for web crawling
Project description
YiraBot
Overview
YiraBot is a sophisticated tool designed for efficient web data collection. Primarily a powerful Python-based command-line tool, it also doubles as an integrable module for Python projects. Ideal for developers, data enthusiasts, and researchers, YiraBot streamlines web crawling with an intuitive interface and robust capabilities.
Key Features
Command-Line Focus
- Intuitive Command-Line Interface: Execute various tasks through simple yet powerful commands, making web crawling accessible and efficient.
- Versatile Usage: Ideal for quick tasks or complex data extraction processes, all manageable through the command line.
Module Integration
- Python Library Flexibility: In addition to its command-line prowess, YiraBot can be imported and used as a Python module, offering extended functionality in Python scripts.
Ethical and Efficient Crawling
- Respect for Robots.txt: Adheres to ethical scraping standards by complying with website's robots.txt policies.
- Rich Data Extraction: Capable of extracting meta tags, images, links, and parsing sitemaps for comprehensive web analysis.
User Experience
- Data Export Capabilities: Features include the extraction of data to files for easy analysis and record-keeping.
Cross-Platform Compatibility
- Universal Application: Works seamlessly across various operating systems.
Ideal Use Cases
- Academic Research: Gathering data from diverse web sources for scholarly studies.
- SEO and Website Audits: Reviewing meta tags, links, and content for SEO analysis.
- Website Monitoring: Tracking updates or changes across web pages.
- Data Gathering for Machine Learning and Analysis: Collecting web data for machine learning models and data projects.
Installation
Ensure Python and Pip is installed on your system before installing YiraBot. Follow these steps for installation:
pip install YiraBot
Command-Line Usage
yirabot <command> [arguments]
Examples
Displatying the help menu
yirabot
Crawling a webpage:
yirabot crawl example.com
Crawling a webpage and extracting the data to a file.
yirabot crawl example.com -file
Crawling a webpage to get the content:
yirabot crawl-content example.com
Use YiraBot On Your Own Projects.
Usage:
Import and use Yirabot in your python script as follows.
from yirabot import Yirabot
# Create an instance of YiraBot
bot = Yirabot()
# Example usage
html_content = bot.get_html('https://example.com')
print(html_content)
Methods:
- get_html(url): Retrieves the HTML content of a webpage.
- is_allowed_by_robots_txt(url): Checks if crawling a URL is allowed by robots.txt.
- parse_sitemap(url): Parses the sitemap of a website to find URLs.
- crawl(url): Crawls a URL and extracts various information.
- crawl_content(url): Extracts detailed content like paragraphs, headings, and lists.
Examples
Crawling a Webpage
data = bot.crawl('https://example.com')
print(data)
Extracting Content
content = bot.crawl_content('https://example.com')
print(content)
Checking if a WebPage is crawlable
crawlable = bot.is_allowed_by_robots_txt('https://example.com')
print(crawlable)
Parse the sitemap of a Website to find URL's
urls = bot.parse_sitemap("https://example.com")
print(urls)
Contributing
Contributions to the YiraBot project are welcomed. Feel free to fork the repository, make your changes, and submit pull requests.
License
YiraBot is open-sourced software licensed under the MIT LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for YiraBot-1.0.6.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4eaba8f74361dbb33ff9005836570b5c74a47ba25a4abb98800fdea97fde7a36 |
|
MD5 | d7faf5ed94281cb54bc1919db1385931 |
|
BLAKE2b-256 | bb88e027f56d7763f562de86ab6cb17b0765aa0f731edfd606a7d31eda674931 |