A sophisticated Python module and command-line tool for web crawling

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

YiraBot

Yira Logo

Overview

YiraBot is a sophisticated tool designed for efficient web data collection. Primarily a powerful Python-based command-line tool, it also doubles as an integrable module for Python projects. Ideal for developers, data enthusiasts, and researchers, YiraBot streamlines web crawling with an intuitive interface and robust capabilities.

Key Features

Command-Line Focus

Intuitive Command-Line Interface: Execute various tasks through simple yet powerful commands, making web crawling accessible and efficient.
Versatile Usage: Ideal for quick tasks or complex data extraction processes, all manageable through the command line.

Module Integration

Python Library Flexibility: In addition to its command-line prowess, YiraBot can be imported and used as a Python module, offering extended functionality in Python scripts.

Ethical and Efficient Crawling

Respect for Robots.txt: Adheres to ethical scraping standards by complying with website's robots.txt policies.
Rich Data Extraction: Capable of extracting meta tags, images, links, and parsing sitemaps for comprehensive web analysis.

User Experience

Data Export Capabilities: Features include the extraction of data to files for easy analysis and record-keeping.

Cross-Platform Compatibility

Universal Application: Works seamlessly across various operating systems.

Ideal Use Cases

Academic Research: Gathering data from diverse web sources for scholarly studies.
SEO and Website Audits: Reviewing meta tags, links, and content for SEO analysis.
Website Monitoring: Tracking updates or changes across web pages.
Data Gathering for Machine Learning and Analysis: Collecting web data for machine learning models and data projects.

Installation

Ensure Python and Pip is installed on your system before installing YiraBot. Follow these steps for installation:

pip install YiraBot

Command-Line Usage

yirabot <command> [arguments]

Examples

Displatying the help menu

yirabot

Crawling a webpage:

yirabot crawl example.com

Crawling a webpage and extracting the data to a file.

yirabot crawl example.com -file

Crawling a webpage to get the content:

yirabot crawl-content example.com

Use YiraBot On Your Own Projects.

Usage:

Import and use Yirabot in your python script as follows.

from yirabot import Yirabot

# Create an instance of YiraBot
bot = Yirabot()

# Example usage
html_content = bot.get_html('https://example.com')
print(html_content)

Methods:

get_html(url): Retrieves the HTML content of a webpage.
is_allowed_by_robots_txt(url): Checks if crawling a URL is allowed by robots.txt.
parse_sitemap(url): Parses the sitemap of a website to find URLs.
crawl(url): Crawls a URL and extracts various information.
crawl_content(url): Extracts detailed content like paragraphs, headings, and lists.

Examples

Crawling a Webpage

data = bot.crawl('https://example.com')
print(data)

Extracting Content

content = bot.crawl_content('https://example.com')
print(content)

Checking if a WebPage is crawlable

crawlable = bot.is_allowed_by_robots_txt('https://example.com')
print(crawlable)

Parse the sitemap of a Website to find URL's

urls = bot.parse_sitemap("https://example.com")
print(urls)

Contributing

Contributions to the YiraBot project are welcomed. Feel free to fork the repository, make your changes, and submit pull requests.

License

YiraBot is open-sourced software licensed under the MIT LICENSE.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.0.9.2

Mar 3, 2024

1.0.9.1

Mar 3, 2024

1.0.9

Mar 2, 2024

1.0.8

Feb 2, 2024

1.0.7.3.1

Jan 27, 2024

1.0.7.3

Jan 27, 2024

1.0.7.2

Jan 26, 2024

1.0.7.1

Jan 25, 2024

1.0.7

Jan 25, 2024

1.0.6.4

Jan 24, 2024

This version

1.0.6.3

Jan 24, 2024

1.0.6.2

Jan 23, 2024

1.0.6.1

Jan 23, 2024

1.0.5

Jan 23, 2024

1.0.4

Jan 22, 2024

1.0.3

Jan 22, 2024

1.0.2

Jan 22, 2024

1.0.1

Jan 22, 2024

1.0.0

Jan 22, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

YiraBot-1.0.6.3.tar.gz (8.2 kB view hashes)

Uploaded Jan 24, 2024 Source

Built Distribution

YiraBot-1.0.6.3-py3-none-any.whl (10.0 kB view hashes)

Uploaded Jan 24, 2024 Python 3

Hashes for YiraBot-1.0.6.3.tar.gz

Hashes for YiraBot-1.0.6.3.tar.gz
Algorithm	Hash digest
SHA256	`de4f6055bf0a297bac769f013833343f9c15f258574b170d282a813e9a76918c`
MD5	`a0ab53bbad17ebe775b45072172ca86b`
BLAKE2b-256	`a9bc676b8dfa61f4d172b333924ec38afee3857fe991eda8594fbb84fad484bf`

Hashes for YiraBot-1.0.6.3-py3-none-any.whl

Hashes for YiraBot-1.0.6.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9026bfca249fb37b36778808f058e11f4d066eba9d096417cbe50d567187a6b7`
MD5	`7dc120060ddbda5b205fe0d4ec91e73c`
BLAKE2b-256	`b543b8c38194aff1087264a20a2405c32c51291bade6c8ddbf0d493b961d1573`