A sophisticated Python module and command-line tool for web crawling

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

YiraBot

Overview

Meet YiraBot – your new web crawling and SEO analysis companion! Designed for simplicity and ease of use, YiraBot makes web scraping accessible to everyone. Whether you're a seasoned developer, a data enthusiast, or just exploring Python, YiraBot streamlines web data extraction, turning it into an effortless and satisfying task.

Key Features

Command-Line Simplicity

User-Friendly Commands: Jump right into web crawling with straightforward and powerful commands.
Ready for Any Task: From quick data grabs to intricate scraping jobs, YiraBot handles it all through the command line.

Module Integration

Scripting Made Easy: More than a command-line tool – YiraBot integrates flawlessly with your Python scripts for enhanced scraping capabilities.

Ethical and Efficient Crawling

Respecting Web Standards: YiraBot adheres to robots.txt policies, ensuring responsible web scraping.
Thorough Data Extraction: Extract everything from meta tags to images and links – YiraBot doesn't miss a beat.

User-Friendly Experience

Simple Data Export: Exporting your data is straightforward with YiraBot's easy options.
Cross-Platform Performance: Enjoy seamless operation across Linux, Windows, and macOS.

Ideal Uses

Academic Research: Gather web data effortlessly for your research projects.
SEO and Website Analysis: Dive deep into website content and SEO elements for comprehensive insights.
Website Monitoring: Keep tabs on changes and updates across web pages.
Machine Learning Data Gathering: Conveniently collect data sets for machine learning purposes.

Getting Started

First things first – make sure Python and Pip are installed on your system. Then, you're just one command away:

pip install YiraBot

Command-Line Usage

Kick things off with the help menu:

yirabot

Dive into YiraBot's Capabilities:

Basic Crawl: 'yirabot crawl example.com'
Save Crawl to a File: 'yirabot crawl example.com -file' (or -json)
Content Crawl: 'yirabot crawl-content example.com'
Check Website for Issues: 'yirabot check example.com'
Clone a Webpage: 'yirabot get-html example.com'
Crawl Authentication Protected Pages: 'yirabot session'

Using Yirabot in Your Projects

Easily integrate YiraBot in your scripts like so:

from yirabot import Yirabot

# Create a YiraBot instance
bot = Yirabot()

# Example usage
html_content = bot.get_html('https://example.com')
print(html_content)

Methods:

get_html(url): Retrieves the HTML content of a webpage.
is_allowed_by_robots_txt(url): Checks if a URL is permitted for crawling by robots.txt.
parse_sitemap(url): Finds URLs by parsing a website's sitemap.
crawl(url): Performs a comprehensive crawl of a URL.
crawl_content(url): Extracts detailed content like text, headings, and lists.

Examples

Crawl a Webpage:

data = bot.crawl('https://example.com')
print(data)

Extract Web Content:

content = bot.crawl_content('https://example.com')
print(content)

Check Crawlability of a WebPage:

crawlable = bot.is_allowed_by_robots_txt('https://example.com')
print(crawlable)

Discover URLs from a Website's Sitemap:

urls = bot.parse_sitemap("https://example.com")
print(urls)

Contributing

Your contributions are what make YiraBot even better. Fork the repository, make your changes, and create a pull request to join in!

License

iraBot is open-source and proudly bears the MIT LICENSE.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.0.9.2

Mar 3, 2024

1.0.9.1

Mar 3, 2024

1.0.9

Mar 2, 2024

1.0.8

Feb 2, 2024

This version

1.0.7.3.1

Jan 27, 2024

1.0.7.3

Jan 27, 2024

1.0.7.2

Jan 26, 2024

1.0.7.1

Jan 25, 2024

1.0.7

Jan 25, 2024

1.0.6.4

Jan 24, 2024

1.0.6.3

Jan 24, 2024

1.0.6.2

Jan 23, 2024

1.0.6.1

Jan 23, 2024

1.0.5

Jan 23, 2024

1.0.4

Jan 22, 2024

1.0.3

Jan 22, 2024

1.0.2

Jan 22, 2024

1.0.1

Jan 22, 2024

1.0.0

Jan 22, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

YiraBot-1.0.7.3.1.tar.gz (11.1 kB view hashes)

Uploaded Jan 27, 2024 Source

Built Distribution

YiraBot-1.0.7.3.1-py3-none-any.whl (12.9 kB view hashes)

Uploaded Jan 27, 2024 Python 3

Hashes for YiraBot-1.0.7.3.1.tar.gz

Hashes for YiraBot-1.0.7.3.1.tar.gz
Algorithm	Hash digest
SHA256	`bc92ce48eeb63073293af17bb5b7c558ffcf8e35360d6237d4fbcfdb399e4da2`
MD5	`6df5113027b69d734635a4a261a9d241`
BLAKE2b-256	`38dd386c696aa8e8dde68721b7a5b5d0701ef7426adfb6675467c0a102106b49`

Hashes for YiraBot-1.0.7.3.1-py3-none-any.whl

Hashes for YiraBot-1.0.7.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3e12e44a03817c881c37c3992b030de820f33eaa68230ea09b3025fa7a7d024b`
MD5	`54b99854e84d2a2e0151138f88690019`
BLAKE2b-256	`54323aeb3dd0c586f49789ea5b2aeede76de0ed08671793824afc47faf15ccb1`