Skip to main content

Yet Another Internet Carwler

Project description

YACRAWLER - Yet Another Internet Crawler

Introduction

YACRAWLER is a simple web crawler written in Python. It is designed to be easy to use and flexible, allowing users to customize the crawling behavior and output format.

YACRAWLER is fully asynchronous, making it efficient and capable of handling large amounts of data quickly. It uses the aiohttp library for making HTTP requests and asyncio for managing the asynchronous tasks.

YACRAWLER is built using the Textual library, which is a modern and powerful library for building rich text-based user interfaces in Python. It provides a simple and intuitive API for creating interactive applications with rich text and widgets.

Example Usage

To use YACRAWLER, you need to create an instance of the CrawlerApp class and pass it the necessary parameters. Here is an example:

from yacrawler.core import Pipeline
from yacrawler.tui import CrawlerTuiApp
from yacrawler.utilities.aioadapter import AioRequest
from yacrawler.utilities.discoverers import SimpleRegexDiscoverer
from yacrawler.utilities.processors import parse_to_dict, write_dict_to_file

pipeline = Pipeline(
    processors=[
        parse_to_dict,
        write_dict_to_file,
    ]
)
app = CrawlerTuiApp(start_url="https://blog.yurin.top", max_depth=3, max_workers=10, request_adapter=AioRequest(),
                    discoverer_adapter=SimpleRegexDiscoverer(), pipeline=pipeline)

Then, you can start the crawling process by calling the run method:

python -m yacrawler YOUR_FILE.app

Screenshot

Features

Pipelines

Pipelines are a powerful feature of YACRAWLER that allow users to customize the processing of the crawled data. Users can define their own processors and add them to the pipeline to perform tasks such as parsing the HTML content, extracting specific information, and writing the data to a file.

PROCESSORS OF PIPELINES HAVE STRONG TYPE CHECKING, SO YOU CAN'T ADD A PROCESSOR THAT DOESN'T MATCH THE TYPE OF THE DATA IT IS EXPECTED TO PROCESS.

Customizable Request Adapters

YACRAWLER allows users to customize the request adapter to use their own HTTP client or library. The default request adapter is AioRequest, which uses the aiohttp library to make HTTP requests asynchronously.

Customizable Discoverer Adapters

YACRAWLER allows users to customize the discoverer adapter to use their own method for discovering new URLs to crawl. The default discoverer adapter is SimpleRegexDiscoverer, which uses regular expressions to discover new URLs from the HTML content of the crawled pages.

License

YACRAWLER is licensed under the MIT License. See the LICENSE file for more information.

Acknowledgments

YACRAWLER is built using the following libraries:

  • aiohttp: A library for making HTTP requests asynchronously.
  • asyncio: A library for managing asynchronous tasks.
  • Textual: A library for building rich text-based user interfaces in Python.
  • aiofiles: A library for handling file I/O operations asynchronously.

Contributing

Contributions are welcome! If you have any ideas for improvements or features, please open an issue or submit a pull request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yacrawler-0.1.3.tar.gz (15.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

yacrawler-0.1.3-py3-none-any.whl (18.4 kB view details)

Uploaded Python 3

File details

Details for the file yacrawler-0.1.3.tar.gz.

File metadata

  • Download URL: yacrawler-0.1.3.tar.gz
  • Upload date:
  • Size: 15.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for yacrawler-0.1.3.tar.gz
Algorithm Hash digest
SHA256 07d5ce196de6a1c5f9adc20d56b17b5016139ef9802318815493d9a782ad7c1d
MD5 c7ab94b79c9cb4223e0eceb5f142052f
BLAKE2b-256 b4b7530ad8d647d8b77d88e33bc23db8900fc028620a67c57c2e9362040f7cb1

See more details on using hashes here.

Provenance

The following attestation bundles were made for yacrawler-0.1.3.tar.gz:

Publisher: publish.yml on LiYulin-s/yacrawler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file yacrawler-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: yacrawler-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 18.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for yacrawler-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 75800536eb23b15b04768b4c68445f0bb1eafd885148f74de2420d9441efd347
MD5 4ecbe71002d60e5458fa83ad26323175
BLAKE2b-256 784679fafa0bdf70b27600bcda37c13fa237f57a67acf073a9b631fccdf6fd80

See more details on using hashes here.

Provenance

The following attestation bundles were made for yacrawler-0.1.3-py3-none-any.whl:

Publisher: publish.yml on LiYulin-s/yacrawler

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page