Skip to main content

AsyncURLCrawler navigates through web pages concurrently by following hyperlinks to collect URLs.

Project description

AsyncURLCrawler

AsyncURLCrawler navigates through web pages concurrently by following hyperlinks to collect URLs. AsyncURLCrawler uses BFS algorithm. To make use of it check robots.txt of the domains first.

👉 For complete documentation read here

👉 Source code on Github here


Install Pacakge

pip install AsyncURLCrawler
pip install AsyncURLCrawler==<version>

👉 The official page of the project in PyPi.


Usage Example in Code

Here is a simple python script to show how to use the package:

import asyncio
import os
from AsyncURLCrawler.parser import Parser
from AsyncURLCrawler.crawler import Crawler
import yaml


async def main():
    parser = Parser(
        delay_start=0.1, 
        max_retries=5, 
        request_timeout=1,
        user_agent="Mozilla",
    )
    crawler = Crawler( 
        seed_urls=["https://pouyae.ir"],
        parser=parser,
        exact=True,
        deep=False,
        delay=0.1,
    )
    result = await crawler.crawl()
    with open(
            os.path.join(output_path, 'result.yaml'), 'w') as file:
        for key in result:
            result[key] = list(result[key])
        yaml.dump(result, file)


if __name__ == "__main__":
    asyncio.run(main())

This is the output for the above code:

https://pouyae.ir:
- https://github.com/PouyaEsmaeili/AsyncURLCrawler
- https://pouyae.ir/images/pouya3.jpg
- https://github.com/PouyaEsmaeili/CryptographicClientSideUserState
- https://github.com/PouyaEsmaeili/RateLimiter
- https://pouyae.ir/
- https://github.com/luizdepra/hugo-coder/
- https://duman.pouyae.ir/
- https://pouyae.ir/projects/
- https://pouyae.ir/images/pouya4.jpg
- https://pouyae.ir/images/pouya5.jpg
- https://pouyae.ir/gallery/
- https://github.com/PouyaEsmaeili
- https://pouyae.ir/blog/
- https://www.linkedin.com/in/pouya-esmaeili-9124b839/
- https://pouyae.ir/about/
- https://stackoverflow.com/users/13118327/pouya-esmaeili?tab=profile
- https://pouyae.ir/contact-me/
- https://github.com/PouyaEsmaeili/SnowflakeID
- https://pouyae.ir/images/pouya2.jpg
- https://github.com/PouyaEsmaeili/gFuzz
- https://linktr.ee/pouyae
- https://gohugo.io/
- https://pouyae.ir/images/pouya1.jpg

👉 There is also a blog post about using AsyncURLCrawler to find malicious URLs in a web page. Read here.


Commandline Tool

The script can be customized using the src/cmd/cmd.py file, which accepts various arguments to configure the crawler's behavior:

argument description
--url Specifies a list of URLs to crawl. At least one URL must be provided.
--exact Optional flag; if set, the crawler will restrict crawling to the specified subdomain/domain only. Default is False.
--deep Optional flag; if enabled, the crawler will explore all visited URLs. Not recommended due to potential resource intensity. If --deep is True, the --exact flag is ignored.
--delay Sets the delay between consecutive HTTP requests, in seconds.
--output Specifies the path for the output file, which will be saved in YAML format.

Run Commandline Tool in Docker Container 🐳

There is a Dockerfile in src/cmd to run the above-mentioned cmd tool in a docker container.

docker build -t crawler .
docker run -v my_dir:/src/output --name crawler crawler

After execution of the container, the resulting output file will be accessible in the directory named my_dir as defined in the above. To configure the tool based on your needs check the CMD in Dockerfile.


Build and Publish to Python Package Index(PyPi)

Requirements:

python3 -m pip install --upgrade build
python3 -m pip install --upgrade twine

👉 For more details check Packaging Python Projects.

Build and upload:

python3 -m build
python3 -m twine upload --repository pypi dist/*

Build Documentation with Sphinx

Install packages listed in docs/doc-requirements.txt.

cd docs
pip install -r doc-requirements.txt
make clean
make html

HTML files will be generated in docs/build. Push them the repository and deploy on pages.dev.


Workflow

  • Branch off, implement features and merge them to main. Remove feature branches.
  • Update version in pyproject.toml and push to main.
  • Add release tag in Github.
  • Build and push the package to PyPi.
  • Build documentation and push HTML files to AsyncURLCrawlerDocs repo
  • Documentation will be deployed on pages.dev automatically.

Contact

Find me @ My Homepage


Disclaimer

⚠️ Use at your own risk. The author and contributors are not responsible for any misuse or consequences resulting from the use of this project.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

asyncurlcrawler-0.0.4.tar.gz (11.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

asyncurlcrawler-0.0.4-py3-none-any.whl (8.4 kB view details)

Uploaded Python 3

File details

Details for the file asyncurlcrawler-0.0.4.tar.gz.

File metadata

  • Download URL: asyncurlcrawler-0.0.4.tar.gz
  • Upload date:
  • Size: 11.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.5

File hashes

Hashes for asyncurlcrawler-0.0.4.tar.gz
Algorithm Hash digest
SHA256 01090164ae4596799af7a96902a1e30e319788f410efa42a6c0607f781fb8399
MD5 2bf44985be32896bfc9ebe3cd2eebfdf
BLAKE2b-256 57e49ebee32ad01a3528fcba6df68d6295b704cdf54777c3945f83d8d0a2d7ed

See more details on using hashes here.

File details

Details for the file asyncurlcrawler-0.0.4-py3-none-any.whl.

File metadata

File hashes

Hashes for asyncurlcrawler-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 97b991c5672f014cf83adb65fd4ac49032f1fc023a1789dfe15c8eda3d64f08e
MD5 bec4d42e0a8a6a2127b2ce0188bb4d0c
BLAKE2b-256 3fe822dde117b25185efe42b97335394b5912eee36d00009f15e7defc1b9e898

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page