Skip to main content

rbdt is a python library (written in rust) for parsing robots.txt files for large scale batch processing.

Project description

rbdt

🚨🚨🚨🚨

rbdt is a work in progress, currently being extracted out of another (private) project for the purpose of open sourcing and better software engineering.

🚨🚨🚨🚨

rbdt is a python library (written in rust) for parsing robots.txt files for large scale batch processing.

PyPI version

rbdt features:

  • MIT license, have fun.
  • Written in Rust, so it is fast.
  • Callable from Python, so it is useful.
  • Has been and continues to be run against millions of unique robots.txt files.
  • Forgiving, corrects some typical mistakes in files written by hand, like recognizing dissallows probably meant to be disallow.
  • Intentionally provides direct access to the parsed robots.txt representation (unlike Reppy or Google's parser).
  • Ability to compare which user agent has more privilege given to it by the website owner, both heuristically and logically.

rbdt anti-features:

  • rbdt isn't meant to be used as part of a web crawler, but as part of a large scale analysis of robots.txt files. If ends up being useful for web crawlers eventually, that's great and only incidental.

Development

maturin develop
python py_tests/tests.py

Contributions

File a ticket or send a PR if you'd like.

To Do

  • Real Open Sourcing Hours
    • Changelog
    • Write documentation and put them somewhere
    • branch protection for main, no direct writes only PR's
    • automated tests
  • Crawl-delay parsing and restructuring of the data representation.
  • Be able to detect whether a crawler can access a specific page.
  • More tests of all the various edge cases.
  • Benchmarks, (maybe someday never).
  • Publish it as a Rust library as well (maybe).
  • Get Rust tests working (maybe).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

rbdt-0.0.3_alpha3-cp38-cp38-macosx_10_7_x86_64.whl (915.0 kB view details)

Uploaded CPython 3.8macOS 10.7+ x86-64

rbdt-0.0.3_alpha3-cp36-cp36m-macosx_10_7_x86_64.whl (912.9 kB view details)

Uploaded CPython 3.6mmacOS 10.7+ x86-64

File details

Details for the file rbdt-0.0.3_alpha3-cp38-cp38-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for rbdt-0.0.3_alpha3-cp38-cp38-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 2ce3abb391ff29c2b4bacda7b94c5a01bb4695822c07f565e6ba46482ffdf4da
MD5 9536155f6a10d969462f3dba9819a9fc
BLAKE2b-256 a38aa07059d3b18eb2c8766d20a73e4740bcda7322ab7d172d6dc3310a9d9b37

See more details on using hashes here.

File details

Details for the file rbdt-0.0.3_alpha3-cp36-cp36m-macosx_10_7_x86_64.whl.

File metadata

File hashes

Hashes for rbdt-0.0.3_alpha3-cp36-cp36m-macosx_10_7_x86_64.whl
Algorithm Hash digest
SHA256 885b4faa0c9d8a2092cee6dda5d36226fb22a7e663877b365827a59c1bd1220f
MD5 a8c7c8af94355db03a67b258f6072ba4
BLAKE2b-256 a8b5f60a389513891b79464547753ea3bb9a4223b442ea23968a147ca2812014

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page