Skip to main content

rbdt is a python library (written in rust) for parsing robots.txt files for large scale batch processing.

Project description

rbdt

🚨🚨🚨🚨

rbdt is a work in progress, currently being extracted out of another (private) project for the purpose of open sourcing and better software engineering.

🚨🚨🚨🚨

rbdt is a python library (written in rust) for parsing robots.txt files for large scale batch processing.

PyPI version

rbdt features:

  • MIT license, have fun.
  • Written in Rust, so it is fast.
  • Callable from Python, so it is useful.
  • Has been and continues to be run against millions of unique robots.txt files.
  • Forgiving, corrects some typical mistakes in files written by hand, like recognizing dissallows probably meant to be disallow.
  • Intentionally provides direct access to the parsed robots.txt representation (unlike Reppy or Google's parser).
  • Ability to compare which user agent has more privilege given to it by the website owner, both heuristically and logically.

rbdt anti-features:

  • rbdt isn't meant to be used as part of a web crawler, but as part of a large scale analysis of robots.txt files. If ends up being useful for web crawlers eventually, that's great and only incidental.

Development

maturin develop
python py_tests/tests.py

Releases

rbdt uses github ci/cd to do releases to pypi. Tag the commit with the version and it will end up on pypi.

Contributions

File a ticket or send a PR if you'd like.

To Do

  • Real Open Sourcing Hours
    • Changelog
    • Write documentation and put them somewhere
    • branch protection for main, no direct writes only PR's
    • automated tests
  • Crawl-delay parsing and restructuring of the data representation.
  • Be able to detect whether a crawler can access a specific page.
  • More tests of all the various edge cases.
  • Benchmarks, (maybe someday never).
  • Publish it as a Rust library as well (maybe).
  • Get Rust tests working (maybe).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

rbdt-0.0.4_alpha2.tar.gz (7.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

rbdt-0.0.4_alpha2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

rbdt-0.0.4_alpha2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

rbdt-0.0.4_alpha2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.7mmanylinux: glibc 2.17+ x86-64

rbdt-0.0.4_alpha2-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.6mmanylinux: glibc 2.17+ x86-64

File details

Details for the file rbdt-0.0.4_alpha2.tar.gz.

File metadata

  • Download URL: rbdt-0.0.4_alpha2.tar.gz
  • Upload date:
  • Size: 7.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.5.0 importlib_metadata/4.8.2 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for rbdt-0.0.4_alpha2.tar.gz
Algorithm Hash digest
SHA256 ce0eacec68cb1f57a92f29d09d7afd817952738446b42b768b356ae5ad583583
MD5 cd891492527791331ad6fd41ffcce389
BLAKE2b-256 92ba337deecde74bb9604022c8bda169a7664f3f8d5c4df380ee40d96f003c26

See more details on using hashes here.

File details

Details for the file rbdt-0.0.4_alpha2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rbdt-0.0.4_alpha2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d524d66b588756f199a4f81902d165b7dae7b1b23214dc5d4223ee38fe07771e
MD5 e836b734a16b6c276108e53b995625da
BLAKE2b-256 f78040d1f5d2a86052c476cdd4bff3cb51703f0d45895349cc2414fed4a4eb21

See more details on using hashes here.

File details

Details for the file rbdt-0.0.4_alpha2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rbdt-0.0.4_alpha2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e4d4c93def8140dc3fc25b52d1ed758ccb2fac35c0e09c518b72b1462235f860
MD5 3029e5f5145b16a7b7bbb8d8d4e00520
BLAKE2b-256 bb2e0f91abb050ef32b1afe0d6f6f3e349f6b8c50249045d189230b589e6a2fe

See more details on using hashes here.

File details

Details for the file rbdt-0.0.4_alpha2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rbdt-0.0.4_alpha2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5d6ea8893284b0ec668107c808cbce6c1ed7c3a3cbf48320a2a6476ed3abc9c1
MD5 1246b64b8df42c17b1f5270418dfa331
BLAKE2b-256 7a4ebf9b4c5b7ddad51e8281dd889a0c915bc3e8e7b1e5e5a4252548de951db4

See more details on using hashes here.

File details

Details for the file rbdt-0.0.4_alpha2-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for rbdt-0.0.4_alpha2-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 54a80666eaa7e3c23bdfe4e4e1c59b82937cf26e5f665ad59e26b7e9c7a84e65
MD5 ba1db4191668ea6eca5f338f82c36a2f
BLAKE2b-256 5a001339cddae18e1c355ef12b8237ee905b77e6624da2a8cbef398e0c5c90a3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page