rbdt is a python library (written in rust) for parsing robots.txt files for large scale batch processing.
Project description
rbdt
🚨🚨🚨🚨
rbdt is a work in progress, currently being extracted out of another (private) project for the purpose of open sourcing and better software engineering.
🚨🚨🚨🚨
rbdt is a python library (written in rust) for parsing robots.txt files for large scale batch processing.
rbdt features:
- MIT license, have fun.
- Written in Rust, so it is fast.
- Callable from Python, so it is useful.
- Has been and continues to be run against millions of unique robots.txt files.
- Forgiving, corrects some typical mistakes in files written by hand, like recognizing
dissallows
probably meant to bedisallow
. - Intentionally provides direct access to the parsed robots.txt representation (unlike Reppy or Google's parser).
- Ability to compare which user agent has more privilege given to it by the website owner, both heuristically and logically.
rbdt anti-features:
- rbdt isn't meant to be used as part of a web crawler, but as part of a large scale analysis of robots.txt files. If ends up being useful for web crawlers eventually, that's great and only incidental.
Development
maturin develop
python py_tests/tests.py
Releases
rbdt uses github ci/cd to do releases to pypi. Tag the commit with the version and it will end up on pypi.
Contributions
File a ticket or send a PR if you'd like.
To Do
- Real Open Sourcing Hours
- Changelog
- Write documentation and put them somewhere
- branch protection for main, no direct writes only PR's
- automated tests
- Crawl-delay parsing and restructuring of the data representation.
- Be able to detect whether a crawler can access a specific page.
- More tests of all the various edge cases.
- Benchmarks, (maybe someday never).
- Publish it as a Rust library as well (maybe).
- Get Rust tests working (maybe).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
rbdt-0.0.4_alpha2.tar.gz
(7.9 kB
view hashes)
Built Distributions
Close
Hashes for rbdt-0.0.4_alpha2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d524d66b588756f199a4f81902d165b7dae7b1b23214dc5d4223ee38fe07771e |
|
MD5 | e836b734a16b6c276108e53b995625da |
|
BLAKE2b-256 | f78040d1f5d2a86052c476cdd4bff3cb51703f0d45895349cc2414fed4a4eb21 |
Close
Hashes for rbdt-0.0.4_alpha2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e4d4c93def8140dc3fc25b52d1ed758ccb2fac35c0e09c518b72b1462235f860 |
|
MD5 | 3029e5f5145b16a7b7bbb8d8d4e00520 |
|
BLAKE2b-256 | bb2e0f91abb050ef32b1afe0d6f6f3e349f6b8c50249045d189230b589e6a2fe |
Close
Hashes for rbdt-0.0.4_alpha2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5d6ea8893284b0ec668107c808cbce6c1ed7c3a3cbf48320a2a6476ed3abc9c1 |
|
MD5 | 1246b64b8df42c17b1f5270418dfa331 |
|
BLAKE2b-256 | 7a4ebf9b4c5b7ddad51e8281dd889a0c915bc3e8e7b1e5e5a4252548de951db4 |
Close
Hashes for rbdt-0.0.4_alpha2-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 54a80666eaa7e3c23bdfe4e4e1c59b82937cf26e5f665ad59e26b7e9c7a84e65 |
|
MD5 | ba1db4191668ea6eca5f338f82c36a2f |
|
BLAKE2b-256 | 5a001339cddae18e1c355ef12b8237ee905b77e6624da2a8cbef398e0c5c90a3 |