rbdt is a python library (written in rust) for parsing robots.txt files for large scale batch processing.
Project description
rbdt
🚨🚨🚨🚨
rbdt is a work in progress, currently being extracted out of another (private) project for the purpose of open sourcing and better software engineering.
🚨🚨🚨🚨
rbdt is a python library (written in rust) for parsing robots.txt files for large scale batch processing.
rbdt features:
- MIT license, have fun.
- Written in Rust, so it is fast.
- Callable from Python, so it is useful.
- Has been and continues to be run against millions of unique robots.txt files.
- Forgiving, corrects some typical mistakes in files written by hand, like recognizing
dissallowsprobably meant to bedisallow. - Intentionally provides direct access to the parsed robots.txt representation (unlike Reppy or Google's parser).
- Ability to compare which user agent has more privilege given to it by the website owner, both heuristically and logically.
rbdt anti-features:
- rbdt isn't meant to be used as part of a web crawler, but as part of a large scale analysis of robots.txt files. If ends up being useful for web crawlers eventually, that's great and only incidental.
Development
maturin develop
python py_tests/tests.py
Releases
rbdt uses github ci/cd to do releases to pypi. Tag the commit with the version and it will end up on pypi.
Contributions
File a ticket or send a PR if you'd like.
To Do
- Real Open Sourcing Hours
- Changelog
- Write documentation and put them somewhere
- branch protection for main, no direct writes only PR's
- automated tests
- Crawl-delay parsing and restructuring of the data representation.
- Be able to detect whether a crawler can access a specific page.
- More tests of all the various edge cases.
- Benchmarks, (maybe someday never).
- Publish it as a Rust library as well (maybe).
- Get Rust tests working (maybe).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file rbdt-0.0.4_alpha2.tar.gz.
File metadata
- Download URL: rbdt-0.0.4_alpha2.tar.gz
- Upload date:
- Size: 7.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.5.0 importlib_metadata/4.8.2 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ce0eacec68cb1f57a92f29d09d7afd817952738446b42b768b356ae5ad583583
|
|
| MD5 |
cd891492527791331ad6fd41ffcce389
|
|
| BLAKE2b-256 |
92ba337deecde74bb9604022c8bda169a7664f3f8d5c4df380ee40d96f003c26
|
File details
Details for the file rbdt-0.0.4_alpha2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: rbdt-0.0.4_alpha2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.8 MB
- Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.5.0 importlib_metadata/4.8.2 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d524d66b588756f199a4f81902d165b7dae7b1b23214dc5d4223ee38fe07771e
|
|
| MD5 |
e836b734a16b6c276108e53b995625da
|
|
| BLAKE2b-256 |
f78040d1f5d2a86052c476cdd4bff3cb51703f0d45895349cc2414fed4a4eb21
|
File details
Details for the file rbdt-0.0.4_alpha2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: rbdt-0.0.4_alpha2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.8 MB
- Tags: CPython 3.8, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.5.0 importlib_metadata/4.8.2 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e4d4c93def8140dc3fc25b52d1ed758ccb2fac35c0e09c518b72b1462235f860
|
|
| MD5 |
3029e5f5145b16a7b7bbb8d8d4e00520
|
|
| BLAKE2b-256 |
bb2e0f91abb050ef32b1afe0d6f6f3e349f6b8c50249045d189230b589e6a2fe
|
File details
Details for the file rbdt-0.0.4_alpha2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: rbdt-0.0.4_alpha2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.8 MB
- Tags: CPython 3.7m, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.5.0 importlib_metadata/4.8.2 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5d6ea8893284b0ec668107c808cbce6c1ed7c3a3cbf48320a2a6476ed3abc9c1
|
|
| MD5 |
1246b64b8df42c17b1f5270418dfa331
|
|
| BLAKE2b-256 |
7a4ebf9b4c5b7ddad51e8281dd889a0c915bc3e8e7b1e5e5a4252548de951db4
|
File details
Details for the file rbdt-0.0.4_alpha2-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: rbdt-0.0.4_alpha2-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 1.8 MB
- Tags: CPython 3.6m, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.5.0 importlib_metadata/4.8.2 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
54a80666eaa7e3c23bdfe4e4e1c59b82937cf26e5f665ad59e26b7e9c7a84e65
|
|
| MD5 |
ba1db4191668ea6eca5f338f82c36a2f
|
|
| BLAKE2b-256 |
5a001339cddae18e1c355ef12b8237ee905b77e6624da2a8cbef398e0c5c90a3
|