A web visualization and crawling tool
Project description
WebVisCrawl
a really nice web crawler that focuses more on branching out the internet rather than getting all your data and stealing it and selling it to some company that's going to use it to train an AI model.
DISCLAIMER
while this project does make use of web crawling, it is not representative of all use cases of web crawling. this project does not respect robots.txt files, although it takes safe measures to avoid aggressive crawling. you use this project at your own risk for educational purposes only. no one is liable but you if you cause trouble.
running
create a venv and install requirements.txt. then run python main.py <START_URL> or run with -h for help.
to visualise, run python vis.py --head <START_URL> and the html should open in your web browser. also run with -h for help.
speed tests
tests were done on a macbook pro m2 (13in) under maximum load without intellij
the original implementation that used multithreading was about 69.2s
the new implementation that uses multithreading and multiprocessing (has additional delays to ensure that EVERYTHING is processed before quitting):
https://hackclub.com to three levels: (using old implementation with no real safeties and no debugging)
- 1 process
- 76.88s, 4501 nodes, 7663 edges
- 89.53s, 4792 nodes, 8058 edges
- 92.21s, 5555 nodes, 8500 edges
- 59.06s, 4405 nodes, 7052 edges
- 90.55s, 4159 nodes, 7283 edges
- 2 processes
- 50.07s, 4977 nodes, 7963 edges
- 37.63s, 2322 nodes, 3067 edges (exception in hread on both processes (thread_counter) after finish)
- 40.19s, 956 nodes, 1541 edges
- 38.43s, 3655 nodes, 6203 edges
- 36.08s, 1285 nodes, 1786 edges
- 4 processes:
- 8 processes:
since then, i have worked on the implementation to make sure error handling works perfectly, since even a single node erroring can cause a bunch more to be missed. in the process, the crawler may have slowed down a lot more, but it should be more accurate.
since the last message above i redesigned the entire crawler to use queues and central message processing rather than the earlier implementation to reduce the chances of race conditions and code dying. it also uses bloomfilters
https://hackclub.com to three levels: (using newer implementation)
- 1 process
- 252% cpu, 24.10s, 5786 nodes, 13376 edges
- 304% cpu, 24.59s, 5134 nodes, 12645 edges
- 309% cpu, 21.09s, 5153 nodes, 12316 edges
- 328% cpu, 21.20s, 5226 nodes, 13191 edges
- 349% cpu, 24.54s, 5709 nodes, 12572 edges
- 2 processes
- 393% cpu, 15.29s, 5165 nodes, 11732 edges
- 388% cpu, 19.64s, 4559 nodes, 10296 edges*
- 392% cpu, 18.50s, 5598 nodes, 12410 edges
- 339% cpu, 19.00s, 4754 nodes, 9577 edges*
- 354% cpu, 17.19s, 5231 nodes, 11774 edges
- 4 processes:
- 501% cpu, 16.34s, 5149 nodes, 11129 edges
- 476% cpu, 16.98s, 4681 nodes, 9674 edges*
- 493% cpu, 16.42s, 5251 nodes, 11402 edges
- 481% cpu, 17.22s, 5760 nodes, 11717 edges
- 482% cpu, 15.55s, 4888 nodes, 11470 edges*
- 8 processes:
- 577% cpu, 15.24s, 5320 nodes, 10127 edges
- 610% cpu, 18.26s, 5665 nodes, 11293 edges
- 594% cpu, 16.19s, 5335 nodes, 11936 edges
- 578% cpu, 15.22s, 4312 nodes, 8807 edges*
- 578% cpu, 15.35s, 5811 nodes, 13200 edges
note*: the reason for such a dip (especially for 8 processes) is most likely due to rate limiting. realistically from testing, since 8 processes runs faster than 1, websites are able to be accessed a lot faster than they can ratelimit, so the crawler gets a slight bit more success.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file webviscrawl-0.0.1.tar.gz.
File metadata
- Download URL: webviscrawl-0.0.1.tar.gz
- Upload date:
- Size: 15.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fc274eadd14c4e349f32bac6e3401a0318aa96d0169f5e4e65dcb881c66da4d8
|
|
| MD5 |
4ccefbdbd85c5a8f5bf84ad9116406ec
|
|
| BLAKE2b-256 |
3e4f68f71a0875c5c036edb0c0442fef693e1a3a0826bb045002b6e688e3c049
|
File details
Details for the file webviscrawl-0.0.1-py3-none-any.whl.
File metadata
- Download URL: webviscrawl-0.0.1-py3-none-any.whl
- Upload date:
- Size: 17.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
567bc3730e1af42c7ed030fa3d3d3a59048276f7d5997f798fc2feea904b971b
|
|
| MD5 |
99dfb3468397d47ce00e2f67b32d6f42
|
|
| BLAKE2b-256 |
c65f0f6aeb2a0cc7fd967251a5ef464800b7d511a54031d5f03e1081a14b816a
|