10 projects
Trough
brozzler
Distributed web crawling with browsers
snakebite-py3
Pure Python HDFS client
surt
Sort-friendly URI Reordering Transform (SURT) python package.
warcprox
WARC writing MITM HTTP/S proxy
urlcanon
url canonicalization library for python and java
doublethink
rethinkdb python library
ujson-ia
Ultra fast JSON encoder and decoder for Python (Internet Archive fork)
rethinkstuff
Rudimentary rethinkdb python library with some smarts, perhaps some dumbs
warctools
Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)