torcrawl

A Python script to crawl and extract (regular or onion) webpages through TOR network.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

MikeMeliz

These details have not been verified by PyPI

Project description

TorCrawl.py is a Python script designed for anonymous web scraping via the Tor network.

It combines ease of use with the robust privacy features of Tor, allowing for secure and untraceable data collection. Ideal for both novice and experienced programmers, this tool is essential for responsible data gathering in the digital age.

Python

What makes it simple and easy to use?

If you are a terminal maniac you know that things have to be simple and clear. Passing the output into other tools is necessary and accuracy is the key.

With a single argument, you can read an .onion webpage or a regular one, through TOR Network and by using pipes you can pass the output at any other tool you prefer.

$ torcrawl -u http://www.github.com/ | grep 'google-analytics'
    <meta-name="google-analytics" content="UA-XXXXXX- ">

If you want to crawl the links of a webpage use the -c and BAM you got on a file all the inside links. You can even use -d to crawl them and so on. You can also use the argument -p to wait some seconds before the next crawl.

$ torcrawl -v -u http://www.github.com/ -c -d 2 -p 2
# TOR is ready!
# URL: http://www.github.com/
# Your IP: XXX.XXX.XXX.XXX
# Crawler started from http://www.github.com/ with 2 depth crawl and 2 second(s) delay:
# Step 1 completed with: 11 results
# Step 2 completed with: 112 results
# File created on /path/to/project/links.txt

[!TIP]
Crawling is not illegal, but violating copyright is. It’s always best to double-check a website’s T&C before start crawling them. Some websites set up what’s called robots.txt to tell crawlers not to visit those pages.
This crawler will allow you to go around this, but we always recommend respecting robots.txt.

Installation

Easy Installation:

from PyPi:
pip install torcrawl
with homebrew:
Coming soon...

Manual Installation:

Clone this repository:
git clone https://github.com/MikeMeliz/TorCrawl.py.git
Install dependencies:
pip install -r requirements.txt
Install and Start TOR Service:
1. Debian/Ubuntu:
  apt-get install tor
  service tor start
2. Windows: Download tor.exe, and:
  tor.exe --service install
  tor.exe --service start
3. MacOS:
  brew install tor
  brew services start tor
4. For different distros, visit:
  TOR Setup Documentation

Arguments

arg	Long	Description
General:
-h	--help	Help message
-v	--verbose	Show more information about the progress
-u	--url *.onion	URL of Webpage to crawl or extract
-w	--without	Without using TOR Network
-rua	--random-ua	Enable random user-agent rotation for requests (works with both TOR and clearnet)
-rpr	--random-proxy	Enable random proxy rotation from res/proxies.txt (requires -w flag, one proxy per line, format: host:port)
-px	--proxy	IP address for SOCKS5 proxy (Default: 127.0.0.1 for using TOR)
-pr	--proxyport	Port for SOCKS5 proxy (Default: 9050)
-f	--folder	The directory which will contain the generated files
-V	--version	Show version and exit
Extract:
-e	--extract	Extract page's code to terminal or file (Default: Terminal)
-i	--input filename	Input file with URL(s) (separated by line)
-o	--output [filename]	Output page(s) to file(s) (for one page)
-y	--yara	Perform yara keyword search: h = search entire html object, t = search only text
Crawl:
-c	--crawl	Crawl website (Default output on website/links.txt)
-d	--depth	Set depth of crawler's travel (Default: 1)
-p	--pause	Seconds of pause between requests (Default: 0)
-l	--log	Log file with visited URLs and their response code

Usage & Examples

As Extractor:

To just extract a single webpage to terminal:

$ python torcrawl.py -u http://www.github.com
<!DOCTYPE html>
...
</html>

Extract into a file (github.htm) without the use of TOR:

$ python torcrawl.py -w -u http://www.github.com -o github.htm
## File created on /script/path/github.htm

Extract to terminal and find only the line with google-analytics:

$ python torcrawl.py -u http://www.github.com | grep 'google-analytics'
    <meta name="google-analytics" content="UA-*******-*">

Extract to file and find only the line with google-analytics using yara:

$ python torcrawl.py -v -w -u https://github.com -e -y 0
...

Note: update res/keyword.yar to search for other keywords. Use -y 0 for raw html searching and -y 1 for text search only.

Extract a set of webpages (imported from file) to terminal:

$ python torcrawl.py -i links.txt
...

As Crawler:

Crawl the links of the webpage without the use of TOR, also show verbose output (really helpful):

$ python torcrawl.py -v -w -u http://www.github.com/ -c
## URL: http://www.github.com/
## Your IP: *.*.*.*
## Crawler Started from http://www.github.com/ with step 1 and wait 0
## Step 1 completed with: 11 results
## File created on /script/path/links.txt

Crawl the webpage with depth 2 (2 clicks) and 5 seconds waiting before crawl the next page:

$ python torcrawl.py -v -u http://www.github.com/ -c -d 2 -p 5
## TOR is ready!
## URL: http://www.github.com/
## Your IP: *.*.*.*
## Crawler Started from http://www.github.com with step 2 and wait 5
## Step 1 completed with: 11 results
## Step 2 completed with: 112 results
## File created on /script/path/links.txt

As Both:

You can crawl a page and also extract the webpages into a folder with a single command:

$ python torcrawl.py -v -u http://www.github.com/ -c -d 2 -p 5 -e
## TOR is ready!
## URL: http://www.github.com/
## Your IP: *.*.*.*
## Crawler Started from http://www.github.com with step 1 and wait 5
## Step 1 completed with: 11 results
## File created on /script/path/FolderName/index.htm
## File created on /script/path/FolderName/projects.html
## ...

Note: The default (and only for now) file for crawler's links is the links.txt document. Also, to extract right after the crawl you have to give -e argument

Following the same logic; you can parse all these pages to grep (for example) and search for specific text:

$ python torcrawl.py -u http://www.github.com/ -c -e | grep '</html>'
</html>
</html>
...

As Both + Keyword Search:

You can crawl a page, perform a keyword search and extract the webpages that match the findings into a folder with a single command:

$ python torcrawl.py -v -u http://www.github.com/ -c -d 2 -p 5 -e -y h
## TOR is ready!
## URL: http://www.github.com/
## Your IP: *.*.*.*
## Crawler Started from http://www.github.com with step 1 and wait 5
## Step 1 completed with: 11 results
## File created on /script/path/FolderName/index.htm
## File created on /script/path/FolderName/projects.html
## ...

Note: Update res/keyword.yar to search for other keywords. Use -y h for raw html searching and -y t for text search only.

Demo

TorCrawl-Demo

Contribution

Feel free to contribute on this project! Just fork it, make any change on your fork and add a pull request on current branch!

:shipit: Any advice, help or questions will be appreciated! :shipit:

License

“GPL” stands for “General Public License”. Using the GNU GPL will require that all the released improved versions be free software (More info).

Changelog

v1.34:
    * Readiness for PyPi and Homebrew
    * Added --version argument
v1.33:
    * Added User-Agent rotation
    * Implemented Proxy rotation
    * Introduced dependabot
v1.32:
    * Removed 1 second default pause between requests
    * Several improvements on results
    * Improved logs
v1.31:
    * Fixed Input Link NoneType Error
    * Fixed name mismatch  
v1.3:
    * Make yara search optional
v1.21:
    * Fixed typos of delay (-d)
    * Fixed TyperError and IndexError 
v1.2:
    * Migrated to Python3
    * Option to generate log file (-l)
    * PEP8 Fixes
    * Fix double folder generation (http:// domain.com)

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

MikeMeliz

These details have not been verified by PyPI

Release history Release notifications | RSS feed

1.35

Jan 3, 2026

This version

1.34

Dec 28, 2025

1.33

Dec 26, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

torcrawl-1.34.tar.gz (36.6 kB view details)

Uploaded Dec 28, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

torcrawl-1.34-py3-none-any.whl (35.9 kB view details)

Uploaded Dec 28, 2025 Python 3

File details

Details for the file torcrawl-1.34.tar.gz.

File metadata

Download URL: torcrawl-1.34.tar.gz
Upload date: Dec 28, 2025
Size: 36.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for torcrawl-1.34.tar.gz
Algorithm	Hash digest
SHA256	`4046f37065b8e8bb9d903add4910c97651a51554a23a17b6bbe18ca884cd1f21`
MD5	`84d746e6717b3d87fdc887622be795d6`
BLAKE2b-256	`7c1be9e7de322ead8c3a65e7e4ee2a5f4fda7eeb9ae5d514368daec694d09588`

See more details on using hashes here.

Provenance

The following attestation bundles were made for torcrawl-1.34.tar.gz:

Publisher: publish.yml on MikeMeliz/TorCrawl.py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: torcrawl-1.34.tar.gz
- Subject digest: 4046f37065b8e8bb9d903add4910c97651a51554a23a17b6bbe18ca884cd1f21
- Sigstore transparency entry: 780761473
- Sigstore integration time: Dec 28, 2025
Source repository:
- Permalink: MikeMeliz/TorCrawl.py@aaeb5d5f4bee872f09cc4f66c19e381f9436e1d1
- Branch / Tag: refs/tags/v1.34
- Owner: https://github.com/MikeMeliz
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@aaeb5d5f4bee872f09cc4f66c19e381f9436e1d1
- Trigger Event: release

File details

Details for the file torcrawl-1.34-py3-none-any.whl.

File metadata

Download URL: torcrawl-1.34-py3-none-any.whl
Upload date: Dec 28, 2025
Size: 35.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for torcrawl-1.34-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e03e58e6a9cbf7132ba58f162a4f3a6dedecde54d6a0f07ac0adfe1255ac15b8`
MD5	`7a322c87ca8aa1a42539f536464e90d4`
BLAKE2b-256	`f365b9d6f8b9b3658110885ebd8beb9b5043f17e950f2870ad1ae32380e1ee6a`

See more details on using hashes here.

Provenance

The following attestation bundles were made for torcrawl-1.34-py3-none-any.whl:

Publisher: publish.yml on MikeMeliz/TorCrawl.py

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: torcrawl-1.34-py3-none-any.whl
- Subject digest: e03e58e6a9cbf7132ba58f162a4f3a6dedecde54d6a0f07ac0adfe1255ac15b8
- Sigstore transparency entry: 780761474
- Sigstore integration time: Dec 28, 2025
Source repository:
- Permalink: MikeMeliz/TorCrawl.py@aaeb5d5f4bee872f09cc4f66c19e381f9436e1d1
- Branch / Tag: refs/tags/v1.34
- Owner: https://github.com/MikeMeliz
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@aaeb5d5f4bee872f09cc4f66c19e381f9436e1d1
- Trigger Event: release

torcrawl 1.34

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

TorCrawl.py is a Python script designed for anonymous web scraping via the Tor network.

What makes it simple and easy to use?

Installation

Easy Installation:

Manual Installation:

Arguments

Usage & Examples

As Extractor:

As Crawler:

As Both:

As Both + Keyword Search:

Demo

Contribution

License

Changelog

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance