SecretScraper is a web scraper tool that can scrape the content through target websites and extract secret information via regular expression.

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

SecretScraper

Tests

Overview

SecretScraper is a web scraper tool that can crawl through target websites, scrape from response and extract secret information via regular expression.

Feature

Web crawler: scrape through target url and extract all extending urls, roll around through new links until depth or page number exceeds the limitation
Support domain white list and black list
Support multiple targets, input target URLs from a file
Scalable customization: header, proxy, timeout, cookie, scrape depth, follow redirect, etc.
Built-in regex to search for sensitive information
Flexible configuration in yaml format

Prerequisite

Platform: MaxOS or Ubuntu, no support for Windows temporarily(due to hyperscan is employed in this project).
Python Version: 3.11

Usage

Install

pip install secretscraper

Basic Usage

Start with single target:

secretscraper -u https://scrapeme.live/shop/

Start with multiple targets:

secretscraper -f urls

# urls
http://scrapeme.live/1
http://scrapeme.live/2
http://scrapeme.live/3
http://scrapeme.live/4
http://scrapeme.live/1

A sample output:

> secretscraper -u http://127.0.0.1:8888
Target urls num: 1
Max depth: 1, Max page num: 1000
Output file: /Users/padishah/Documents/Files/Python_WorkSpace/secretscraper/src/secretscraper/crawler.log
Target URLs: http://127.0.0.1:8888

1 URLs from http://127.0.0.1:8888 [200] (depth:0):
http://127.0.0.1:8888/index.html [200]

1 Domains:
127.0.0.1:8888


13 Secrets found in http://127.0.0.1:8888/1.js 200:
Email: 3333333qqqxxxx@qq.com
Shiro: =deleteme
JS Map: xx/static/asdfaf.js.map
Email: example@example.com
Swagger: static/swagger-ui.html
ID Card: 130528200011110000
URL as a Value: redirect=http://
Phone: 13273487666
Internal IP:  192.168.1.1
Cloud Key: Accesskeyid
Cloud Key: AccessKeySecret
Shiro: rememberme=
Internal IP:  10.0.0.1


1 JS from http://127.0.0.1:8888:
http://127.0.0.1:8888/1.js [200]

All supported options:

> secretscraper --help
Usage: secretscraper [OPTIONS]

  Main commands

Options:
  -V, --version                Show version and exit.
  --debug                      Enable debug.
  -a, --ua TEXT                Set User-Agent
  -c, --cookie TEXT            Set cookie
  -d, --allow-domains TEXT     Domain white list, wildcard(*) is supported,
                               separated by commas, e.g. *.example.com,
                               example*
  -D, --disallow-domains TEXT  Domain black list, wildcard(*) is supported,
                               separated by commas, e.g. *.example.com,
                               example*
  -f, --url-file FILE          Target urls file, separated by line break
  -i, --config FILE            Set config file, defaults to settings.yml
  -m, --mode [1|2]             Set crawl mode, 1(normal) for max_depth=1,
                               2(thorough) for max_depth=2, default 1
  --max-page INTEGER           Max page number to crawl, default 100000
  --max-depth INTEGER          Max depth to crawl, default 1
  -o, --outfile FILE           Output result to specified file
  -s, --status TEXT            Filter response status to display, seperated by
                               commas, e.g. 200,300-400
  -x, --proxy TEXT             Set proxy, e.g. http://127.0.0.1:8080,
                               http://127.0.0.1:7890
  -H, --hide-regex             Hide regex search result
  -F, --follow-redirects       Follow redirects
  -u, --url TEXT               Target url
  -l, --local PATH             Local file or directory, scan local
                               file/directory recursively
  --help                       Show this message and exit.

Advanced Usage

Thorough Crawl

The max depth is set to 1, which means only the start urls will be crawled. To change that, you can specify via --max-depth <number>. Or in a simpler way, use -m 2 to run the crawler in thorough mode which is equivalent to --max-depth 2. By default the normal mode -m 1 is adopted with max depth set to 1.

secretscraper -u https://scrapeme.live/shop/ -m 2

Write Results to File

secretscraper -u https://scrapeme.live/shop/ -o result.log

Hide Regex Result

Use -H option to hide regex-matching results. Only found links will be displayed.

secretscraper -u https://scrapeme.live/shop/ -H

Extract secrets from local file

secretscraper -l <dir or file>

Customize Configuration

The built-in config is shown as below. You can assign custom configuration via -i settings.yml.

verbose: false
debug: false
loglevel: warning
logpath: log

proxy: "" # http://127.0.0.1:7890
max_depth: 1 # 0 for no limit
max_page_num: 1000 # 0 for no limit
timeout: 5
follow_redirects: false
workers_num: 1000
headers:
  Accept: "*/*"
  Cookie: ""
  User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36 SE 2.X MetaSr 1.0

rules:
  - name: Swagger
    regex: \b[\w/]+?((swagger-ui.html)|(\"swagger\":)|(Swagger UI)|(swaggerUi)|(swaggerVersion))\b
    loaded: true
  - name: ID Card
    regex: \b((\d{8}(0\d|10|11|12)([0-2]\d|30|31)\d{3}\$)|(\d{6}(18|19|20)\d{2}(0[1-9]|10|11|12)([0-2]\d|30|31)\d{3}(\d|X|x)))\b
    loaded: true
  - name: Phone
    regex: \b((?:(?:\+|00)86)?1(?:(?:3[\d])|(?:4[5-79])|(?:5[0-35-9])|(?:6[5-7])|(?:7[0-8])|(?:8[\d])|(?:9[189]))\d{8})\b
    loaded: true
  - name: JS Map
    regex: \b([\w/]+?\.js\.map)
    loaded: true
  - name: URL as a Value
    regex: (\b\w+?=(https?)(://|%3a%2f%2f))
    loaded: true
  - name: Email
    regex: \b(([a-z0-9][_|\.])*[a-z0-9]+@([a-z0-9][-|_|\.])*[a-z0-9]+\.([a-z]{2,}))\b
    loaded: true
  - name: Internal IP
    regex: '[^0-9]((127\.0\.0\.1)|(10\.\d{1,3}\.\d{1,3}\.\d{1,3})|(172\.((1[6-9])|(2\d)|(3[01]))\.\d{1,3}\.\d{1,3})|(192\.168\.\d{1,3}\.\d{1,3}))'
    loaded: true
  - name: Cloud Key
    regex: \b((accesskeyid)|(accesskeysecret)|\b(LTAI[a-z0-9]{12,20}))\b
    loaded: true
  - name: Shiro
    regex: (=deleteMe|rememberMe=)
    loaded: true
  - name: Suspicious API Key
    regex: "[\"'][0-9a-zA-Z]{32}['\"]"
    loaded: true

TODO

Scan local file
Support windows
Support headless browser
Extract links via regex
Support url-finder output format, add --tree option
Add regex doc reference

Change Log

2024.4.26 Version 1.3

New Features
- Support scan local files

2024.4.15

Add status to url result
All crawler test passed

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.4

May 25, 2024

1.3.10

May 24, 2024

1.3.9.7a2 pre-release

May 24, 2024

1.3.9.7a1 pre-release

May 24, 2024

1.3.9.7a0 pre-release

May 24, 2024

1.3.9.6

May 24, 2024

1.3.9.5

May 24, 2024

1.3.9.4

May 24, 2024

1.3.9.3

Apr 30, 2024

1.3.9.2

Apr 30, 2024

1.3.9.1

Apr 30, 2024

1.3.9

Apr 30, 2024

1.3.8.1

Apr 29, 2024

1.3.8

Apr 29, 2024

1.3.7.1

Apr 29, 2024

1.3.6

Apr 29, 2024

1.3.5.21 yanked

Apr 28, 2024

Reason this release was yanked:

wrong version number

1.3.5.3

Apr 28, 2024

1.3.5

Apr 28, 2024

1.3.4

Apr 28, 2024

1.3.3

Apr 28, 2024

1.3.2

Apr 28, 2024

This version

1.3.1

Apr 28, 2024

1.3

Apr 26, 2024

1.2

Apr 26, 2024

1.1

Apr 16, 2024

1.0

Apr 16, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

secretscraper-1.3.1.tar.gz (22.1 kB view hashes)

Uploaded Apr 28, 2024 Source

Built Distribution

secretscraper-1.3.1-py3-none-any.whl (24.9 kB view hashes)

Uploaded Apr 28, 2024 Python 3

Hashes for secretscraper-1.3.1.tar.gz

Hashes for secretscraper-1.3.1.tar.gz
Algorithm	Hash digest
SHA256	`5ba92df2c921581e37cb55ba9a40498c22e20c0387cdcd84287eecea13d4293a`
MD5	`8657b53cc06aba56ceb3fd0da0740c8c`
BLAKE2b-256	`9cabb8906513287ca82830260a88b13bce764f3ec173eb235e6ebdb550fd60f3`

Hashes for secretscraper-1.3.1-py3-none-any.whl

Hashes for secretscraper-1.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`dc4ad85bdbb1cbe977f7eab257cb01e5261c9f618c903f12b7c12b6ed7e93a91`
MD5	`cd2c1737b324079fc7b184b3e8ec6a25`
BLAKE2b-256	`a7743309361492b451dc33ed4de1f2e9c8f30b5bb27f46d34bb3a832b59b4101`