Skip to main content

Multifunctional tool for HTTP reconnaissance, web crawling and web directory bruteforce.

Project description

Multifunctional tool for http reconnaissance, web crawling, web directory bruteforce. Based at PSpider

Killer features:

  1. Fast multiurl crawling
  2. Fast multiurl directory bruteforce
  3. Find new domains without DNS bruteforce. (for example https://mail.ru --> 105 Domains of *.mail.ru)
  4. To Do: dynamic creation dictionary for brute-force
  5. To Do: deduplication based on Simhash
  6. Headless browsing and forms fill-up as addtional option
  7. To Do: add proper output to jsonl + html reports
  8. To Do: Collect query parameters (for get and post)
  9. To Do: better deduplication based on page hash

Installation

Ansferatu is a regular Python package. It requires Python 3.8+.

From PyPI:

pip3 install ansferatu

From source / GitHub:

pip3 install git+https://github.com/frostbits-security/ansferatu.git
# or, from a local checkout:
pip3 install .

Headless / form-filling support (optional). The --headless and --fill-forms modes rely on Playwright. Install the optional extra and download the Chromium runtime:

pip3 install 'ansferatu[headless]'
playwright install chromium

Installing the package exposes an ansferatu console command (equivalent to python3 -m ansferatu).

How to run

After installation, run via the ansferatu command:

ansferatu crawl --url https://mail.ru -o ./results/ --limit 1

Use as a library

The package can be imported into other Python tools:

from ansferatu import common_crawler, common_brute_from_file

common_crawler(
    url_list=["https://example.com"],
    scope=["example.com"],
    exclude_codes_list=[403, 404, 401],
    visit_count_limit=10,
    max_deep=2,
    threads=10,
    output_file="results.jsonl",
)

For lower-level control, build the spider directly:

from ansferatu.spider import WebSpider, TaskFetch

Docker

Build docker image:

docker build -t ansferatu .

Run the container (the image's entrypoint is the ansferatu command):

docker run --rm -it -v /tmp/ansferatu_out:/ansferatu/results ansferatu \
  crawl --url https://mail.ru -o /ansferatu/results/ --limit 1

Modes

crawl - run crawl for web sites. Main parameter is "visit_count_limit"

ansferatu crawl --url https://deti.mail.ru -o /home/sabotaged/BB/mail.ru/

crawl --headless - same crawl but with Playwright headless extraction for qualifying pages. Requires the headless extra: pip install 'ansferatu[headless]' && playwright install chromium.

ansferatu crawl --headless --url https://example.com -o ./results/

crawl --fill-forms - extends headless crawl with form detection and interaction. Detects <form> elements on pages, fills fields with smart defaults (email, password, search, etc.), submits forms and clicks buttons, then captures the resulting POST responses and new URLs. Implies --headless.

ansferatu crawl --fill-forms --url https://example.com -o ./results/

brute - classic web directories bruteforce. Needs wordlist.

ansferatu brute --url https://news.mail.ru -w ./wordlists/fuzz_big.txt -o /home/sabotaged/BB/mail.ru/

Modes task flow (queues and owners)

crawl puts start tasks into QueueFetch, then the queues are filled and drained by the workers shown below:

flowchart LR
  start([Start Task]) -->|set_start_task| qf[QueueFetch<br/>priority keys deep url repeat]
  qf --> fetchers[Fetchers<br/>multi-threading]
  fetchers -->|TaskExtract| qe[QueueExtract<br/>priority keys deep url content]
  fetchers -->|TaskHTMLHandle| qh[QueueHTMLHandle<br/>priority keys deep url content]
  qe --> extractor[Extractor]
  extractor -->|TaskFetch| qf
  qh --> html[HTML Handler]
  html -->|TaskSave if item| qs[QueueSave<br/>priority keys deep url item]
  qs --> saver[Saver]

  proxieser[Proxieser] -.->|optional| qp[QueueProxies]
  qp -.->|optional| fetchers

crawl --headless extends the regular crawl with a Playwright-based headless browser pipeline. Qualifying pages (decided by HeadlessCandidate) are routed to a single-threaded headless engine instead of the normal Extractor + HTML Handler path. The headless engine intercepts CDP network events to discover URLs and captures the fully-rendered page for the HTML Handler.

flowchart LR
  start([Start Task]) -->|set_start_task| qf[QueueFetch<br/>priority keys deep url repeat]
  qf --> fetchers[Fetchers<br/>multi-thread]

  fetchers -->|HeadlessCandidate?| decision{is<br/>candidate?}

  decision -->|No| qe[QueueExtract]
  decision -->|No| qh[QueueHTMLHandle]
  decision -->|Yes| qhl[QueueHeadless<br/>dedup: VisitLimit]

  qhl --> headless[HeadlessThread<br/>single thread<br/>Playwright + CDP]

  headless -->|intercepted URLs<br/>TaskFetch| qf
  headless -->|normalized page<br/>TaskHTMLHandle| qh

  qe --> extractor[Extractor]
  extractor -->|TaskFetch| qf

  qh --> html[HTML Handler<br/>_normalize_content]
  html -->|TaskSave| qs[QueueSave]
  qs --> saver[Saver]

Key points:

  • HeadlessCandidate decides which fetched pages qualify. Currently: root/index-like URLs (is_absolute) and HTML responses with status 200/301/302.
  • HeadlessExtractor (Playwright) uses lazy browser init on the worker thread to avoid thread-affinity issues. It hooks page.on("request") to capture all network URLs, then returns both discovered TaskFetch items and a TaskHTMLHandle with a normalized dict (status_code, url, html_text, headers, title, etc.).
  • CommonHTMLHandler accepts both requests.Response objects (regular path) and the normalized dict (headless path) via _normalize_content().
  • Deduplication: VisitLimit.check_headless_visited() prevents the same URL from being sent to headless twice. UrlFilter continues to deduplicate the fetch queue as usual.
  • When a fetched URL qualifies for headless, it skips the regular Extractor and HTML Handler; only the headless pipeline processes it.

crawl --fill-forms extends the headless pipeline with a two-phase form interaction system. Phase 1 (cheap): HeadlessExtractor calls FormDetector.detect(page) on the already-loaded page to produce universal form descriptors. Phase 2 (expensive, deferred): HeadlessFormInteractor picks up form tasks from a dedicated queue, opens the page in a separate browser, fills fields via FormFiller, submits, and captures results.

flowchart LR
  start([Start Task]) -->|set_start_task| qf[QueueFetch<br/>priority keys deep url repeat]
  qf --> fetchers[Fetchers<br/>multi-thread]

  fetchers -->|HeadlessCandidate?| decision{is<br/>candidate?}

  decision -->|No| qe[QueueExtract]
  decision -->|No| qh[QueueHTMLHandle]
  decision -->|Yes| qhl[QueueHeadless<br/>dedup: VisitLimit]

  qhl --> headless[HeadlessThread<br/>single thread<br/>Playwright + CDP]

  headless -->|intercepted URLs<br/>TaskFetch| qf
  headless -->|normalized page<br/>TaskHTMLHandle| qh
  headless -->|form descriptors<br/>TaskFormInteract| qfi[QueueFormInteract]

  qfi --> forminteract[FormInteractThread<br/>single thread<br/>separate Playwright browser]
  forminteract -->|POST response URLs<br/>TaskFetch| qf
  forminteract -->|POST response page<br/>TaskHTMLHandle| qh

  qe --> extractor[Extractor]
  extractor -->|TaskFetch| qf

  qh --> html[HTML Handler<br/>_normalize_content]
  html -->|TaskSave| qs[QueueSave]
  qs --> saver[Saver]

Key points for form interaction:

  • FormDetector scans the already-loaded page DOM for <form> elements. Pure detection, no extra navigation (~50ms overhead). Returns universal form descriptors.
  • Form descriptor schema: {form_selector, action, method, fields[], buttons[], page_url}. Designed to be self-contained so HeadlessFormInteractor needs no extra DOM inspection.
  • FormFiller maps input types/names to smart defaults (email, password, search, etc.). Supports custom value overrides via dict.
  • HeadlessFormInteractor runs in a dedicated thread with its own Playwright browser. It navigates to the page, fills fields, submits/clicks, and captures network traffic + the resulting page data. Results flow back through the normal URL_FETCH and HTM_HANDLE queues.
  • Budget cap: FormDetector.max_forms_per_page (default 5) and HeadlessFormInteractor.max_interactions_per_page prevent runaway on form-heavy pages.
  • The form interaction pipeline is fully independent from the headless extraction pipeline — separate queue, separate thread, separate browser instance.

brute skips extraction and only handles/save results from fetches:

flowchart LR
  start([Start Task]) -->|set_start_task| qf[QueueFetch<br/>priority keys deep url repeat]
  qf --> fetchers[Fetchers<br/>multi-threading]
  fetchers -->|TaskHTMLHandle| qh[QueueHTMLHandle<br/>priority keys deep url content]
  qh --> html[HTML Handler]
  html -->|TaskSave if item| qs[QueueSave<br/>priority keys deep url item]
  qs --> saver[Saver]

  proxieser[Proxieser] -.->|optional| qp[QueueProxies]
  qp -.->|optional| fetchers

How to change settings

Besides parsing the console arguments, ansferatu has a settings file for:

  • blacklist extentions for requests
  • blacklist extentions for parsing
  • HTTP request workers num
  • CPU consumed workers num
  • HTTP error_limit
  • limit of request to one host
  • HTTP request headers
  • ignored content-types for report
  • deduplication mode

The default file is stored in modules\settings\default_config.yaml

If you want to update settings, it's best to copy the file modules\settings\default_config.yaml to modules\settings\config.yaml and then edit config.yaml file.

How we avoid loops

checkRecursion() - check if something is going wrong and request start repeat the same path again and again, like: /blog/atricle/blog/article/... It is happening sometimes because of imperfection of extracting URLs process.

check_limits () - Check how many times we access to parent directory.
How it works. Let's use http://www.example.com/blog/articles/my_article_1.php as example.

  1. We check how many times we visit http://www.example.com/blog/articles/
  2. If it cross crawl_limit we mark this path as over_limit_pages.
  3. We add +1 to crawl limit to upper path (http://www.example.com/blog/).
  4. Go to step 1 (if this path also contains big amount of URLs we also would avoid this loop too)

Step by step at the last we ban visit this website, if all limits will be crossed.

How retries work

We have two types of error limit:

  1. To retried URL
  2. To add same URL in queue

Retries limit should be less than error limit.

When we got connection error with url we retried it before retries limit is over and leave this url for a while. Than we continue to add urls in queue (maybe it start answer after while) and if it still unavailable we ban it. But if url will answer we would reset the count.

Wappalazer role

Wappalazer work with app.json file. This file contains regexp database for search anything in server response. (cookies, headers, scripts, text in html, etc.)

The idea is use wappalazer’s regex engine for “bad place” searching:

  • All inputs
<input type="email">
<input type="password">
<input type="search">
<input type="submit">
  • SSRF
formcontrolname="url"
  • Submit buttons
<button class="aa" type="submit">Search</button>
  • File uploads
<input type="file">

Wappalazer could be used as simple vulnerability scanner:

  1. Send specific request
  2. Regexp search in server's answer.

Deduplication

  • Content length + word_count
  • Content length prediction (not fully tested)
  • To Do: Similarity check
    • Check changes in HTML (search for new functions)

Development

Editable install (changes to the source are picked up immediately):

pip3 install -e '.[headless,dev]'

Run the test suite:

pytest

Building & publishing to PyPI

The project is configured with pyproject.toml (PEP 621). To build the distribution artifacts (source distribution + wheel):

pip3 install build
python3 -m build          # writes dist/ansferatu-<version>.tar.gz and .whl

Validate and upload with Twine:

pip3 install twine
twine check dist/*

# Test upload first (recommended): https://test.pypi.org
twine upload --repository testpypi dist/*

# Real upload
twine upload dist/*

Notes:

  • Bump version in pyproject.toml (and __version__ in ansferatu/__init__.py) before each release; PyPI rejects re-uploads of an existing version.
  • Uploading requires a PyPI account and an API token (configure it via ~/.pypirc or the TWINE_USERNAME=__token__ / TWINE_PASSWORD=<token> environment variables).
  • The package name ansferatu must be available on PyPI for the first upload.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ansferatu-0.1.0.tar.gz (170.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ansferatu-0.1.0-py3-none-any.whl (175.0 kB view details)

Uploaded Python 3

File details

Details for the file ansferatu-0.1.0.tar.gz.

File metadata

  • Download URL: ansferatu-0.1.0.tar.gz
  • Upload date:
  • Size: 170.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for ansferatu-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3f47ae4b105a959c392abf88eeb8f4f957358f735c4d24bf1f49117988d7184f
MD5 d80d2f5a98805fb0bddc633774719de4
BLAKE2b-256 043a28cc84d1c6a8c33c6b34a10e1b0bb39c4ce918b9bcb560bbd1a81e85e874

See more details on using hashes here.

File details

Details for the file ansferatu-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: ansferatu-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 175.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for ansferatu-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0bf42960b1a1edefb013e1d083168866d39a5dc0bc074a37b39a748ae0af6b44
MD5 c2c8a7ba8c1ce58b7eeadc970f76bd83
BLAKE2b-256 596efd173e1053b149c0b018391db7aa5bb8ef859ee741ebd7bc8741866d8714

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page