Multifunctional tool for HTTP reconnaissance, web crawling and web directory bruteforce.
Project description
Multifunctional tool for http reconnaissance, web crawling, web directory bruteforce. Based at PSpider
Killer features:
- Fast multiurl crawling
- Fast multiurl directory bruteforce
- Find new domains without DNS bruteforce. (for example https://mail.ru --> 105 Domains of *.mail.ru)
- To Do: dynamic creation dictionary for brute-force
- To Do: deduplication based on Simhash
- Headless browsing and forms fill-up as addtional option
- To Do: add proper output to jsonl + html reports
- To Do: Collect query parameters (for get and post)
- To Do: better deduplication based on page hash
Installation
Ansferatu is a regular Python package. It requires Python 3.8+.
From PyPI:
pip3 install ansferatu
From source / GitHub:
pip3 install git+https://github.com/frostbits-security/ansferatu.git
# or, from a local checkout:
pip3 install .
Headless / form-filling support (optional). The --headless and
--fill-forms modes rely on Playwright.
Install the optional extra and download the Chromium runtime:
pip3 install 'ansferatu[headless]'
playwright install chromium
Installing the package exposes an ansferatu console command (equivalent to
python3 -m ansferatu).
How to run
After installation, run via the ansferatu command:
ansferatu crawl --url https://mail.ru -o ./results/ --limit 1
Use as a library
The package can be imported into other Python tools:
from ansferatu import common_crawler, common_brute_from_file
common_crawler(
url_list=["https://example.com"],
scope=["example.com"],
exclude_codes_list=[403, 404, 401],
visit_count_limit=10,
max_deep=2,
threads=10,
output_file="results.jsonl",
)
For lower-level control, build the spider directly:
from ansferatu.spider import WebSpider, TaskFetch
Docker
Build docker image:
docker build -t ansferatu .
Run the container (the image's entrypoint is the ansferatu command):
docker run --rm -it -v /tmp/ansferatu_out:/ansferatu/results ansferatu \
crawl --url https://mail.ru -o /ansferatu/results/ --limit 1
Modes
crawl - run crawl for web sites. Main parameter is "visit_count_limit"
ansferatu crawl --url https://deti.mail.ru -o /home/sabotaged/BB/mail.ru/
crawl --headless - same crawl but with Playwright headless extraction for qualifying pages.
Requires the headless extra: pip install 'ansferatu[headless]' && playwright install chromium.
ansferatu crawl --headless --url https://example.com -o ./results/
crawl --fill-forms - extends headless crawl with form detection and interaction.
Detects <form> elements on pages, fills fields with smart defaults (email, password, search, etc.),
submits forms and clicks buttons, then captures the resulting POST responses and new URLs.
Implies --headless.
ansferatu crawl --fill-forms --url https://example.com -o ./results/
brute - classic web directories bruteforce. Needs wordlist.
ansferatu brute --url https://news.mail.ru -w ./wordlists/fuzz_big.txt -o /home/sabotaged/BB/mail.ru/
Modes task flow (queues and owners)
crawl puts start tasks into QueueFetch, then the queues are filled and drained by the workers shown below:
flowchart LR
start([Start Task]) -->|set_start_task| qf[QueueFetch<br/>priority keys deep url repeat]
qf --> fetchers[Fetchers<br/>multi-threading]
fetchers -->|TaskExtract| qe[QueueExtract<br/>priority keys deep url content]
fetchers -->|TaskHTMLHandle| qh[QueueHTMLHandle<br/>priority keys deep url content]
qe --> extractor[Extractor]
extractor -->|TaskFetch| qf
qh --> html[HTML Handler]
html -->|TaskSave if item| qs[QueueSave<br/>priority keys deep url item]
qs --> saver[Saver]
proxieser[Proxieser] -.->|optional| qp[QueueProxies]
qp -.->|optional| fetchers
crawl --headless extends the regular crawl with a Playwright-based headless browser pipeline.
Qualifying pages (decided by HeadlessCandidate) are routed to a single-threaded headless
engine instead of the normal Extractor + HTML Handler path. The headless engine intercepts
CDP network events to discover URLs and captures the fully-rendered page for the HTML Handler.
flowchart LR
start([Start Task]) -->|set_start_task| qf[QueueFetch<br/>priority keys deep url repeat]
qf --> fetchers[Fetchers<br/>multi-thread]
fetchers -->|HeadlessCandidate?| decision{is<br/>candidate?}
decision -->|No| qe[QueueExtract]
decision -->|No| qh[QueueHTMLHandle]
decision -->|Yes| qhl[QueueHeadless<br/>dedup: VisitLimit]
qhl --> headless[HeadlessThread<br/>single thread<br/>Playwright + CDP]
headless -->|intercepted URLs<br/>TaskFetch| qf
headless -->|normalized page<br/>TaskHTMLHandle| qh
qe --> extractor[Extractor]
extractor -->|TaskFetch| qf
qh --> html[HTML Handler<br/>_normalize_content]
html -->|TaskSave| qs[QueueSave]
qs --> saver[Saver]
Key points:
- HeadlessCandidate decides which fetched pages qualify. Currently: root/index-like URLs
(
is_absolute) and HTML responses with status 200/301/302. - HeadlessExtractor (Playwright) uses lazy browser init on the worker thread to avoid
thread-affinity issues. It hooks
page.on("request")to capture all network URLs, then returns both discoveredTaskFetchitems and aTaskHTMLHandlewith a normalized dict (status_code,url,html_text,headers,title, etc.). - CommonHTMLHandler accepts both
requests.Responseobjects (regular path) and the normalized dict (headless path) via_normalize_content(). - Deduplication:
VisitLimit.check_headless_visited()prevents the same URL from being sent to headless twice.UrlFiltercontinues to deduplicate the fetch queue as usual. - When a fetched URL qualifies for headless, it skips the regular Extractor and HTML Handler; only the headless pipeline processes it.
crawl --fill-forms extends the headless pipeline with a two-phase form interaction system.
Phase 1 (cheap): HeadlessExtractor calls FormDetector.detect(page) on the already-loaded page
to produce universal form descriptors. Phase 2 (expensive, deferred): HeadlessFormInteractor
picks up form tasks from a dedicated queue, opens the page in a separate browser, fills fields
via FormFiller, submits, and captures results.
flowchart LR
start([Start Task]) -->|set_start_task| qf[QueueFetch<br/>priority keys deep url repeat]
qf --> fetchers[Fetchers<br/>multi-thread]
fetchers -->|HeadlessCandidate?| decision{is<br/>candidate?}
decision -->|No| qe[QueueExtract]
decision -->|No| qh[QueueHTMLHandle]
decision -->|Yes| qhl[QueueHeadless<br/>dedup: VisitLimit]
qhl --> headless[HeadlessThread<br/>single thread<br/>Playwright + CDP]
headless -->|intercepted URLs<br/>TaskFetch| qf
headless -->|normalized page<br/>TaskHTMLHandle| qh
headless -->|form descriptors<br/>TaskFormInteract| qfi[QueueFormInteract]
qfi --> forminteract[FormInteractThread<br/>single thread<br/>separate Playwright browser]
forminteract -->|POST response URLs<br/>TaskFetch| qf
forminteract -->|POST response page<br/>TaskHTMLHandle| qh
qe --> extractor[Extractor]
extractor -->|TaskFetch| qf
qh --> html[HTML Handler<br/>_normalize_content]
html -->|TaskSave| qs[QueueSave]
qs --> saver[Saver]
Key points for form interaction:
- FormDetector scans the already-loaded page DOM for
<form>elements. Pure detection, no extra navigation (~50ms overhead). Returns universal form descriptors. - Form descriptor schema:
{form_selector, action, method, fields[], buttons[], page_url}. Designed to be self-contained soHeadlessFormInteractorneeds no extra DOM inspection. - FormFiller maps input types/names to smart defaults (email, password, search, etc.). Supports custom value overrides via dict.
- HeadlessFormInteractor runs in a dedicated thread with its own Playwright browser. It navigates to the page, fills fields, submits/clicks, and captures network traffic + the resulting page data. Results flow back through the normal URL_FETCH and HTM_HANDLE queues.
- Budget cap:
FormDetector.max_forms_per_page(default 5) andHeadlessFormInteractor.max_interactions_per_pageprevent runaway on form-heavy pages. - The form interaction pipeline is fully independent from the headless extraction pipeline — separate queue, separate thread, separate browser instance.
brute skips extraction and only handles/save results from fetches:
flowchart LR
start([Start Task]) -->|set_start_task| qf[QueueFetch<br/>priority keys deep url repeat]
qf --> fetchers[Fetchers<br/>multi-threading]
fetchers -->|TaskHTMLHandle| qh[QueueHTMLHandle<br/>priority keys deep url content]
qh --> html[HTML Handler]
html -->|TaskSave if item| qs[QueueSave<br/>priority keys deep url item]
qs --> saver[Saver]
proxieser[Proxieser] -.->|optional| qp[QueueProxies]
qp -.->|optional| fetchers
How to change settings
Besides parsing the console arguments, ansferatu has a settings file for:
- blacklist extentions for requests
- blacklist extentions for parsing
- HTTP request workers num
- CPU consumed workers num
- HTTP error_limit
- limit of request to one host
- HTTP request headers
- ignored content-types for report
- deduplication mode
The default file is stored in modules\settings\default_config.yaml
If you want to update settings, it's best to copy the file modules\settings\default_config.yaml to modules\settings\config.yaml and then edit config.yaml file.
How we avoid loops
checkRecursion() - check if something is going wrong and request start repeat the same path again and again, like: /blog/atricle/blog/article/... It is happening sometimes because of imperfection of extracting URLs process.
check_limits () - Check how many times we access to parent directory.
How it works. Let's use http://www.example.com/blog/articles/my_article_1.php as example.
- We check how many times we visit http://www.example.com/blog/articles/
- If it cross crawl_limit we mark this path as over_limit_pages.
- We add +1 to crawl limit to upper path (http://www.example.com/blog/).
- Go to step 1 (if this path also contains big amount of URLs we also would avoid this loop too)
Step by step at the last we ban visit this website, if all limits will be crossed.
How retries work
We have two types of error limit:
- To retried URL
- To add same URL in queue
Retries limit should be less than error limit.
When we got connection error with url we retried it before retries limit is over and leave this url for a while. Than we continue to add urls in queue (maybe it start answer after while) and if it still unavailable we ban it. But if url will answer we would reset the count.
Wappalazer role
Wappalazer work with app.json file. This file contains regexp database for search anything in server response. (cookies, headers, scripts, text in html, etc.)
The idea is use wappalazer’s regex engine for “bad place” searching:
- All inputs
<input type="email">
<input type="password">
<input type="search">
<input type="submit">
- SSRF
formcontrolname="url"
- Submit buttons
<button class="aa" type="submit">Search</button>
- File uploads
<input type="file">
Wappalazer could be used as simple vulnerability scanner:
- Send specific request
- Regexp search in server's answer.
Deduplication
- Content length + word_count
- Content length prediction (not fully tested)
- To Do: Similarity check
- Check changes in HTML (search for new functions)
Development
Editable install (changes to the source are picked up immediately):
pip3 install -e '.[headless,dev]'
Run the test suite:
pytest
Building & publishing to PyPI
The project is configured with pyproject.toml (PEP 621). To build the
distribution artifacts (source distribution + wheel):
pip3 install build
python3 -m build # writes dist/ansferatu-<version>.tar.gz and .whl
Validate and upload with Twine:
pip3 install twine
twine check dist/*
# Test upload first (recommended): https://test.pypi.org
twine upload --repository testpypi dist/*
# Real upload
twine upload dist/*
Notes:
- Bump
versioninpyproject.toml(and__version__inansferatu/__init__.py) before each release; PyPI rejects re-uploads of an existing version. - Uploading requires a PyPI account and an API token (configure it via
~/.pypircor theTWINE_USERNAME=__token__/TWINE_PASSWORD=<token>environment variables). - The package name
ansferatumust be available on PyPI for the first upload.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ansferatu-0.1.0.tar.gz.
File metadata
- Download URL: ansferatu-0.1.0.tar.gz
- Upload date:
- Size: 170.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3f47ae4b105a959c392abf88eeb8f4f957358f735c4d24bf1f49117988d7184f
|
|
| MD5 |
d80d2f5a98805fb0bddc633774719de4
|
|
| BLAKE2b-256 |
043a28cc84d1c6a8c33c6b34a10e1b0bb39c4ce918b9bcb560bbd1a81e85e874
|
File details
Details for the file ansferatu-0.1.0-py3-none-any.whl.
File metadata
- Download URL: ansferatu-0.1.0-py3-none-any.whl
- Upload date:
- Size: 175.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0bf42960b1a1edefb013e1d083168866d39a5dc0bc074a37b39a748ae0af6b44
|
|
| MD5 |
c2c8a7ba8c1ce58b7eeadc970f76bd83
|
|
| BLAKE2b-256 |
596efd173e1053b149c0b018391db7aa5bb8ef859ee741ebd7bc8741866d8714
|