Crawl, collect & document your codebase in Markdown - now with integrated LLM analysis.
Project description
Crawlect
Now with LLM AI integrated analysis and Crawlect-GUI alpha.
Crawl, Collect & Document Your Codebase in Markdown.
Crawlect is a Python module designed to crawl a given directory, collect relevant files and contents, document the entire structure in a clean, readable Markdown file, and analyze the whole project with LLM AI API feedback.
Whether you're analyzing someone else's code or sharing your own, Crawlect makes it effortless to generate a comprehensive project snapshot; complete with syntax-highlighted code blocks, a tree-like structure overview, and fine-tuned filtering rules.
Even easier with its dedicaded GUI project Crawlect-GUI alpha. simplified settings, parameters retention, easy test and try; give your keyboard a well-deserved break!
Crawlect is a study project initiated by Yves Guillo & Alexandre Jenzer, supervised by Matthieu Ammiguet during He-Arc's CAS-IDD's Python module (2025).
Why Crawlect?
When starting with a new project; whether you're reviewing, refactoring, or collaborating; understanding its structure and key files is essential. Crawlect does the heavy lifting by:
- Analyzes your codebase with integrated LLM API calls.
- Crawl your project directory (recursively if needed),
- Parsing
.gitignore/.dockerignore/.crawlectignorerules to mimic your dev setup, - Masking sensitive data (like
.envvalues), - Automatically generating a well-organized, shareable
.mdfile, - Embedding file contents in Markdown-formatted code blocks.
Use cases
- Quickly understand an unfamiliar codebase
- Auto-document your projects
- Share code context with collaborators
- Safely include
.envfiles without leaking sensitive values - Enhancing your workflow with LLM code analysis.
Think of Crawlect as your markdown-minion; obedient, efficient, and allergic to messy folders.
Getting Started
Crawlect is written in Python and requires minimal setup. Install the package or clone it to set up the virtual environment, and you’re ready to document codebases like a Markdown ninja.
Requires Python 3.10+
Install Crawlect via pip (The Handy Way™)
Tired of fiddling with virtual environments just to run a CLI tool? We got you.
You can install Crawlect globally in just a few seconds:
pip install crawlect
Then summon your loyal markdown minion from anywhere on your system:
crawlect -p . -o digest.md -open
No clutter, no drama. It just works.
Why choose the pip route?
- You can call
crawlectfrom any folder; no need tocdinto the repo. - Great for repeated use or integrating into your tooling.
- Keeps your system clean (no extra scripts, venvs, or manual installs) and allows the use of Crawlect-GUI!
(Optional) Local Dev Mode
If you're planning to make it bigger, we’ve still got you covered.
1. Clone the repo
git clone https://github.com/yvesguillo/crawlect.git
cd crawlect
2. Run the setup script
Make it executable first if needed (Linux/macOS):
chmod +x setup.sh
Then run:
./setup.sh
This script creates a
venv, activates it, and installs dependencies.
OR manually install requirements.txt:
pip install -r ./requirements.txt
3. Run Crawlect
python -m crawlect -p . -o ../digest.md -open
This scan the current folder and generate a Markdown file named
digest.mdin the parent directory.
4. Teardown (optional)
When you're done (if you used setup.sh script), you can clean everything up with:
./teardown.sh
This deactivates and removes the
venvand optional artifacts.
CLI Options
Here are the most useful options Crawlect understands:
| Options | Description |
|---|---|
-p, --path |
Path to crawl (default is current folder .). |
-o, --output |
Static output file path (e.g. ./digest.md). |
-op, --output_prefix |
Prefix for dynamic output unique file name (e.g. ./digest). |
-os, --output_suffix |
Suffix for dynamic output unique file name (e.g. .md). |
-r, --recur |
Enable recursive crawling (default: enabled). Use --no-recur to disable. |
-d, --depth |
Scan depth limit (default is infinite). |
--crawlig |
Use .crawlectignore exclusion rules if exist (default: enabled). Use --no-crawlig to disable. |
--gitig |
Use .gitignore exclusion rules if exist (default: enabled). Use --no-gitig to disable. |
--dockig |
Use .dockerignore exclusion rules if exist (default: enabled). Use --no-dockig to disable. |
--xenv |
Sanitize .env variables to mitigate sensitive info leak risk (default: enabled). Use --no-xenv to disable. |
--tree |
Visualize directory tree in the output file (default: enabled). Use --no-tree to disable. |
-llmapi, --llm-api |
LLM provider to use (e.g., openai or ollama). |
-llmhost, --llm-host |
Host URL for the LLM API (only required for Ollama). |
-llmkey, --llm-api-key |
API key for the LLM (only required for OpenAI). |
-llmmod, --llm-model |
Model name to use (e.g., gpt-4.1-nano or llama3). |
-llmreq, --llm-request |
LLM tasks to perform: review, docstring, readme. |
-llmcust, --llm-custom-requests |
Custom LLM analysis prompts list (each prompt wrapped in "). |
-open, --open |
Open the output files once generated (default: disabled). |
-verbose, --verbose |
Toggle verbosity (default: enabled). Use --no-verbose to disable. |
Examples
Scan awesomeproject folder with 2 depth level scan and write its digest.md in parent folder, including a project folder tree, while ignoring .gitignore and .dockerignore rules but interpreting .crawlectignore filtering, without sanitizing .env files, then open the generated file with the default system reader.
crawlect -p ./awesomeproject \
-o ../digest.md \
-d 2 \
--no-gitig no \
--no-dokig no \
--no-xenv no \
-open
Scan current folder and write its digest.md in parent folder, then request OpenAi's gpt-4.1-nano model to review and create docstrings from the codebase.
crawlect -p . \
-o ../digest.md \
--llm-api openai \
--llm-api-key yoursupersecretkey \
--llm-model gpt-4 \
--llm-request review docstring \
--llm-custom-requests "Is this project awesome? " "Can you write a poem about this codebase?"
How LLM Feature Works
Ever wish your code could write its own README, explain its quirks, or fill in those long-forgotten docstrings?
With Crawlect’s LLM-powered analysis, it can.
What happens under the hood?
When you add the --llm-* parameters to your command, Crawlect does:
- Generate the full project digest as Markdown.
- Read that digest and send it to your favorite LLM (OpenAI or Ollama, your pick).
- Inject it into a custom prompt depending on your request:
review: ask the model to review and critique the code.docstring: generate docstrings for classes and functions.readme: draft a clean, professional README.md based on your project.
The responses are then written to a second file (<output path>.analysis.md).
Supported LLMs
- Ollama – Running your own local Ollama service?
If not, no worry, give a try to this one: LLM-Serve
Then, simply use--llm-api ollama, provide your--llm-host(e.g.http://localhost:11434), choose your--llm-model(e.g.,llama3) and you are good to go! - OpenAI – Use with
--llm-api openai, supply your--llm-api-key, and pick your--llm-model(e.g.,gpt-4.1-nano).
Example
Scan current folder and write its digest.md in parent folder then request Ollama to run a request to Llama3 model and create README documentation from the codebase.
crawlect -p . \
-o ../digest.md \
--llm-api ollama \
--llm-host http://localhost:11434 \
--llm-model llama3 \
--llm-request readme
Crawlect writes your digest, then generates an
<output path>.analysis.mdfile packed with insights.
And Yes! This README.md have been generated like that. Well… with a bit of editing, yet much faster. Spend less time on boilerplate; more on content and style.
Bonus
Crawlect injects the entire codebase (in Markdown format) once, then asks the LLM to perform each task with that shared context. This means:
- No repeated uploads.
- Coherent answers across tasks.
- Responses that actually make sense together.
How Filtering Works
Crawlect supports standard .gitignore filtering. You can use:
.crawlectignore(optional and custom rules; your secret weapon, auto-detected and parsed like Git would).gitignoreand.dockerignore(auto-detected and parsed like Git would)
These filters follow the standard .gitignore syntax.
Bonus: Crawlect also exclude the ignore file itself from the digest, so your
.crawlectignoreor any ignore file won’t show up in the output unless you choose not to use these.
Example Output
Digest
Here's a sneak peek at what Crawlect produces as digest:
# my-awesome-project
2025.05.10 14:22
Generated with Crawlect.
## File structure
- **src/**
- [main.py](#main.py)
- [utils.py](#utils.py)
## Files:
### main.py
[`src/main.py`](src/main.py)
```python
from .utils import un_plus_un
def main():
print(f"Hello! Did you know that one plus one is strictly similar to \n{un_plus_un()}?")
```
### utils.py
[`src/utils.py`](src/utils.py)
```python
def un_plus_un():
return "deux"
```
Analysis
LLM code analysis looks like that:
////////////
// REVIEW //
////////////
<Markdown-formatted review by the LLM>
///////////////
// DOCSTRING //
///////////////
<Auto-generated docstrings with file/class/function structure>
////////////
// README //
////////////
<Markdown README suggestion ready to paste>
/////////////////////
// CUSTOM_PROMPT_1 //
/////////////////////
**Prompt**: `Is this project awesome?`
<Markdown-formatted response>
/////////////////////
// CUSTOM_PROMPT_2 //
/////////////////////
**Prompt**: `Can you write a poem about this codebase?`
<Markdown-formatted response>
Roadmap & Crazy Ideas
- Project analysis from repo URL.
- Standardized review scores.
- HTML output.
Contributing
Got ideas? Spot a bug? Wanna make this thing even cooler?
Feel free to fork, star, or open an issue; we’d love to hear from you!
References and thanks
- Markdown code syntax table - From jincheng9 on GitHub
- Argpars boolean argument treatment - From Codemia
gitignore_parserby Michael Herrmann
If you find Crawlect useful, give it a ☆ to support the project!
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crawlect-1.0.7.tar.gz.
File metadata
- Download URL: crawlect-1.0.7.tar.gz
- Upload date:
- Size: 67.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6408a07630e5ca319d0c73a7303f97f5b9ab72bbddc2d30a28dd787eb00e641a
|
|
| MD5 |
c843a6c9d492eb19528f5e54178b0e55
|
|
| BLAKE2b-256 |
98129dc9d8570d20e0b7dc2b0d6cad522627ef1850757e9dc53b2e85b98ecee0
|
File details
Details for the file crawlect-1.0.7-py3-none-any.whl.
File metadata
- Download URL: crawlect-1.0.7-py3-none-any.whl
- Upload date:
- Size: 25.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4f68600e7e50d742ad2df3b8a3139fa6957918a9efd9c9262376fcfdecf93f0a
|
|
| MD5 |
fe5d51cc36c012d38d89352edb82bea3
|
|
| BLAKE2b-256 |
a711d024ff65f56fda208e9f05b2d48e54f63e369601c878db380e7d8b58e272
|