Point it at a repo, get back 'this is an e-commerce app that does X' — pattern-based application functional-category inference from routes, data models, and README.
Project description
app-classifier
Point it at a repo, get back "this is an e-commerce app that does X".
Pattern-based application functional-category inference from routes, data models, and README. Zero runtime dependencies (pure stdlib). Optional LLM polish — bring your own provider.
What problem this solves
Onboarding to a new repo, every engineer asks the same questions: "What does this thing do? Is it a CRUD app or a queue worker? What database? What ports does it need?" The README is usually wrong or stale. Codeowners are unavailable. You end up grepping for clues.
app-classifier answers those questions in under a second for any repo, on disk, with no network calls:
- What kind of app is this? — e-commerce, blog, social network, admin panel, REST API, auth/SSO, file management, scheduling, or messaging (9 categories, weighted-pattern matching, confidence-scored)
- What does it do? — a 2-3 sentence functional description, deterministically composed, optionally LLM-polished
- How does it deploy? — runtime + version, framework, web server, databases, caches, ports, env vars, container base image, runtime CVEs
Quick start
pip install app-classifier
app-classifier ./my-repo
=== my-repo ===
Category: e-commerce (78% confidence)
Runtime: python 3.11
Framework: FastAPI
Deploys as: ASGI server (uvicorn / hypercorn / daphne)
Databases: PostgreSQL, SQLAlchemy ORM
Cache/Queue: Redis, Celery
Features: online shopping, messaging
📋 Summary: my-repo · python 3.11 · FastAPI · 23 HTTP route(s) · 5 data model(s) · DB: PostgreSQL, SQLAlchemy ORM
📝 What it does:
my-repo is a e-commerce application. Primary functionality: online shopping, messaging.
It models entities like Cart, Order, Product, User serving authenticated users.
🌐 HTTP Routes (23 found):
GET /products → list_products
POST /cart/add → add_to_cart
POST /checkout → checkout
...
Python API
from app_classifier import classify
result = classify("./my-repo")
print(result.app_category) # 'e-commerce'
print(result.app_category_confidence) # 0.78
print(result.detected_features) # ['online shopping', 'messaging']
print(result.functional_description) # "my-repo is a e-commerce application. ..."
# Full structured access
for route in result.routes:
print(route.method, route.path, route.handler)
for model in result.data_models:
print(model.name, model.framework, model.fields_hint)
# JSON-serializable
import json
print(json.dumps(result.to_dict(), indent=2))
Just the deployment data?
Skip the classifier, use hosting directly:
from app_classifier import analyze_hosting_requirements
report = analyze_hosting_requirements("./my-repo")
print(report.runtime) # {'language': 'python', 'version': '3.11'}
print(report.web_server) # {'framework': 'FastAPI', 'deployment_target': '...'}
print(report.databases) # [{'name': 'PostgreSQL', ...}, ...]
print(report.ports) # [{'port': 8000, 'source': 'Dockerfile', ...}]
print(report.web_server_vulnerabilities) # CVEs on the container base image
Optional: LLM polish
classify_async accepts ANY async callable as the LLM provider — no SDK pinned. If the LLM gives a useful response, the deterministic functional_description is replaced with the polished version; on any failure (timeout / parse error / hallucination guard / no provider) the deterministic version is kept.
# OpenAI shim
async def my_openai_provider(prompt, max_tokens=400, temperature=0.2):
import openai
client = openai.AsyncOpenAI()
resp = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
max_tokens=max_tokens, temperature=temperature,
)
return resp.choices[0].message.content
# Anthropic shim
async def my_anthropic_provider(prompt, max_tokens=400, temperature=0.2):
import anthropic
client = anthropic.AsyncAnthropic()
resp = await client.messages.create(
model="claude-haiku-4-5",
max_tokens=max_tokens, temperature=temperature,
messages=[{"role": "user", "content": prompt}],
)
return resp.content[0].text
# Local llama.cpp / Ollama shim
async def my_ollama_provider(prompt, max_tokens=400, temperature=0.2):
import httpx
async with httpx.AsyncClient() as client:
r = await client.post("http://localhost:11434/api/generate", json={
"model": "llama3", "prompt": prompt, "stream": False,
"options": {"num_predict": max_tokens, "temperature": temperature},
})
return r.json().get("response")
# Use any of the above
import asyncio
from app_classifier import classify_async
result = asyncio.run(classify_async("./my-repo", llm_provider=my_openai_provider))
print(result.functional_description)
What detection is supported
Runtimes
Python, Java (JDK 8+), Node.js, Go, Ruby, PHP, Rust — detected from manifest files, Dockerfiles, version files (.nvmrc, .python-version, .ruby-version).
Web frameworks (route extraction)
| Language | Frameworks |
|---|---|
| Python | Flask, FastAPI, Django |
| Java | Spring Boot, Struts 2 (struts.xml), classic Spring |
| Node | Express, Fastify, NestJS, Next.js |
Data model ORMs
| ORM | Detected from |
|---|---|
| JPA / Hibernate | @Entity, @Table annotations |
| SQLAlchemy | class X(Base) |
| Django ORM | class X(models.Model) |
Databases / caches
PostgreSQL, MySQL, MongoDB, H2, Oracle, SQL Server, MariaDB, Redis, RabbitMQ, Kafka, Elasticsearch, Celery.
Container/deployment
Dockerfile (FROM, EXPOSE, ENV), docker-compose, Kubernetes manifests, Helm charts, k8s deployment YAML, Heroku Procfile, Vercel / Netlify configs.
Runtime CVEs (web-server vulnerabilities)
Curated CVE manifest for nginx, Apache HTTPD, Tomcat, OpenJDK / Eclipse Temurin / Amazon Corretto. ~30 high-impact CVEs covered out of the box. PRs welcome.
App categories (functional fingerprints)
e-commerce, blog/content, social network, admin panel/dashboard, REST API service, authentication/SSO, file/document management, scheduling/booking, messaging/notification. Each is matched by a weighted regex pattern against routes + model names + README.
How it works
- Walk every manifest/config file in the repo (capped at 800 files for speed)
- Each file extracts language-specific signals (Maven artifact IDs, npm package names, Python deps, Dockerfile FROM, k8s containerPort, etc.) →
HostingReport - Walk source files to extract HTTP routes + data models per framework
- Pattern-match routes + model names + README purpose against 9 category fingerprints (weighted regex)
- Compose the 2-3 sentence functional description deterministically
- (Optional) Hand the structured signals to your LLM for a polished rewrite
Time budget: under 1 second on a 5K-file repo. Bounded scan caps file count + per-file read size.
Design principles
- No network. Every signal comes from on-disk content. Bundled CVE manifest, no live API calls.
- No SDK pin. The LLM step is provider-agnostic — bring your own callable. We never
import openai. - No surprises. Failures on individual files don't kill the pass. Confidence is always reported; the consumer decides whether to trust it.
- Pure read. We never modify the target repo.
Contributing
PRs welcome on three axes:
- More category fingerprints —
_CATEGORY_FINGERPRINTSinclassifier.py. Each is{ name, feature_label, signals: [(regex, weight), ...] }. - More CVE entries —
data/web_server_cves.json. Schema is documented in the file header. - More framework extractors — route + model extraction for Ruby on Rails, Phoenix, ASP.NET Core, Gin, Rocket, etc. would all be welcome.
Run the test suite
pip install -e ".[test]"
pytest
Tests use fixture directories under tests/fixtures/ — point the classifier at each, assert the expected category + features.
What this is NOT
- Not a security scanner. It surfaces runtime CVEs on the container base image, but the rest of the code is for understanding, not vulnerability detection.
- Not a deployment tool. It tells you what the deployment looks like; it doesn't deploy anything.
- Not a replacement for a README. It generates a structural sketch; humans still write the narrative.
If you want a full security analysis + fix pipeline that uses this internally, see Codefixer (closed-source).
License
MIT — see LICENSE. Use it however you want. Attribution appreciated but not required.
Acknowledgements
Extracted from Codefixer's hosting_requirements + app_description analyzers. The category-fingerprint approach was inspired by Sourcegraph's "what is this repo?" tooling and the way Backstage classifies services.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file app_classifier-0.1.0.tar.gz.
File metadata
- Download URL: app_classifier-0.1.0.tar.gz
- Upload date:
- Size: 43.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2f1f995b5b0cb63b0947b1f52c0248260634c71db03694a74968f4ed5ee7ae5f
|
|
| MD5 |
aa3c0cf6650cb449d9f7dba44b261d9d
|
|
| BLAKE2b-256 |
5bcb42a0c1dd575d01349f7ef1b6aeb0ff6ea7a1f7c6bb40d312c9ae9219fa0d
|
File details
Details for the file app_classifier-0.1.0-py3-none-any.whl.
File metadata
- Download URL: app_classifier-0.1.0-py3-none-any.whl
- Upload date:
- Size: 39.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d391fcc9bfbcd71d883b1cc5dcb35a7a4c252cf61ecada9ac07e6a0a77b02b0b
|
|
| MD5 |
9efbbce70dbb851e2446d2aaa9afa99a
|
|
| BLAKE2b-256 |
99316c1f898d1b9e6b713bdc9771296fc443374baf608035263b731092b7f761
|