Skip to main content

A fast, lightweight, zero-dependency programming language detector for Python.

Project description

codelang-detect

A fast, lightweight, regex-based programming language detector for Python.

PyPI version Build Status Python Versions License: Apache 2.0


Codelang-detect identifies the programming language of a given code snippet. It is designed from the ground up to be fast, accurate, and have zero external dependencies. It's the perfect tool for pre-processing code, routing files, or any application where you need a quick and reliable language check without pulling in heavy libraries.

Key Features

  • โšก๏ธ Blazing Fast: Built on a system of weighted, compiled regular expressions. Performance is measured in microseconds.
  • ๐ŸŽฏ Highly Accurate: Demonstrably more accurate than popular alternatives on a curated suite of real-world and tricky code snippets.
  • ๐Ÿ“ฆ Zero Dependencies: Pure Python. pip install codelang-detect is all you need. No heavyweight models, no external binaries.
  • ๐Ÿ”ง Simple API: A single function call: detect(code).
  • ๐Ÿ’ป CLI Included: Use it directly from your terminal or in shell scripts.

Why codelang-detect?

Many existing language detectors have significant trade-offs:

  • Heavy ML Models (e.g., guesslang): Often have complex or outdated dependencies (like older TensorFlow versions) that make installation difficult. They are also significantly slower for single detections.
  • Comprehensive Tools (e.g., pygments): Excellent for syntax highlighting, but its primary goal isn't detection. As the benchmarks show, its guessing can be unreliable on complex snippets.
  • Platform-Specific Tools (e.g., GitHub's linguist): The industry standard, but it's a Ruby Gem, making it difficult to integrate into a Python environment.

codelang-detect fills the gap for a "just right" solution: a lightweight, portable, and fast detector that delivers best-in-class accuracy.

Benchmark: Accuracy & Performance

The results speak for themselves. On a curated set of 36 code snippets designed to test real-world accuracy, codelang-detect is both significantly more accurate and an order of magnitude faster than other popular, lightweight libraries.

Library Accuracy Avg. Time / Sample (ยตs) Dependencies
codelang-detect (Ours) 100% ~173 ยตs None
Pygments 22.2% ~1395 ยตs None
WhatsThatCode 30.6% ~1881 ยตs None

Benchmarks run on Python 3.13. Your results may vary.

As the results show, codelang-detect is not only the most accurate solution on this test suite but also ~8x faster than Pygments and ~11x faster than WhatsThatCode, all while maintaining zero dependencies.

Click to see detailed accuracy breakdown
--- Accuracy Benchmark ---
| Test Case          | Expected   | Codelang-Detect (Ours) | Pygments               | WhatsThatCode          |
--------------------------------------------------------------------------------------------------------------
| cs_simple          | cs         | cs                  โœ… | unknown             โŒ | java                โŒ |
| cs_lambda          | cs         | cs                  โœ… | scdoc               โŒ | unknown             โŒ |
| cs_full            | cs         | cs                  โœ… | gdscript            โŒ | unknown             โŒ |
| py_simple          | py         | py                  โœ… | py                  โœ… | py                  โœ… |
| py_class           | py         | py                  โœ… | perl6               โŒ | py                  โœ… |
| java_simple        | java       | java                โœ… | py                  โŒ | java                โœ… |
| java_full          | java       | java                โœ… | teratermmacro       โŒ | unknown             โŒ |
| js_arrow           | js         | js                  โœ… | gdscript            โŒ | unknown             โŒ |
| yaml_k8s           | yaml       | yaml                โœ… | actionscript3       โŒ | unknown             โŒ |
| sh_shebang         | sh         | sh                  โœ… | sh                  โœ… | sh                  โœ… |
| kt_data_class      | kt         | kt                  โœ… | ssp                 โŒ | unknown             โŒ |
| swift_func         | swift      | swift               โœ… | gdscript            โŒ | unknown             โŒ |
| scala_case_class   | scala      | scala               โœ… | unknown             โŒ | unknown             โŒ |
| sql_select         | sql        | sql                 โœ… | scdoc               โŒ | unknown             โŒ |
| cbl_simple         | cbl        | cbl                 โœ… | componentpascal     โŒ | unknown             โŒ |
| plain_text         | unknown    | unknown             โœ… | unknown             โœ… | unknown             โœ… |
| cs_async_method    | cs         | cs                  โœ… | gdscript            โŒ | cs                  โœ… |
| cs_linq_query      | cs         | cs                  โœ… | gdscript            โŒ | js                  โŒ |
| py_async_http      | py         | py                  โœ… | py                  โœ… | unknown             โŒ |
| py_pandas          | py         | py                  โœ… | py                  โœ… | unknown             โŒ |
| java_streams       | java       | java                โœ… | py                  โŒ | unknown             โŒ |
| js_promise_fetch   | js         | js                  โœ… | gdscript            โŒ | unknown             โŒ |
| js_react_component | js         | js                  โœ… | py                  โŒ | unknown             โŒ |
| ts_interface       | ts         | ts                  โœ… | gdscript            โŒ | unknown             โŒ |
| kt_coroutine       | kt         | kt                  โœ… | py                  โŒ | py                  โŒ |
| swift_struct       | swift      | swift               โœ… | gdscript            โŒ | unknown             โŒ |
| scala_future       | scala      | scala               โœ… | py                  โŒ | unknown             โŒ |
| go_http_server     | go         | go                  โœ… | py                  โŒ | go                  โœ… |
| sql_join           | sql        | sql                 โœ… | scdoc               โŒ | unknown             โŒ |
| yaml_dockercompose | yaml       | yaml                โœ… | scdoc               โŒ | unknown             โŒ |
| sh_env_check       | sh         | sh                  โœ… | sh                  โœ… | sh                  โœ… |
| rb_class           | rb         | rb                  โœ… | tsql                โŒ | rb                  โœ… |
| php_router         | php        | php                 โœ… | javascript+php      โŒ | php                 โœ… |
| rust_result        | rs         | rs                  โœ… | ecl                 โŒ | unknown             โŒ |
| c_function_pointer | c          | c                   โœ… | c                   โœ… | unknown             โŒ |
| plain_text_doc     | unknown    | unknown             โœ… | unknown             โœ… | unknown             โœ… |

Note: Libraries like guesslang and enry were excluded from the final benchmark due to significant installation issues with modern Python versions and their respective dependencies.

Installation

pip install codelang-detect

Usage

As a Python Library

The API is dead simple. The detect function takes a string of code and returns the file extension of the detected language.

from codelang_detect import detect

# Example 1: Python
python_code = "class User:\n    def __init__(self, name): self.name = name"
print(detect(python_code))
# Output: py

# Example 2: C#
csharp_code = "public class Person { public string Name { get; set; } }"
print(detect(csharp_code))
# Output: cs

# Example 3: Non-code
unknown_text = "This is just a regular sentence."
print(detect(unknown_text))
# Output: unknown

As a Command-Line Tool (CLI)

You can also use codelang-detect directly from your terminal to analyze files or stdin.

# Analyze a file
codelang-detect my_script.js
# Output: js

# Pipe content into the CLI
cat deployment.yaml | codelang-detect
# Output: yaml

Supported Languages

codelang-detect currently provides high-quality detection for the following languages, sorted by their returned extension:

  • C (c)
  • C++ (cpp)
  • C# (cs)
  • COBOL (cbl)
  • Dart (dart)
  • Go (go)
  • Java (java)
  • JavaScript (js)
  • Kotlin (kt)
  • PHP (php)
  • Python (py)
  • R (r)
  • Ruby (rb)
  • Rust (rs)
  • Scala (scala)
  • Shell (sh)
  • Solidity (sol)
  • SQL (sql)
  • Swift (swift)
  • TypeScript (ts)
  • YAML (yaml)

How It Works

No magic here. codelang-detect uses a curated list of regular expressions for each language. Each regex is assigned a "weight" based on how uniquely it identifies a language.

For example:

  • The pattern async Task< is a very strong signal for C# and gets a high weight.
  • The keyword def is a strong signal for Python but could also appear in Scala or Ruby, so it gets a moderate weight.
  • The keyword class is a weak signal, as it appears in many languages, and requires more context to be useful.

The library runs all regexes against the input code, sums the weights for each language, and returns the language with the highest score. It's simple, transparent, and incredibly fast.

Contributing

Contributions are welcome and appreciated! This project was started to fill a gap, and community help is the best way to make it the definitive tool for this job.

Whether it's improving regexes, adding support for a new language, or fixing a bug, please feel free to:

  1. Open an issue to discuss the change.
  2. Fork the repository and submit a pull request.

When adding a language or fixing a misidentification, please add relevant code snippets to tests/test_data.json. This helps verify your changes and prevents future regressions. We follow a simple principle: if a human can't reliably distinguish a short snippet, the detector probably can't either, so focus on realistic test cases.

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

codelang_detect-1.0.0.tar.gz (16.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

codelang_detect-1.0.0-py3-none-any.whl (13.3 kB view details)

Uploaded Python 3

File details

Details for the file codelang_detect-1.0.0.tar.gz.

File metadata

  • Download URL: codelang_detect-1.0.0.tar.gz
  • Upload date:
  • Size: 16.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for codelang_detect-1.0.0.tar.gz
Algorithm Hash digest
SHA256 25e72bfdc32a3475b74b0f84c5f0f16025384da23087c3547f5632c1583b109b
MD5 a04aae236c24ffeeb7a5d02d6e62349c
BLAKE2b-256 060256114588ea9259d4b9aee893ab98575fcebe24c46e27642f80c1a81f92bf

See more details on using hashes here.

File details

Details for the file codelang_detect-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for codelang_detect-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cfbb5a63d0d0c4bf1c494ba92ddedefc46d263d0ff7e47c6fcff96dbe3e493f1
MD5 4deacd0fb8389e83afe4e9e654e0203b
BLAKE2b-256 dc4aeee3b426eef1f8718d96eb17304a9d558b3e94e1c7b12e54a733fd95512f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page