Skip to main content

A fast, lightweight, zero-dependency programming language detector for Python.

Project description

codelang-detect

A fast, lightweight, regex-based programming language detector for Python.

Repo Introduction Video

PyPI version Build Status Python Versions License: Apache 2.0


Codelang-detect identifies the programming language of a given code snippet. It is designed from the ground up to be fast, accurate, and have zero external dependencies. It's the perfect tool for pre-processing code, routing files, or any application where you need a quick and reliable language check without pulling in heavy libraries.

Key Features

  • โšก๏ธ Blazing Fast: Built on a system of weighted, compiled regular expressions. Performance is measured in microseconds.
  • ๐ŸŽฏ Highly Accurate: Demonstrably more accurate than popular alternatives on a curated suite of real-world and tricky code snippets.
  • ๐Ÿ“ฆ Zero Dependencies: Pure Python. pip install codelang-detect is all you need. No heavyweight models, no external binaries.
  • ๐Ÿ”ง Simple API: A single function call: detect(code).
  • ๐Ÿ’ป CLI Included: Use it directly from your terminal or in shell scripts.

Why codelang-detect?

Many existing language detectors have significant trade-offs:

  • Heavy ML Models (e.g., guesslang): Often have complex or outdated dependencies (like older TensorFlow versions) that make installation difficult. They are also significantly slower for single detections.
  • Comprehensive Tools (e.g., pygments): Excellent for syntax highlighting, but its primary goal isn't detection. As the benchmarks show, its guessing can be unreliable on complex snippets.
  • Platform-Specific Tools (e.g., GitHub's linguist): The industry standard, but it's a Ruby Gem, making it difficult to integrate into a Python environment.

codelang-detect fills the gap for a "just right" solution: a lightweight, portable, and fast detector that delivers best-in-class accuracy.

Benchmark: Accuracy & Performance

The results speak for themselves. On a curated set of 60 code snippets designed to test real-world accuracy, codelang-detect is both significantly more accurate and an order of magnitude faster than other popular, lightweight libraries.

Library Accuracy Avg. Time / Sample (ยตs) Dependencies
codelang-detect (Ours) 100% ~173 ยตs None
Pygments 22.2% ~1395 ยตs None
WhatsThatCode 30.6% ~1881 ยตs None

Benchmarks run on Python 3.13. Your results may vary.

As the results show, codelang-detect is not only the most accurate solution on this test suite but also ~8x faster than Pygments and ~11x faster than WhatsThatCode, all while maintaining zero dependencies.

Click to see detailed accuracy breakdown
--- Accuracy Benchmark ---
| Test Case          | Expected   | Codelang-Detect (Ours) | Pygments               | WhatsThatCode          |
--------------------------------------------------------------------------------------------------------------
| cs_simple          | cs         | cs                  โœ… | unknown             โŒ | java                โŒ |
| cs_lambda          | cs         | cs                  โœ… | scdoc               โŒ | unknown             โŒ |
| cs_full            | cs         | cs                  โœ… | gdscript            โŒ | unknown             โŒ |
| py_simple          | py         | py                  โœ… | py                  โœ… | py                  โœ… |
| py_class           | py         | py                  โœ… | perl6               โŒ | py                  โœ… |
| java_simple        | java       | java                โœ… | py                  โŒ | java                โœ… |
| java_full          | java       | java                โœ… | teratermmacro       โŒ | unknown             โŒ |
| js_arrow           | js         | js                  โœ… | gdscript            โŒ | unknown             โŒ |
| yaml_k8s           | yaml       | yaml                โœ… | actionscript3       โŒ | unknown             โŒ |
| sh_shebang         | sh         | sh                  โœ… | sh                  โœ… | sh                  โœ… |
| kt_data_class      | kt         | kt                  โœ… | ssp                 โŒ | unknown             โŒ |
| swift_func         | swift      | swift               โœ… | gdscript            โŒ | unknown             โŒ |
| scala_case_class   | scala      | scala               โœ… | unknown             โŒ | unknown             โŒ |
| sql_select         | sql        | sql                 โœ… | scdoc               โŒ | unknown             โŒ |
| cbl_simple         | cbl        | cbl                 โœ… | componentpascal     โŒ | unknown             โŒ |
| plain_text         | unknown    | unknown             โœ… | unknown             โœ… | unknown             โœ… |
| cs_async_method    | cs         | cs                  โœ… | gdscript            โŒ | cs                  โœ… |
| cs_linq_query      | cs         | cs                  โœ… | gdscript            โŒ | js                  โŒ |
| py_async_http      | py         | py                  โœ… | py                  โœ… | unknown             โŒ |
| py_pandas          | py         | py                  โœ… | py                  โœ… | unknown             โŒ |
| java_streams       | java       | java                โœ… | py                  โŒ | unknown             โŒ |
| js_promise_fetch   | js         | js                  โœ… | gdscript            โŒ | unknown             โŒ |
| js_react_component | js         | js                  โœ… | py                  โŒ | unknown             โŒ |
| ts_interface       | ts         | ts                  โœ… | gdscript            โŒ | unknown             โŒ |
| kt_coroutine       | kt         | kt                  โœ… | py                  โŒ | py                  โŒ |
| swift_struct       | swift      | swift               โœ… | gdscript            โŒ | unknown             โŒ |
| scala_future       | scala      | scala               โœ… | py                  โŒ | unknown             โŒ |
| go_http_server     | go         | go                  โœ… | py                  โŒ | go                  โœ… |
| sql_join           | sql        | sql                 โœ… | scdoc               โŒ | unknown             โŒ |
| yaml_dockercompose | yaml       | yaml                โœ… | scdoc               โŒ | unknown             โŒ |
| sh_env_check       | sh         | sh                  โœ… | sh                  โœ… | sh                  โœ… |
| rb_class           | rb         | rb                  โœ… | tsql                โŒ | rb                  โœ… |
| php_router         | php        | php                 โœ… | javascript+php      โŒ | php                 โœ… |
| rust_result        | rs         | rs                  โœ… | ecl                 โŒ | unknown             โŒ |
| c_function_pointer | c          | c                   โœ… | c                   โœ… | unknown             โŒ |
| plain_text_doc     | unknown    | unknown             โœ… | unknown             โœ… | unknown             โœ… |

Note: Libraries like guesslang and enry were excluded from the final benchmark due to significant installation issues with modern Python versions and their respective dependencies.

Installation

pip install codelang-detect

Usage

As a Python Library

The API is dead simple. The detect function takes a string of code and returns the file extension of the detected language.

from codelang_detect import detect

# Example 1: Python
python_code = "class User:\n    def __init__(self, name): self.name = name"
print(detect(python_code))
# Output: py

# Example 2: C#
csharp_code = "public class Person { public string Name { get; set; } }"
print(detect(csharp_code))
# Output: cs

# Example 3: Non-code
unknown_text = "This is just a regular sentence."
print(detect(unknown_text))
# Output: unknown

As a Command-Line Tool (CLI)

You can also use codelang-detect directly from your terminal to analyze files or stdin.

# Analyze a file
codelang-detect my_script.js
# Output: js

# Pipe content into the CLI
cat deployment.yaml | codelang-detect
# Output: yaml

Supported Languages

codelang-detect currently provides high-quality detection for the following languages, sorted by their returned extension:

  • C (c)
  • C++ (cpp)
  • C# (cs)
  • COBOL (cbl)
  • CSS (css)
  • Dart (dart)
  • Go (go)
  • Groovy (groovy)
  • HTML (html)
  • Java (java)
  • JavaScript (js)
  • JSON (json)
  • Kotlin (kt)
  • PHP (php)
  • Python (py)
  • R (r)
  • Ruby (rb)
  • Rust (rs)
  • Scala (scala)
  • Shell (sh)
  • Solidity (sol)
  • SQL (sql)
  • Swift (swift)
  • TypeScript (ts)
  • XML (xml)
  • YAML (yaml)

How It Works

No magic here. codelang-detect uses a curated list of regular expressions for each language. Each regex is assigned a "weight" based on how uniquely it identifies a language.

For example:

  • The pattern async Task< is a very strong signal for C# and gets a high weight.
  • The keyword def is a strong signal for Python but could also appear in Scala or Ruby, so it gets a moderate weight.
  • The keyword class is a weak signal, as it appears in many languages, and requires more context to be useful.

The library runs all regexes against the input code, sums the weights for each language, and returns the language with the highest score. It's simple, transparent, and incredibly fast.

Running Tests

This project uses pytest for testing. To run the test suite, first install the development dependencies and then run pytest:

# Install development dependencies
pip install -r requirements-dev.txt

# Run the test suite
pytest

Contributing

Contributions are welcome and appreciated! This project was started to fill a gap, and community help is the best way to make it the definitive tool for this job.

Whether it's improving regexes, adding support for a new language, or fixing a bug, please feel free to:

  1. Open an issue to discuss the change.
  2. Fork the repository and submit a pull request.

When adding a language or fixing a misidentification, please add relevant code snippets to tests/test_data.json. This helps verify your changes and prevents future regressions. We follow a simple principle: if a human can't reliably distinguish a short snippet, the detector probably can't either, so focus on realistic test cases.

License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

codelang_detect-1.1.0.tar.gz (18.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

codelang_detect-1.1.0-py3-none-any.whl (14.2 kB view details)

Uploaded Python 3

File details

Details for the file codelang_detect-1.1.0.tar.gz.

File metadata

  • Download URL: codelang_detect-1.1.0.tar.gz
  • Upload date:
  • Size: 18.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for codelang_detect-1.1.0.tar.gz
Algorithm Hash digest
SHA256 d5047fb9077d00bc40049b10c38752b4e616b5a9ca773296eeee9bfddb6d58b9
MD5 e9185382a512ab77ea7aa84427667a71
BLAKE2b-256 305e26922861639096c7d63aebabe3e4a77710a9feecae15dad82aec89126103

See more details on using hashes here.

File details

Details for the file codelang_detect-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for codelang_detect-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ba9811528ff5864b39c66475774b3a736d03fe885f74c65fd08bc328e685188e
MD5 4fb6dd2895d714919f629b22974cc7e1
BLAKE2b-256 04aacbb60ee731f48cec8ff2f09f9e175bbc932fd40fd0406bc33e9761969569

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page