Skip to main content

Code snippets language classification tool

Project description

Codcat

Natural Language Processing (NLP) is a rapidly growing field that aims to help machines understand and interpret human language. With the increasing use of code repositories like Github, there is a growing need to accurately categorize program code by programming language. This is particularly important in large repositories where multiple programming languages are used, as it allows developers to easily navigate and search for specific code snippets.

The goal of this NLP project is to develop a model that can automatically categorize program code by programming language. The model will be trained on a large dataset of code snippets from various programming languages, and will use NLP techniques to extract features and patterns from the code.

The project will involve several steps, including data collection, pre-processing, feature extraction, model selection and evaluation. The dataset for the project will be sourced from various public code repositories, including GitHub, GitLab, Stackoverflow. The collected data will then be pre-processed to remove irrelevant information and to standardize the format of the code snippets. This will involve techniques such as tokenization and stop-word removal.

Once the data is pre-processed, features will be extracted from the code snippets using NLP techniques. This will involve using methods such as Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), and Embeddings to capture the semantics of the code. These features will then be used to train and evaluate several machine learning models, including Naive Bayes, RandomForest, CNN, RNN, Transformers.

The final model will be evaluated on a test set of code snippets to assess its accuracy and generalizability.

Prerequisites

On your PC with local run you must have Python >= 3.9

Installation

Install codcat with pip:

pip install codcat

or with your favorite package manager.

Example

Input

from codcat.downloader import load
model = load('base-tiny')
print(model.predict(['def foo(bar): return bar', '#include <stdio.h>']))

Output

['python' 'c']

Authors

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

codcat-0.2.0.tar.gz (2.6 MB view details)

Uploaded Source

Built Distribution

codcat-0.2.0-py3-none-any.whl (13.4 kB view details)

Uploaded Python 3

File details

Details for the file codcat-0.2.0.tar.gz.

File metadata

  • Download URL: codcat-0.2.0.tar.gz
  • Upload date:
  • Size: 2.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for codcat-0.2.0.tar.gz
Algorithm Hash digest
SHA256 caa446a064c6ca4a3eef0d3bdc044fcbd1e7e30b49470d03795a62f2b3a5fc5a
MD5 e4fe7f3cbbdd3ff175b7c9c77edc433f
BLAKE2b-256 ff0e012a57e3fa782483a3901dfe9c6fdce61f1bb43589561660a5ee1b04eda1

See more details on using hashes here.

File details

Details for the file codcat-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: codcat-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 13.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.16

File hashes

Hashes for codcat-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 39b26ddd776fa48fc1927eed2e146c9c4b1728ca64fe246e817620d206d1707c
MD5 4f9545eeb8b78b67ddae64cf3a8144f7
BLAKE2b-256 31f0fbf26b92b046b9c20113843fad5eda500a79d328d1234336725f34fff1f8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page