Skip to main content

This package contains java code similarity features calculation

Project description

Java Features

This module will construct java code similarity calculation between two pair of file

Installation

Using PIP via PyPI

pip install java-features

Using PIP via Github

pip install git+https://github.com/tomytw/java-features.git@0.1.1

Usage

Importing the package

from java_features.utilities.main_utility import get_all_filepaths, generate_init_data
from java_features.utilities.stats_scoring_utility import (
    initialize_stats_config, initialize_tokens_percentile, build_token_stats_features,              initialize_features_percentile, build_main_style_stats_features
)
from java_features.FeaturesCalculation import initialize_config, build_style_sequence, create_features_result_df
  1. Main Utility will be used as initializer for the filepaths of the codes and generate the initial data for the dataframe Init Dataframe

  2. Stats scoring utility will be used to calculate with features related to percentile stats

  3. FeaturesCalculation module will be used to calculate the main features and combine all the features Result

    Columns Explanation:

    • Line Pos 1 and Line Pos 2 will be used with CLTS dicts to locate the actual position of segments of codes that are duplicated
    • Starting from CLTS or column number 7 (index 6) are the features that will be used by any model to predict plagiarism

Boilerplate code to generate feature pairs between all files in a directory

# imports
from java_features.utilities.main_utility import get_all_filepaths, generate_init_data
from java_features.utilities.stats_scoring_utility import (
    initialize_stats_config, initialize_tokens_percentile, build_token_stats_features, initialize_features_percentile, build_main_style_stats_features
)
from java_features.FeaturesCalculation import initialize_config, build_style_sequence, create_features_result_df
import pandas as pd

# initialize config (to determine the list of feature names that will be used)
# initialize config can receive an ini file filepath, please look at plag-configs-example.ini for an example
(PERCENTILES_DEFINE_TOKEN, ADD_PERCENTILE_FEATURE_COLS,
 PERCENTILES_DEFINE_FEATURE, USED_MAIN_FEATURES) = initialize_config()
initialize_stats_config(PERCENTILES_DEFINE_TOKEN, ADD_PERCENTILE_FEATURE_COLS, PERCENTILES_DEFINE_FEATURE, USED_MAIN_FEATURES)

# replace with your directory that contains codes, make sure all of them have unique file names
DIR = '{YOUR_DIRECTORY_NAME}'

# nerf feature (to minimize the false positive generated due to code that contains only skeleton code)
NERF_FEATURES = False
MINIMAL_PAIR_HAVE_SAME_SEGMENT = 0.25

# initialization of the dataframe
filepaths = get_all_filepaths(DIR)
all_data = generate_init_data(filepaths)
columns = ['filename', 'raw_code', 'sequence_line', 'line_pos', 'raw_sequence_line', 'sequence', 'line_len']
initial_codes_df = pd.DataFrame(all_data, columns=columns)
initial_codes_df = initial_codes_df.sort_values(by=['line_len']).reset_index(drop=True)

# start to calculate features
main_codes_df = initial_codes_df.copy(deep=True)
build_style_sequence(main_codes_df)
result_scoring_df = create_features_result_df(main_codes_df, same_segment_nerf=NERF_FEATURES, minimal_pair_have_same_segment=MINIMAL_PAIR_HAVE_SAME_SEGMENT)
main_codes_df['sequence_len'] = main_codes_df['sequence'].str.len()
initialize_tokens_percentile(main_codes_df)
build_token_stats_features(result_scoring_df, main_codes_df)
initialize_features_percentile(result_scoring_df)
build_main_style_stats_features(result_scoring_df)

# output the result to csv file
result_scoring_df.to_csv('features-output.csv', index=False)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

java_features-0.1.1.tar.gz (22.1 kB view details)

Uploaded Source

Built Distribution

java_features-0.1.1-py3-none-any.whl (23.2 kB view details)

Uploaded Python 3

File details

Details for the file java_features-0.1.1.tar.gz.

File metadata

  • Download URL: java_features-0.1.1.tar.gz
  • Upload date:
  • Size: 22.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.9.13

File hashes

Hashes for java_features-0.1.1.tar.gz
Algorithm Hash digest
SHA256 14dafb4f2f752c49893932d3285476b8c17ae98bb8002010c0438928a04f897f
MD5 fd54520b7d17ca0ed558a929f802070c
BLAKE2b-256 0244604b0e2aa7420a0fef4081c8ff5ef265d7189feff49d98c661939253149f

See more details on using hashes here.

File details

Details for the file java_features-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for java_features-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 21f017474c1c176299cf268dfcfc8d4d67aa0e286fd99067f2a1eb9e7046710e
MD5 7ce951aa2af9e981f581d7b6e5ee0755
BLAKE2b-256 787559d69c3a6a5fb2c25329ec72bd770e0a3ae1b004f08c573e349fa71797f1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page