Skip to main content

Search and download Java files that contain StackOverflow links from Searchcode, and compare them with code snippets from StackOverflow questions or answers.

Project description

Py-SCSO-Compare

This is neither an official Searchcode nor an official Stackoverflow application! Just something I wrote for my bachelor thesis.

It gathers code-files from open-source projects via the searchcode.com api, then gathers code-snippets from StackOverflow via it's api using links found in the comments of the open-source files and finally submits them to MOSS and parses the results locally into an index file. Which is used to generate statistics and diagrams to visualize the amount of copy&paste happening within open-source projects.

Requirements

Python 3+ required + the contents of the requirements.txt file.

Usage

All of the scripts should be used as CLI applications, but I also restructered them to modules. So you can use them in that way as well. Anyway if you want to run them as CLI apps just clone/download this repo and run the scripts in order of appearance.

dsc_cli.py


$ py dsc_cli.py -h
usage: dsc_cli.py [-h] [-i] [-r REPO]

Download Java Code from searchcode, that contain the a StackOverflow Link.

optional arguments:
  -h, --help            show this help message and exit
  -i, --info            only get the number of results.
  -r REPO, --repo REPO  specify the repo to search by giving the repo_id.

exlf_cli.py


$ py exlf_cli.py -h
usage: exlf_cli.py [-h] [-r] [-o] [-c] [-v] F

Scans Java files for a StackOverflow links and returns those in a csv
sanitized as much as possible.

positional arguments:
  F                  file to be scanned.

optional arguments:
  -h, --help         show this help message and exit
  -r, --recursive    scan a directory recursively.
  -o, --output-file  save output in csv file found in data/extracted_data.csv.
  -c, --copy-line    copy first line of the scanned file(s), removing comment
                     characters like "//". This works in tandem with
                     dsc_cli.py which writes the link to the raw file in the
                     first line with a preceding "//".
  -v, --verbose      gives a more detailed output

dso_cli.py


$ py dso_cli.py -h
usage: dso_cli.py [-h] [-q] [-b] [-a] [-o OUTPUT_FILE] [-i] [-v] I

Download code snippets from StackOverflow

positional arguments:
  I                     The id of the entity, either an answer or a question,
                        from which the code snippet(s) will be downloaded.

optional arguments:
  -h, --help            show this help message and exit
  -q, --question        Get the code snippet(s) from a question body instead.
  -b, --best-answer     When the question option is used, this option tells
                        the program to get the highest rated answer of the
                        specified question.
  -a, --accepted-answer
                        When the question option is used, this option tells
                        the program to get the accepted answer of the
                        specified question. If there is no accepted answer the
                        highest rated answer is used instead.
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        Saves extracted code snippet to file with the
                        specified name, or if there are more than one to a
                        folder of the same name.
  -i, --input-file      Parses data from CSV file and uses that data to get
                        code snippets and downloads them into
                        data/extracted_data/. REQUIRED HEADERS:
                        Stackoverflow_Links, SC_Filepath. OPTIONAL HEADER:
                        Download.
  -v, --verbose         gives a more detailed output

moss_cli_client.py


$ py moss_client_cli.py -h
usage: moss_client_cli.py [-h] [-p] [-o] [-j JOIN_FILE] [-b] U F

MOSS CLI client for submitting java files to the service and downloading the
report from the service locally. Will go through the sub folders of the given
folder and submit the java files for plagiarism checks and download the
reports locally, creating a linking file in the process

positional arguments:
  U                     Your user-id for the MOSS service.
  F                     The folder whose contents you want to submit.

optional arguments:
  -h, --help            show this help message and exit
  -p, --parse           Parses the moss reports into a csv file.
  -o, --only-parse      Only parses the local moss reports and does not submit
                        files and download the reports. Requires the reports
                        and the links_to_reports html file created normally by
                        this app.
  -j JOIN_FILE, --join-file JOIN_FILE
                        When the parse or only-parse option is given, joins
                        the parsed data with the parsed data.
  -b, --batch-mode      Only submits a 100 folders to the Moss Service, also
                        looks for already processed folders so that it does
                        not submit those again.

process_data_cli.py

No arguments needed, just run the following

$ py process_data_cli.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Py-SCSO-Compare-1.0.1.tar.gz (14.3 kB view hashes)

Uploaded Source

Built Distribution

Py_SCSO_Compare-1.0.1-py3-none-any.whl (17.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page