Skip to main content

Search and download Java files that contain StackOverflow links from Searchcode, and compare them with code snippets from StackOverflow questions or answers.

Project description

Py-SCSO-Compare

This is neither an official Searchcode nor an official Stackoverflow application! Just something I wrote for my bachelor thesis.

It gathers code-files from open-source projects via the searchcode.com api, then gathers code-snippets from StackOverflow via it's api using links found in the comments of the open-source files and finally submits them to MOSS and parses the results locally into an index file. Which is used to generate statistics and diagrams to visualize the amount of copy&paste happening within open-source projects.

Requirements

Python 3+ required + the contents of the requirements.txt file.

Usage

All of the scripts should be used as CLI applications, but I also restructered them to modules. So you can use them in that way as well. Anyway if you want to run them as CLI apps just clone/download this repo and run the scripts in order of appearance.

dsc_cli.py


$ py dsc_cli.py -h
usage: dsc_cli.py [-h] [-i] [-r REPO]

Download Java Code from searchcode, that contain the a StackOverflow Link.

optional arguments:
  -h, --help            show this help message and exit
  -i, --info            only get the number of results.
  -r REPO, --repo REPO  specify the repo to search by giving the repo_id.

exlf_cli.py


$ py exlf_cli.py -h
usage: exlf_cli.py [-h] [-r] [-o] [-c] [-v] F

Scans Java files for a StackOverflow links and returns those in a csv
sanitized as much as possible.

positional arguments:
  F                  file to be scanned.

optional arguments:
  -h, --help         show this help message and exit
  -r, --recursive    scan a directory recursively.
  -o, --output-file  save output in csv file found in data/extracted_data.csv.
  -c, --copy-line    copy first line of the scanned file(s), removing comment
                     characters like "//". This works in tandem with
                     dsc_cli.py which writes the link to the raw file in the
                     first line with a preceding "//".
  -v, --verbose      gives a more detailed output

dso_cli.py


$ py dso_cli.py -h
usage: dso_cli.py [-h] [-q] [-b] [-a] [-o OUTPUT_FILE] [-i] [-v] I

Download code snippets from StackOverflow

positional arguments:
  I                     The id of the entity, either an answer or a question,
                        from which the code snippet(s) will be downloaded.

optional arguments:
  -h, --help            show this help message and exit
  -q, --question        Get the code snippet(s) from a question body instead.
  -b, --best-answer     When the question option is used, this option tells
                        the program to get the highest rated answer of the
                        specified question.
  -a, --accepted-answer
                        When the question option is used, this option tells
                        the program to get the accepted answer of the
                        specified question. If there is no accepted answer the
                        highest rated answer is used instead.
  -o OUTPUT_FILE, --output-file OUTPUT_FILE
                        Saves extracted code snippet to file with the
                        specified name, or if there are more than one to a
                        folder of the same name.
  -i, --input-file      Parses data from CSV file and uses that data to get
                        code snippets and downloads them into
                        data/extracted_data/. REQUIRED HEADERS:
                        Stackoverflow_Links, SC_Filepath. OPTIONAL HEADER:
                        Download.
  -v, --verbose         gives a more detailed output

moss_cli_client.py


$ py moss_client_cli.py -h
usage: moss_client_cli.py [-h] [-p] [-o] [-j JOIN_FILE] [-b] U F

MOSS CLI client for submitting java files to the service and downloading the
report from the service locally. Will go through the sub folders of the given
folder and submit the java files for plagiarism checks and download the
reports locally, creating a linking file in the process

positional arguments:
  U                     Your user-id for the MOSS service.
  F                     The folder whose contents you want to submit.

optional arguments:
  -h, --help            show this help message and exit
  -p, --parse           Parses the moss reports into a csv file.
  -o, --only-parse      Only parses the local moss reports and does not submit
                        files and download the reports. Requires the reports
                        and the links_to_reports html file created normally by
                        this app.
  -j JOIN_FILE, --join-file JOIN_FILE
                        When the parse or only-parse option is given, joins
                        the parsed data with the parsed data.
  -b, --batch-mode      Only submits a 100 folders to the Moss Service, also
                        looks for already processed folders so that it does
                        not submit those again.

process_data_cli.py

No arguments needed, just run the following

$ py process_data_cli.py

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Py-SCSO-Compare-1.0.1.tar.gz (14.3 kB view details)

Uploaded Source

Built Distribution

Py_SCSO_Compare-1.0.1-py3-none-any.whl (17.3 kB view details)

Uploaded Python 3

File details

Details for the file Py-SCSO-Compare-1.0.1.tar.gz.

File metadata

  • Download URL: Py-SCSO-Compare-1.0.1.tar.gz
  • Upload date:
  • Size: 14.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.23.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.6

File hashes

Hashes for Py-SCSO-Compare-1.0.1.tar.gz
Algorithm Hash digest
SHA256 50e5e887268e400d66281daadd96678e684483fd34085db2a8ce6e2afdf946f5
MD5 a7feff1d4bc8b92a02e103fa8948f882
BLAKE2b-256 8f3c95101476229247594a516a814cf43ba06e7c18694d7a4d393262f7879ad2

See more details on using hashes here.

File details

Details for the file Py_SCSO_Compare-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: Py_SCSO_Compare-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 17.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.23.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.51.0 CPython/3.8.6

File hashes

Hashes for Py_SCSO_Compare-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 6d14755db50bad6d491b77656922ce34eb0c6c0868c1ef279725e21d023a2ae2
MD5 961c2a5eca5fcd24c9a5d24f549370a2
BLAKE2b-256 b4bfeda5eee1aa84d4708b738f16aa44e15b8115edc4755e926b206cc047d9d6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page