Skip to main content

Command-line tool for automatically detecting vendored and copy/pasted code

Project description

Vendetect

CI PyPI version Packaging status

A command-line tool for automatically detecting vendored and copy/pasted code between repositories.

Description 🧑‍🎓

Vendetect helps identify copied or vendored code between repositories, making it easier to detect when code has been copied with or without attribution. The tool uses similarity detection algorithms to compare code files and highlight matching sections.

Key features:

  • Compare code between two repositories (local or remote)
  • Analyze specific subdirectories within repositories
  • Identify files with similar code and display them side-by-side
  • Show similarity percentages for matched code
  • Filter by file types and adjust similarity thresholds
  • Support for different programming languages through Pygments lexers
  • Similarity is not solely based upon symbol names; vendetect also considers semantics

Installation 🚀

Using pip

pip install vendetect

Using uv

uv tool install vendetect

From source

Clone the repository and install:

git clone https://github.com/trailofbits/vendetect.git
cd vendetect
uv tool install .

Development installation

For development with all dependencies:

git clone https://github.com/trailofbits/vendetect.git
cd vendetect
uv sync --group dev
source .venv/bin/activate

Usage 🏃

Basic usage

vendetect TEST_REPO SOURCE_REPO

Where:

  • TEST_REPO: Path or URL to the repository you want to check for copied code
  • SOURCE_REPO: Path or URL to the repository that is the potential source of the code

Examples

# Compare two local repositories
vendetect /path/to/my/project /path/to/another/project

# Compare a local project with a remote repository
vendetect /path/to/my/project https://github.com/example/repo.git

# Compare only specific subdirectories within repositories
vendetect /path/to/my/project https://github.com/example/repo.git \
  --test-subdir src/components \
  --source-subdir lib/ui

# Filter by file types and adjust similarity threshold
vendetect /path/to/my/project /path/to/another/project \
  --type py --type js \
  --min-similarity 0.8

Options

--format FORMAT              Output format: rich, csv, or json (default=rich)
--output OUTPUT              Output file path (default: stdout)
--force                      Force overwrite of existing output file
--type FILE_TYPES, -t        File extension to consider (can be used multiple times)
--min-similarity THRESHOLD   Minimum similarity threshold (range: 0.0-1.0, default: 0.5)
--test-subdir DIR, -ts       Subdirectory within TEST_REPO to analyze
--source-subdir DIR, -ss     Subdirectory within SOURCE_REPO to analyze
--incremental                Enable incremental result reporting
--batch-size SIZE            Number of files to process per batch (default: 100)
--max-history-depth DEPTH    Maximum commit history depth (default: -1 = entire history)
--log-level LEVEL            Sets the log level (default=INFO)
--debug                      Equivalent to --log-level=DEBUG
--quiet                      Equivalent to --log-level=CRITICAL

Advanced Features

Subdirectory Analysis

When working with large repositories, you can focus analysis on specific subdirectories:

# Analyze only the src/ directory in both repositories
vendetect /path/to/my/project /path/to/another/project \
  --test-subdir src --source-subdir src

# Compare frontend code in one repo with backend in another
vendetect /path/to/frontend-repo /path/to/backend-repo \
  --test-subdir client/src --source-subdir server/utils

This is particularly useful for:

  • Focusing on relevant code sections
  • Reducing analysis time for large repositories
  • Comparing similar modules across different project structures

File Type Filtering

Control which files are analyzed by specifying file extensions:

# Only analyze Python files
vendetect /path/to/my/project /path/to/another/project --type py

# Analyze multiple file types
vendetect /path/to/my/project /path/to/another/project --type py --type js --type ts

Similarity Thresholds

Adjust the minimum similarity threshold to filter results:

# Show only high-confidence matches (80% similarity or higher)
vendetect /path/to/my/project /path/to/another/project --min-similarity 0.8

# Show all potential matches (lower threshold)
vendetect /path/to/my/project /path/to/another/project --min-similarity 0.3

Output Formats

Vendetect supports three output formats:

  1. rich (default): Interactive console output with syntax highlighting and side-by-side code comparison
  2. csv: Comma-separated values format with columns for Test File, Source File, Test Slice Start, Test Slice End, Source Slice Start, Source Slice End, and Similarity
  3. json: JSON format with detailed information about each detection, including file paths, similarity scores, and matched code slices

Example using CSV output:

vendetect /path/to/my/project /path/to/another/project --format csv --output results.csv

Example using JSON output:

vendetect /path/to/my/project /path/to/another/project --format json --output results.json

How it works 🧐

Vendetect uses a combination of techniques to identify similar code:

  1. It fingerprints all source code files in both repositories based upon their semantics rather than syntax
  2. For each file pair, it computes a similarity score
  3. It identifies specific sections (slices) of code that match between files
  4. Results are presented in a rich output format with side-by-side comparison

The tool can handle:

  • Local file system repositories
  • Git repositories (with history support)
  • Remote git repositories (automatically cloned for analysis)

Requirements 🛒

  • Python 3.11 or higher
  • Git (optional, for repository history analysis)

Contributing 🧑‍💻

Contributions are welcome! Check out the issues for ideas on where to start.

Development setup

# Install development dependencies
uv sync --group dev

# Source virtual env
source .venv/bin/activate

# Run tests
pytest

# Lint code
ruff check

# Type checking
mypy

Contact 💬

If you'd like to file a bug report or feature request, please use our issues page. Feel free to contact us or reach out in Empire Hacking for help using or extending Deptective.

License 📝

This utility was developed by Trail of Bits.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see https://www.gnu.org/licenses/.

Contact us if you're looking for an exception to the terms.

© 2025, Trail of Bits.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vendetect-0.0.2.tar.gz (46.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vendetect-0.0.2-py3-none-any.whl (43.9 kB view details)

Uploaded Python 3

File details

Details for the file vendetect-0.0.2.tar.gz.

File metadata

  • Download URL: vendetect-0.0.2.tar.gz
  • Upload date:
  • Size: 46.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for vendetect-0.0.2.tar.gz
Algorithm Hash digest
SHA256 dc9f386401ea94e4076751e1943dc2447e8b490c180fd58ab4af00a408031714
MD5 ac02efe8b298d34f8962b1fccdfe0b59
BLAKE2b-256 540bad5c4a8f8f41279cf3ce4841f2c5ad4d4512b0dc5cb2a238c1a15cb4dd96

See more details on using hashes here.

Provenance

The following attestation bundles were made for vendetect-0.0.2.tar.gz:

Publisher: release.yml on trailofbits/vendetect

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vendetect-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: vendetect-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 43.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for vendetect-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 af5e76ce0a64d1d0f999565fd7e6d167beb690d97516cd740eb08dd7f3420331
MD5 084ecc6022f22787fd9bdb0e0bc9af77
BLAKE2b-256 c885b3efd4e68f06b90c95698d33206252282611dbf914986e5eeed95777d2c6

See more details on using hashes here.

Provenance

The following attestation bundles were made for vendetect-0.0.2-py3-none-any.whl:

Publisher: release.yml on trailofbits/vendetect

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page