Skip to main content

Tag source files with real-world stories.

Project description

srctag

PyPI version Smoke Test

Tag source files with real-world stories.

What' s it?

Based on user-provided tag lists, srctag associates files with relevant tags and provides a measure of relevance by mining the commits.

For example, axios is a famous JavaScript library. We can extract some features (tags) it provides from the README and pass them to srctag:

File XMLHttpRequests HTTP requests (Node.js) Promise API support Request/response interception Request/response data transformation Request cancellation Automatic JSON data transforms Automatic serialization of data objects Client-side XSRF protection
lib/adapters/http.js 1 1 1 1 1 1 1 1 1
lib/adapters/xhr.js 0.980769231 0.981132075 0.980769231 0.979591837 0.981132075 0.98 0.98 0.960784314 0.981132075
lib/utils.js 0.961538462 0.962264151 0.961538462 0.959183673 0.962264151 0.94 0.96 0.980392157 0.962264151
lib/platform/browser/index.js 0.942307692 0.924528302 0.846153846 0.795918367 0.830188679 0.72 0.84 0.705882353 0.924528302
lib/helpers/buildURL.js 0.923076923 0.867924528 0.884615385 0.836734694 0.811320755 0.88 0.82 0.843137255 0.773584906
lib/core/dispatchRequest.js 0.903846154 0.943396226 0.903846154 0.897959184 0.905660377 0.96 0.86 0.882352941 0.886792453
lib/helpers/toFormData.js 0.884615385 0.905660377 0.923076923 0.857142857 0.943396226 0.78 0.94 0.941176471 0.867924528
lib/axios.js 0.865384615 0.773584906 0.942307692 0.918367347 0.924528302 0.9 0.92 0.901960784 0.943396226
lib/defaults/index.js 0.846153846 0.830188679 0.826923077 0.87755102 0.886792453 0.86 0.9 0.862745098 0.849056604
lib/core/Axios.js 0.826923077 0.886792453 0.865384615 0.93877551 0.867924528 0.92 0.88 0.921568627 0.830188679
lib/core/AxiosError.js 0.807692308 0.849056604 0.673076923 0.816326531 0.773584906 0.84 0.78 0.803921569 0.811320755
lib/helpers/parseHeaders.js 0.788461538 0.811320755 0.653846154 0.551020408 0.452830189 0.68 0.4 0.450980392 0.339622642
lib/helpers/isURLSameOrigin.js 0.769230769 0.698113208 0.403846154 0.571428571 0.641509434 0 0.7 0.352941176 0.905660377
lib/platform/node/index.js 0.75 0.735849057 0.788461538 0.653061224 0.735849057 0.44 0.64 0.529411765 0.735849057
lib/platform/browser/classes/FormData.js 0.730769231 0.716981132 0.711538462 0.428571429 0.716981132 0 0.56 0.078431373 0.698113208
lib/helpers/fromDataURI.js 0.711538462 0.754716981 0.769230769 0.428571429 0.509433962 0.34 0.42 0.078431373 0.679245283
lib/platform/index.js 0.692307692 0.660377358 0.519230769 0.367346939 0.566037736 0.44 0.5 0.529411765 0.641509434
lib/platform/browser/classes/URLSearchParams.js 0.673076923 0.641509434 0.807692308 0.591836735 0.698113208 0.44 0.74 0.764705882 0.509433962
lib/helpers/cookies.js 0.653846154 0.679245283 0.692307692 0.306122449 0.641509434 0.42 0.68 0.352941176 0.79245283
lib/core/transformData.js 0.634615385 0.79245283 0.75 0.734693878

Then we can obtain the relevance of each code file with these tags. You can choose your preferred format to process this data: CSV, pandas, or even networkx with Graphviz.

my_graph

How to use?

Installation

Requires Python 3.8 or later and the sentence-transformers library.

# For full installation with dependencies
pip install "srctag[embedding]"

# For manual installation of sentence-transformers
pip install srctag

Use as LIB

You can check the links below for more detailed information:

import pathlib
import sys
import warnings

import networkx

from srctag.collector import Collector
from srctag.storage import Storage
from srctag.tagger import Tagger

axios_repo = pathlib.Path(__file__).parent.parent / "axios"
if not axios_repo.is_dir():
    warnings.warn(f"clone axios to {axios_repo} first")
    sys.exit(0)

collector = Collector()
collector.config.repo_root = axios_repo
collector.config.max_depth_limit = -1
collector.config.include_regex = r"lib.*"

ctx = collector.collect_metadata()
storage = Storage()
storage.embed_ctx(ctx)
tagger = Tagger()
tagger.config.tags = [
    "XMLHttpRequests from browser",
    "HTTP requests from node.js",
    "Promise API support",
    "Request and response interception",
    "Request and response data transformation",
    "Request cancellation",
    "Automatic JSON data transforms",
    "Automatic serialization of data objects",
    "Client-side XSRF protection"
]
tag_result = tagger.tag(storage)

# access the pandas.DataFrame
print(tag_result.scores_df)

# csv dump
tag_result.export_csv()

# dot file dump
graph = tag_result.export_networkx()
networkx.drawing.nx_pydot.write_dot(graph, sys.stdout)

Use as CLI

  examples git:(main)  srctag tag --help
Usage: srctag tag [OPTIONS]

  tag your repo

Options:
  --repo-root TEXT             Repository root directory
  --max-depth-limit INTEGER    Maximum depth limit
  --include-regex TEXT         File include regex pattern
  --tags-file FILENAME         Path to a text file containing tags
  --output-path TEXT           Output file path for CSV
  --file-level TEXT            Scan file level, FILE or DIR, default to FILE
  --st-model TEXT              Sentence Transformer Model
  --commit-include-regex TEXT  Commit message include regex pattern
  --help                       Show this message and exit.

Goal & Motivation

Diff Analysis

This project was initially created to address the following issue. In complex business projects, there are often numerous modules with many contributors. The tight coupling between modules can easily lead to changes affecting each other among developers. Detecting such issues through code review is time-consuming, labor-intensive, and prone to oversights.

We aim to help evaluate the potential impact of a change on various functionalities, guiding subsequent testing efforts.

Also we have a WIP Github Actions project for supporting PR evaluations: https://github.com/williamfzc/srctag-action

API for LLM

With the rise of large language models (LLMs), many teams are considering how to make LLMs understand the entire codebase. From the current progress, LLMs can understand details at the code implementation level well, but their understanding of the business functionalities they represent is limited.

We also hope to use this approach to enable LLMs to establish associations between code files and specific business functionalities at a lower cost, enhancing their overall understanding of the code repository.

How it actually works?

  • Collector: Collects sufficient metadata from the code repository, such as commit messages.
  • Storage: Organizes this metadata and embeds it into a vector database in an appropriate form.
  • Tagger: Searches for relevant files based on the existing tag list and further establishes associations.

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

srctag-0.5.1.tar.gz (18.4 kB view details)

Uploaded Source

Built Distribution

srctag-0.5.1-py3-none-any.whl (18.4 kB view details)

Uploaded Python 3

File details

Details for the file srctag-0.5.1.tar.gz.

File metadata

  • Download URL: srctag-0.5.1.tar.gz
  • Upload date:
  • Size: 18.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for srctag-0.5.1.tar.gz
Algorithm Hash digest
SHA256 9a006ba845c56c9102bace8d7d0351165136956a22896a75609e7b0c5a7f638a
MD5 558784dbc07003a57b1d58e80d534f51
BLAKE2b-256 77bdad33f6f92000f3c0ef380629cbcc4a43cfbef0c8c7f7b0dba7fc99c5b856

See more details on using hashes here.

File details

Details for the file srctag-0.5.1-py3-none-any.whl.

File metadata

  • Download URL: srctag-0.5.1-py3-none-any.whl
  • Upload date:
  • Size: 18.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/4.0.2 CPython/3.11.7

File hashes

Hashes for srctag-0.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 be712f67abfb6d4f43c15176be8af0f4b1b5d87c5174f12445d202011f88462b
MD5 a0a71c3d141681d6c1d89d3d90b57f32
BLAKE2b-256 f199ec2e2c99a99be82523887f1cd61d9c36f9a3d6e514128a7956dab6e9c86c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page