Skip to main content

Tooling around mass-scale Python source code analysis.

Project description

Sourced

Tooling around mass-scale Python source code analysis.

Usage

Currently there are two datasets: pypi-all and pypi-popular although I highly recommend pypi-popular if you intend to keep your sample size low (the chance of getting far more relevant results with it higher compared to pypi-all).

You can check out any number of datasets with different sample sizes:

$ sourced datasets create \
    --source pypi-popular \
    --sample-size 10 \
    playground

By default it will download all then source code under ~/.cache/sourced/0.0.1/<name> but it might be more pleasant to have a separate directory outside of your home:

$ sourced datasets create \
    --source pypi-popular \
    --sample-size 5000 \
    --base-data-dir /mnt/my-giant-disk/sourced-datasets \
    top-5000-packages

All these datasets are accessible through the CLI as long as those paths exist:

$ sourced datasets list
playground /path/to/.cache/sourced/0.0.1/playground
top-5000-packages /path/to/my-giant-disk/sourced-datasets/top-5000-packages

Running analyses on source code

As soon as you have a dataset checked out, you can run any analyses on it with the tooling offered in this package. Here is a simple program that parses every file in the dataset to find out what is the most common name:

from __future__ import annotations

import ast
import tokenize
from argparse import ArgumentParser
from collections import Counter

from sourced import Sourced


def most_common_name(file: str) -> dict[str, int]:
    usage: dict[str, int] = {}
    try:
        with tokenize.open(file) as stream:
            tree = ast.parse(stream.read())
    except BaseException as exc:
        return usage

    for node in ast.walk(tree):
        if isinstance(node, ast.Name):
            usage.setdefault(node.id, 0)
            usage[node.id] += 1
    return usage


def main():
    parser = ArgumentParser()
    parser.add_argument("dataset")

    options = parser.parse_args()
    sourced = Sourced()

    results = Counter()
    for result in sourced.run_on(options.dataset, most_common_name):
        results.update(result)

    for name, count in results.most_common(n=20):
        print(f"{name}: {count}")


if __name__ == "__main__":
    main()
$ python examples/python_specific_source.py playground
Found 10 sources
Collected 959 files from 10 unique projects.
self: 24489
os: 1821
str: 1735
request: 1157
response: 1064
value: 1029
pytest: 984
mock: 966
name: 837
r: 770
isinstance: 715
len: 705
cmd: 701
client: 674
params: 672
path: 668
key: 659
pool: 623
int: 599
config: 553

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sourced-0.1.0.tar.gz (10.4 kB view details)

Uploaded Source

Built Distribution

sourced-0.1.0-py3-none-any.whl (11.4 kB view details)

Uploaded Python 3

File details

Details for the file sourced-0.1.0.tar.gz.

File metadata

  • Download URL: sourced-0.1.0.tar.gz
  • Upload date:
  • Size: 10.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.0

File hashes

Hashes for sourced-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8e8def3e7508c916a3d4da441964802e2a303fd134e365cc4dda251609b99db8
MD5 ee2124e876328c388b585d2d85a29114
BLAKE2b-256 ed4de48027b4a4768336fde04cd3b55230362f59c9a03c3f3e6a251d923b72f8

See more details on using hashes here.

File details

Details for the file sourced-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sourced-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.11.0

File hashes

Hashes for sourced-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c53731bc097d1da6f566ba9a9d164c155912ce1b446d41cc5e8d1563642292d0
MD5 603b3fc447ed540577804a320034d90f
BLAKE2b-256 08b14b9b892512e7fdc560ca458ea5c2b90be22338a130e3947a64be6a9affb7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page