Project description

Sourced

Tooling around mass-scale Python source code analysis.

Usage

Currently there are two datasets: pypi-all and pypi-popular although I highly recommend pypi-popular if you intend to keep your sample size low (the chance of getting far more relevant results with it higher compared to pypi-all).

You can check out any number of datasets with different sample sizes:

$ sourced datasets create \
    --source pypi-popular \
    --sample-size 10 \
    playground

By default it will download all then source code under ~/.cache/sourced/0.0.1/<name> but it might be more pleasant to have a separate directory outside of your home:

$ sourced datasets create \
    --source pypi-popular \
    --sample-size 5000 \
    --base-data-dir /mnt/my-giant-disk/sourced-datasets \
    top-5000-packages

All these datasets are accessible through the CLI as long as those paths exist:

$ sourced datasets list
playground /path/to/.cache/sourced/0.0.1/playground
top-5000-packages /path/to/my-giant-disk/sourced-datasets/top-5000-packages

Running analyses on source code

As soon as you have a dataset checked out, you can run any analyses on it with the tooling offered in this package. Here is a simple program that parses every file in the dataset to find out what is the most common name:

from __future__ import annotations

import ast
import tokenize
from argparse import ArgumentParser
from collections import Counter

from sourced import Sourced


def most_common_name(file: str) -> dict[str, int]:
    usage: dict[str, int] = {}
    try:
        with tokenize.open(file) as stream:
            tree = ast.parse(stream.read())
    except BaseException as exc:
        return usage

    for node in ast.walk(tree):
        if isinstance(node, ast.Name):
            usage.setdefault(node.id, 0)
            usage[node.id] += 1
    return usage


def main():
    parser = ArgumentParser()
    parser.add_argument("dataset")

    options = parser.parse_args()
    sourced = Sourced()

    results = Counter()
    for result in sourced.run_on(options.dataset, most_common_name):
        results.update(result)

    for name, count in results.most_common(n=20):
        print(f"{name}: {count}")


if __name__ == "__main__":
    main()

$ python examples/python_specific_source.py playground
Found 10 sources
Collected 959 files from 10 unique projects.
self: 24489
os: 1821
str: 1735
request: 1157
response: 1064
value: 1029
pytest: 984
mock: 966
name: 837
r: 770
isinstance: 715
len: 705
cmd: 701
client: 674
params: 672
path: 668
key: 659
pool: 623
int: 599
config: 553

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.1.0

Dec 4, 2022

0.1.0a0 pre-release

Dec 2, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sourced-0.1.0.tar.gz (10.4 kB view hashes)

Uploaded Dec 4, 2022 Source

Built Distribution

sourced-0.1.0-py3-none-any.whl (11.4 kB view hashes)

Uploaded Dec 4, 2022 Python 3

Hashes for sourced-0.1.0.tar.gz

Hashes for sourced-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8e8def3e7508c916a3d4da441964802e2a303fd134e365cc4dda251609b99db8`
MD5	`ee2124e876328c388b585d2d85a29114`
BLAKE2b-256	`ed4de48027b4a4768336fde04cd3b55230362f59c9a03c3f3e6a251d923b72f8`

Hashes for sourced-0.1.0-py3-none-any.whl

Hashes for sourced-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c53731bc097d1da6f566ba9a9d164c155912ce1b446d41cc5e8d1563642292d0`
MD5	`603b3fc447ed540577804a320034d90f`
BLAKE2b-256	`08b14b9b892512e7fdc560ca458ea5c2b90be22338a130e3947a64be6a9affb7`