Skip to main content

Tools to investigate the state of PyPI packages.

Project description

PyPI package rot

PyPI Downloads License CI

Investigating the state of package rot on PyPI.

Introduction

The objective of this dataset is to provide a screenshot of the state of PyPI packages, so to facilitate investigations on the state of the Python ecosystem. While its first intended goal is to facilitate the identification of package rot, it can be used for other purposes as well.

At this time, we are building the dataset still and we are about 200k packages of 600k. We expect to have the dataset ready by the end of 2024.

Installing

As usual, you can install the package with pip:

pip install pypi-package-rot

CLI

The package provides the following CLI utilities to allow you to replicate the dataset. Do keep in mind that this operation will take a long time, as it involves scraping informations from websites that allow about 1 request per second.

Perpetual scraper

This command will scrape PyPI packages and store their metadata in the cache directory. Since the API RATE allows only about 1 request per second, it will take a long time (currently about 8 days) to retrieve all the packages. This utility will run in perpetuity, running and retrieving new packages continuously.

The email is needed to build an adequate user-agent string for the requests, so that PyPI can contact you if you are abusing the API.

pypi_package_rot perpetual_scraper --email "your@email.com"

Build the dataset

After having downloaded for a while the metadata of the packages, you can build the summarized anonymized dataset with the following command:

pypi_package_rot perpetual_builder --verbose --output "pypi.rot.v1.summarized.csv"  --email "your@email.com"

Alternatively, to include all informations, you can use the following command:

pypi_package_rot perpetual_builder --verbose --full --output "pypi.rot.v1.full.json"  --email "your@email.com"

Part of the procedure included testing whether the URLs in the metadata are still valid. This is done by sending a HEAD request to the URL and checking the status code. Such operations may take a long time (about one month at the time of writing). As per the previous command, the email is needed to build an adequate user-agent string for the requests, so that websites can contact you if you are abusing their services.

Contributing

Do you have some ideas to improve this dataset or associated analysis? Feel free to open an issue or a pull request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pypi_package_rot-0.1.0.tar.gz (12.8 kB view details)

Uploaded Source

File details

Details for the file pypi_package_rot-0.1.0.tar.gz.

File metadata

  • Download URL: pypi_package_rot-0.1.0.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.5

File hashes

Hashes for pypi_package_rot-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d9c68064eeb6c838fa2a252c61fc6953fd1161559c45f8ada03e712c884a860b
MD5 84ea9d017fb25d4ddf600c65bcd9e081
BLAKE2b-256 17d64b228f530377f57f6448c730572bc5bd319312e60698159c5e53588e56df

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page