Skip to main content

Utility functions that are often useful

Project description

pelutils

Various utilities useful for Python projects. Features include

  • A simple and powerful logger with colourful printing and stacktrace logging
  • Parsing for combining config files and command-line arguments - especially useful for algorithms with several parameters
  • A timer inspired by Matlab's tic and toc
  • Simple code profiler
  • An extension to the built-in dataclass for saving and loading data
  • Table formatting
  • Miscellaneous standalone functions - see pelutils/__init__.py
  • Data-science submodule with extra utilities for statistics, plotting with matplotlib, and machine learning using PyTorch
  • Linear time unique function in the style of numpy.unique

pelutils supports Python 3.7-3.9.

pytest Coverage Status

Timing and Code Profiling

Simple time taker inspired by Matlab Tic, Toc, which also has profiling tooling.

# Time a task
TT.tick()
<some task>
seconds_used = TT.tock()

# Profile a for loop
for i in range(100):
    TT.profile("Repeated code")
    <some task>
    TT.profile("Subtask")
    <some subtask>
    TT.end_profile()
    TT.end_profile()
print(TT)  # Prints a table view of profiled code sections

# Alternative syntax using with statement
with TT.profile("The best task"):
    <some task>

# When using multiprocessing, it can be useful to simulate multiple hits of the same profile
with mp.Pool() as p, TT.profile("Processing 100 items on multiple threads", hits=100):
    p.map(100 items)
# Similar for very quick loops
a = 0
with TT.profile("Adding 1 to a", hits=100):
    for _ in range(100):
        a += 1

# Examples so far use a global TickTock instance, which is convenient,
# but it can also be desirable to use for multiple different timers, e.g.
tt1 = TickTock()
tt2 = TickTock()
t1_interval = 1  # Do task 1 every second
t2_interval = 2  # Do task 2 every other second
tt1.tick()
tt2.tick()
while True:
    if tt1.tock() > t1_interval:
        <task 1>
        tt1.tick()
    if tt2.tock() > t2_interval:
        <task 2>
        tt2.tick()
    time.sleep(0.01)

Data Storage

The DataStorage class is an augmentation of the dataclass that incluces save and load functionality. This simplifies saving data, as only save command has to be issued for all data, and it keeps type hinting when loading data compared to e.g. a dictionary.

Currently works specifically with:

  • Numpy arrays (numpy.ndarray)
  • Torch tensors (torch.Tensor)
  • Any json serializable data (as determined by the rapidjson library)

All other data is pickled.

DataStorage classes must inherit from DataStorage and be annotated with @dataclass.

It is further possible to give arguments to the class definition:

  • json_name: Name of the saved json file
  • indent: How many spaces to use for indenting in the json file

Usage example:

@dataclass
class ResultData(DataStorage, json_name="game.json", indent=4):
    shots: int
    goalscorers: list
    dists: np.ndarray

rdata = ResultData(shots=1, goalscorers=["Max Fenger"], dists=np.ones(22)*10)
rdata.save("max")
# Now shots and goalscorers are saved in <pwd>/max/game.json and dists in <pwd>/max/dists.npy

# Then to load
rdata = ResultData.load("max")
print(rdata.goalscorers)  # ["Max Fenger"]

Parsing

A parsing tool for combining command-line and config file arguments. Useful for parametric methods such as machine learning. The first argument must always be a path. This can for instance be used to put log files, results, plots etc.

Consider the execution of a file main.py with the command line call

python main.py path/to/put/results -c path/to/config/file.ini --data-path path/to/data

The config file could contain

[DEFAULT]
fp16
learning-rate=1e-4

[LOWLR]
learning-rate=1e-5

[NOFP16]
fp16=False

where main.py contains

options = [
    # Mandatory argument with set abbreviation -p
    Argument("--data-path", help="Path to where data is located", abbrv"-p"),
    # Optional argument with auto-generated abbreviation -l
    Option("--learning-rate", default=1e-5, help="Learning rate to use for gradient descent steps"),
    # Boolean flag with auto-generated abbreviation -f
    Flag("--fp16", help="Use mixed precision for training")
]
parser = Parser(*options, multiple_jobs=True)  # Two jobs are specified in the config file, so multiple_jobs=True
location = parser.location  # Experiments are stored here. In this case path/to/put/results
job_descriptions = parser.parse()
parser.document_settings()  # Save a config file to reproduce the experiment
# Run each experiment
for job_description in experiments:
    # Get location of this job as job_description.location
    run_experiment(job_description)

This could then by run by python main.py data/my-big-experiment --learning-rate 1e-5 or by python main.py data/my-big-experiment --config cfg.ini or using a combination where CLI args takes precedence: python main.py data/my-big-experiment --config cfg.ini --learning-rate 1e-5 where cfg.ini could contain

Logging

The logging submodule contains a simple yet feature-rich logger which fits common needs. Can be imported from pelutils directly, e.g. from pelutils import log.

# Configure logger for the script
log.configure("path/to/save/log.log")

# Start logging
for i in range(70):  # Nice
    log("Execution %i" % i)

# Sections
log.section("New section in the logfile")

# Adjust logging levels
log.warning("Will be logged")
with log.level(LogLevels.ERROR):  # Only log at ERROR level or above
    log.warning("Will not be logged")
with log.no_log:
    log.section("I will not be logged")

# Error handling
# The zero-division error and stacktrace is logged
with log.log_errors:
    0 / 0
# Entire chained stacktrace is logged
with log.log_errors:
    try:
        0 / 0
    except ZeroDivisionError as e:
        raise ValueError("Denominator must be non-zero") from e

# User input - acts like built-in input but logs both prompt and user input
inp = log.input("Continue [Y/n]? ")
# Parse yes/no user input
cont = log.parse_bool_input(inp, default=True)

# Log all logs from a function at the same time
# This is especially useful when using multiple threads so logging does not get mixed up
def fun():
    log("Hello there")
    log("General Kenobi!")
with mp.Pool() as p:
    p.map(log.collect_logs(fun), args)

# It is also possible to create multiple loggers by importing the Logger class, e.g.
log2 = Logger()
log2.configure("path/to/save/log2.log")

Data Science

This submodule contains various utility functions for data science and machine learning. To make sure the necessary requirements are installed, install using

pip install pelutils[ds]

Note that in some terminals (e.g. zsh), you will have escape the brackets:

pip install pelutils\[ds\]

Deep Learning

All PyTorch functions work independently of whether CUDA is available or not.

# Inference only: No gradients should be tracked in the following function
# Same as putting entire function body inside `with torch.no_grad()`
@no_grad
def infer():
    <code that includes feedforwarding>

Statistics

Includes various commonly used statistical functions.

# Get one sided z value for exponential(lambda=2) distribution with a significance level of 1 %
zval = z(alpha=0.01, two_sided=False, distribution=scipy.stats.expon(loc=1/2))

# Get correlation, confidence interval, and p value for two vectors
a, b = np.random.randn(100), np.random.randn(100)
r, lower_r, upper_r, p = corr_ci(a, b, alpha=0.01)

Plotting

pelutils provides plotting utilities based on matplotlib. Most notable is the Figure context class, which attempts to remedy some of the common grievances with matplotlib, e.g. having to remember the correct kwargs and rcParams for setting font sizes, grid line colours etc, and notably adding type hinting to fig and ax produced by plt.subplots.

from pelutils.ds.plots import Figure

# The following makes a plot and saves it to `plot.png`.
# The seaborn is style is used for demonstration, but if the `style` argument
# is not given, the default matplotlib style is used.
# The figure and font size are also given for demonstration, but their default
# values are increased compared to matplotlib's default, as these are generally
# too small for finished plots.
with Figure("plot.png", figsize=(20, 10), style="seaborn", fontsize=20):
    plt.scatter(x, y, label="Data")
    plt.grid()
    plt.title("Very nice plot")
# The figure is automatically saved to `plot.png` and closed, such that
# plt.plot can be used again from here.
# Figure changes `matplotlib.rcParams`, but these changes are also undone
# after the end of the `with statement`.

# For more complex plots, it is also possible to access the `fig` and `ax`
# variables usually assigned as `fig, ax = plt.subplots()`.
# These are type hinted, so no more remembering if it is `ax.title()` or
# `ax.set_title()`.
with Figure("plot.png") as f:
    f.fig  # fig available as attribute on the Figure instance
    f.ax.set_title("Very nice plot")  # The same goes for `ax`

The plotting utilies also include binning functions for creating nice histograms. The get_bins function produces bins based on a binning function, of which three are provided:

  • linear_binning: Bins are spaced evenly from the lowest to the largest value of the data.
  • log_binning: Bins are log-spaced from the lowest to the largest value of the data, which is assumed to be positive.
  • normal_binning: Bins are distributed according to the distribution of the data, such there are more bins closer to the center of the data. This is useful if the data somewhat resembles a normal distribution, as the resolution will be the greatest where there is the most data.

It is also possible to provide custom binning functions.

get_bins provide both x and y coordinates, making it simple to use with argument unpacking:

import matplotlib.pyplot as plt
import numpy as np
from pelutils.ds.plots import get_bins, normal_binning

# Generate normally distributed data
x = np.random.randn(100)
# Plot distribution
plt.plot(*get_bins(x, binning_fn=normal_binning))

Finally, different smoothing functions are provided. The two most common are moving_avg and exponential_avg which smooth the data using a moving average and exponential smoothing, respectively.

The double_moving_avg is special in that the number of smoothed data points do not depend on the number of given data points but is instead based on a given number of samples, which allows the resulting smoothed curve to not by jagged as happens with the other smoothing functions. It also has two smoothness parameters, which allows a large degree of smoothness control.

Apart from smoothness parameters, all smoothness functions have the same call signature:

from pelutils.ds.plots import double_moving_avg

# Generate noisy data
n = 100
x = np.linspace(-1, 1, n)
y = np.random.randn(n)

# Plot data along with smoothed curve
plt.plot(*double_moving_avg(x, y))
# If x is not given, it is assumed to go from 0 to n-1 in steps of 1
plt.plot(*double_moving_avg(y))

Examples of all the plotting utilities are shown in the examples directory.

Supported platforms

Precompiled wheels are provided for most common platforms. Notably, they are not provided for 32-bit systems. If no wheel is provided, pip should attempt a source install. If all else fails, it is possible to install from source by pointing pip to Github directly:

pip install git+https://github.com/peleiden/pelutils.git@release#egg=pelutils

It is also possible to install from source using pip's --no-binary option.

Source installs can also be necessary if you are using numpy versions that are incompatible with the one that the precompiled wheels are build with. If that is the case, you will probably see errors in the style of

ImportError: numpy.core.multiarray failed to import
RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd

History

1.0.0 - Breaking changes, first stable release

  • Removed log.tqdm
  • Added binning functions to pelutils.ds.plot
  • Added pelutils.ds.distributions with scipy distributions that use same notation as Jim Pitman's "Probability"
  • Added examples directory, which currently only has plotting examples
  • Removed subfolder attribute on DataStorage as this was cause for much confusion and many issues
  • Changed pickle extension from p to pkl
  • Added option to clear existing data folder saving with DataStorage
  • Added indent option to json files saved by DataStorage
  • Made naming of different submodules consistent
  • Added linecounter entry point that allows plotting development of code in repositories over time
  • Made get_repo return the absolute repository path
  • Made logger use absolute instead of relative paths preventing issues with changing working directory
  • Renamed vline to hline in Table as the line was horizontal rather than vertical
  • Added jsonl.dumps and jsonl.loads and made all functions in jsonl module use same naming scheme as json
  • Added support for Pytorch tensors to c_ptr
  • Added log.no_log for disabling logging
  • Renamed Levels to LogLevels as that makes it clearer in usage context
  • unique can now take non-contiguous arrays and also has an axis argument
  • Added a command line entry to examples, pelexamples
  • Added a binary_search for searching in ordered iterables in log(n) time
  • Added moving average function and variations thereof for plotting noisy data with uneven spacing
  • Made TickTock.tick void and TickTock.tock raise TickTockException if .tick not called first
  • Changed to rapidjson instead for built-in json module for all json operations for strictness and that sweet performance
  • Added reverse_line_iterator for iterating through a file backwards
  • Renamed throws to raises
  • Reanedm thousand_seps to thousands_seperators
  • Renamed HISTORY.md to CHANGELOG.md
  • Made get_timestamp arguments keyword-only
  • Added restore_argv decorator that allows safe testing of simulated sys.argv
  • Added except_keys function for removing a list of keys from a dictionary
  • Renamed thousand_seps to thousands_seperators
  • Made C functions to work on most platforms
  • Added is_windows function
  • Removed TickTock.profile_iter
  • Added SimplePool for testing with multiprocessing.Pool
  • Renamed MainTest to UnitTestCollection
  • Removed BatchFeedForward
  • Added Figure plotting context class for easier plotting settings control
  • Added convenience function for adding dates on the x axis, get_dateticks

Bug fixes

  • Fixed a bug where a backslash would sometimes be printed from the logger before square brackets
  • Fixed raises throwing errors if error was not caught
  • Made thousands_seperators work with negative numbers
  • Made corr_ci handle all iterable types correctly

0.6.9 - Nice

  • Made load_jsonl load the file lazily

0.6.7

  • Logger can now be used without writing to file

0.6.6

  • Fixed parser naming when using config files and not multiple_jobs
  • Fixed parser naming when using cli only and multiple_jobs

0.6.5 - Breaking changes

  • Parser.parse now returns only a single experiment dict if multiple_jobs is False
  • Improved logger error messages
  • Added Parser.is_explicit to check if an argument was given explicitly, either from CLI or a config file
  • Fixed bug in parser, where if a type was not given, values from config files would not be used
  • Made fields that should not be used externally private in parser
  • Made pelutils.ds.unique slightly faster

0.6.4 - Breaking changes

  • Commit is now logged as DEBUG
  • Removed BatchFeedForward.update_net
  • BatchFeedForward no longer requires batch size and increase factor as an argument
  • Removed reset_cuda function, as was a too small and obscure function and broke distributed training
  • Added ignore_missing field to DataStorage for ignoring missing fields in stored data

0.6.3 - Breaking changes

  • Fixed bug where TickTock profiles would sometimes not be printed in the correct order
  • Removed TickTock.reset
  • Added __len__ and __iter__ methods to TickTock
  • Added option to print standard deviation for profiles
  • Renamed TimeUnit to TimeUnits to follow enum naming scheme
  • Time unit lengths are now given in units/s rather than s/unit

0.6.2

  • TickTock.__str__ now raises a ValueError if profiling is still ongoing to prevent incorrect calculations
  • Printing a TickTock instance now indents percentage of time spent to indicate task subsets

0.6.1

  • Added subfolder argument to Parser.document_settings

0.6.0 - Breaking changes

  • A global instance of TickTock, TT, has been added - similar to log
  • Added TickTock.profile_iter for performing profiling over a for loop
  • Fixed wrong error being thrown when keyboard interrupting within with TT.profile(...)
  • All collected logs are now logged upon an exception being thrown when using log.log_errors and collect_logs
  • Made log.log_errors capable of handling chained exeptions
  • Made log.throw private, as it had little use and could be exploited
  • get_repo no longer throws an error if a repository has not been found
  • Added utility functions for reading and writing .jsonl files
  • Fixed incorrect torch installations breaking importing pelutils

0.5.9

  • Add split_path function which splits a path into components
  • Fix bug in MainTest where test files where not deleted

0.5.7

  • Logger prints to stderr instead of stdout at level WARNING or above
  • Added log.tqdm that disables printing while looping over a tqdm object
  • Fixed from __future__ import annotations breaking DataStorage

0.5.6

  • DataStorage can save all picklable formats + torch.Tensor specifically

0.5.5

  • Test logging now uses Levels.DEBUG by default
  • Added TickTock.fuse_multiple for combining several TickTock instances
  • Fixed bugs when using multiple TickTock instances
  • Allow multiple hits in single profile
  • Now possible to profile using with statement
  • Added method to logger to parse boolean user input
  • Added method to Table for adding vertical lines manually

0.5.4 - Breaking changes

  • Change log error colour

  • Replace default log level with print level that defaults to Levels.INFO

    __call__ now always defaults to Levels.INFO

  • Print microseconds as us instead of mus

0.5.3

  • Fixed missing regex requirement

0.5.2

  • Allowed disabling printing by default in logger

0.5.1

  • Fixed accidental rich formatting in logger
  • Fixed logger crashing when not configured

0.5.0 - Breaking changes

  • Added np.unique-style unique function to ds that runs in linear time but does not sort
  • Replaced verbose/non-verbose logging with logging levels similar to built-in logging module
  • Added with_print option to log.__call__
  • Undid change from 0.3.4 such that None is now logged again
  • Added format module. Currently supports tables
  • Updated stringification of profiles to include percentage of parent profile
  • Added throws function that checks if a functions throws an exception of a specific type
  • Use Rich for printing to console when logging

0.4.1

  • Added append mode to logger to append to old log files instead of overwriting

0.4.0

  • Added ds submodule for data science and machine learning utilities

    This includes PyTorch utility functions, statistics, and matplotlib default values

0.3.4

  • Logger now raises errors normally instead of using throw method

0.3.3

  • get_repo now accepts a custom path search for repo as opposed to always using working dir

0.3.2

  • log.input now also accepts iterables as input

    For such inputs, it will return a generator of user inputs

0.3.1 - Breaking changes

  • Added functionality to logger for logging repository commit

  • Removed function get_commit

  • Added function get_repo which returns repository path and commit

    It attempts to find a repository by searching from working directory and upwards

  • Updates to examples in README and other minor documentation changes

  • set_seeds no longer returns seed, as this is already given as input to the function

0.3.0 - Breaking changes

  • Only works for Python 3.7+

  • If logger has not been configured, it now does no logging instead of crashing

    This prevents dependecies that use the logger to crash the program if it is not used

  • log.throw now also logs the actual error rather than just the stack trace

  • log now has public property is_verbose

  • Fixed with log.log_errors always throwing errors

  • Added code samples to README

  • Parser no longer automatically determines if experiments should be placed in subfolders

    Instead, this is given explicitly as an argument to __init__

    It also supports boolean flags in the config file

0.2.13

  • Readd clean method to logger

0.2.12 - Breaking changes

  • The logger is now solely a global variable

    Different loggers are handled internally in the global _Logger instance

0.2.11

  • Add catch property to logger to allow automatically logging errors with with
  • All code is now indented using spaces

0.2.10

  • Allow finer verbosity control in logger
  • Allow multiple log commands to be collected and logged at the same time
  • Add decorator for aforementioned feature
  • Change thousand_seps from TickTock method to stand-alone function in __init__
  • Verbose logging now has same signature as normal logging

0.2.8

  • Add code to execute code with specific environment variables

0.2.7

  • Fix error where the full stacktrace was not printed by log.throw

  • set_seeds now checks if torch is available

    This means torch seeds are still set without needing it as a dependency

0.2.6 - Breaking changes

  • Make Unverbose class private and update documentation
  • Update formatting when using .input

0.2.5

  • Add input method to logger

0.2.4

  • Better logging of errors

0.2.1 - Breaking changes

  • Removed torch as dependency

0.2.0 - Breaking changes

  • Logger is now a global variable, log

    Logging should happen by importing the log variable and calling .configure to set it up

    To reset the logger, .clean can be called

  • It is still possible to just import Logger and use it in the traditional way, though .configure should be called first

  • Changed timestamp function to give a cleaner output

  • get_commit now returns None if gitpython is not installed

0.1.2

  • Update documentation for logger and ticktock
  • Fix bug where seperator was not an argument to Logger.__call__

0.1.0

  • Include DataStorage
  • Logger can throw errors and handle seperators
  • TickTock includes time handling and units
  • Minor parser path changes

0.0.1

  • Logger, Parser, and TickTock added from previous projects

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pelutils-0.99.0.tar.gz (54.0 kB view hashes)

Uploaded Source

Built Distributions

pelutils-0.99.0-cp39-cp39-win_amd64.whl (50.2 kB view hashes)

Uploaded CPython 3.9 Windows x86-64

pelutils-0.99.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (71.1 kB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ x86-64

pelutils-0.99.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (71.6 kB view hashes)

Uploaded CPython 3.9 manylinux: glibc 2.17+ ARM64

pelutils-0.99.0-cp39-cp39-macosx_10_9_x86_64.whl (47.5 kB view hashes)

Uploaded CPython 3.9 macOS 10.9+ x86-64

pelutils-0.99.0-cp38-cp38-win_amd64.whl (50.2 kB view hashes)

Uploaded CPython 3.8 Windows x86-64

pelutils-0.99.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (71.2 kB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ x86-64

pelutils-0.99.0-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (71.7 kB view hashes)

Uploaded CPython 3.8 manylinux: glibc 2.17+ ARM64

pelutils-0.99.0-cp38-cp38-macosx_10_9_x86_64.whl (47.5 kB view hashes)

Uploaded CPython 3.8 macOS 10.9+ x86-64

pelutils-0.99.0-cp37-cp37m-win_amd64.whl (50.1 kB view hashes)

Uploaded CPython 3.7m Windows x86-64

pelutils-0.99.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (70.9 kB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ x86-64

pelutils-0.99.0-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (71.4 kB view hashes)

Uploaded CPython 3.7m manylinux: glibc 2.17+ ARM64

pelutils-0.99.0-cp37-cp37m-macosx_10_9_x86_64.whl (47.5 kB view hashes)

Uploaded CPython 3.7m macOS 10.9+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page