Skip to main content

Samplitude (s8e) is a statistical distributions command line tool

Project description

samplitude

CLI generation and plotting of random variables:

$ samplitude "sin(0.31415) | sample(6) | round | cli"
0.0
0.309
0.588
0.809
0.951
1.0

The word samplitude is a portmanteau of sample and amplitude. This project also started as an étude, hence should be pronounced sampl-étude.

samplitude is a chain starting with a generator, followed by zero or more filters, followed by a consumer. Most generators are infinite (with the exception of range and lists and possibly stdin). Some of the filters can turn infinite generators into finite generators (like sample and gobble), and some filters can turn finite generators into infinite generators, such as choice.

Consumers are filters that necessarily flush the input; list, cli, tojson, unique, and the plotting tools, hist, scatter and line are examples of consumers. The list consumer is a Jinja2 built-in, and other Jinja2 consumers are sum, min, and max:

samplitude "sin(0.31415) | sample(5) | round | max | cli"
0.951

For simplicity, s8e is an alias for samplitude.

Generators

In addition to the standard range function, we support infinite generators

  • exponential(lambd): lambd is 1.0 divided by the desired mean.
  • uniform(a, b): Get a random number in the range [a, b) or [a, b] depending on rounding.
  • gauss(mu, sigma): mu is the mean, and sigma is the standard deviation.
  • normal(mu, sigma): as above
  • lognormal(mu, sigma): as above
  • triangular(low, high): Continuous distribution bounded by given lower and upper limits, and having a given mode value in-between.
  • beta(alpha, beta): Conditions on the parameters are alpha > 0 and beta > 0. Returned values range between 0 and 1.
  • gamma(alpha, beta): as above
  • weibull(alpha, beta): alpha is the scale parameter and beta is the shape parameter.
  • pareto(alpha): Pareto distribution. alpha is the shape parameter.
  • vonmises(mu, kappa): mu is the mean angle, expressed in radians between 0 and 2*pi, and kappa is the concentration parameter, which must be greater than or equal to zero. If kappa is equal to zero, this distribution reduces to a uniform random angle over the range 0 to 2*pi.

We have a special infinite generator (filter) that works on finite generators:

  • choice,

whose behaviour is explained below.

For input from files, either use words with a specified environment variable DICTIONARY, or pipe through

  • stdin()

which reads from stdin.

If the file is a csv file, there is a csv generator that reads a csv file with Pandas and outputs the first column (if nothing else is specified). Specify the column with either an integer index or a column name:

>>> s8e "csv('iris.csv', 'virginica') | counter | cli"
0 50
1 50
2 50

Finally, we have combinations and permutations that are inherited from itertools and behave exactly like those.

s8e "'ABC' | permutations | cli"

However, the output of this is rather non-UNIXy, with the abstractions leaking through:

s8e "'HT' | permutations | cli"
('H', 'T')
('T', 'H')

So to get a better output, we can use an elementwise join elt_join:

s8e "'HT' | permutations | elt_join | cli"
H T
T H

which also takes a seperator as argument:

samplitude "'HT' | permutations | elt_join(';') | cli"
H;T
T;H

This is already supported by Jinja's map function (notice the strings around join):

samplitude "'HT' | permutations | map('join', ';') | cli"
H;T
T;H

We can thus count the number of permutations of a set of size 10:

s8e "range(10) | permutations | len"
3628800

The product generator takes two generators and computes a cross-product of these. In addition,

A warning about infinity

All generators are (potentially) infinite generators, and must be sampled with sample(n) before consuming!

Usage and installation

Install with

pip install samplitude

or to get bleeding release,

pip install git+https://github.com/pgdr/samplitude

Examples

This is pure Jinja2:

>>> samplitude "range(5) | list"
[0, 1, 2, 3, 4]

However, to get a more UNIXy output, we use cli instead of list:

>>> s8e "range(5) | cli"
0
1
2
3
4

To limit the output, we use sample(n):

>>> s8e "range(1000) | sample(5) | cli"
0
1
2
3
4

That isn't very helpful on the range generator, but is much more helpful on an infinite generator, such as the uniform generator:

>>> s8e "uniform(0, 5) | sample(5) | cli"
3.3900198868059235
1.2002767137709318
0.40999391897569126
1.9394585953696264
4.37327472704115

We can round the output in case we don't need as many digits (note that round is a generator as well and can be placed on either side of sample):

>>> s8e "uniform(0, 5) | round(2) | sample(5) | cli"
4.58
4.33
1.87
2.09
4.8

Selection and modifications

The samplitude behavior is equivalent to the head program, or from languages such as Haskell. The head alias is supported:

>>> samplitude "uniform(0, 5) | round(2) | head(5) | cli"
4.58
4.33
1.87
2.09
4.8

drop is also available:

>>> s8e "uniform(0, 5) | round(2) | drop(2) | head(3) | cli"
1.87
2.09
4.8

To shift and scale distributions, we can use the shift(s) and scale(s) filters. To get a Poisson point process starting at 15, we can run

>>> s8e "poisson(0.3) | shift(15)"  # equivalent to exponential(0.3)...

Both shift and scale work on generators, so to add sin(0.1) and sin(0.2), we can run

>>> s8e "sin(0.1) | shift(sin(0.2)) | sample(10) | cli"

sin(0.1)+sin(0.2) line

Choices and other operations

Using choice with a finite generator gives an infinite generator that chooses from the provided generator:

>>> samplitude "range(0, 11, 2) | choice | sample(6) | cli"
8
0
8
10
4
6

Jinja2 supports more generic lists, e.g., lists of strings. Hence, we can write

>>> s8e "['win', 'draw', 'loss'] | choice | sample(6) | sort | cli"
draw
draw
draw
loss
win
win

... and as in Python, strings are also iterable:

>>> s8e "'HT' | cli"
H
T

... so we can flip six coins with

>>> s8e "'HT' | choice | sample(6) | cli"
H
T
T
H
H
H

We can flip 100 coins and count the output with counter (which is collections.Counter)

>>> s8e "'HT' | choice | sample(100) | counter | cli"
H 47
T 53

The sort functionality works as expected on a Counter object (a dict type), so if we want the output sorted by key, we can run

>>> s8e "range(1,7) | choice | sample(100) | counter | sort | elt_join | cli" 42 # seed=42
1 17
2 21
3 12
4 21
5 13
6 16

There is a minor hack to sort by value, namely by swap-ing the Counter twice:

>>> s8e "range(1,7) | choice | sample(100) |
         counter | swap | sort | swap | elt_join | cli" 42 # seed=42
3 12
5 13
6 16
1 17
2 21
4 21

The swap filter does an element-wise reverse, with element-wise reverse defined on a dictionary as a list of (value, key) for each key-value pair in the dictionary.

Using stdin() as a generator, we can pipe into samplitude. Beware that stdin() flushes the input, hence stdin (currently) does not work with infinite input streams.

>>> ls | samplitude "stdin() | choice | sample(1) | cli"
some_file

Then, if we ever wanted to shuffle ls we can run

>>> ls | samplitude "stdin() | shuffle | cli"
some_file
>>> cat FILE | samplitude "stdin() | cli"
# NOOP; cats FILE

The fun powder plot

For fun, if you have installed matplotlib, we support plotting, hist being the most useful.

>>> samplitude "normal(100, 5) | sample(1000) | hist"

normal distribution

An exponential distribution can be plotted with exponential(lamba). Note that the cli output must be the last filter in the chain, as that is a command-line utility only:

>>> s8e "normal(100, 5) | sample(1000) | hist | cli"

exponential distribution

To repress output after plotting, you can use the gobble filter to empty the pipe:

>>> s8e "normal(100, 5) | sample(1000) | hist | gobble"

Although hist is the most useful, one could imaging running s8e on timeseries, where a line plot makes most sense:

>>> s8e "sin(22/700) | sample(200) | line"

sine and line

The scatter function can also be used, but requires that the input stream is a stream of pairs, which can be obtained either by the product generator, or via the pair or counter filter:

s8e "normal(100, 10) | sample(10**5) | round(0) | counter | scatter"

scatter normal

Fourier

A fourier transform is offered as a filter fft:

>>> samplitude "sin(0.1) | shift(sin(0.2)) | sample(1000) | fft | line | gobble"

fft line

Your own filter

If you use Samplitude programmatically, you can register your own filter by sending a dictionary

{'name1' : filter1,
 'name2' : filter2,
 ...,
 'namen': filtern,
}

to the samplitude function.

Example: secretary problem

Suppose you want to emulate the secretary problem ...

Intermezzo: The problem

For those not familiar, you are a boss, Alice, who wants to hire a new secretary Bob. Suppose you want to hire the tallest Bob of all your candidates, but the candidates arrive in a stream, and you only the number of candidates. For each candidate, you have to accept (hire) or reject the candidate. Once you have rejected a candidate, you cannot undo the decision.

The solution to this problem is to look at the first n/e (e~2.71828 being the Euler constant), and thereafter accept the first candidate taller than all of the n/e first candidates.

A Samplitude solution

Let normal(170, 10) be the candidate generator, and let n=100. We create a filter secretary that takes a stream and an integer (n) and picks according to the solution. In order to assess the quality of the solution, we want to restream the entire population, and annotate the one we choose. Let (c, False) denote a candidate we rejected, and (c, True) denote the candidate we accepted.

def secretary(gen, n):
    import math
    explore = int(n / math.e)
    target = None
    candidate_found = False
    i = 0
    for c in gen:
        if i <= explore:
            if target is None or c > target:
                target = c
            yield (c, False)
        else:
            if c > target and not candidate_found:
                candidate_found = True
                yield (c, True)
            elif i == n-1:
                yield (c, True)  # we failed, must pick last candidate!
                return
            else:
                yield (c, False)
        i += 1
        if i == n:
            return

Now, to emulate the secretary problem with Samplitude:

from samplitude import samplitude as s8e

# insert above secretary function

n = 100
filters = {'secretary': secretary}

solution = s8e('normal(170, 10) | secretary(%d) | list' % n, filters=filters)
solution = eval(solution)  # Samplitude returns an eval-able string
cands = map(lambda x: x[0], solution)
opt = [s[0] for s in solution if s[1]][0]
# the next line prints in which position the candidate is
print(1+sorted(cands, reverse=True).index(opt), '/', n)

In about 67% of the cases we can expect to get ~1/100 or ~2/100, whereas in the remaining 33% of the cases, we expect somewhere a bit below 50th.

Secretary selection

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

samplitude-0.0.14.tar.gz (10.5 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page