Useful transforms for supporting apache beam pipelines.
Project description
Carling
Via Wikipedia:
Carlings are pieces of timber laid fore and aft under the deck of a ship, from one beam to another. They serve as a foundation for the whole body of the ship.
Useful transforms for supporting our apache beam pipelines.
Mapping transform utils
carling.Label
Labels all elements.
carling.Select
Removes all columns which are not specified in *keys
.
carling.Project
Transforms each element into a tuple of values of the specified columns.
carling.IndexBy
Transforms each element V
into a tuple (K, V)
.
K
is the projection of V
by *keys
, which is equal to the tuple
produced by the Project
transform.
carling.Stringify
Transforms each element into its JSON representation.
carling.IndexBySingle
Transforms each element V
into a tuple (K, V)
.
The difference between IndexBySingle(key)
and IndexBy(key)
with a single
key is as follows:
IndexBySingle
produces the index as a plain value.IndexBy
produces the index as a single-element tuple.
carling.RenameFromTo
Rename columns according to from_to_key_mapping
.
carling.Exclude
Removes all columns specified in *keys
.
Grouping transform utils
Generic grouping transform utils
carling.UniqueOnly
Produces elements that are the only elements per key after deduplication.
Given a PCollection
of (K, V)
,
this transform produces the collection of all V
s that do not share
the same corresponding K
s with any other elements after deduplicating
all equivalent (K, V)
pairs.
This transform is equivalent to SingletonOnly
with apache_beam.Distinct
.
[(1, "A"), (2, "B1"), (2, "B2"), (3, "C"), (3, "C"), (4, "A")]
will be
transformed into ["A", "C", "A"]
.
carling.SingletonOnly
Produces elements that are the only elements per key.
Given a PCollection
of (K, V)
,
this transform produces the collection of all V
s that do not share
the same corresponding K
s with any other elements.
[(1, "A"), (2, "B1"), (2, "B2"), (3, "C"), (3, "C"), (4, "A")]
will be
transformed into ["A", "A"]
.
carling.Intersection
Produces the intersection of given PCollection
s.
Given a list of PCollection
s,
this transform produces every element that appears in all collections of
the list.
Elements are deduplicated before taking the intersection.
carling.FilterByKey
Filters elements by their keys.
The constructor receives one or more PCollection
s of K
s,
which are regarded as key lists.
Given a PCollection
of (K, V)
,
this transform discards all elements with K
s that do not appear
in the key lists.
If multiple collections are given to the constructor, this transform treats the intersection of them as the key list.
carling.FilterByKeyUsingSideInput
Filters a single collection by a single lookup collection, using a common key.
Given: - a PCollection
(lookup_entries) of (V)
, as a lookup collection - a PCollection
(pcoll) of (V)
, as values to be filtered - a common key (filter_key)
A dictionary called filter_dict
- is created by mapping the value of filter_key
for each entry in lookup_entries
to True.
Then, for each item in pcoll, the value associated with filter_key
checkd against
filter_dict
, and if it is found, the entry passes through. Otherwise, the entry is
discarded.
Note: lookup_entries
will be used as a side input, so care
must be taken regarding the size of the lookup_entries
carling.DifferencePerKey
Produces the difference per key between two PCollection
s.
Given two PCollection
s of V
,
this transform indexes the collections by the specified keys primary_keys
,
compares corresponding two V
lists for every K
,
and produces the difference per K
.
If there is no difference, this transform produces nothing.
Two V
lists are considered to be different if the numbers of elements
differ or two elements of the lists with a same index differ
at one of the specified columns columns
.
carling.SortedSelectPerKey
- Groups items by a set of
keys
-- column names per row - Emits the "MAX" value for each collection as defined by the
key_fn
- Can emit "MIN" by passing
reverse=True
kwarg
carling.PartitionRowsContainingNone
Emits two tagged PCollections:
- Default (
result[None]
): Rows are guaranteed not to have anyNone
values result["contains_none"]
: Rows for which at least one column had aNone
value
Categorical
carling.CreateCategoricalDicts
For a set of columnular data inputs this function takes:
- cat_cols:
Type: `[str]`
An array of "categorical" columns
- existing_dicts:
Type: `PCollection[(string, string, int)]`
Rows of tuples of type:
(column, previously_seen_value, mapped_unique_int)
Mapping a set of "previously seen" keys to unique int values for each
column.
Not optional.
If none exist, pass an empty PCollection.
It then creates a transform which takes a pcollection and
- looks at the input pcoll for unseen values in each categorical column
- creates new unique integers for each distinct unseen value, starting at max(previous value for column)+1
- ammends the existing mappings with (col, unseen_value, new_unique_int)
Output is:
- Type: `PCollection[(string, string, int)]`
This is useful for preparing data to be trained by eg. LightGBM
carling.ReplaceCategoricalColumns
-
Utilizes the "categorical dictionary rows" generated by
CreateCategoricalDicts
which is a list of pairs of type of(column, value,unique_int)
. -
Replaces each column with the appropriate value found in the mapping.
Test Utils
carling.test_utils.pprint_equal_to
This module contains the equal_to
function from apache beam, but adapted to output results using pretty print. Reading the results as a large, unformatted string makes it harder to pick out what changed/is missing.
General Util
carling.LogSample
Print items of the given PCollection
to the log.
LogSample
prints the JSON representations of the input items to the Python's
standard logging system.
To avoid too much log entries being printed, LogSample
limits the number of
logged items. The constructor parameter n
determines the limit.
By default, LogSample
prints logs with the INFO
log level. The constructor
parameter level
determines the level.
carling.ReifyMultiValueOption
Prepares multi-value delimited options. Useful in contexts where you want to create a multi-value option in a template environment.
- inputs:
- delimited string option
- optional delimiter string (default is "|")
- output:
- Type:
PCollection[str]
- Type:
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file carling-0.4.0.tar.gz
.
File metadata
- Download URL: carling-0.4.0.tar.gz
- Upload date:
- Size: 16.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.1 CPython/3.7.17 Linux/6.5.0-1016-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f256b432b9a822b1079661a3095823cb852692b144267d3900dd56706e5ce6c9 |
|
MD5 | 178173224f9474585d3e65cd0da3a6f3 |
|
BLAKE2b-256 | 65e90c064e685034e2e4127037f6ab4a57a92a1cc001ce2dcd5d3ced86681f7e |
File details
Details for the file carling-0.4.0-py3-none-any.whl
.
File metadata
- Download URL: carling-0.4.0-py3-none-any.whl
- Upload date:
- Size: 18.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.5.1 CPython/3.7.17 Linux/6.5.0-1016-azure
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1e42de1e359cf24309e10a33cafa1ada72ae8b0ad4b311703c48dec103f8f6b4 |
|
MD5 | edea013b66af4e3902887fb9d8ec083c |
|
BLAKE2b-256 | aa7a050846109e0c9f878260b5d472247370d34e5dcf390a8eac2d8e0ed0b805 |