datawelder

Joins large dataframes together

These details have not been verified by PyPI

Project links

Project description

Perform SQL-like JOINs on large file-like dataframes.

Do you have tons of larger-than-memory datasets lying around on your file system?
Do you dream of an easy way to join them together?
Do you want to achieve this without using a database?

If the answers to the above questions are “yes”, then datawelder is for you!

Example

First, examine our toy dataset. It contains country names and currencies in two separate tables.

$ head -n 5 sampledata/names.csv
iso3,name
AND,Principality of Andorra
ARE,United Arab Emirates
AFG,Islamic Republic of Afghanistan
ATG,Antigua and Barbuda
$ head -n 5 sampledata/currencies.csv
iso3,currency
AND,Euro
ARE,Dirham
AFG,Afghani
ATG,Dollar

We can join these two dataframes as follows:

$ python -m datawelder.partition sampledata/names.csv partitions/names 5
$ python -m datawelder.partition sampledata/currencies.csv partitions/currencies 5
$ python -m datawelder.join out.csv partitions/names partitions/currencies --format csv
$ grep AND out.csv
AND,Principality of Andorra,AND,Euro

Tweaking

You can specify the partition key explicitly:

$ python -m datawelder.partition sampledata/names.csv partitions/names 5 --keyindex 0
$ python -m datawelder.partition sampledata/names.csv partitions/names 5 --keyname iso3

You can specify any format parameters (e.g. CSV delimiter) explicitly:

$ python -m datawelder.partition sampledata/names.csv partitions/names 5 --fmtparams delimiter=',' lineterminator='\n'

Similarly, for output:

$ python -m datawelder.join out.csv partitions/names partitions/currencies --format csv --fmtparams delimiter=;
$ grep AND out.csv
AND;Principality of Andorra;AND;Euro

Other formats work transparently:

$ python -m datawelder.partition sampledata/names.json partitions/names 5 --keyname iso3

Once you partition your datasets, it doesn’t matter what format they were originally in. You can merge them with any other partitioned dataset with ease:

$ python -m datawelder.join out.json partitions/names partitions/currencies --format json --subs 1
$ head -n 5 out.json
{"iso3": "AGO", "name": "Republic of Angola", "iso3_1": "AGO", "currency": "Kwanza"}
{"iso3": "AND", "name": "Principality of Andorra", "iso3_1": "AND", "currency": "Euro"}
{"iso3": "ARM", "name": "Republic of Armenia", "iso3_1": "ARM", "currency": "Dram"}
{"iso3": "ATF", "name": "French Southern and Antarctic Lands", "iso3_1": "ATF", "currency": "Euro"}
{"iso3": "AZE", "name": "Republic of Azerbaijan", "iso3_1": "AZE", "currency": "Manat"}

You can also select a subset of fields to keep (similar to SQL SELECT):

$ python -m datawelder.join out.csv partitions/names partitions/currencies --format csv --select name,currency --subs 1
$ head -n 5 out.csv
name,currency
Republic of Angola,Kwanza
Principality of Andorra,Euro
Republic of Armenia,Dram
French Southern and Antarctic Lands,Euro

The name of each column is prefixed by the number of the dataframe it came from. For example, 1.currency means “the currency field from dataframe 1”.

You can also rename the selected fields as desired (again, similar to SQL SELECT):

$ python -m datawelder.join out.csv partitions/names partitions/currencies --format csv --select 'name as country_name, currency as curr' --subs 1
$ head -n 5 out.csv
country_name,curr
Republic of Angola,Kwanza
Principality of Andorra,Euro
Republic of Armenia,Dram
French Southern and Antarctic Lands,Euro

Finally, you can use multiple processes for joining. The default is the number of CPUs. The order of the rows in the output file may differ due to race conditions, but this does not affect the integrity of the data.

$ python -m datawelder.join out.csv partitions/names partitions/currencies --format csv --select '0.name as name, 1.currency as curr' --subs 4

How does it work?

First, datawelder partitions each dataset using a partition key. We used 5 partitions because the datasets are tiny, but you can specify an arbitrary partition size when working with real data.

In this case, it automatically identified the format of the file as CSV. You can give it a helping hand by specifying the format and relevant parameters (e.g. field separator, quoting, etc) manually.

We did not specify a partition key to use in the above example, so datawelder picked a default for us (you can override this). In the above example, we split each dataset into 5 partititions using the default key (whatever is the first column), but you can override that.

Features

Parallelization across multiple cores via subprocess/multiprocessing
Access to cloud storage for reading and writing e.g. S3 via smart_open. You do not have to store anything locally.
Read/write various file formats (CSV, JSON, pickle) out of the box
Flexible API for dealing with file format edge cases

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.9.0

Apr 13, 2021

0.8.0

Mar 10, 2021

0.7.0

Feb 8, 2021

0.6.0

Jan 20, 2021

0.5.1

Jan 19, 2021

0.5.0

Jan 12, 2021

0.4.2

Jan 8, 2021

0.4.1

Jan 8, 2021

0.4.0

Jan 6, 2021

0.3.3

Jan 2, 2021

0.3.2

Jan 2, 2021

0.3.1

Dec 22, 2020

0.3.0

Dec 10, 2020

0.2.7

Dec 10, 2020

0.2.6

Dec 8, 2020

0.2.5

Dec 8, 2020

0.2.4

Nov 27, 2020

0.2.3

Nov 21, 2020

0.2.2

Nov 2, 2020

0.2.1

Oct 22, 2020

0.2.0

Oct 19, 2020

0.1.1

Oct 16, 2020

0.1.0

Oct 16, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datawelder-0.9.0.tar.gz (19.5 kB view details)

Uploaded Apr 13, 2021 Source

File details

Details for the file datawelder-0.9.0.tar.gz.

File metadata

Download URL: datawelder-0.9.0.tar.gz
Upload date: Apr 13, 2021
Size: 19.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.8

File hashes

Hashes for datawelder-0.9.0.tar.gz
Algorithm	Hash digest
SHA256	`0c75b34ea7fac0e4535b54305ae921b4c4da0f2be7cf1039b3497f16e45688b4`
MD5	`59cd21cb15e488f702468dcc845cfe56`
BLAKE2b-256	`a6317c2d002d0a17c9a1a68305b5f2691aed1bde32c2daf673598fa567874838`

See more details on using hashes here.

datawelder 0.9.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Example

Tweaking

How does it work?

Features

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes