Skip to main content

Joins large dataframes together

Project description

Perform SQL-like JOINs on large file-like dataframes.

  • Do you have tons of larger-than-memory datasets lying around on your file system?

  • Do you dream of an easy way to join them together?

  • Do you want to achieve this without using a database?

If the answers to the above questions are “yes”, then datawelder is for you!

Example

First, examine our toy dataset. It contains country names and currencies in two separate tables.

$ head -n 5 sampledata/names.csv
iso3,name
AND,Principality of Andorra
ARE,United Arab Emirates
AFG,Islamic Republic of Afghanistan
ATG,Antigua and Barbuda
$ head -n 5 sampledata/currencies.csv
iso3,currency
AND,Euro
ARE,Dirham
AFG,Afghani
ATG,Dollar

We can join these two dataframes as follows:

$ python -m datawelder.partition sampledata/names.csv partitions/names 5
$ python -m datawelder.partition sampledata/currencies.csv partitions/currencies 5
$ python -m datawelder.join out.csv partitions/names partitions/currencies --format csv
$ grep AND out.csv
AND,Principality of Andorra,AND,Euro

Tweaking

You can specify the partition key explicitly:

$ python -m datawelder.partition sampledata/names.csv partitions/names 5 --keyindex 0
$ python -m datawelder.partition sampledata/names.csv partitions/names 5 --keyname iso3

You can specify any format parameters (e.g. CSV delimiter) explicitly:

$ python -m datawelder.partition sampledata/names.csv partitions/names 5 --fmtparams delimiter=',' lineterminator='\n'

Similarly, for output:

$ python -m datawelder.join out.csv partitions/names partitions/currencies --format csv --fmtparams delimiter=;
$ grep AND out.csv
AND;Principality of Andorra;AND;Euro

Other formats work transparently:

$ python -m datawelder.partition sampledata/names.json partitions/names 5 --keyname iso3

Once you partition your datasets, it doesn’t matter what format they were originally in. You can merge them with any other partitioned dataset with ease:

$ python -m datawelder.join out.json partitions/names partitions/currencies --format json --subs 1
$ head -n 5 out.json
{"iso3": "AGO", "name": "Republic of Angola", "iso3_1": "AGO", "currency": "Kwanza"}
{"iso3": "AND", "name": "Principality of Andorra", "iso3_1": "AND", "currency": "Euro"}
{"iso3": "ARM", "name": "Republic of Armenia", "iso3_1": "ARM", "currency": "Dram"}
{"iso3": "ATF", "name": "French Southern and Antarctic Lands", "iso3_1": "ATF", "currency": "Euro"}
{"iso3": "AZE", "name": "Republic of Azerbaijan", "iso3_1": "AZE", "currency": "Manat"}

You can also select a subset of fields to keep (similar to SQL SELECT):

$ python -m datawelder.join out.csv partitions/names partitions/currencies --format csv --select name,currency --subs 1
$ head -n 5 out.csv
name,currency
Republic of Angola,Kwanza
Principality of Andorra,Euro
Republic of Armenia,Dram
French Southern and Antarctic Lands,Euro

The name of each column is prefixed by the number of the dataframe it came from. For example, 1.currency means “the currency field from dataframe 1”.

You can also rename the selected fields as desired (again, similar to SQL SELECT):

$ python -m datawelder.join out.csv partitions/names partitions/currencies --format csv --select 'name as country_name, currency as curr' --subs 1
$ head -n 5 out.csv
country_name,curr
Republic of Angola,Kwanza
Principality of Andorra,Euro
Republic of Armenia,Dram
French Southern and Antarctic Lands,Euro

Finally, you can use multiple processes for joining. The default is the number of CPUs. The order of the rows in the output file may differ due to race conditions, but this does not affect the integrity of the data.

$ python -m datawelder.join out.csv partitions/names partitions/currencies --format csv --select '0.name as name, 1.currency as curr' --subs 4

How does it work?

First, datawelder partitions each dataset using a partition key. We used 5 partitions because the datasets are tiny, but you can specify an arbitrary partition size when working with real data.

In this case, it automatically identified the format of the file as CSV. You can give it a helping hand by specifying the format and relevant parameters (e.g. field separator, quoting, etc) manually.

We did not specify a partition key to use in the above example, so datawelder picked a default for us (you can override this). In the above example, we split each dataset into 5 partititions using the default key (whatever is the first column), but you can override that.

Features

  • Parallelization across multiple cores via subprocess/multiprocessing

  • Access to cloud storage for reading and writing e.g. S3 via smart_open. You do not have to store anything locally.

  • Read/write various file formats (CSV, JSON, pickle) out of the box

  • Flexible API for dealing with file format edge cases

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

datawelder-0.9.0.tar.gz (19.5 kB view details)

Uploaded Source

File details

Details for the file datawelder-0.9.0.tar.gz.

File metadata

  • Download URL: datawelder-0.9.0.tar.gz
  • Upload date:
  • Size: 19.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/3.10.1 pkginfo/1.7.0 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.8.8

File hashes

Hashes for datawelder-0.9.0.tar.gz
Algorithm Hash digest
SHA256 0c75b34ea7fac0e4535b54305ae921b4c4da0f2be7cf1039b3497f16e45688b4
MD5 59cd21cb15e488f702468dcc845cfe56
BLAKE2b-256 a6317c2d002d0a17c9a1a68305b5f2691aed1bde32c2daf673598fa567874838

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page