data wrangling simplicity, complete audit transparency, and at speed

These details have not been verified by PyPI

Project links

Project description

whyqd: simplicity, transparency, speed

What is it?

More research, less wrangling

whyqd (/wɪkɪd/) is a curatorial toolkit intended to produce well-structured and predictable data for research analysis.

It provides an intuitive method for creating schema-to-schema crosswalks for restructuring messy data to conform to a standardised metadata schema. It supports rapid and continuous transformation of messy data using a simple series of steps. Once complete, you can import wrangled data into more complex analytical or database systems.

whyqd plays well with your existing Python-based data-analytical tools. It uses Ray and Modin as a drop-in replacement for Pandas to support processing of large datasets, and Pydantic for data models.

Each definition is saved as JSON Schema-compliant file. This permits others to read and scrutinise your approach, validate your methodology, or even use your crosswalks to import and transform data in production.

Once complete, a transform file can be shared, along with your input data, and anyone can import and validate your crosswalk to verify that your output data is the product of these inputs.

Why use it?

whyqd allows you to get to work without requiring you to achieve buy-in from anyone or change your existing code.

If you don't want to spend days or weeks slogging through data when all you want to do is test whether your source data are even useful. If you already have a workflow and established software which includes Python and pandas, and don't want to change your code every time your source data changes.

If you want to go from a Cthulhu dataset like this:

UNDP Human Development Index 2007-2008: a beautiful example of messy data.

To this:

	country_name	indicator_name	reference	year	values
0	Hong Kong, China (SAR)	HDI rank	e	2008	21
1	Singapore	HDI rank	nan	2008	25
2	Korea (Republic of)	HDI rank	nan	2008	26
3	Cyprus	HDI rank	nan	2008	28
4	Brunei Darussalam	HDI rank	nan	2008	30
5	Barbados	HDI rank	e,g,f	2008	31

With a readable set of scripts to ensure that your process can be audited and repeated:

schema_scripts = [
    f"UNITE > 'reference' < {REFERENCE_COLUMNS}",
    "RENAME > 'country_name' < ['Country']",
    "PIVOT_LONGER > ['indicator_name', 'values'] < ['HDI rank', 'HDI Category', 'Human poverty index (HPI-1) - Rank;;2008', 'Human poverty index (HPI-1) - Value (%);;2008', 'Probability at birth of not surviving to age 40 (% of cohort);;2000-05', 'Adult illiteracy rate (% aged 15 and older);;1995-2005', 'Population not using an improved water source (%);;2004', 'Children under weight for age (% under age 5);;1996-2005', 'Population below income poverty line (%) - $1 a day;;1990-2005', 'Population below income poverty line (%) - $2 a day;;1990-2005', 'Population below income poverty line (%) - National poverty line;;1990-2004', 'HPI-1 rank minus income poverty rank;;2008']",
    "SEPARATE > ['indicator_name', 'year'] < ';;'::['indicator_name']",
    "DEBLANK",
    "DEDUPE",
]

Then whyqd may be for you.

How does it work?

Crosswalks are mappings of the relationships between fields defined in different metadata schemas. Ideally, these are one-to-one, where a field in one has an exact match in the other. In practice, it's more complicated than that.

Your workflow is:

Define a single destination schema,
Derive a source schema from a data source,
Review your source data structure,
Develop a crosswalk to define the relationship between source and destination,
Transform and validate your outputs,
Share your output data, transform definitions, and a citation.

It starts like this:

import whyqd as qd

Install and then read the quickstart.

There are four worked tutorials to guide you through typical scenarios:

Installation

You'll need at least Python 3.9, then install with your favourite package manager:

pip install whyqd

To derive a source schema from tabular data, import from DATASOURCE_PATH, define its MIMETYPE, and derive a schema:

import whyqd as qd

datasource = qd.DataSourceDefinition()
datasource.derive_model(source=DATASOURCE_PATH, mimetype=MIMETYPE)
schema_source = qd.SchemaDefinition()
schema_source.derive_model(data=datasource.get)
schema_source.fields.set_categories(name=CATEGORY_FIELD, 
                                    terms=datasource.get_data())
schema_source.save()

Get started...

Changelog

The version history can be found in the changelog.

Background and funding

whyqd was created to serve a continuous data wrangling process, including collaboration on more complex messy sources, ensuring the integrity of the source data, and producing a complete audit trail from data imported to our database, back to source. You can see the product of that at openLocal.uk.

whyqd received initial funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 101017536. Technical development support is from EOSC Future through the RDA Open Call mechanism, based on evaluations of external, independent experts.

The 'backronym' for whyqd /wɪkɪd/ is Whythawk Quantitative Data, Whythawk is an open data science and open research technical consultancy.

Licence

The whyqd Python distribution is licensed under the terms of the BSD 3-Clause license. All documentation is released under Attribution 4.0 International (CC BY 4.0). whyqd tradenames and marks are copyright Whythawk.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.2.8

Apr 18, 2026

1.2.7

Apr 16, 2026

1.2.6

Apr 16, 2026

1.2.5

Sep 30, 2025

1.2.4

Sep 29, 2025

1.2.3

Sep 12, 2025

1.2.2

Sep 11, 2025

1.2.1

Jul 19, 2025

1.2.0

Mar 13, 2025

1.1.3

Mar 8, 2024

1.1.2

Mar 8, 2024

1.1.1

Feb 12, 2024

1.1.0

Dec 12, 2023

1.0.9

Dec 12, 2023

1.0.8

Aug 10, 2023

1.0.7

Aug 9, 2023

1.0.6

Aug 9, 2023

1.0.5

Aug 9, 2023

1.0.4

Aug 8, 2023

1.0.3

Aug 7, 2023

1.0.2

Jul 5, 2023

1.0.1

Jul 5, 2023

1.0.0

May 10, 2023

0.6.2

Jan 13, 2022

0.6.1

Nov 14, 2021

0.6.0

Nov 14, 2021

0.5.0

Aug 23, 2021

0.3.1

May 8, 2020

0.3.0

May 3, 2020

0.2.2

Apr 16, 2020

0.2.1

Apr 16, 2020

0.2.0

Feb 23, 2020

0.1.1

Feb 20, 2020

0.1.0

Feb 19, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

whyqd-1.2.8.tar.gz (3.1 MB view details)

Uploaded Apr 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

whyqd-1.2.8-py3-none-any.whl (93.4 kB view details)

Uploaded Apr 18, 2026 Python 3

File details

Details for the file whyqd-1.2.8.tar.gz.

File metadata

Download URL: whyqd-1.2.8.tar.gz
Upload date: Apr 18, 2026
Size: 3.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.0.1 CPython/3.11.4 Linux/6.6.87.2-microsoft-standard-WSL2

File hashes

Hashes for whyqd-1.2.8.tar.gz
Algorithm	Hash digest
SHA256	`05ce9a28bd7545d9337c26d96d3702ae7926408a9460d34e49cfaac72b6c83ae`
MD5	`e3e0b35f1ca893b50aba77c3e7f62853`
BLAKE2b-256	`9397e47ec32e294c03914dcfcff5890dcaa7ced899fa8972b814bfb29333b435`

See more details on using hashes here.

File details

Details for the file whyqd-1.2.8-py3-none-any.whl.

File metadata

Download URL: whyqd-1.2.8-py3-none-any.whl
Upload date: Apr 18, 2026
Size: 93.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.0.1 CPython/3.11.4 Linux/6.6.87.2-microsoft-standard-WSL2

File hashes

Hashes for whyqd-1.2.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4a85f5f00b5fffae978d22f5ac194ce60a75970c17d00e7e39e9c51f1c098d46`
MD5	`e21163362ce8880a0d50d1e5df25e789`
BLAKE2b-256	`bea2d6a0a40ac5bca5748247edb491360089c8f101bb49d000b707ef2a76a65b`

See more details on using hashes here.

whyqd 1.2.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

whyqd: simplicity, transparency, speed

What is it?

Why use it?

How does it work?

Installation

Changelog

Background and funding

Licence

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes