Skip to main content

Automatic Crosswalk

Project description

autocrosswalk: A generic approach to crosswalking

This library automates crosswalks from one dataframe to another.

Please contact the authors below if you find any bugs or have any suggestions for improvement. Thank you!

Author: Nicolaj Søndergaard Mühlbach (n.muhlbach at gmail dot com, muhlbach at mit dot edu)

Code dependencies

This code has the following dependencies:

  • Python >=3.6
  • pandas >=1.3

Installation

There are no heavy dependencies for this library to work. We have included an example that requires a parquet reader, e.g., pyarrow, brotli, or fastparquet. One needs to have one of them installed in order to use the example data provided. Otherwise, go ahead and install by pip install autocrosswalk.

Usage

# Libraries
from autocrosswalk.main import AutoCrosswalk
from autocrosswalk.tools import load_example_data

# Load example data
data = load_example_data()

# Separate into old and new data, i.e., we crosswalk the 'data_from' to 'data_to' 
data_from = data.loc[data["DB"]=="db_20_0"]
data_to = data.loc[data["DB"]=="db_26_1"]

# Instantiate
autocrosswalk = AutoCrosswalk(n_best_match=3,
                              prioritize_exact_match=True,
                              enforce_completeness=True,
                              verbose=2)

# Generate crosswalk file
df_crosswalk = autocrosswalk.generate_crosswalk(df_from=data_from,
                                                df_to=data_to,
                                                numeric_key=['O*NET-SOC Code'],
                                                text_key=['Job title'])

# Perform crosswalk
df_updated = autocrosswalk.perform_crosswalk(crosswalk=df_crosswalk,
                                             df=data_from,
                                             values=["Data Value"],
                                             by=['Date', 'DB',
                                                 'Category', 'Element ID',
                                                 'Element Name','Element description'])

# Check if number of unique keys match
print(len(df_updated["O*NET-SOC Code"].unique()) == len(data_to["O*NET-SOC Code"].unique()))
print(len(df_updated["Job title"].unique()) == len(data_to["Job title"].unique()))

# Now, 'df_updated' has all new keys from 'data_to'!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autocrosswalk-0.0.24.tar.gz (1.2 MB view details)

Uploaded Source

Built Distribution

autocrosswalk-0.0.24-py3-none-any.whl (1.2 MB view details)

Uploaded Python 3

File details

Details for the file autocrosswalk-0.0.24.tar.gz.

File metadata

  • Download URL: autocrosswalk-0.0.24.tar.gz
  • Upload date:
  • Size: 1.2 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/0.0.0 pkginfo/1.8.2 readme-renderer/27.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.3 keyring/23.4.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.12

File hashes

Hashes for autocrosswalk-0.0.24.tar.gz
Algorithm Hash digest
SHA256 13762265398400ae11767e2771a7dc25cbc77c2df3e66515e382950f0a18eada
MD5 22881fa6b440b4cfa7682a7b77d2cf23
BLAKE2b-256 12ef00f70c0e003a000ba79e2260e15235be7cf492d38b483ca170f4b21d04df

See more details on using hashes here.

File details

Details for the file autocrosswalk-0.0.24-py3-none-any.whl.

File metadata

  • Download URL: autocrosswalk-0.0.24-py3-none-any.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/0.0.0 pkginfo/1.8.2 readme-renderer/27.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.3 keyring/23.4.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.12

File hashes

Hashes for autocrosswalk-0.0.24-py3-none-any.whl
Algorithm Hash digest
SHA256 5a49639a58e56c804620e2f8c707b1e7f520fb90c58bd2a3c4c708519b221233
MD5 775ef19331876c6523935063416386c9
BLAKE2b-256 ac42c5e4bcd6234b14508e37c046b56431c2a8e4894cd7ae663f9ff35afddf37

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page