Skip to main content

Package to create aggregated variables from CBS network data

Project description

netCBS

netCBS efficiently creates network-based measures using CBS POPNET network tables (e.g. family, colleagues, neighbors, schoolmates, housemates). For example: compute the average income of a person’s parents, or the average income of the parents of their classmates, using CBS network links.

Installation

pip install netcbs

Quick start

See notebook for accessible information and examples.

The core function is transform(query, df_sample, df_agg, ...).

Inputs

  • df_sample: your “ego” sample. Must contain:

    • RINPERSOON (unique person identifier). Note: RINPERSOONS must be R
  • df_agg: the table containing variables you want to aggregate for alters reached by the network traversal. Must contain:

    • RINPERSOON. Note: RINPERSOONS must be R
    • all variables referenced in the query’s aggregation-variable list (e.g. Income, Age)

Query format

A query describes:

  1. Which variables to aggregate (first segment), and
  2. Which network hops to traverse (one or more context segments), ending in sample.

Format:

"[Var1, Var2, ...] -> ContextA[types] -> ContextB[types] -> ... -> sample"

  • The first segment must be in square brackets: "[Income]" or "[Income, Age]".
  • Each context is one of: Family, Colleagues, Neighbors, Schoolmates, Housemates.
  • Context type selector is either:
    • [all] (use all relationship codes valid for that context), or
    • [101,102,...] (explicit relationship codes)
  • The final segment should be sample (case-sensitive recommended).

Example:

query = "[Income, Age] -> Family[301] -> Schoolmates[all] -> sample"

This means: find the aggregated Income and Age of parents (301) of the schoolmates of the people in the sample (df_sample).

Usage

import polars as pl  
import netcbs

query = "[Income, Age] -> Family[301] -> Schoolmates[all] -> sample"

df_out = netcbs.transform(
    query=query,
    df_sample=df_sample,     # must contain: RINPERSOON
    df_agg=df_agg,           # must contain: RINPERSOON, Income, Age
    year=2021,
    format_file="parquet",   # "parquet" (recommended) or "csv"
    agg_funcs=("avg", "sum", "count"),  # DuckDB aggregate function names (strings)
    return_pandas=False, 
)

About agg_funcs (important)

agg_funcs must be a sequence of DuckDB aggregate function names as strings, e.g.:

  • "avg", "sum", "count", "min", "max" (and other DuckDB aggregates)

The output columns are named:

"_"

So with agg_funcs=("avg","sum") and "[Income, Age]", you get:

  • avg_Income, sum_Income, avg_Age, sum_Age

How it works

  1. Validate query
    validate_query() checks:

    • query structure
    • df_sample has RINPERSOON
    • df_agg has RINPERSOON and all requested aggregation variables
    • each context and relationship-type selector is valid
    • (optionally) referenced CBS files exist for the requested year
  2. Resolve network files
    For each hop, format_path() selects the latest available version of the CBS network file for the requested year.

    • For format_file="parquet", files are expected under a geconverteerde data subfolder.
    • For format_file="csv", files are read with read_csv_auto(..., delim=';').
  3. Traverse the network
    DuckDB reads each network file, filters by the requested relationship codes, and joins hop-by-hop from egos to alters.

  4. Aggregate
    DuckDB joins the final reached persons to df_agg and computes the requested aggregates, grouped by the original sample person.

  5. Join back to sample
    Results are left-joined back onto the sample so every sample person remains in the output (missing networks produce null aggregates).

Contributing

Please refer to the repository’s CONTRIBUTING guide for issues and pull requests.

License and citation

netCBS is published under the MIT license.
For academic citation: Garcia-Bernardo, J. (2024). netCBS: Package to efficiently create network measures using CBS networks in the RA. (v0.1). Zenodo. https://doi.org/10.5281/zenodo.13908121

Contact

Developed and maintained by the ODISSEI Social Data Science (SoDa) team.
Questions or suggestions: please open an issue or contact via the ODISSEI SoDa website.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

netcbs-0.3.1.tar.gz (109.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

netcbs-0.3.1-py3-none-any.whl (11.8 kB view details)

Uploaded Python 3

File details

Details for the file netcbs-0.3.1.tar.gz.

File metadata

  • Download URL: netcbs-0.3.1.tar.gz
  • Upload date:
  • Size: 109.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for netcbs-0.3.1.tar.gz
Algorithm Hash digest
SHA256 80d14c5e38df3bebaf950a06c279d037156f245122dd596af79167c4494cfc19
MD5 b09ca6e02fe7576d944b5f63d00766a6
BLAKE2b-256 7e45ffa3945bf13ac33e87d848d6a59e291c4f3a5714bf7e2a4bb394779658a7

See more details on using hashes here.

File details

Details for the file netcbs-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: netcbs-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 11.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for netcbs-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 48233bbfabcac2130ae4f37d6e28f810e491538481c4420fa4fd76eff68c25ca
MD5 b9cb29c4a95debf02f7b8ab3a0dd78fe
BLAKE2b-256 7fd5c985b293a7ca37b9ca971359c8f4d72f01c0a230887c8707a7dfd7de7ea7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page