Skip to main content

Data caching, scoring and testing for causal discovery

Project description

causaliq-data

Python Support License: MIT

This package provides data handling, statistical testing, and scoring infrastructure for causal discovery and Bayesian network operations.

Installation

Install from PyPI:

pip install causaliq-data

Status

🚧 Active Development - This repository is currently in active development, which involves:

  • migrating functionality from the legacy monolithic discovery repo
  • restructuring classes to reduce module size and improve maintainability and improve usability
  • ensure CausalIQ development standards are met

Features

Currently implemented:

  • Release v0.1.0 - Foundation Data: CausalIQ compliant Data provider interface and concrete implementations with data store internally as pandas Dataframes or Numpy 2D arrays.
  • Release v0.2.0 - Score Functions: Comprehensive scoring framework for Bayesian networks and DAGs with entropy-based (BIC, AIC, log-likelihood), Bayesian (BDE, K2, BDJ, BDS), and Gaussian (BGE, BIC-g, loglik-g) score types.
  • Release v0.3.0 - Independence Tests: Statistical independence testing with Chi-squared and Mutual Information tests for conditional independence (X ⊥ Y | Z), supporting multiple data sources and designed for constraint-based algorithms. Includes data preprocessing utilities for cleaning datasets before causal discovery.

Planned releases (supporting new functionality):

  • none planned yet

Upcoming Key Innovations

🧩 Plugin Architecture

  • use by third-party software - ability to use these data capabilities in third party structure learning algorithms so that comparisons are based on a common scoring or conditional independence framework, and performance optimisations speed up third-party algorithms.

🏛️ Stability Integration

  • Stable scores - stable resolution of equal-score situations for unstable algorithms e.g. Tabu

🧠 LLM-assisted Causal Discovery

  • Data values - Data values and variable names may provide part of the context for LLM-assisted causal discovery
  • Knowledge integration - incorporation of LLM and human expertise in scores and priors via the CausalIQ Knowledge package.
  • Relationship explanations: Natural language descriptions of relationships in data

⚡Optimised Performance

  • GPU Data provider - support for optimised data handling on GPU hardware
  • Intelligent data scanning - reduce number of full-row data scans

🎲 Enhanced Distribution Support

  • Mixed Types: scores and independence tests that support mixtures of continuous and categorical variables

Integration with CausalIQ Ecosystem

  • 🔍 CausalIQ Discovery makes use of this package to provide objective functions and conditional independence tests for structure learning algorithms.
  • 🧪 CausalIQ Analysis uses score functions as part of the evaluation of learnt graphs.
  • 💎 CausalIQ Core makes use of the BNFit interface to estimate parameters based on data.
  • 🤖 CausalIQ Workflow uses the in-memory randomisation of this package for stability experiments.

LLM Support

The following provides project-specific context for this repo which should be provided after the personal and ecosystem context:

I wish to migrate the code in legacy/code/data following all CausalIQ development guidelines
so that the legacy repo can use the migrated code instead. I also want my legacy Bayesian Network
code to be able to use the BNFit interface (see bnfit_interface_spec.md). I would start by migrating
the Data abstract class and pandas.py. Please do this a little at a time and advise me what you intend
to do before making any changes.

Quick Start

import pandas as pd
from causaliq_data.pandas import Pandas
from causaliq_data.indep import indep
from causaliq_data.score import node_score

# Load your data
df = pd.read_csv("your_dataset.csv")
data = Pandas(df)

# Test independence: Is X independent of Y given Z?
result = indep("X", "Y", ["Z"], data.as_df(), types=["mi", "x2"])
p_value = result.loc["p_value", "mi"]
print(f"Independence test p-value: {p_value}")

# Score a node with its parents
score = node_score("Y", ["X", "Z"], data, "bic")
print(f"BIC score: {score}")

Getting started

Prerequisites

  • Git
  • Latest stable versions of Python 3.9, 3.10. 3.11 and 3.12

Clone the new repo locally and check that it works

Clone the causaliq-core repo locally as normal

git clone https://github.com/causaliq/causaliq-data.git

Set up the Python virtual environments and activate the default Python virtual environment. You may see messages from VSCode (if you are using it as your IDE) that new Python environments are being created as the scripts/setup-env runs - these messages can be safely ignored at this stage.

scripts/setup-env -Install
scripts/activate

Check that the causaliq-core CLI is working, check that all CI tests pass, and start up the local mkdocs webserver. There should be no errors reported in any of these.

causaliq-data --help
scripts/check_ci
mkdocs serve

Enter http://127.0.0.1:8000/ in a browser and check that the causaliq-data documentation is visible.

If all of the above works, this confirms that the code is working successfully on your system.

Documentation

Full API documentation is available at: http://127.0.0.1:8000/ (when running mkdocs serve)

Contributing

This repository is part of the CausalIQ ecosystem. For development setup:

  1. Clone the repository
  2. Run scripts/setup-env -Install to set up environments
  3. Run scripts/check_ci to verify all tests pass
  4. Start documentation server with mkdocs serve

Supported Python Versions: 3.9, 3.10, 3.11, 3.12
Default Python Version: 3.11
License: MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

causaliq_data-0.3.0.tar.gz (33.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

causaliq_data-0.3.0-py3-none-any.whl (34.7 kB view details)

Uploaded Python 3

File details

Details for the file causaliq_data-0.3.0.tar.gz.

File metadata

  • Download URL: causaliq_data-0.3.0.tar.gz
  • Upload date:
  • Size: 33.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for causaliq_data-0.3.0.tar.gz
Algorithm Hash digest
SHA256 d20f6403e2bfb83bccd92db9b60718aeb3000ec5fc543afca473a2bb433323e5
MD5 6744b690696f1dc4e22260561785055b
BLAKE2b-256 d1cda4c39486a821d1c4f52885968b9c4a921f3fce7a4c9c897c2d2ea00e2bb3

See more details on using hashes here.

File details

Details for the file causaliq_data-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: causaliq_data-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 34.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for causaliq_data-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3ac3964947d9a7504e7099afc593e4788af3a0de8ff63c3a5f3090bde0fb39b1
MD5 c2c7edf67ce930bf8374c2e177f39240
BLAKE2b-256 0ae6a767d48840596a4ebb2222816ee4cfef0dbf484cf45b6533e8d103569a3b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page