Skip to main content

Data caching, scoring and testing for causal discovery

Project description

causaliq-data

Python Support License: MIT

This package provides data handling, statistical testing, and scoring infrastructure for causal discovery and Bayesian network operations.

Installation

Install from PyPI:

pip install causaliq-data

Status

🚧 Active Development - This repository is currently in active development, which involves:

  • migrating functionality from the legacy monolithic discovery repo
  • restructuring classes to reduce module size and improve maintainability and improve usability
  • ensure CausalIQ development standards are met

Features

Currently implemented:

  • Release v0.1.0 - Foundation Data: CausalIQ compliant Data provider interface and concrete implementations with data store internally as pandas Dataframes or Numpy 2D arrays.
  • Release v0.2.0 - Score Functions: Comprehensive scoring framework for Bayesian networks and DAGs with entropy-based (BIC, AIC, log-likelihood), Bayesian (BDE, K2, BDJ, BDS), and Gaussian (BGE, BIC-g, loglik-g) score types.

Planned releases (supporting legacy functionality):

  • Release v0.3.0 - CI Tests: Conditional Independence

Upcoming Key Innovations

🧩 Plugin Architecture

  • use by third-party software - ability to use these data capabilities in third party structure learning algorithms so that comparisons are based on a common scoring or conditional independence framework, and performance optimisations speed up third-party algorithms.

🏛️ Stability Integration

  • Stable scores - stable resolution of equal-score situations for unstable algorithms e.g. Tabu

🧠 LLM-assisted Causal Discovery

  • Data values - Data values and variable names may provide part of the context for LLM-assisted causal discovery
  • Knowledge integration - incorporation of LLM and human expertise in scores and priors via the CausalIQ Knowledge package.
  • Relationship explanations: Natural language descriptions of relationships in data

⚡Optimised Performance

  • GPU Data provider - support for optimised data handling on GPU hardware
  • Intelligent data scanning - reduce number of full-row data scans

🎲 Enhanced Distribution Support

  • Mixed Types: scores and independence tests that support mixtures of continuous and categorical variables

Integration with CausalIQ Ecosystem

  • 🔍 CausalIQ Discovery makes use of this package to provide objective functions and conditional independence tests for structure learning algorithms.
  • 🧪 CausalIQ Analysis uses score functions as part of the evaluation of learnt graphs.
  • 💎 CausalIQ Core makes use of the BNFit interface to estimate parameters based on data.
  • 🤖 CausalIQ Workflow uses the in-memory randomisation of this package for stability experiments.

LLM Support

The following provides project-specific context for this repo which should be provided after the personal and ecosystem context:

I wish to migrate the code in legacy/code/data following all CausalIQ development guidelines
so that the legacy repo can use the migrated code instead. I also want my legacy Bayesian Network
code to be able to use the BNFit interface (see bnfit_interface_spec.md). I would start by migrating
the Data abstract class and pandas.py. Please do this a little at a time and advise me what you intend
to do before making any changes.

Quick Start

# To be completed - example will score a known graph

Getting started

Prerequisites

  • Git
  • Latest stable versions of Python 3.9, 3.10. 3.11 and 3.12

Clone the new repo locally and check that it works

Clone the causaliq-core repo locally as normal

git clone https://github.com/causaliq/causaliq-data.git

Set up the Python virtual environments and activate the default Python virtual environment. You may see messages from VSCode (if you are using it as your IDE) that new Python environments are being created as the scripts/setup-env runs - these messages can be safely ignored at this stage.

scripts/setup-env -Install
scripts/activate

Check that the causaliq-core CLI is working, check that all CI tests pass, and start up the local mkdocs webserver. There should be no errors reported in any of these.

causaliq-data --help
scripts/check_ci
mkdocs serve

Enter http://127.0.0.1:8000/ in a browser and check that the causaliq-data documentation is visible.

If all of the above works, this confirms that the code is working successfully on your system.

Documentation

Full API documentation is available at: http://127.0.0.1:8000/ (when running mkdocs serve)

Contributing

This repository is part of the CausalIQ ecosystem. For development setup:

  1. Clone the repository
  2. Run scripts/setup-env -Install to set up environments
  3. Run scripts/check_ci to verify all tests pass
  4. Start documentation server with mkdocs serve

Supported Python Versions: 3.9, 3.10, 3.11, 3.12
Default Python Version: 3.11
License: MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

causaliq_data-0.2.0.tar.gz (28.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

causaliq_data-0.2.0-py3-none-any.whl (29.8 kB view details)

Uploaded Python 3

File details

Details for the file causaliq_data-0.2.0.tar.gz.

File metadata

  • Download URL: causaliq_data-0.2.0.tar.gz
  • Upload date:
  • Size: 28.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for causaliq_data-0.2.0.tar.gz
Algorithm Hash digest
SHA256 c1cadeb6ca8753e496637127270a6240505be2ed320012d775b96608402b0f63
MD5 a4b59b37055abcc447f73630a6035077
BLAKE2b-256 cea5bd6b115ca56590022616d94a566b04f5ca722d9fac63cb4c7917acd9114e

See more details on using hashes here.

File details

Details for the file causaliq_data-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: causaliq_data-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 29.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for causaliq_data-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 ce99e513a3d560ec3ed1b0140f43c84e0f29e73da9fa79d8bb7abc198285774b
MD5 8fa5c2e8311350f054ed77d4f22bc481
BLAKE2b-256 24a84b309764006a237c68f015275a184def0d151257add852d2bdebee4940c5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page