Skip to main content

Analysis toolkit for time-series natural language datasets

Project description

chronai

Tests Documentation

Read the chronai documentation

The name chronai is a contraction of chrono, relating to time, and ai. The motivation for this repo is to formalise some timeseries analysis algrothims for datasets with the following features:

  1. time - obviosly
  2. text - some narural language field
  3. category - some partition of this data i.e. user_id, area_code, etc.

The motivating example for this tool-kit is Google Search data, for example:

user_id search_datetime search_text
1 2025-05-08T13:56:35Z what does chrono mean
1 2025-05-08T13:58:23Z what does NLP mean
2 2025-05-08T13:58:23Z how to center a div

The motivating technology for this tool kit is LLMs. Anthropics December 2024 paper Clio: Privacy-Preserving Insights into Real-World AI Use details a neat methodology for gleaming observations and insights from a large corpus of conversations with Claude. Nothing was particularly novel about this type of analysis, except in lieu of traditional NLP techniques, LLMs were used. No fancy LDA or even less fancy TF-IDF, just calls to see what Claude thought--a lot of them. I've seen this implemented over and over again in the last 12 months, this tool-kit is an attempt to formalise this "new" data science.

What's injectable?

  1. Pipeline Orchestrator [Default: Dagster]: These tools should play nice with pipeline orchestration tools. Dagster, Airflow, Prefect, Celery etc. If our algorithms are to leverage LLMs, we must expect latency and unexpected failures. It seems responsible to expect these jobs to often be run in on such Pipeline Orechestration tools.

  2. LLM Provider [Default: lambda.ai]: The LLM in question should be injectable, of course. Lambda Inference fits particularly well for these sorts of workloads as these is a large range of cheap models and rate-limits are not an issue.

  3. Embeddings Model [Default: HuggingFace]: Of course if you wish to bring your own embeddings model, that's also allowed.

  4. Trad. DS Routines [Default: Sklearn]: Now and then we will need to use traditional DS techniques, these will also be injectable, but must follow the sklearn standard model interface.

What can (should) we do now (later)?

  • Sessionization: The act of taking a set of semantically consistant and temporally relevant NL texts and grouping them into one session (see glossary).

  • Topic Trends: Akin to what google trends does.

  • Temporal Tagging: Tagging a certain categorical dimension with topics with some relevance to a temporal window. For example user_123 tagged with Looking for holiday destination for the window [2025-05-01, '2025-05-10'].

  • Clio: While the original paper did not concern itself with time, we should.

TODO

  • Set-up proj structure
  • Set-up docs
  • Set-up tests
  • Set-up publishing to pypi w/ CICD
  • Sessionization (s12n) README.md
  • Topic Trends (tt) README.md
  • heirarchy_tree (ht) README.md
  • Set-up code cov
  • Set-up appropriate mocks for the injectables
  • Build class interfaces
  • Set-up readme's for above routines.
  • Set-up branch rules

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chronai-0.0.4.tar.gz (153.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chronai-0.0.4-py3-none-any.whl (11.6 kB view details)

Uploaded Python 3

File details

Details for the file chronai-0.0.4.tar.gz.

File metadata

  • Download URL: chronai-0.0.4.tar.gz
  • Upload date:
  • Size: 153.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for chronai-0.0.4.tar.gz
Algorithm Hash digest
SHA256 ce6aee6411dc6a368a75da9a377f3ac67cf8fb5a5f04a02d732a225546a5666a
MD5 e9c681f1eb2bbaaa451e355aef913ce5
BLAKE2b-256 72c61591a268031b081ed1d7a1bcc67873d84218d157990a6663a041b0f3f73d

See more details on using hashes here.

Provenance

The following attestation bundles were made for chronai-0.0.4.tar.gz:

Publisher: release.yaml on jskerman/chronai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chronai-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: chronai-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 11.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for chronai-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 71bf8f05aff1058d229dc5e56ca5cf90124931ccb97aa77a6add55294f165dc3
MD5 4d879026ecc39ddd72304a1ff06cb812
BLAKE2b-256 32ac6cacdeb3767057c67612e270a7cf4e8a0c9772fdc68d51c42723cac36b04

See more details on using hashes here.

Provenance

The following attestation bundles were made for chronai-0.0.4-py3-none-any.whl:

Publisher: release.yaml on jskerman/chronai

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page