Analysis toolkit for time-series natural language datasets
Project description
chronai
The name chronai is a contraction of chrono, relating to time, and ai. The motivation for this repo is to formalise some timeseries analysis algrothims for datasets with the following features:
time- obvioslytext- some narural language fieldcategory- some partition of this data i.e.user_id,area_code, etc.
The motivating example for this tool-kit is Google Search data, for example:
| user_id | search_datetime | search_text |
|---|---|---|
| 1 | 2025-05-08T13:56:35Z | what does chrono mean |
| 1 | 2025-05-08T13:58:23Z | what does NLP mean |
| 2 | 2025-05-08T13:58:23Z | how to center a div |
The motivating technology for this tool kit is LLMs. Anthropics December 2024 paper Clio: Privacy-Preserving Insights into Real-World AI Use details a neat methodology for gleaming observations and insights from a large corpus of conversations with Claude. Nothing was particularly novel about this type of analysis, except in lieu of traditional NLP techniques, LLMs were used. No fancy LDA or even less fancy TF-IDF, just calls to see what Claude thought--a lot of them. I've seen this implemented over and over again in the last 12 months, this tool-kit is an attempt to formalise this "new" data science.
What's injectable?
-
Pipeline Orchestrator [Default: Dagster]: These tools should play nice with pipeline orchestration tools. Dagster, Airflow, Prefect, Celery etc. If our algorithms are to leverage LLMs, we must expect latency and unexpected failures. It seems responsible to expect these jobs to often be run in on such Pipeline Orechestration tools.
-
LLM Provider [Default: lambda.ai]: The LLM in question should be injectable, of course. Lambda Inference fits particularly well for these sorts of workloads as these is a large range of cheap models and rate-limits are not an issue.
-
Embeddings Model [Default: HuggingFace]: Of course if you wish to bring your own embeddings model, that's also allowed.
-
Trad. DS Routines [Default: Sklearn]: Now and then we will need to use traditional DS techniques, these will also be injectable, but must follow the
sklearnstandard model interface.
What can (should) we do now (later)?
-
Sessionization:
The act of taking a set of semantically consistant and temporally relevant NL texts and grouping them into one
session(see glossary). -
Akin to what google trends does.
-
Temporal Tagging:
Tagging a certain categorical dimension with topics with some relevance to a temporal window. For example
user_123tagged withLooking for holiday destinationfor the window[2025-05-01, '2025-05-10']. -
Clio:
While the original paper did not concern itself with time, we should.
TODO
- Set-up docs
- Set-up tests
- Set-up appropriate mocks for the injectables
- Build class interfaces
- Set-up publishing w/ CICD
- Set-up readme's for above routines.
- Set-up glossary
- Set-up branch rules
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chronai-0.0.1.tar.gz.
File metadata
- Download URL: chronai-0.0.1.tar.gz
- Upload date:
- Size: 20.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
37b128be89a177ae77dc6f46f3902ad30e4af77bef5b7d77385b294060d01cd0
|
|
| MD5 |
8225cd685dec4610316fe9fcc18730fb
|
|
| BLAKE2b-256 |
23d944379aeac714278bd302ca118d76d3bffd89c14176fc030e382976f68502
|
Provenance
The following attestation bundles were made for chronai-0.0.1.tar.gz:
Publisher:
release.yaml on jskerman/chronai
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
chronai-0.0.1.tar.gz -
Subject digest:
37b128be89a177ae77dc6f46f3902ad30e4af77bef5b7d77385b294060d01cd0 - Sigstore transparency entry: 209192485
- Sigstore integration time:
-
Permalink:
jskerman/chronai@a9cd9f7f68c6eab916d39b12fc699fbb25b80294 -
Branch / Tag:
refs/tags/v0.0.1 - Owner: https://github.com/jskerman
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@a9cd9f7f68c6eab916d39b12fc699fbb25b80294 -
Trigger Event:
release
-
Statement type:
File details
Details for the file chronai-0.0.1-py3-none-any.whl.
File metadata
- Download URL: chronai-0.0.1-py3-none-any.whl
- Upload date:
- Size: 7.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5ed0ed9e76a8794330f4ef356b21780b383a93e62de0e120b0860f912577216b
|
|
| MD5 |
d4deac8320c189b1acc6f561dd9b7e02
|
|
| BLAKE2b-256 |
b729c5127ac1ad51f65343b5aace01f47fd9d979a4d9767a049b8623fc14e267
|
Provenance
The following attestation bundles were made for chronai-0.0.1-py3-none-any.whl:
Publisher:
release.yaml on jskerman/chronai
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
chronai-0.0.1-py3-none-any.whl -
Subject digest:
5ed0ed9e76a8794330f4ef356b21780b383a93e62de0e120b0860f912577216b - Sigstore transparency entry: 209192492
- Sigstore integration time:
-
Permalink:
jskerman/chronai@a9cd9f7f68c6eab916d39b12fc699fbb25b80294 -
Branch / Tag:
refs/tags/v0.0.1 - Owner: https://github.com/jskerman
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yaml@a9cd9f7f68c6eab916d39b12fc699fbb25b80294 -
Trigger Event:
release
-
Statement type: