Skip to main content

This is a package that helps data scientists and data analysts to capture notes while they work through data science tasks. The captured tasks can then be searched and analyzed.

Project description

DOI License Documentation Status

Knowledge Management for Data Science

What is this tool used for?

This tool is used for knowledge management in data science. In software design, conceptual domain models inform lower-level models. In data science work, experiments yield knowledge that informs modeling choices. Data science models are almost always informed by a variety of analysis experiments. Experimentation and organization of knowledge artifacts supporting modeling decisions is a requirement for reproducible data science models and improving model quality and performance. (video).

Please see knowledge application development context for a description of a typical knowledge application development setting.

Why do you need this tool?

The above narrative suggests that ability to retrieve knowledge about experiments and historical models in an adhoc manner is critical in data science. It is. It is also grossly underserved. Knowledge management tools for domain specific models exist, knowledge management tools for dev-ops and ML-Ops exist, but tools for analytics and model development are siloed. Information gets fragmented over time. So analysts and data scientists often have to go to experiment tools, data catalogs or ML-Ops tools to fetch information they need to develop a model. In a subsequent iteration of this model, the contextual information that informed the development of this model is often lost, and the development team, possibly with new team members, have the task of reconstructing this contextual information again. This library is a step in fixing this problem. The central idea is to organize tasks in terms of a sequence of steps that are pretty standard in data analysis work and capture knowledge in context while these tasks are performed.

How is it related to process guidelines and vocabularies for machine learning?

Initiatives such as CRISP DM provide guidelines and processes for developing data science projects. Projects such as Open ML provide semantic vocabulary standardization for machine learning tasks. These are excellent tools. However, the guidelines they provide are task focussed. The gap between a conceptual idea and the final, or, even candidate data science tasks for a project is filled with many assumptions and experimental evaluations. The information in these assumptions and experimental evaluations is what this tool aims to capture. There is also an ordering to these assumptions and experimental evaluations. This is also what this tool aims to capture.

Who would use this tool?

This tool is for data scientists and data analysts.

How do you use this tool?

  1. You install this library along with the other python dependecies you need for your analysis task
  2. Review the basic recipe for capturing your observations.
  3. Review the templates section to find the example relevant to you. For analytics projects, review the analytics template. For machine learning projects, review the machine learning template.
  4. Start using the tool in your projects using the information from your review of the above two steps.

Note:

  1. The examples are based on using the files in the package, but it is quite straight forward to connect to S3 storage to get your data files, see the connection notes document for details. Minio provides a sandbox where you can try this

  2. Please the wiki pages section of the repository for design perspectives and documentation. This is work in progress.

Licensing and Feature Questions

  1. The tool is open source with an apache 2.0 license

  2. If you are interested in the following features, please set up a meeting with me:

    1. Help with a data analysis task for your use case
    2. Developing an ontology based solution similar to the above for your specific use case.
    3. Customizing this tool with other extensions, for example to integrate a feature store or meta-data management tool that you use in your data science tool stack.
  3. If this problem resonates with you as a developer and you would like to contribute, submit an issue and if the feature makes sense we can discuss the possiblity of submitting a PR for it. Of course, you can fork this repository and use it per the licensing information. Thank you for your interest.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

kmds-0.2.16.tar.gz (5.6 MB view details)

Uploaded Source

Built Distribution

kmds-0.2.16-py3-none-any.whl (5.8 MB view details)

Uploaded Python 3

File details

Details for the file kmds-0.2.16.tar.gz.

File metadata

  • Download URL: kmds-0.2.16.tar.gz
  • Upload date:
  • Size: 5.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.12.1 Darwin/23.4.0

File hashes

Hashes for kmds-0.2.16.tar.gz
Algorithm Hash digest
SHA256 38b92e217ae143ac6434c0cef0d3e99cf8b7fe59b5d14c2bdda78e549b53a9c3
MD5 137098a905e67de810a5fb512b3674b9
BLAKE2b-256 4a128a949c3e0ac5e4e3c8c31bd72b083b615be6d301d2e9a498d2ffec711901

See more details on using hashes here.

File details

Details for the file kmds-0.2.16-py3-none-any.whl.

File metadata

  • Download URL: kmds-0.2.16-py3-none-any.whl
  • Upload date:
  • Size: 5.8 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.12.1 Darwin/23.4.0

File hashes

Hashes for kmds-0.2.16-py3-none-any.whl
Algorithm Hash digest
SHA256 fe216e70767099556f31543962ab954c07c0409e631639e4ce7086d1d11eb644
MD5 ba8b9929be7a0b09de73c15a91c7733b
BLAKE2b-256 6d6da1adc3c81831087985ccac2d53b593210d2b188ec327eabd657608bd4c4b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page