kern-refinery

The open-source data-centric IDE for NLP.

These details have not been verified by PyPI

Project links

Homepage

Project description

Open-source data-centric IDE for NLP. Combining (semi-)automated labeling, extensive data management and neural search capabilities.

Kern AI refinery (abbr. refinery) is like the data-centric sibling of your favorite programming environment. It provides an easy-to-use interface for weak supervision as well as extensive data management, neural search, and monitoring to ensure that the quality of your training data is as good as possible.

refinery doesn't get rid of manual labeling, but it makes sure that your valuable time is spent well.

Showcase GIF of refinery

DEMO: You can interact with the application in a (mostly read-only) online playground. Check it out here

refinery consists of multiple microservices to enable a scalable and optimized workload balance, so this is the central repository used to orchestrate the system. It builds on top of 🤗 Hugging Face and spaCy to leverage pre-built language models for your NLP tasks, as well as qdrant for neural search. Our microservices natively support GPU acceleration.

🧑‍💻 Why refinery? Built for developers with collaboration in mind
🤓 Features
☕ Installation
📘 Documentation and tutorials
😵‍💫 Need help?
🪢 Community and contact
🙌 Contributing
🗺️ Roadmap
❓ FAQ
🐍 Python SDK
🏠 Architecture
🏫 Glossary
👩‍💻👨‍💻 Team and contributors
🌟 Star History
📃 License

🧑‍💻 Why refinery? Built for developers with collaboration in mind

There are already many other labeling tools out there, so why did we decide to build yet another one?

Open-source and developer-oriented

We believe that there is a lack of open-source, developer-oriented tools for data-centric NLP. In other terms: developers and scientists should be able to participate in the refinement of raw data to training data, but with the programmatic approach they love. That's why we made sure that integrations to tools like 🤗 Hugging Face and spaCy are as easy as possible.

For automation or quality control

The labeling workflow in refinery is designed to integrate heuristics like labeling functions or active learning modules, which are combined via weak supervision. This way, you can either prototype the training data for your model from scratch, or improve existing training data continuously. We designed our workflow and data management such that you can do exactly that, and find the spots in your data that need to be re-visited.

Improving collaboration with subject matter experts

While doing so, we aim to improve the collaboration between engineers and subject matter experts (SMEs). In the past, we've seen how our application was being used in meetings to discuss label patterns in form of labeling functions and distant supervisors. We believe that data-centric AI is the best way to leverage collaboration.

Integrations

Lastly, refinery supports SDK actions like pulling and pushing data. Data-centric AI redefines labeling to be more than a one-time job by giving it an iterative workflow, so we aim to give you more power every day by providing end-to-end capabilities, growing the large-scale availability of high-quality training data. Use our SDK to program integrations with your existing landscapes.

Your benefits

You gain better insights into the data labeling workflow, receive an implicit documentation for your training data (which you can use to discuss findings), and can ultimately build better models in shorter time.

Our goal is to make labeling feel more like a programmatic and enjoyable task, instead of something tedious and repetitive. refinery is our contribution to this goal. And we're constantly aiming to improve this contribution.

If you like what we're working on, please leave a ⭐!

🤓 Features

(Semi-)automated labeling workflow for NLP tasks

Both manual and programmatic for classifications and span-labeling
Integration with state-of-the-art libraries and frameworks
Creation and management of lookup lists/knowledge bases to support during labeling
Neural search-based retrieval of similar records and outliers
Sliceable labeling sessions to drill-down on specific subsets
Multiple labeling tasks possible per project

Extensive data management and monitoring

Integration with 🤗 Hugging Face to automatically create document- and token-level embeddings
JSON-based data model for up- and downloads
Overview of project metrics like label distributions and confusion matrix
Data accessible and extendable via our Python SDK

Team workspaces in the managed version

Allow multiple users to label your data
Automated calculation of inter-annotator agreements

☕ Installation

From pip

pip install kern-refinery

Once the library is installed, go to the directory where you want to store the data and run refinery start. This will automatically git clone this repository first if you haven't done so yet. To stop the server, run refinery stop.

From repository

TL;DR:

$ git clone https://github.com/code-kern-ai/refinery.git
$ cd refinery

If you're on Mac/Linux:

$ ./start

If you're on Windows:

$ start.bat

To stop, type ./stop (Mac/Linux) or stop.bat.

refinery consists of multiple services that need to be run together. To do so, we've set up a setup file, which will automatically pull and connect the respective services for you. The file is part of this repository, so you can just clone it and run ./start (Mac/Linux) or start.bat (Windows) in the repository. After some minutes (now is a good time to grab a coffee ☕), the setup is done and you can access http://localhost:4455 in your browser. To stop the server, run ./stop (Mac/Linux) or ./stop.bat (Windows).

You're ready to start! 🙌 🎉

If you run into any issues during installation, please don't hesitate to reach out to us (see community section below).

Persisting data

By default, we store the data to the directory refinery/postgres-data. If you want to change that path, you need to modify the variable LOCAL_VOLUME of the start script of your operating system. To remove data, simply delete the volume folder. Make sure to delete only if you don't need the data any longer - this is irreversible!

📘 Documentation and tutorials

The best way to start with refinery is our quick start.

You can find extensive guides in our README docs and tutorials on our YouTube channel. We've also prepared a repository with sample projects which you can clone.

If you need help writing your first labeling functions, look into our template functions repository.

You can find our changelog here.

😵‍💫 Need help?

No worries, we've got you. If you have questions, please open a ticket in the "q&a" category of our forum.

🪢 Community and contact

Feel free to join our Discord, where we'll discuss about recent findings in data-centric AI:

We send out a (mostly) weekly newsletter about recent findings in data-centric AI, product highlights in development and more. You can subscribe to the newsletter here.

Also, you can follow us on Twitter and LinkedIn.

To reach out to us, please use our contact formula.

🙌 Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated. You can do so by providing feedback about desired features and bugs you might detect.

If you actively want to participate in extending the code base, reach out to us. We'll explain you how the architecture is set up, so you can customize the application as you desire.

🗺️ Roadmap

Our goal is to provide you with an easy-to-use, yet powerful open-source tool, which helps you to build the best training data for your model. We'll focus on the following high-level tasks:

Further labeling task options in the area of NLP
Extensive user-, label- and data-management capabilities
Improving the developer experience continuously
Continuously making the whole system more efficient to provide you with realtime insights
Providing you with great content to learn more about data-centric AI and how to implement it in refinery
Integrations to your favorite ML frameworks and applications

You can find our short- to midterm feature plans in the public roadmap

❓ FAQ

Concept questions

What is a heuristic?

Heuristics are the ingredients for scaling your data labeling. They don't have to be 100% accurate, heuristics can be e.g. simple Python functions expressing some domain knowledge. When you add and run several of these heuristics, you create what is called a noisy label matrix, that is matched against the reference data that you manually labeled. This allows us to analyze correlations, conflicts, overlaps, the number of hits for a data set, and the accuracy of each heuristic.

How can I build an active learning model?

We use pre-trained models to create embeddings in the first place. Once this is done, the embeddings are available in the application (both for building active learning heuristics and neural search). In our active learning IDE, you can then build a simple classification or extraction head on top of the embedding, and we'll manage then execution in a containerized environment.

How do I know whether my heuristic is good?

A heuristic can be “good” with respect to both coverage and precision. For coverage there basically is no limitation at all, for precision we generally recommend some value above 70%, depending on how many heuristics you have. The more heuristics you have, the more overlaps and conflicts will be given, the better weak supervision can work.

If I can automate the labeling, why should I train a model at all?

Technically, you could use our _refinery_ for inference. However, best results are achieved if a supervised learning model is trained on the generated labels, as these models improve generalization. It’s just a best practice. If you want to use the model for inference, check out our open-source library weak-nlp.

I have less than 1,000 records - Do I need this?

You can definitely use the system for smaller datasets, too! It now only shines via programmatic labeling, but also has a simple and beautiful UI. Go for it 😁

Technical questions

Help!! I forgot my password!

No worries, you can send a reset link even on your local machine. However, the link isn't sent to your email, but to the mailhog. Access it via http://localhost:4436.

I want to install a library for my labeling function

For this, we need to change the requirements.txt of the lf-exec-env, the containerized execution environment for your labeling functions. Please just open an issue, and we'll integrate your library as soon as possible.

Which data formats are supported?

We’ve structured our data formats around JSON, so you can upload most file types natively. This includes spreadsheets, text files, CSV data, generic JSON and many more.

How can I upload data?

We use pandas internally for matching your data to our JSON-based data model. You can upload the data via our UI, or via our Python SDK.

How can I download data, and what format does it have?

You can download your data in our UI or via the Python SDK, where we also provide e.g. adapters to Rasa. The export looks something like this:

[
    {
        "running_id": "0",
        "headline": "T. Rowe Price (TROW) Dips More Than Broader Markets",
        "date": "Jun-30-22 06:00PM\u00a0\u00a0",
        "headline__sentiment__MANUAL": null,
        "headline__sentiment__WEAK_SUPERVISION": "NEGATIVE",
        "headline__sentiment__WEAK_SUPERVISION__confidence": 0.62,
        "headline__entities__MANUAL": null,
        "headline__entities__WEAK_SUPERVISION": [
            "STOCK", "STOCK", "STOCK", "STOCK", "STOCK", "STOCK", "O", "O", "O", "O", "O"
        ],
        "headline__entities__WEAK_SUPERVISION__confidence": [
            0.98, 0.98, 0.98, 0.98, 0.98, 0.98, 0.00, 0.00, 0.00, 0.00, 0.00
        ]
    }
]

Service and hosting questions

Are there options for an enterprise on-prem solution?

If you're interested in running the multi-user version on your premises, please reach out to us. We can help you to set up the deployment and prepare your project(s) e.g. with workshops.

I don't want to label myself. What are my options?

Do you want to outsource your labeling, and let your engineers use _refinery_ as a mission control for your training data? Reach out to us, so we can discuss how we can help you with your use case.

How can I reach support?

In our open-source solution, you can reach out to us via Discord. For our managed version, you have an in-app chat to directly contact our support team.

🐍 Python SDK

You can extend your projects by using our Python SDK. With it, you can easily export labeled data of your current project and import new files both programmatically and via CLI (rsdk pull and rsdk push <file_name>). It also comes with adapters, e.g. to Rasa.

🏠 Architecture

Our architecture follows some main patterns:

Shared service database to efficiently transfer large data loads. To avoid redundant code in the services, we use submodules to share the data model
Containerized function execution for labeling functions, active learning and the record ide
Machine learning logic is implemented in stand-alone libraries (e.g. sequence-learn)

Architecture refinery

Some edges are not displayed for simplicity's sake.
The color of the edges have no implicit meaning, and are only used for better readability.

Service overview (maintained by Kern AI)

Service	Description
ml-exec-env	Execution environment for the active learning module. Containerized function as a service to build active learning models using scikit-learn and sequence-learn.
embedder	Embedder for refinery. Manages the creation of document- and token-level embeddings using the embedders library.
weak-supervisor	Weak supervision for refinery. Manages the integration of heuristics such as labeling functions, active learners or zero-shot classifiers. Uses the weak-nlp library for the actual integration logic and algorithms.
record-ide-env	Execution environment for the record IDE. Containerized function as a service to build record-specific "quick-and-dirty" code snippets for exploration and debugging.
config	Configuration of refinery. Amongst others, this manages endpoints and available language models for spaCy.
tokenizer	Tokenizer for refinery. Manages the creation and storage of spaCy tokens for text-based record attributes and supports multiple language models.
gateway	Gateway for refinery. Manages incoming requests and holds the workflow logic. To interact with the gateway, the UI or Python SDK can be used.
authorizer	Evaluates whether a user has access to certain resources.
websocket	Websocket module for refinery. Enables asynchronous notifications inside the application.
lf-exec-env	Execution environment for labeling functions. Containerized function as a service to execute user-defined Python scripts.
ac-exec-env	Execution environment for attribute calulaction. Containerized function as a service to generate new attributes using Python scripts.
updater	Updater for refinery. Manages migration logic to new versions if required.
neural-search	Neural search for refinery. Manages similarity search powered by Qdrant and outlier detection, both based on vector representations of the project records.
zero-shot	Zero-shot module for refinery. Enables the integration of 🤗 Hugging Face zero-shot classifiers as an off-the-shelf no-code heuristic.
entry	Login and registration screen for refinery. Implemented via Ory Kratos.
ui	UI for refinery. Used to interact with the whole system; to find out how to best work with the system, check out our docs.
doc-ock	Usage statistics collection for refinery. If users allow it, this collects product insight data used to optimize the user experience.
gateway-proxy	Gateway proxy for refinery. Manages incoming requests and forwards them to the gateway. Used by the Python SDK.

Service overview (open-source 3rd party)

Service	Description
qdrant/qdrant	Qdrant - Vector Search Engine for the next generation of AI applications
postgres/postgres	PostgreSQL: The World's Most Advanced Open Source Relational Database
minio/minio	Multi-Cloud ☁️ Object Storage
mailhog/MailHog	Web and API based SMTP testing
ory/kratos	Next-gen identity server (think Auth0, Okta, Firebase) with Ory-hardened authentication, MFA, FIDO2, TOTP, WebAuthn, profile management, identity schemas, social sign in, registration, account recovery, passwordless. Golang, headless, API-only - without templating or theming headaches. Available as a cloud service.
ory/oathkeeper	A cloud native Identity & Access Proxy / API (IAP) and Access Control Decision API that authenticates, authorizes, and mutates incoming HTTP(s) requests. Inspired by the BeyondCorp / Zero Trust white paper. Written in Go.

Integrations overview (maintained by Kern AI)

Integration	Description
refinery-python	Official Python SDK for Kern AI refinery.
sequence-learn	With sequence-learn, you can build models for named entity recognition as quickly as if you were building a sklearn classifier.
embedders	With embedders, you can easily convert your texts into sentence- or token-level embeddings within a few lines of code. Use cases for this include similarity search between texts, information extraction such as named entity recognition, or basic text classification. Integrates 🤗 Hugging Face transformer models
weak-nlp	With weak-nlp, you can integrate heuristics like labeling functions and active learners based on weak supervision. Automate data labeling and improve label quality.

Integrations overview (open-source 3rd party)

Integration	Description
huggingface/transformers	🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
scikit-learn/scikit-learn	scikit-learn: machine learning in Python
explosion/spaCy	💫 Industrial-strength Natural Language Processing (NLP) in Python

Submodules overview

Not listed in the architecture, but for internal code management, we apply git submodules.

Submodule	Description
submodule-model	Data model for refinery. Manages entities and their access for multiple services, e.g. the gateway.
submodule-s3	S3 related AWS and Minio logic.

🏫 Glossary

Term	Meaning
Weak supervision	Technique/methodology to integrate different kinds of noisy and imperfect heuristics like labeling functions. It can be used not only to automate data labeling, but generally as an approach to improve your existing label quality.
Neural search	Embedding-based approach to retrieve information; instead of telling a machine a set of constraints, neural search analyzes the vector space of data (encoded via e.g. pre-trained neural networks). Can be used e.g. to find nearest neighbors.
Active learning	As data is labeled manually, a model is trained continuously to support the annotator. Can be used e.g. stand-alone, or as a heuristic for weak supervision.
Vector encoding (embedding)	Using pre-trained models such as transformers from 🤗 Hugging Face, texts can be transformed into vector space. This is both helpful for neural search and active learning (in the latter case, simple classifiers can be applied on top of the embedding, which enables fast re-training on the vector representations).

Missing anything in the glossary? Add the term in an issue with the tag "enhancement".

👩‍💻👨‍💻 Team and contributors

_{Henrik Wenck}	_{Johannes Hötter}	_{Anton Pullem}	_{Lina Lumburovska}	_{Moritz Feuerpfeil}	_{Leo Püttmann}	_{Simon Degraf}
_{Felix Kirsch}	_{Jens Wittmeyer}	_{Mikhail Kochikov}	_{Simon Witzke}	_{Shamanth Shetty}

🌟 Star History

📃 License

refinery is licensed under the Apache License, Version 2.0. View a copy of the License file.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.3.0

Sep 28, 2022

1.2.0

Sep 12, 2022

1.1.1

Aug 24, 2022

1.1.0

Aug 19, 2022

1.0.3

Aug 3, 2022

1.0.2

Jul 17, 2022

1.0.1

Jul 17, 2022

1.0.0

Jul 16, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

kern_refinery-1.3.0-py2.py3-none-any.whl (18.3 kB view details)

Uploaded Sep 28, 2022 Python 2Python 3

File details

Details for the file kern_refinery-1.3.0-py2.py3-none-any.whl.

File metadata

Download URL: kern_refinery-1.3.0-py2.py3-none-any.whl
Upload date: Sep 28, 2022
Size: 18.3 kB
Tags: Python 2, Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.10.3

File hashes

Hashes for kern_refinery-1.3.0-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`4d28797472e7cc24550529916ba61a51e6ffbd4d156aea72c25d606482c6ee3c`
MD5	`e6e84dd1eefdf91755d12f94c97d5a74`
BLAKE2b-256	`c4ea5528873acc4b8b9557bc6d4cd78df9cc2c6853f85777ddfbdcdda2603560`

See more details on using hashes here.

kern-refinery 1.3.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Table of contents

🧑‍💻 Why refinery? Built for developers with collaboration in mind

Open-source and developer-oriented

For automation or quality control

Improving collaboration with subject matter experts

Integrations

Your benefits

🤓 Features

(Semi-)automated labeling workflow for NLP tasks

Extensive data management and monitoring

Team workspaces in the managed version

☕ Installation

From pip

From repository

Persisting data

📘 Documentation and tutorials

😵‍💫 Need help?

🪢 Community and contact

🙌 Contributing

🗺️ Roadmap

❓ FAQ

Concept questions

Technical questions

Service and hosting questions

🐍 Python SDK

🏠 Architecture

🏫 Glossary

👩‍💻👨‍💻 Team and contributors

🌟 Star History

📃 License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes