Skip to main content

XAI human-in-the-loop information extraction framework

Project description

POTATO: exPlainable infOrmation exTrAcTion framewOrk

POTATO is a human-in-the-loop XAI framework for extracting and evaluating interpretable graph features for any classification problem

Built systems

To get started with rule-systems we provide rule-based features prebuilt with POTATO on different datasets (e.g. our paper Offensive text detection on English Twitter with deep learning models and rule-based systems for the HASOC2021 shared task). If you are interested in that, you can go under features/ for more info!

Install and Quick Start

Setup

The tool is heavily dependent upon the tuw-nlp repository. You can install tuw-nlp with pip:

pip install tuw-nlp

Then follow the instructions to setup the package.

Then install POTATO from pip:

pip install xpotato

Or you can install it from source:

pip install -e .

Usage

First import packages from potato:

from xpotato.dataset.dataset import Dataset
from xpotato.models.trainer import GraphTrainer

Initialize the dataset you want to classify:

sentences = [("fuck absolutely everything about today.", "HOF"),
            ("I just made food and I'm making myself sick to my stomach. Lol, wtf is this shit", "HOF"),
            ("RT [USER]: America is the most fucked up country [URL]", "HOF"),
            ("you'd be blind to not see the heart eyes i have for you.", "NOT"),
            ("It's hard for me to give a fuck now", "HOF"),
            ("tell me everything", "NOT"),
            ("Bitch YES [URL]", "HOF"),
            ("Eight people a minute....", "NOT"),
            ("RT [USER]: im not fine, i need you", "NOT"),
            ("Holy shit.. 3 months and I'll be in Italy", "HOF"),
            ("Now I do what I want 🤪", "NOT"),
            ("[USER] you'd immediately stop", "NOT"),
            ("Just... shut the fuck up", "HOF"),
            ("RT [USER]: ohhhh shit a [USER] [URL]", "HOF"),
            ("all i want is for yara to survive tonight", "NOT"),
            ("fuck them", "HOF")]

Initialize the dataset and also provide a label encoding. Then parse the sentences into graphs. Currently we provide three types of graphs: ud, fourlang, amr.

dataset = Dataset(sentences, label_vocab={"NOT":0, "HOF": 1})
dataset.set_graphs(dataset.parse_graphs(graph_format="ud"))

Rules

If the dataset is prepared and the graphs are parsed, we can write rules to match labels. We can write rules either manually or extract them automatically (POTATO also provides a frontend that tries to do both).

The simplest rule would be just a node in the graph:

#the syntax of the rules is List[List[rules that we want to match], List[rules that shouldn't be in the matched graphs], Label of the rule]
rule_to_match = [[["(u_1 / fuck)"], [], "HOF"]]

Init the rule matcher:

from xpotato.graph_extractor.extract import FeatureEvaluator
evaluator = FeatureEvaluator()

Match the rules in the dataset:

#match single feature
df = dataset.to_dataframe()
evaluator.match_features(df, rule_to_match)

The function will return a dataframe with the matched instances:

Sentence Predicted label Matched rule
0 fuck absolutely everything about today. HOF [['(u_1 / fuck)'], [], 'HOF']
1 I just made food and I'm making myself sick to my stomach. Lol, wtf is this shit
2 RT [USER]: America is the most fucked up country [URL]
3 you'd be blind to not see the heart eyes i have for you.
4 It's hard for me to give a fuck now HOF [['(u_1 / fuck)'], [], 'HOF']
5 tell me everything
6 Bitch YES [URL]
7 Eight people a minute....
8 RT [USER]: im not fine, i need you
9 Holy shit.. 3 months and I'll be in Italy
10 Now I do what I want 🤪
11 [USER] you'd immediately stop
12 Just... shut the fuck up HOF [['(u_1 / fuck)'], [], 'HOF']
13 RT [USER]: ohhhh shit a [USER] [URL]
14 all i want is for yara to survive tonight
15 fuck them HOF [['(u_1 / fuck)'], [], 'HOF']

One of the core features of our tool is that we are also able to match subgraphs:

#match a simple graph feature
evaluator.match_features(df, [[["(u_1 / fuck :obj (u_2 / everything))"], [], "HOF"]])

This will only return one match instead of three:

Sentence Predicted label Matched rule
0 fuck absolutely everything about today. HOF [['(u_1 / fuck :obj (u_2 / everything))'], [], 'HOF']
1 I just made food and I'm making myself sick to my stomach. Lol, wtf is this shit
2 RT [USER]: America is the most fucked up country [URL]
3 you'd be blind to not see the heart eyes i have for you.
4 It's hard for me to give a fuck now
5 tell me everything
6 Bitch YES [URL]
7 Eight people a minute....
8 RT [USER]: im not fine, i need you
9 Holy shit.. 3 months and I'll be in Italy
10 Now I do what I want 🤪
11 [USER] you'd immediately stop
12 Just... shut the fuck up
13 RT [USER]: ohhhh shit a [USER] [URL]
14 all i want is for yara to survive tonight
15 fuck them

We can also add negated features that we don't want to match (this won't match the first row where 'absolutely' is present):

#match a simple graph feature
evaluator.match_features(df, [[["(u_1 / fuck)"], ["(u_2 / absolutely)"], "HOF"]])
Sentence Predicted label Matched rule
0 fuck absolutely everything about today.
1 I just made food and I'm making myself sick to my stomach. Lol, wtf is this shit
2 RT [USER]: America is the most fucked up country [URL]
3 you'd be blind to not see the heart eyes i have for you.
4 It's hard for me to give a fuck now HOF [['(u_1 / fuck)'], ['(u_2 / absolutely)'], 'HOF']
5 tell me everything
6 Bitch YES [URL]
7 Eight people a minute....
8 RT [USER]: im not fine, i need you
9 Holy shit.. 3 months and I'll be in Italy
10 Now I do what I want 🤪
11 [USER] you'd immediately stop
12 Just... shut the fuck up HOF [['(u_1 / fuck)'], ['(u_2 / absolutely)'], 'HOF']
13 RT [USER]: ohhhh shit a [USER] [URL]
14 all i want is for yara to survive tonight
15 fuck them HOF [['(u_1 / fuck)'], ['(u_2 / absolutely)'], 'HOF']

If we don't want to specify nodes, regex can also be used in place of the node and edge-names:

#regex can be used to match any node (this will match instances where 'fuck' is connected to any node with 'obj' edge)
evaluator.match_features(df, [[["(u_1 / fuck :obj (u_2 / .*))"], [], "HOF"]])
Sentence Predicted label Matched rule
0 fuck absolutely everything about today. HOF [['(u_1 / fuck :obj (u_2 / .*))'], [], 'HOF']
1 I just made food and I'm making myself sick to my stomach. Lol, wtf is this shit
2 RT [USER]: America is the most fucked up country [URL]
3 you'd be blind to not see the heart eyes i have for you.
4 It's hard for me to give a fuck now
5 tell me everything
6 Bitch YES [URL]
7 Eight people a minute....
8 RT [USER]: im not fine, i need you
9 Holy shit.. 3 months and I'll be in Italy
10 Now I do what I want 🤪
11 [USER] you'd immediately stop
12 Just... shut the fuck up
13 RT [USER]: ohhhh shit a [USER] [URL]
14 all i want is for yara to survive tonight
15 fuck them HOF [['(u_1 / fuck :obj (u_2 / .*))'], [], 'HOF']

We can also train regex rules from a training data, this will automatically replace regex '.*' with nodes that are 'good enough' statistically based on the provided dataframe.

#regex can be used to match any node (this will match instances where 'fuck' is connected to any node with 'obj' edge)
evaluator.train_feature("HOF", "(u_1 / fuck :obj (u_2 / .*))", df)

This will return '(u_1 / fuck :obj (u_2 / everything|they))'] (replaced '.*' with everything and they)

Learning rules

To extract rules automatically, train the dataset with graph features and rank them based on relevancy:

df = dataset.to_dataframe()
trainer = GraphTrainer(df)
#extract features
features = trainer.prepare_and_train()

from sklearn.model_selection import train_test_split

train, val = train_test_split(df, test_size=0.2, random_state=1234)

#save train and validation, this is important for the frontend to work
train.to_pickle("train_dataset")
val.to_pickle("val_dataset")

import json

#also save the ranked features
with open("features.json", "w+") as f:
    json.dump(features, f)

You can also save the parsed graphs for evaluation or for caching:

import pickle
with open("graphs.pickle", "wb") as f:
    pickle.dump(val.graph, f)

To see the code you can check the jupyter notebook under notebooks/examples.ipynb

Frontend

If the DataFrame is ready with the parsed graphs, the UI can be started to inspect the extracted rules and modify them. The frontend is a streamlit app, the simplest way of starting it is (the training and the validation dataset must be provided):

streamlit run frontend/app.py -- -t notebooks/train_dataset -v notebooks/val_dataset -g ud

it can be also started with the extracted features:

streamlit run frontend/app.py -- -t notebooks/train_dataset -v notebooks/val_dataset -g ud -sr notebooks/features.json

if you already used the UI and extracted the features manually and you want to load it, you can run:

streamlit run frontend/app.py -- -t notebooks/train_dataset -v notebooks/val_dataset -g ud -sr notebooks/features.json -hr notebooks/manual_features.json

Unsupervised mode

If labels are not or just partially provided, the frontend can be started also in unsupervised mode, where the user can annotate a few examples at the start, then the system gradually offers rules based on the provided examples.

Dataset without labels can be initialized with:

sentences = [("fuck absolutely everything about today.", ""),
            ("I just made food and I'm making myself sick to my stomach. Lol, wtf is this shit", ""),
            ("RT [USER]: America is the most fucked up country [URL]", ""),
            ("you'd be blind to not see the heart eyes i have for you.", ""),
            ("It's hard for me to give a fuck now", ""),
            ("tell me everything", ""),
            ("Bitch YES [URL]", ""),
            ("Eight people a minute....", ""),
            ("RT [USER]: im not fine, i need you", ""),
            ("Holy shit.. 3 months and I'll be in Italy", ""),
            ("Now I do what I want 🤪", ""),
            ("[USER] you'd immediately stop", ""),
            ("Just... shut the fuck up", ""),
            ("RT [USER]: ohhhh shit a [USER] [URL]", ""),
            ("all i want is for yara to survive tonight", ""),
            ("fuck them", "")]

Then, the frontend can be started:

streamlit run frontend/app.py -- -t notebooks/unsupervised_dataset -g ud -m unsupervised

Evaluate

If you have the features ready and you want to evaluate them on a test set, you can run:

python scripts/evaluate.py -t ud -f notebooks/features.json -d notebooks/val_dataset

The result will be a csv file with the labels and the matched rules:

Sentence Predicted label Matched rule
0 RT [USER]: ohhhh shit a [USER] [URL] HOF ['(u_48 / shit)']
1 [USER] you'd immediately stop HOF ['(u_40 / user :punct (u_42 / LSB))']
2 fuck absolutely everything about today. HOF ['(u_1 / fuck)']

Contributing

We welcome all contributions! Please fork this repository and create a branch for your modifications. We suggest getting in touch with us first, by opening an issue or by writing an email to Adam Kovacs or Gabor Recski at firstname.lastname@tuwien.ac.at

Citing

License

MIT license

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xpotato-0.0.1.tar.gz (1.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

xpotato-0.0.1-py3-none-any.whl (17.4 kB view details)

Uploaded Python 3

File details

Details for the file xpotato-0.0.1.tar.gz.

File metadata

  • Download URL: xpotato-0.0.1.tar.gz
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.5

File hashes

Hashes for xpotato-0.0.1.tar.gz
Algorithm Hash digest
SHA256 493d8549a46de3a8fa8d58462d2f1e5c6eefcf3d0cabdac1102dd88357a3ddb3
MD5 d7f3b5cb7260efe396c459bdda97dd4e
BLAKE2b-256 87a232fb4c20121f834a5861513e79c7c4d3777c320141771dcb633115af3f27

See more details on using hashes here.

File details

Details for the file xpotato-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: xpotato-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 17.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.8.2 pkginfo/1.8.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.61.2 CPython/3.9.5

File hashes

Hashes for xpotato-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 83c5d2196310310d7d01df0cab2eb9a96ec753d3f79c0ddf974c5768e240e441
MD5 cf32220b643eba878e64c114a38d56c1
BLAKE2b-256 c5bededa47d30faeec12404d2459e21a81e8bc7a6a76bcb5ae72d4567ac365a6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page