Skip to main content

Customizable tool for easy manual annotation

Project description

Humannotator

Library for conveniently creating simple customizable annotators for manual annotation of your data
Jenia Kim, Lawrence Vriend

Works well with Jupyter notebooks:

Binder

Use case

The humannotator provides an easy way to set up custom annotators. This tool is for you if manual annotation is part of your workflow and you are looking for a solution that is:

  • Lightweight
  • Customizable
  • Easy to set up
  • Integrates with Jupyter/pandas/Python

Quick start

Install the humannotator

Install with conda:

    conda install -c lcvriend humannotator

Or use pip:

    pip install humannotator

Create a simple annotator

  1. Load the data
  2. Define the tasks
  3. Instantiate the annotator
    import pandas as pd
    from humannotator import Annotator

    # load data
    df = pd.read_csv('examples/popcorn_classics.csv', sep=';', index_col=0)

    # set up the annotator
    ratings = [
        'One bag',
        'Two bags',
        'Three bags',
        'Four bags',
        'Five-bagger',
    ]
    annotator = Annotator(df, name='VFA | Rate my popcorn classics')
    annotator.tasks['Bags of popcorn'] = ratings

    # run annotator
    annotator(user='GT')

In Jupyter this gives:

Humannotator

Annotate your data

  • Use the annotator by calling it: annotator().
  • The annotator keeps track of where you were.
  • Highlight phrases with the 'phrases' argument.
  • The annotator stores user (if provided) and timestamp with the annotation.

Access your annotations

  • The annotations are conveniently stored in a pandas DataFrame.
  • Access the annotations with the annotated attribute.
  • Get the indeces of the records without annotation with unannotated.
  • Return the data merged with its annotations with the merged method.

Store your annotations

  • Store the annotator with the save method.
  • Load the annotator with the load method.

Load data

The annotator accepts list, dict, Series and DataFrame objects as data.
The data will be converted to a dataframe internally.

Dataframes

  • By default, the annotator will use the dataframe's index and all columns.
  • Use load_data to easily create a data object if you need more control:
    1. id_col sets the column to be used as index.
    2. item_cols set the column or columns to be displayed.

Define tasks

Tasks can be set up through subscription or with the task_factory.

Setting up tasks with the task factory

Create a task by passing task_factory:

  • the kind of task
  • the name of the task
  • (optionally) an instruction
  • (optionally) a list of dependencies
  • whether it is nullable (default is False)
  • any kwargs necessary (depends on the kind of task)

Typically:

    task_factory(
        'kind',
        'name',
        instruction='instruction',
        dependencies=dependencies,
        nullable=True/False,
        **kwargs,
    )

Passing a dict or list to kind will create a categorical task.
In this case the categories kwarg is ignored.

Setting up tasks through subscription

It is also possible to instantiate an annotator and add tasks through subscription:

    a = Annotator()
    a.tasks['topic'] = ['economy', 'politics', 'media', 'other']
    a.tasks['factual'] = bool, "Is the article factual?", False

To add a task like this, you minimally need to provide the kind of task you are trying to create. Optionally, you can add instruction, nullability, dependencies and any other kwargs (as dictionary). Change the order in which tasks are prompted to the user with the order attribute on tasks.

Available tasks

kind kwargs dtype description
str object String
regex regex object String validated by regex
int Int64 Nullable integer
float float64 Float
bool bool Boolean
category categories CategoricalDtype Categorical variable
date datetime64[ns] Date

Dependencies

Dependencies consist of a condition and a value, that can be passed as tuple:

    ("col1 == 'x'", False)

The condition is a pandas query statement. Before prompting the user for input, the condition is evaluated on the current annotation. If the query evaluates to True then the value will be assigned automatically.

Annotator

Calling the annotator

The annotator detects if it is run from Jupyter. If so, the annotator will render itself in html and css. If not, the annotator will render itself as text. You can annotate a selection of records by passing a list of ids to the annotator call. If you want to reannotate ids that have already been annotated, then set redo to True when calling the annotator.

Instantiating the annotator

arguments

tasks : Task, list of Task objects, Tasks, Annotations or DataFrame

Annotation task(s).
If passed a DataFrame, then the tasks will be inferred from it.
Annotation data in the dataframe will also be initialized.

data : data, list-/dict-like, Series or DataFrame, default None

Data to be annotated.
If `data` is not already a data object,
then it will be passed through `load_data`.
The annotator can be instantiated without data,
but will only work after data is loaded.

user : str, default None

Name of the user.

name : str, default 'HUMANNOTATOR'

Name of the annotator.

save_data : boolean, default False

Set flag to True if you want to store the data with the annotator.
This will ensure that the pickled object, will contain the data.

other parameters

DISPLAY
text_display : boolean, default None

If True will display the annotator in plain text instead of html.

DATA
item_cols : str or list of str, default None

Name(s) of dataframe column(s) to display when annotating.
By default: display all columns.

id_col : str, default None

Name of dataframe column to use as index.
By default: use the dataframe's index.

HIGHLIGHTER
phrases : str, list of str, default None

Phrases to highlight in the display.
The phrases can be regexes.
It also to pass in a dict where:
- the keys are the phrases
- the values are the css styling

escape : boolean, default False

Set escape to True in order to escape the phrases.

flags : int, default 0 (no flags)

Flags to pass through to the re module, e.g. re.IGNORECASE.

TRUNCATER
truncate : boolean, default True

Set to False to not truncate items.

trunc_limit : int, default 32

The number of words beyond which an item will be truncated.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

humannotator-0.0.2.tar.gz (28.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

humannotator-0.0.2-py3-none-any.whl (49.1 kB view details)

Uploaded Python 3

File details

Details for the file humannotator-0.0.2.tar.gz.

File metadata

  • Download URL: humannotator-0.0.2.tar.gz
  • Upload date:
  • Size: 28.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.3

File hashes

Hashes for humannotator-0.0.2.tar.gz
Algorithm Hash digest
SHA256 5fe9d68e5715b070f14de14d5e1e2f284a9b0bbc73b8779b4f3f66676c7acbca
MD5 6339c090aae74fa4edea2261e70d95d6
BLAKE2b-256 d77769107396c9343e9ff60530b8d448179d32e64c8efd20d239c051e43ab0ed

See more details on using hashes here.

File details

Details for the file humannotator-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: humannotator-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 49.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/2.0.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.3

File hashes

Hashes for humannotator-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 6da40fccffd11b4e6cc82b44bd84a453dd794f315dc247477ec35ca405e4f8c4
MD5 75b20c2a5acdde6adcd63ddb8ada188c
BLAKE2b-256 3480d6fdc8b5c0149614f6ae5924c4c233c37bc2be08abdc400b7afabd11a759

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page