Pipeline for decontextualization of scientific snippets.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

QA Decontextualization

See experiments for description of how to run experiments investigating this method.

Set Up

conda create  -n qa_decontext python=3.9
conda activate qa_decontext
pip install -e .

Quick Start

Set your OpenAI API key

export OPENAI_API_KEY='yourkey'

Decontextualize

To decontextualize a snippet using some context, you can pass both the snippet and context to the decontextualization function.

from qa_decontext import decontextualize

context = """\
Data collection. Subreddits are sub-communities on Reddit oriented around specific interests or topics, such as technology or politics. Sampling from Reddit as a whole would bias the model towards the most commonly discussed content. But by sampling posts from individual subreddits, we can control the kinds of posts we use to train our model. To collect a diverse training dataset, we have randomly s
ampled 1000 posts each from the subreddits politics, business, science, and AskReddit, and 1000 additional posts from the Reddit frontpage. All posts in our sample appeared between January 2007 and March 2015, and to control for length effects, contain between 300 and 400 characters. This results in a total training dataset of 5000 posts.

We compare the predictions of logistic regression models based on unigram bag-of-words features (BOW), sentiment signals (SENT), the linguistic features from our earlier analyses (LING), and combinations of these features. BOW and SENT provide baselines for the task. We compute BOW features using term frequency-inverse document frequency (TF-IDF) and category-based features by normalizing counts for each category by the number of words in each document. The BOW classifiers are trained with regularization (L2 penalties of 1.5).

We now apply our dogmatism classifier to a larger dataset of posts, examining how dogmatic language shapes the Reddit community. Concretely, we apply the BOW+LING model trained on the full Reddit dataset to millions of new unannotated posts, labeling these posts with a probability of dogmatism according to the classifier (0=non-dogmatic, 1=dogmatic). We then use these dogmatism annotations to address four research questions."""

snippet = "Concretely, we apply the BOW+LING model trained on the full Reddit dataset to millions of new unannotated posts, labeling these posts with a probability of dogmatism according to the classifier (0=non-dogmatic, 1=dogmatic)."

decontextualize(snippet, context)
> "[REF0] apply the BOW+LING [bag-of-words and linguistic features] model trained on the full Reddit dataset [different subreddit representing different topics, such as politics, business, science and other other posts in the Reddit home page] to millions of new unannotated posts, labeling these posts with a probability of dogmatism according to the classifier (0=non-dogmatic, 1=dogmatic)."

Often times, you want to use a whole paper or a list of papers as context. You can specify the papers as paths to json files with a specific format or as S2ORC IDs.

paper_1 = "path/to/paper.json"
paper_2 = "path/to/cited_paper.json"

decontextualize(snippet, paper_1)
decontextualize(snippet, [paper_1, paper_2])

Below is the json file format:

{
    "title": "<title>",
    "abstract": "<abstract>",
    "full_text": [{
        "section_name" : "<section_title>",
        "paragraphs": ["<paragraph>", ...]
    }, ...]
}

Subsection names should be separated from their supersection name with ":::". For example, the subsection "Metrics" of the "Methods" section would have the section_name: "Methods ::: Metrics".

You can also specify papers with S2ORC IDs if the full text of the paper as long as the full text of the paper is in S2ORC.

decontextualize(snippet, "52118895")
decontextualize(snippet, ["52118895", "65318895"])

Customize decontextualization pipeline

By default we use GPT4 to answer based on the full document, but you can customize the different part of the pipeline by including a config in the call to decontextualize. The pipeline consists of three parts: A question generator (qgen), a question-answering system (qa), and a synthesizer (synth) to rewrite the snippets to include the answers to the questions. Each component of the pipeline can be specified using the a config. The config can either be a dictionary or a path to a yaml file with the following structure. (The values are the defaults.)

qgen:
    model_name: "text-davinci-003"
    max_questions: 3
    template: "templates/qgen.yaml"
qa:
    retriever: null  # "dense" for contriever or "tfidf" for BM25
    model_name: "gpt4"
    template: "templates/qa.yaml"
synth:
    model_name: "text-davinci-003"
    template: "templates/synth.yaml"

decontextualize(snippet, context, config="config.yaml")

Debugging For debugging purposes, it's useful to have access to the intermediate outputs of the pipeline. To show these, set the return_metadata argument to True.

new_snippet, metadata = decontextualize(snippet, paper_1, return_metadata=True)

The returned metadata has the following structure:

{
    "qgen": [
        {"qid": "<question_1_id>", "question": "<question_1>"},
        ...
    ],
    "qa-retrieval": [
        {"<question_1_id>": ["<doc_1>", "<doc_2>", ...]},
        ...
    ],
    "qa-answers": {
        "<question_1_id>": "answer",
        "<question_2_id>": "<answer>",
        ...
    },
    "cost": <cost_in_dollars>
}

Documentation

def decontextualize(
    snippet: str,
    context: Union[str, list[str], Path, list[Path]],
    config: Union[dict, str, Path] = "configs/default.yaml",
    return_metadata: bool = False
) -> Union[str, Tuple[str, dict]]
"""Decontextualizes the snippet using the given context according to the given config.

Args:
    snippet: The text snippet to decontextualize.
    context: The context to incorporate in the decontextualization. This can be:
        * a string with the context.
        * a list of context strings (each item should be a paragraph).
        * a path to a json file with the full paper.
        * a list of paths to json files with the full paper, all of which should be used as context.
        * a string containing the S2ORC ID of a paper to use.
        * a list of strings containing the S2ORC IDs of the papers to use as context.

Returns:
    string with the decontextualized version of the snippet.

    if `return_metadata = True`, additionally return the intermediate results for each step of the pipeline as described above.
"""

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.1.7

Aug 15, 2023

0.1.6

Aug 15, 2023

0.1.5

Aug 6, 2023

0.1.4

Aug 3, 2023

0.1.3

Aug 2, 2023

0.1.2

Jul 17, 2023

0.1.1

Jul 14, 2023

This version

0.1.0

Jul 5, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

decontext-0.1.0.tar.gz (5.1 kB view hashes)

Uploaded Jul 5, 2023 Source

Built Distribution

decontext-0.1.0-py3-none-any.whl (4.4 kB view hashes)

Uploaded Jul 5, 2023 Python 3

Hashes for decontext-0.1.0.tar.gz

Hashes for decontext-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`98a67a34cdccb845a17df03938eb4c16e51e0ed7bf1a48c2a5ec930671a60466`
MD5	`63dc36574a95a2544cba458e91bc26ba`
BLAKE2b-256	`eeca8478d11c642463de5e4333cba9dfe6c8bd9a15b70d4a1bba3099508bc0a4`

Hashes for decontext-0.1.0-py3-none-any.whl

Hashes for decontext-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`064c35bd87c40a5e2bba9418dbabad59e16db39d3d307729cc3415b2f08585db`
MD5	`42e7322eb12781197fd1d6bdcae92291`
BLAKE2b-256	`96b9f16080439fc6af21ea5a73c10b081ba32fe2b4f1b82e1183e73e8d85e6e1`