POSNoise: An Effective Countermeasure Against Topic Biases in Authorship Analysis
Project description
POSNoise
POSNoise: An Effective Countermeasure Against Topic Biases in Authorship Analysis
Description
POSNoise is a preprocessing method that systematically masks topic-related content from text documents. The motivation for this becomes clear when we take a closer look at the field of authorship analysis, particularly Authorship Attribution (AA) and Authorship Verification (AV).
AA is concerned with determining which of several possible authors wrote a particular anonymous text. AV, on the other hand, deals with the so-called fundamental problem of whether two documents were written by the same person. To accomplish both tasks, we rely on the assumption that every person has his or her own writing style, which differs to a certain extent from the writing styles of other people. This is exactly where the topic can pose a serious challenge.
Consider for example the following scenario: D1 and D2 represent two documents that deal with the same topic but were written by two different people. An AV method (which aims to answer the question of whether two documents were written by the same author) could mistakenly conclude that D1 and D2 were written by the same author, as both documents are very similar in terms of topic. However, the goal of AA and AV is not to predict whether authorship is the same or different based on the topic, but rather based on the actual writing style.
To prevent AA and AV methods from focusing on the topic and instead force them to concentrate on linguistic patterns such as function words or punctuation marks (which are closely related to writing style), topic masking approaches can be used, where POSNoise is one such approach...
Installation
The easiest way to install POSNoise is to use pip, where you can choose between (1) the PyPI repository and (2) this repository.
-
(1)
pip install posnoise -
(2)
pip install git+https://github.com/Halvani/POSNoise.git
The latter will pull and install the latest commit from this repository as well as the required Python dependencies.
Quickstart
from posnoise import POSNoise
# By default POSNoise loads the "Large" spaCy model
posnoise_instance = POSNoise()
# In case you want specify another model, just set it accordingly e.g.
# posnoise_instance = POSNoise(spacy_model_size=posnoise.core.SpacyModelSize.Medium)
document = "Fitzgerald made her first tour of Australia in July 1954 for the Australian-based American promoter Lee Gordon."
posnoised_doc = posnoise_instance.pos_noise(document)
print(document)
# Fitzgerald made her first tour of Australia in July 1954 for the Australian-based American promoter Lee Gordon.
print(posnoised_doc)
# § made her first # of § in § µ for the §-Ø @ # § §.
All part-of-speech (POS) placeholders used in POSNoise to replace topic-related words or tokens are listed below:
| Category | Tag | Examples |
|---|---|---|
| Noun | # | { house, music, bird, tree, air, … } |
| Proper noun | § | { David, Vivien, London, USA, COVID-19, … } |
| Verb | Ø | { eat, laugh, dance, travel, hiking, … } |
| Adjective | @ | { red, shiny, fascinating, phenomenal, … } |
| Adverb | © | { financially, foolishly, angrily, … } |
| Numeral | μ | { 0, 5, 2013, 3.14159, III, IV, MMXIV, … } |
| Symbol | $ | { £, ©, §, %, #, … } |
| Other | ¥ | { xfgh, pdl, jklw, … } |
Source: POSNoise: An Effective Countermeasure Against Topic Biases in Authorship Analysis ➜ Table 2.
Features
- Effective way to mask topic-relatd content with custom POS placeholders
- Automatic NLP pipeline creation (loads and installs the spaCy models on demand, while providing feedback)
- No API dependency (after downloading the spaCy models, POSNoise can be used completely offline)
- Documented source code
Citation
If you find this library helpful, please invest a few minutes and cite it in your paper/project:
@inproceedings{HalvaniGranerPOSNoise:2021,
author = {Oren Halvani and Lukas Graner},
editor = {Delphine Reinhardt and Tilo M{\"{u}}ller},
title = {{POSNoise: An Effective Countermeasure Against Topic Biases in Authorship Analysis}},
booktitle = {{ARES} 2021: The 16th International Conference on Availability, Reliability and Security, Vienna, Austria, August 17-20, 2021},
pages = {47:1--47:12},
publisher = {{ACM}},
year = {2021},
url = {https://doi.org/10.1145/3465481.3470050},
doi = {10.1145/3465481.3470050},
timestamp = {Sun, 04 Aug 2024 19:40:49 +0200},
biburl = {https://dblp.org/rec/conf/IEEEares/HalvaniG21.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
License
The POSNoise package is released under the Apache-2.0 license. See LICENSE for further details.
Last Remarks
As is usual with open source projects, we developers do not earn any money with what we do, but are primarily interested in giving something back to the community with fun, passion and joy. Nevertheless, we would be very happy if you rewarded all the time that has gone into the project with just a small star 🤗
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file posnoise-0.0.2.tar.gz.
File metadata
- Download URL: posnoise-0.0.2.tar.gz
- Upload date:
- Size: 61.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ff7eacb34363a5cb359b2e3233404839fc2cb28897fd10ec775ce2442c5a5036
|
|
| MD5 |
fe7becff8fc2829689aa2b9fbc4d1bab
|
|
| BLAKE2b-256 |
9f1fa27549e06dd30dac3a48fa213652212fdacc902ac5353e287476f801003d
|
Provenance
The following attestation bundles were made for posnoise-0.0.2.tar.gz:
Publisher:
python-publish.yml on Halvani/POSNoise
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
posnoise-0.0.2.tar.gz -
Subject digest:
ff7eacb34363a5cb359b2e3233404839fc2cb28897fd10ec775ce2442c5a5036 - Sigstore transparency entry: 777324648
- Sigstore integration time:
-
Permalink:
Halvani/POSNoise@46eef8af3af2025b990234c6ca63ec5013b65cc2 -
Branch / Tag:
refs/tags/v.0.0.2 - Owner: https://github.com/Halvani
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@46eef8af3af2025b990234c6ca63ec5013b65cc2 -
Trigger Event:
release
-
Statement type:
File details
Details for the file posnoise-0.0.2-py3-none-any.whl.
File metadata
- Download URL: posnoise-0.0.2-py3-none-any.whl
- Upload date:
- Size: 17.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3d884c4cb787f5fb0ae6fa6f526f7a971c104099dbcf4c4fc08d5ad4a6aab1c2
|
|
| MD5 |
1f88d23da53ae2a71c92fd54ae962a44
|
|
| BLAKE2b-256 |
319e57fe25b0e6ee8a5737b36c8247192cf7552d84aa0cd57a9945872007dc7c
|
Provenance
The following attestation bundles were made for posnoise-0.0.2-py3-none-any.whl:
Publisher:
python-publish.yml on Halvani/POSNoise
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
posnoise-0.0.2-py3-none-any.whl -
Subject digest:
3d884c4cb787f5fb0ae6fa6f526f7a971c104099dbcf4c4fc08d5ad4a6aab1c2 - Sigstore transparency entry: 777324678
- Sigstore integration time:
-
Permalink:
Halvani/POSNoise@46eef8af3af2025b990234c6ca63ec5013b65cc2 -
Branch / Tag:
refs/tags/v.0.0.2 - Owner: https://github.com/Halvani
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@46eef8af3af2025b990234c6ca63ec5013b65cc2 -
Trigger Event:
release
-
Statement type: