A Comprehensive Multimodal Argument Mining Toolkit.
Project description
| 🌐 Website | 📚 Documentation | 🤝 Contributing |
MAMKit: Multimodal Argument Mining Toolkit
A Comprehensive Multimodal Argument Mining Toolkit.
Table of Contents
Introduction
MAMKit is an open-source, publicly available PyTorch toolkit designed to access and develop datasets, models, and benchmarks for Multimodal Argument Mining (MAM). It provides a flexible interface for accessing and integrating datasets, models, and preprocessing strategies through composition or custom definition. MAMKit is designed to be extendible, ensure replicability, and provide a shared interface as a common foundation for experimentation in the field.
At the time of writing, MAMKit offers 4 datasets, 4 tasks and 6 distinct model architectures, along with audio and text processing capabilities, organized in 5 main components.
Datasets | Tasks |
---|---|
UkDebates | Argumentative Sentence Detection (ASD) |
MArgγ | Argumentative Relation Classification (ARC) |
MM-USED | Argumentative Sentence Detection (ASD) Argumentative Component Classification (ACC) |
MM-USED-fallacy | Argumentative Fallacy Classification (AFC) |
Model | Text Encoding | Audio Encoding | Fusion |
---|---|---|---|
BiLSTM | GloVe + BiLSTM | (Wav2Vec2 ∨ MFCCs) + BiLSTM | Conat-Late |
MM-BERT | BERT | (Wav2Vec2 ∨ HuBERT ∨ WavLM) + BiLSTM | Concat-Late |
MM-RoBERTa | RoBERTa | (Wav2Vec2 ∨ HuBERT ∨ WavLM) + BiLSTM | Concat-Late |
CSA | BERT | (Wav2Vec2 ∨ HuBERT ∨ WavLM) + Transformer | Concat-Early |
Ensemble | BERT | (Wav2Vec2 ∨ HuBERT ∨ WavLM) + Transformer | Avg-Late |
Mul_TA | BERT | (Wav2Vec2 ∨ HuBERT ∨ WavLM) + Transformer | Cross |
🔧 Installation
Clone the repository and install the requirements:
git clone git@github.com:TBA_AFTER_ACCEPTANCE/mamkit.git
cd MAMKit
pip install -r requirements.txt
⚙️ Usage
Data
MAMKit provides a modular interface for defining datasets or allowing users to load datasets from the literature.
Load a Dataset
In the example that follows, illustrates how to load a dataset.
In this case, a dataset is loaded using the MMUSED
class from mamkit.data.datasets
, which extends the Loader
interface and implements specific functionalities for data loading and retrieval.
Users can specify task and input mode (text-only
, audio-only
, or text-audio
) when loading the data, with options to use default splits or load splits from previous works. The example uses splits from Mancini et al. (2022).
The get_splits
method of the loader
returns data splits in the form of a data.datasets.SplitInfo
. The latter wraps split-specific data, each implementing PyTorch's Dataset
interface and compliant to the specified input modality (i.e., text-only
).
from mamkit.data.datasets import UKDebates, InputMode
loader = UKDebates(
task_name='asd',
input_mode=InputMode.TEXT_ONLY,
base_data_path=base_data_path)
split_info = loader.get_splits('mancini-et-al-2022')
The Loader
interface also allows users to integrate methods defining custom splits as follows:
from mamkit.data.datasets import SplitInfo
def custom_splits(self) -> List[SplitInfo]:
train_df = self.data.iloc[:50]
val_df = self.data.iloc[50:100]
test_df = self.data.iloc[100:]
fold_info = self.build_info_from_splits(train_df=...)
return [fold_info]
loader.add_splits(method=custom_splits,
key='custom')
split_info = loader.get_splits('custom')
Add a New Dataset
To add a new dataset, users need to create a new class that extends the Loader
interface and implements the required functionalities for data loading and retrieval.
The new class should be placed in the mamkit.data.datasets
module.
Modelling
The toolkit provides a modular interface for defining models, allowing users to compose models from pre-defined components or define custom models. In particular, MAMkit offers a simple method for both defining custom models and leveraging models from the literature.
Load a Model
The following example demonstrates how to instantiate a model with a configuration found in the literature.
This configuration is identified by a key, ConfigKey
, containing all the defining information.
The key is used to fetch the precise configuration of the model from the configs
package.
Subsequently, the model is retrieved from the models
package and configured with the specific parameters outlined in the configuration.
from mamkit.configs.base import ConfigKey
from mamkit.configs.text import TransformerConfig
from mamkit.data.datasets import InputMode
config_key = ConfigKey(
dataset='mmused',
task_name='asd',
input_mode=InputMode.TEXT_ONLY,
tags={'mancini-et-al-2022'})
config = TransformerConfig.from_config(
key=config_key)
model = Transformer(
model_card=config.model_card,
dropout_rate=config.dropout_rate
...)
Custom Model Definition
The example below illustrates that defining a custom model is straightforward. It entails creating the model within the models
package, specifically by extending either the AudioOnlyModel
, TextOnlyModel
, or TextAudioModel
classes in the models.audio
, models.text
, or models.text_audio
modules, respectively, depending on the input modality handled by the model.
class Transformer(TextOnlyModel):
def __init__(
self,
model_card,
head,
dropout_rate=0.0,
is_transformer_trainable: bool = False,
): ...
from mamkit.models.text import Transformer
model = Transformer(
model_card='bert-base-uncased',
dropout_rate=0.1, ...)
Training
Our models are designed to be encapsulated into a PyTorch LightningModule
, which can be trained using PyTorch Lightning's Trainer
class.
The following example demonstrates how to wrap and train a model using PyTorch Lightning.
from mamkit.utility.model import to_lighting_model
import lightning
model = to_lighting_model(model=model,
num_classes=config.num_classes,
loss_function=...,
optimizer_class=...)
trainer = lightning.Trainer(max_epochs=100,
accelerator='gpu',
...)
trainer.fit(model,
train_dataloaders=train_dataloader,
val_dataloaders=val_dataloader)
Benchmarking
The mamkit.configs
package simplifies reproducing literature results in a structured manner.
Upon loading the dataset, experiment-specific configurations can be easily retrieved via a configuration key.
This enables instantiating a processor using the same features processor employed in the experiment.
In the example below, we adopt a configuration akin to Mancini et al. (2022), employing a BiLSTM model with audio encoded with MFCCs features. Hence, we define a MFCCExtractor
processor using configuration parameters.
from mamkit.configs.audio import BiLSTMMFCCsConfig
from mamkit.configs.base import ConfigKey
from mamkit.data.datasets import UKDebates, InputMode
from mamkit.data.processing import MFCCExtractor, UnimodalProcessor
from mamkit.models.audio import BiLSTM
loader = UKDebates(task_name='asd',
input_mode=InputMode.AUDIO_ONLY)
config = BiLSTMMFCCsConfig.from_config(
key=ConfigKey(dataset='ukdebates',
input_mode=InputMode.AUDIO_ONLY,
task_name='asd',
tags='mancini-et-al-2022'))
for split_info in loader.get_splits(
key='mancini-et-al-2022'):
processor =
UnimodalProcessor(
features_processor=MFCCExtractor(
mfccs=config.mfccs, ...))
split_info.train = processor(split_info.train)
...
model = BiLSTM(embedding_dim=
config.embedding_dim, ...)
🧠 Structure
The toolkit is organized into five main components: configs
, data
, models
, modules
and utility
.
In addition to that, the toolkit provides a demos
directory for running all the experiments presented in the paper.
The figure below illustrates the toolkit's structure.
📚 Website and Documentation
The documentation is available here.
The website is available here.
Our website provides a comprehensive overview of the toolkit, including installation instructions, usage examples, and a detailed description of the toolkit's components. Moreover, the website provides a detailed description of the datasets, tasks, and models available in the toolkit, together with a leaderboard of the results obtained on the datasets with the current models.
🤝 Contributing
We welcome contributions to MAMKit! Please refer to the contributing guidelines for more information.
📧 Contact Us
For any questions or suggestions, don't hesitate to contact us: Eleonora Mancini, Federico Ruggeri.
📖 Citation
If you use MAMKit in your research, please cite the following paper:
@inproceedings{TBAmamkit,
title={MAMKit: A Comprehensive Multimodal Argument Mining Toolkit},
author={TBA},
booktitle={TBA},
year={TBA}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file mamkit-0.1.0.tar.gz
.
File metadata
- Download URL: mamkit-0.1.0.tar.gz
- Upload date:
- Size: 47.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 5f8d6b603fada40eca24e90f304265ca0c771f135f275bcaf127c03b4beadc7b |
|
MD5 | 558baab64462d7543785a64691cc2e0d |
|
BLAKE2b-256 | 16026c2b6c10b7a5a377b862d8f07d96b51c6dc029b880b3f66661191f4fdbee |
File details
Details for the file mamkit-0.1.0-py3-none-any.whl
.
File metadata
- Download URL: mamkit-0.1.0-py3-none-any.whl
- Upload date:
- Size: 50.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.9.19
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2189f806caf57364abfa4faceaec6fd1b07a429ce49343c865d9aa599f14640a |
|
MD5 | 9a6c1f09b003b6db00197279aa557e60 |
|
BLAKE2b-256 | 1f84cd29293f66b1a871c71bfa76fdfff4bf3097a7b8320460237379ffdf8f08 |