Skip to main content

MELD: A multilingual and multidomain dataset for named entity recognition (NER)

Project description

MELD: Melding Diverse Multilingual and Multi-Domain Datasets for Named Entity Recognition Evaluation


MELD is a multilingual and multi-domain dataset for Named Entity Recognition (NER) constructed from 60 existing datasets. Built with reproducibility and extensibility in mind, MELD currently provides gold-standard annotations for 60 languages across up to 14 domains with a total of 601 normalized entity labels. MELD was primarily designed for diverse mutlilingual and multi-domain evaluation but also includes all training and validation sets from its source datasets where available.

Key Features

  • Standardized Formats: All datasets are converted to a consistent parquet format, preserving nested and discontinuous annotations, and document boundaries where available.
  • Highly Multilingual: Gold-standard annotations for 60 languages and silver-standard annotations derived from Wikipedia for 134 additional languages
  • Multi-domain: 14 diverse domains including legal, biomedical, financial, and social media text. Domain diversity is more limited for languages other than English.
  • Structural Validation: Several structural issues in source datasets are identified and automatically resolved during processing, such as misaligned span indices and inconsistent IOB labels.
  • Reproducible: Fully end-to-end reproducible from published source.
  • Extensible: Designed to be extended further through its modular data processing framework. If a data format is already supported, adding new datasets can be as simple as defining a single JSON file.
  • Zero-Shot Ready: Provides a normalized entity label mapping specifically designed for zero-shot NER evaluation

Installation

To start working with our dataset, install MELD using pip:

pip install meld-data

For reproducing sentence level tokenization from source, the sentence-segmentation extra needs to be enabled:

pip install 'meld-data[sentence-segmentation]'

For development, we recommend managing your environment with uv:

git clone https://github.com/kgnlp/meld.git
cd meld
uv sync

Listing Available Datasets

To list all datasets available for download:

meld-data list

Download MELD

NOTE: It is recommended to log into a HuggingFace account with huggingface-cli login before downloading datasets to avoid running into API rate limits, particularly when reproducing MELD from source.

To get started, the preprocessed, redistributable subset of MELD can be downloaded using:

meld-data download -v info path/to/download_directory

By default, this downloads the meld:open profile, which includes all dataset available in preprocessed form on the HuggingFace Hub. MELD Open can also be used independently of the meld package. Versions with original kgnlp/meld-open and normalized entity labels kgnlp/meld-open-normalized are available. Datasets not includes in meld:open will be automatically downloaded from their original source and processed locally due to licensing restrictions. To download all datasets including CoNLL-2003:

meld-data download -v info --datasets meld:full path/to/download_directory

NOTE: Currently, the initially downloaded data will contain the original unnormalized entity labels from each dataset. To apply our label normalization, the meld-data hf command can be used.

Notice regarding CoNLL-2003:

Because of copyright restrictions, we cannot redistribute the Reuters Corpus data itself, on which CoNLL-2003 is based. Please refer to the Reuters Corpus licensing information for specific terms and conditions before downloading or using this dataset. This restriction applies only to the CoNLL-2003 dataset and does not affect other datasets in MELD.

MELD Directory Structure

The download_directory passed to meld-data download will contain a downloads subdirectory for source datasets processed by MELD and a meld subdirectory containing the final processed data. The downloads subdirectory can be deleted once data processing is complete. Each dataset in meld contains a meld_metadata.json including additional metadata, statistics, and paths for each subset and split. The NER data itself will be stored in parquet format, optionally in subdirectories for each subset if a dataset includes more than one subset:

download_directory/
├─ downloads/
│  └─ ... # Source datasets processed by MELD
└─ meld/
   ├─ CrossNER/
   │  ├─ meld_metadata.json
   │  ├─ literature/
   │  │  ├─ train.parquet
   │  │  ├─ test.parquet
   │  │  └─ validation.parquet
   │  └─ ...
   ├─ AnatEM/
   │  ├─ meld_metadata.json
   │  ├─ train.parquet
   │  ├─ test.parquet
   │  └─ validation.parquet
   └─ ...

Download Groups of Datasets

Download all redistributable datasets (default):

meld-data download -v info --datasets meld:open path/to/download_directory

Download all non-proprietary datasets that include a test set for evaluation:

meld-data download -v info --datasets meld:non-proprietary-eval path/to/download_directory

Download all non-proprietary datasets including Polyglot-NER:

meld-data download -v info --datasets meld:non-proprietary path/to/download_directory

Download all datasets supported by MELD including CoNLL-2003:

meld-data download -v info --datasets meld:full path/to/download_directory

Profiles and individual dataset names can also be mixed. E.g., for downloading MELD Open and CoNLL-2003:

meld-data download -v info --datasets 'meld:open,CoNLL-2003' path/to/download_directory

Download Specific Datasets

Individual datasets can be downloaded by passing their names as a comma separated list. Dataset names are case-sensitive corresponding to the output of the list command.

meld-data download -v info --datasets "conll-2003,scierc,few-nerd" path/to/download_directory

Reproducing MELD from Source

By default, meld download downloads the already processed version of datasets contained in meld-open to save bandwidth and processing time. To process all datasets from their original source data, add the -r/--reproduce flag. E.g. for reproducing meld:open from source:

meld download -v info -r path/to/download_directory

We use the SAT sentence tokenizer introduced by Frohmann et al. (2024) to tokenize long documents into sentences where no canonical sentence tokenization is available. To avoid slightly different boundaries being generated, e.g., due to GPU non-determinism, sentence boundaries bundled with the MELD package are used by default even when -r/--reproduce is set. To also reproduce the sentence boundaries from scratch, use:

meld download -v info -r --sentence-span-path path/to/new/segmentations path/to/download_directory

Where the directory passed as --sentence-span-path will contain parquet files in the same format as those bundled with the MELD package.

Convert to HuggingFace Datasets Format

The meld-data hf subcommand can be used to convert locally processed MELD data to a format compatible with the HuggingFace datasets library and optionally apply our normalized entity label mapping. For instance, for converting processed datasets belonging to the meld:open subset with normalized entity labels:

meld-data hf -d meld:open --normalize-labels /path/to/processed/meld/data /path/to/converted/data

Note that datasets converted in this way should not be uploaded to the HuggingFace Hub unless the constituent dataset's licensing requirements are fulfilled. See meld-data hf --help for additional options.

Included Datasets

MELD integrates 60 NER datasets spanning 194 languages (60 with gold standard test sets), 14 domains, and 601 normalized entity labels. The table below provides a general overview of the included datasets:

Name Primary Domain Languages Annotation Type License
AgCNER Agriculture zho gold-standard CC 0
AgriNER Agriculture eng gold-standard CC BY-SA 4.0
AnatEM Biomedical eng gold-standard CC BY-SA 3.0
BC2GM Biomedical eng gold-standard CC BY 4.0
BC4CHEMD Biomedical eng gold-standard Unspecified
BC5CDR Biomedical eng gold-standard Public Domain
BioRED Biomedical eng gold-standard Public Domain
JNLPBA Biomedical eng gold-standard GENIA Project License (CC BY 3.0 annotations)
NCBI-Disease Biomedical eng gold-standard Public Domain
CANTEMIST Clinical spa gold-standard CC BY 4.0
EBM-NLP Clinical eng gold-standard Unspecified
RaTE-NER Clinical eng gold-standard, silver-standard CC BY-NC 4.0
FiNER-139 Finance eng gold-standard CC BY-SA 4.0
TASTEset Food eng gold-standard MIT
Arabic-Cross-Dialectal-NER General apc, ary, arz gold-standard Unspecified
Naamapadam General asm, ben, guj, hin, kan, mal, mar, ori, pan, tam, tel gold-standard, silver-standard CC 0
Thai-NER General tha gold-standard CC BY 4.0
Turku-NER-corpus General fin gold-standard CC BY-SA 4.0
TurkuONE General fin gold-standard CC BY-ND-NC 1.0, CC BY-SA 3.0, CC BY-SA 4.0
NYTK-NerKor General, Law, Literature, News, Wikipedia hun gold-standard CC BY-SA 4.0
UniversalNER General, Literature, News, Wikipedia 15 languages gold-standard CC BY-SA 4.0
E-NER Law eng gold-standard CC BY-NC-SA 4.0
German-LER Law deu gold-standard CC BY 4.0
LegalNERo Law ron gold-standard CC BY-NC-ND 4.0
Herodotos-Project-NER Literature lat gold-standard AGPL-3.0 license
CLEANANERCorp News ara gold-standard GPL 3.0
CoNLL-2003 News eng gold-standard Proprietary text (See Download MELD for details)
EverestNER News nep gold-standard Non-commercial
FiNER-ORD News eng gold-standard CC BY-NC 4.0
FoNE News fao gold-standard CC BY 4.0
idner-news-2k News ind gold-standard MIT
MasakhaNER-X News 20 languages gold-standard CC BY-NC 4.0
PhoNER-COVID19 News vie gold-standard Research and Education Purposes Only
pioNER News, Wikipedia hye gold-standard, silver-standard Apache 2.0
FabNER Science eng gold-standard CC BY 4.0
SciER Science eng gold-standard GPL 3.0
SCIERC Science eng gold-standard Unspecified
SciREX Science eng gold-standard Apache 2.0
SOFC-Exp Science eng gold-standard CC BY 4.0
SoMeSci Science eng gold-standard CC BY 4.0
WIESP2022 Science eng gold-standard CC BY 4.0
WLP Science eng gold-standard MIT
DanfeNER Social Media nep gold-standard Non-commercial
HarveyNER Social Media eng gold-standard Unspecified
MIT-Movie Social Media eng gold-standard Unspecified
MIT-Restaurant Social Media eng gold-standard Unspecified
Tweebank-NER Social Media eng gold-standard Apache 2.0
TweetNER7 Social Media eng gold-standard Non-commercial
Weibo-NER Social Media zho gold-standard CC BY-SA 3.0
WNUT2017 Social Media eng gold-standard CC BY 4.0
StackOverflowNER Software eng gold-standard MIT
FindVehicle Transportation eng gold-standard Unspecified
CrossNER Wikipedia eng gold-standard MIT
Few-NERD Wikipedia eng gold-standard CC BY-SA 4.0
Japanese-Wikipedia Wikipedia jpn gold-standard CC BY-SA 3.0
MultiCoNER Wikipedia MULTI, ben, deu, eng, fas, fra, hin, ita, por, spa, swe, ukr, zho silver-standard CC BY 4.0
MultiNERd Wikipedia deu, eng, fra, ita, nld, pol, por, rus, spa, zho silver-standard CC BY-NC-SA 4.0
Polyglot-NER Wikipedia 40 languages silver-standard Unspecified
WikiANN Wikipedia 175 languages silver-standard Unspecified
WikiNEuRal Wikipedia deu, eng, fra, ita, nld, pol, por, rus, spa silver-standard CC BY-NC-SA 4.0

BibTeX Citations

Get citation for MELD:

meld-data cite

Get citations for specific datasets:

meld-data cite conll-2003,scierc

Get citations for all datasets:

meld-data cite --all

Citation

When using MELD, please cite our paper:

@inproceedings{glocker2026meld,
  title = {MELD: Melding Diverse Multilingual and Multi-Domain Datasets for
           Named Entity Recognition Evaluation},
  author = {Glocker, Kevin and Kuhlmann, Marco},
  booktitle = {Proceedings of the Fifteenth Language Resources and Evaluation
               Conference (LREC 2026)},
  month = {May},
  year = {2026},
  pages = {1889--1903},
  address = {Palma, Mallorca, Spain},
  publisher = {European Language Resources Association (ELRA)},
  editor = {Piperidis, Stelios and Bel, Núria and van den Heuvel, Henk and Ide,
            Nancy and Krek, Simon and Toral, Antonio},
  doi = {10.63317/32qrd24xac2e},
}

When using the PhoNER COVID19 subset, also cite the following article in accordance with its terms of use:

@inproceedings{PhoNER_COVID19,
  title = {{COVID-19 Named Entity Recognition for Vietnamese}},
  author = {Thinh Hung Truong and Mai Hoang Dao and Dat Quoc Nguyen},
  booktitle = {Proceedings of the 2021 Conference of the North American Chapter
               of the Association for Computational Linguistics: Human Language
               Technologies},
  year = {2021},
}

To retrieve citations for other datasets in MELD, see BibTex Citations.

API Reference

Documentation for the package can be found here.

Contributing

We welcome contributions to expand the dataset! Documentation and guidelines for adding new datasets to MELD are coming soon.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

meld_data-1.0.0.tar.gz (1.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

meld_data-1.0.0-py3-none-any.whl (1.6 MB view details)

Uploaded Python 3

File details

Details for the file meld_data-1.0.0.tar.gz.

File metadata

  • Download URL: meld_data-1.0.0.tar.gz
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for meld_data-1.0.0.tar.gz
Algorithm Hash digest
SHA256 c6c55c6c12b38b3e3ce50d2c24fa9bef86ccd150e2dd286a3cd2d2433de7564c
MD5 52a678294f02ca8862820ab6edaae04f
BLAKE2b-256 c310011da6ce7e8f56cdbf666445472a085d7586fd026b7820463e219ad1263a

See more details on using hashes here.

File details

Details for the file meld_data-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: meld_data-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 1.6 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.4 {"installer":{"name":"uv","version":"0.10.4","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for meld_data-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 fab02fd5bad81acf26968ef2ac8141d923437a81d18674a008a4958a159edde5
MD5 4f75fc2bf2aef344a5c47c8be8b98be4
BLAKE2b-256 6c11c0de5ef11bd1b4cd3010cafab244d565d5a655d79ead46d7ea70ac337f92

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page