Skip to main content

A NER Data Preparing Tool

Project description

NER Data Processor

Python Version PyPI version License: MIT

NER Data Processor is a Python library to help you easily prepare datasets for Named Entity Recognition (NER) and Coreference Resolution tasks. It transforms raw text into formats ready for training token classification models using Hugging Face or other frameworks.


📚 Documentation


📦 Installation

✅ From PyPI (Recommended)

pip install ner-data-processor

🛠️ From GitHub

git clone https://github.com/rajboopathiking/NER_DATA_PREPROCESSING.git
cd NER_DATA_PREPROCESSING
pip install -r requirements.txt

🚀 Getting Started

from ner_data_processor.Ner_Data_Preparation import Custom_Ner_Dataset

ner = Custom_Ner_Dataset()

📊 Dataset Format

Input should be a pandas DataFrame with two columns:

  • text: Sentence or paragraph
  • entities: List of labeled entities with their tags

Example:

text entities
Arun Kumar Jagatramka vs Ultrabulk AS on 22 Sept [Arun Kumar Jagatramka - PLAINTIFF, Ultrabulk AS - Defender]
Author Biren Vaishnav [Biren Vaishnav - PERSON]

⚙️ API Overview

extract_DataFrame(df)

Convert the annotated DataFrame into span-based entity format.

data = ner.extract_DataFrame(df)

Output:

text entities
Arun Kumar Jagatramka vs Ultrabulk AS on... [(0, 21, PLAINTIFF), (25, 37, Defender)]
Author Biren Vaishnav [(7, 21, PERSON)]

to_dataset(data)

Convert span-format data into token-label format for model training.

import pandas as pd
df = pd.DataFrame(ner.to_dataset(data))

Output:

id tokens ner_tags
0 [Arun, Kumar, Jagatramka, ...] [B-PLAINTIFF, I-PLAINTIFF, I-PLAINTIFF, ...]
1 [Author, Biren, Vaishnav] [O, B-PERSON, I-PERSON]

create _label_maps

labels = []
for i in df["ner_tags"]:
    labels.extend(i)
labels = np.unique(labels).tolist()

Output:

['B-DATE', 'B-Defender', 'B-LOC', 'B-ORG', 'B-PERSON', 'B-PLAINTIFF',
 'I-DATE', 'I-Defender', 'I-LOC', 'I-ORG', 'I-PERSON', 'I-PLAINTIFF', 'O']

to_huggingface_dataset(df, labels)

Convert your processed DataFrame into Hugging Face DatasetDict.

dataset = ner.to_huggingface_dataset(df, labels)
dataset = dataset.train_test_split(test_size=0.1)

Output:

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 3
    }),
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 1
    })
})

coreference_model(text)

Basic coreference resolution model.

text = "John is Victim. He is Innocent"
result = ner.coreference_model(text)

Output:

{
  "mentions": [
    {
      "text": "He",
      "refers_to": "John",
      "span": [13, 15]
    }
  ]
}

🪪 License

This project is licensed under the MIT License.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ner_data_processor-1.1.1.tar.gz (5.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ner_data_processor-1.1.1-py3-none-any.whl (6.1 kB view details)

Uploaded Python 3

File details

Details for the file ner_data_processor-1.1.1.tar.gz.

File metadata

  • Download URL: ner_data_processor-1.1.1.tar.gz
  • Upload date:
  • Size: 5.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for ner_data_processor-1.1.1.tar.gz
Algorithm Hash digest
SHA256 a90927489b3260d649bf9b6dc618ae6c36fa88f79738af3a375c83c72f9cd7b1
MD5 b2c2b02731fc54c8c9d6104231c9fd03
BLAKE2b-256 1f90a7998876f37782ac30e453676ec17e64cf00165d59caf10e68d43494f762

See more details on using hashes here.

File details

Details for the file ner_data_processor-1.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for ner_data_processor-1.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1373e928a460c4ef492afcf53a0e7bb6344939df91a6c790854c38d70acd858e
MD5 adda6334bdd286714abaf466db4247e7
BLAKE2b-256 bd1a620328ac96f32730841dc53c972d9f074ddaa40befbc2bbefe6638c9442f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page