A NER Data Preparing Tool
Project description
NER Data Processor
NER Data Processor is a Python library to help you easily prepare datasets for Named Entity Recognition (NER) and Coreference Resolution tasks. It transforms raw text into formats ready for training token classification models using Hugging Face or other frameworks.
📚 Documentation
📦 Installation
✅ From PyPI (Recommended)
pip install ner-data-processor
🛠️ From GitHub
git clone https://github.com/rajboopathiking/NER_DATA_PREPROCESSING.git
cd NER_DATA_PREPROCESSING
pip install -r requirements.txt
🚀 Getting Started
from ner_data_processor.Ner_Data_Preparation import Custom_Ner_Dataset
ner = Custom_Ner_Dataset()
📊 Dataset Format
Input should be a pandas DataFrame with two columns:
text: Sentence or paragraphentities: List of labeled entities with their tags
Example:
| text | entities |
|---|---|
| Arun Kumar Jagatramka vs Ultrabulk AS on 22 Sept | [Arun Kumar Jagatramka - PLAINTIFF, Ultrabulk AS - Defender] |
| Author Biren Vaishnav | [Biren Vaishnav - PERSON] |
⚙️ API Overview
extract_DataFrame(df)
Convert the annotated DataFrame into span-based entity format.
data = ner.extract_DataFrame(df)
Output:
| text | entities |
|---|---|
| Arun Kumar Jagatramka vs Ultrabulk AS on... | [(0, 21, PLAINTIFF), (25, 37, Defender)] |
| Author Biren Vaishnav | [(7, 21, PERSON)] |
to_dataset(data)
Convert span-format data into token-label format for model training.
import pandas as pd
df = pd.DataFrame(ner.to_dataset(data))
Output:
| id | tokens | ner_tags |
|---|---|---|
| 0 | [Arun, Kumar, Jagatramka, ...] | [B-PLAINTIFF, I-PLAINTIFF, I-PLAINTIFF, ...] |
| 1 | [Author, Biren, Vaishnav] | [O, B-PERSON, I-PERSON] |
create _label_maps
labels = []
for i in df["ner_tags"]:
labels.extend(i)
labels = np.unique(labels).tolist()
Output:
['B-DATE', 'B-Defender', 'B-LOC', 'B-ORG', 'B-PERSON', 'B-PLAINTIFF',
'I-DATE', 'I-Defender', 'I-LOC', 'I-ORG', 'I-PERSON', 'I-PLAINTIFF', 'O']
to_huggingface_dataset(df, labels)
Convert your processed DataFrame into Hugging Face DatasetDict.
dataset = ner.to_huggingface_dataset(df, labels)
dataset = dataset.train_test_split(test_size=0.1)
Output:
DatasetDict({
train: Dataset({
features: ['id', 'tokens', 'ner_tags'],
num_rows: 3
}),
test: Dataset({
features: ['id', 'tokens', 'ner_tags'],
num_rows: 1
})
})
coreference_model(text)
Basic coreference resolution model.
text = "John is Victim. He is Innocent"
result = ner.coreference_model(text)
Output:
{
"mentions": [
{
"text": "He",
"refers_to": "John",
"span": [13, 15]
}
]
}
🪪 License
This project is licensed under the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ner_data_processor-1.1.1.tar.gz.
File metadata
- Download URL: ner_data_processor-1.1.1.tar.gz
- Upload date:
- Size: 5.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a90927489b3260d649bf9b6dc618ae6c36fa88f79738af3a375c83c72f9cd7b1
|
|
| MD5 |
b2c2b02731fc54c8c9d6104231c9fd03
|
|
| BLAKE2b-256 |
1f90a7998876f37782ac30e453676ec17e64cf00165d59caf10e68d43494f762
|
File details
Details for the file ner_data_processor-1.1.1-py3-none-any.whl.
File metadata
- Download URL: ner_data_processor-1.1.1-py3-none-any.whl
- Upload date:
- Size: 6.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.16
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1373e928a460c4ef492afcf53a0e7bb6344939df91a6c790854c38d70acd858e
|
|
| MD5 |
adda6334bdd286714abaf466db4247e7
|
|
| BLAKE2b-256 |
bd1a620328ac96f32730841dc53c972d9f074ddaa40befbc2bbefe6638c9442f
|