Skip to main content

A NER Data Preparing Tool

Project description

NER_DATA_PREPROCESSING :

Collecting Data For NER:

this tool helps to create a ner and corefer dataset easily . To train a Token classification and corefer resolution need a dataset. it not like a raw dataset. we want to convert text (sentence) to required format. lets see how this framework/library used in your project. lets go ...

Step - 1:

  1. install via git

Download :

 ```bash
 git clone https://github.com/rajboopathiking/NER_DATA_PREPROCESSING.git
 ```

optional (if you already in correct folder)

 ```bash 
 cd NER_DATA_PREPROCESSING
 ```

requirements.txt -->> installation :

```bash
pip install requirements.txt
```

2 ) install via pypi

```bash
pip install ner_data_processor
```
```python
from ner_data_processor.Ner_Data_Preparation import Custom_Ner_Dataset
ner = Custom_Ner_Dataset()
```

Step - 2:

DataSet Format : pandas Dataframe with text(Arun Kumar Jagatramka vs Ultrabulk AS ) and exact word and entity (Arun Kumar Jagatramka - PLAINTIFF)

  text	                                               |             entities
0	Arun Kumar Jagatramka vs Ultrabulk AS on 22 Se...	  | [Arun Kumar Jagatramka - PLAINTIFF, Ultrabulk ...
1	Author Biren Vaishnav	                              |  [Biren Vaishnav - PERSON]
2	The Supreme Court ruled in favor of Jane Smith.	    |   [Supreme Court - LOC, Jane Smith - PLAINTIFF]
3	The Gujarat High Court issued a judgment in Ah...	  |  [Gujarat High Court - ORG, Ahmedabad - LOC]

API Documentation :

output for example only

  1. install via Github

  2. extract_DataFrame(df) >>

    ner = Custom_Ner_Dataset()
    data = ner.extract_DataFrame(df)
    

    output :

text	entities
0	Arun Kumar Jagatramka vs Ultrabulk AS on 22 Se...	[(0, 21, PLAINTIFF), (25, 37, Defender), (41, ...
1	Author Biren Vaishnav	[(7, 21, PERSON)]
2	The Supreme Court ruled in favor of Jane Smith.	[(4, 17, LOC), (36, 46, PLAINTIFF)]
3	The Gujarat High Court issued a judgment in Ah...	[(4, 22, ORG), (44, 53, LOC)]
  1. to_dataset(data) >>

    import pandas as pd
    import numpy as np
    df = pd.DataFrame(ner.to_dataset(data))
    

    output :

          id	                                                     tokens	ner_tags
     0	0	[Arun, Kumar, Jagatramka, vs, Ultrabulk, AS, o...	 [B-PLAINTIFF, I-PLAINTIFF, I-PLAINTIFF, O, B-D...
     1	1	[Author, Biren, Vaishnav]	[O, B-PERSON, I-PERSON]
     2	8	[The, Supreme, Court, ruled, in, favor, of, Ja...	  [O, B-LOC, I-LOC, O, O, O, O, B-PLAINTIFF, I-P...
     3	9	[The, Gujarat, High, Court, issued, a, judgmen...	  [O, B-ORG, I-ORG, I-ORG, O, O, O, O, B-LOC, O]
    
  2. Create _label_maps to create Huggingface Dataset :

```python
labels = []
for i in df["ner_tags"].tolist():
  labels.extend(i)
labels = np.unique(labels).tolist()
labels
```

output :

   ['B-DATE',
 'B-Defender',
 'B-LOC',
 'B-ORG',
 'B-PERSON',
 'B-PLAINTIFF',
 'I-DATE',
 'I-Defender',
 'I-LOC',
 'I-ORG',
 'I-PERSON',
 'I-PLAINTIFF',
 'O']
  1. to_huggingface_dataset(data,labels) >>

    dataset = ner.to_huggingface_dataset(df,labels)
    dataset = dataset.train_test_split(test_size=0.1)
    dataset
    

    output :

     DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 3
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 1
    })
    

    })

  2. coreference_model(text) >>>

    ner.coreference_model(text:str)  
    
    input : text = "John is Victim. He is Innocent"
    output : He mentions John it returns in json format which text,mentions,and span ...
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ner_data_processor-0.0.3.tar.gz (6.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ner_data_processor-0.0.3-py3-none-any.whl (6.1 kB view details)

Uploaded Python 3

File details

Details for the file ner_data_processor-0.0.3.tar.gz.

File metadata

  • Download URL: ner_data_processor-0.0.3.tar.gz
  • Upload date:
  • Size: 6.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for ner_data_processor-0.0.3.tar.gz
Algorithm Hash digest
SHA256 c0d24f0cb3a7da46af77d25eda4c89a8e425645d0e354a5016c52c56f227086e
MD5 bf8fbe0a0d94e0bcf1da453d2d7296d6
BLAKE2b-256 28a73508a3fcfac14e5ded4ca7f99a59f58a59fdfe5d377367881c32e385e9c7

See more details on using hashes here.

File details

Details for the file ner_data_processor-0.0.3-py3-none-any.whl.

File metadata

File hashes

Hashes for ner_data_processor-0.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 b35315f50c7e44bf23a3887d7cfcb0b0f139c5d59bce188ecfa21e4309b8be15
MD5 8205c81d07451617c1a618c74e0221a4
BLAKE2b-256 48d058d70b2d7d344927987ab4c572ca8285660c44b53445f0ab577f9a75a540

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page