Skip to main content

A NER Data Preparing Tool

Project description

NER_DATA_PREPROCESSING :

Collecting Data For NER:

this tool helps to create a ner and corefer dataset easily . To train a Token classification and corefer resolution need a dataset. it not like a raw dataset. we want to convert text (sentence) to required format. lets see how this framework/library used in your project. lets go ...

Step - 1:

  1. install via git

Download :

 ```bash
 git clone https://github.com/rajboopathiking/NER_DATA_PREPROCESSING.git
 ```

optional (if you already in correct folder)

 ```bash 
 cd NER_DATA_PREPROCESSING
 ```

requirements.txt -->> installation :

```bash
pip install requirements.txt
```

2 ) install via pypi

```bash
pip install ner-data-processor
```
```python
from ner-data-processor.Ner_Data_Preparation import Custom_Ner_Dataset
ner = Custom_Ner_Dataset()
```

Step - 2:

DataSet Format : pandas Dataframe with text(Arun Kumar Jagatramka vs Ultrabulk AS ) and exact word and entity (Arun Kumar Jagatramka - PLAINTIFF)

  text	                                               |             entities
0	Arun Kumar Jagatramka vs Ultrabulk AS on 22 Se...	  | [Arun Kumar Jagatramka - PLAINTIFF, Ultrabulk ...
1	Author Biren Vaishnav	                              |  [Biren Vaishnav - PERSON]
2	The Supreme Court ruled in favor of Jane Smith.	    |   [Supreme Court - LOC, Jane Smith - PLAINTIFF]
3	The Gujarat High Court issued a judgment in Ah...	  |  [Gujarat High Court - ORG, Ahmedabad - LOC]

API Documentation :

output for example only

  1. install via Github

  2. extract_DataFrame(df) >>

    ner = Custom_Ner_Dataset()
    data = ner.extract_DataFrame(df)
    

    output :

text	entities
0	Arun Kumar Jagatramka vs Ultrabulk AS on 22 Se...	[(0, 21, PLAINTIFF), (25, 37, Defender), (41, ...
1	Author Biren Vaishnav	[(7, 21, PERSON)]
2	The Supreme Court ruled in favor of Jane Smith.	[(4, 17, LOC), (36, 46, PLAINTIFF)]
3	The Gujarat High Court issued a judgment in Ah...	[(4, 22, ORG), (44, 53, LOC)]
  1. to_dataset(data) >>

    import pandas as pd
    import numpy as np
    df = pd.DataFrame(ner.to_dataset(data))
    

    output :

          id	                                                     tokens	ner_tags
     0	0	[Arun, Kumar, Jagatramka, vs, Ultrabulk, AS, o...	 [B-PLAINTIFF, I-PLAINTIFF, I-PLAINTIFF, O, B-D...
     1	1	[Author, Biren, Vaishnav]	[O, B-PERSON, I-PERSON]
     2	8	[The, Supreme, Court, ruled, in, favor, of, Ja...	  [O, B-LOC, I-LOC, O, O, O, O, B-PLAINTIFF, I-P...
     3	9	[The, Gujarat, High, Court, issued, a, judgmen...	  [O, B-ORG, I-ORG, I-ORG, O, O, O, O, B-LOC, O]
    
  2. Create _label_maps to create Huggingface Dataset :

```python
labels = []
for i in df["ner_tags"].tolist():
  labels.extend(i)
labels = np.unique(labels).tolist()
labels
```

output :

   ['B-DATE',
 'B-Defender',
 'B-LOC',
 'B-ORG',
 'B-PERSON',
 'B-PLAINTIFF',
 'I-DATE',
 'I-Defender',
 'I-LOC',
 'I-ORG',
 'I-PERSON',
 'I-PLAINTIFF',
 'O']
  1. to_huggingface_dataset(data,labels) >>

    dataset = ner.to_huggingface_dataset(df,labels)
    dataset = dataset.train_test_split(test_size=0.1)
    dataset
    

    output :

     DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 3
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 1
    })
    

    })

  2. coreference_model(text) >>>

    ner.coreference_model(text:str)  
    
    input : text = "John is Victim. He is Innocent"
    output : He mentions John it returns in json format which text,mentions,and span ...
    

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ner_data_processor-0.2.tar.gz (6.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

ner_data_processor-0.2.0-py3-none-any.whl (6.1 kB view details)

Uploaded Python 3

ner_data_processor-0.2-py3-none-any.whl (6.1 kB view details)

Uploaded Python 3

File details

Details for the file ner_data_processor-0.2.tar.gz.

File metadata

  • Download URL: ner_data_processor-0.2.tar.gz
  • Upload date:
  • Size: 6.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.16

File hashes

Hashes for ner_data_processor-0.2.tar.gz
Algorithm Hash digest
SHA256 acc165fcb0aaf184854a044e543a63d9059f26b731b916bc1562db985a3bf70a
MD5 511dc74a7ca8467fb958e5d0c1cff099
BLAKE2b-256 a89adb16bf68ec6bbec973ce82a8fa3357d23281b493117a4ea4582181460cfc

See more details on using hashes here.

File details

Details for the file ner_data_processor-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for ner_data_processor-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 48f946e62b28f5fe484d8943f6af717f9b48d1f92b7d399be3ba0b6498bd8114
MD5 e103f22fff2df690385a731493c6e00d
BLAKE2b-256 03ef349b6e5d9cced95abac8485cedd8e719fa1df5b16bd9850c34f2c9d16e53

See more details on using hashes here.

File details

Details for the file ner_data_processor-0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for ner_data_processor-0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 81030370df1bfd0298798a4084783519705f4365ba38cd6c09ae96bcef340e72
MD5 bb419fcccff73ddf4069b8b0dea49950
BLAKE2b-256 0150a8e41759a8cd7474adda2d125709b24f8a0f234057783517d24cc4354cf8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page