NLP text processing toolkit for Deep Learning

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

FERN

English Version | 中文版

Fern defines a model development structure control for NLP. With the help of Fern, the text preprocessing, model building and model training can be implemented quickly. These modules contain the following functions:

Text preprocessing: data downloader, data cleaner, data transformer and data splitter
Model building: model saving, loading and architecture printing
Model training: step /epochs training and evaluation, evaluation function setting, loss function setting and label weight setting

The design purpose of Fern is mainly to solve the problem of too much repetitive code in different NLP projects and reduce the flow code, so as to avoid random bugs in the process data interaction

INSTALL

Install from pypi
```
$ pip install Fern2
```

Install from source code

$ pip install -e git+https://github.com/Jasonsey/Fern.git@develop

TUTORIAL

This is a quick tutorial that covers the basics of all classes. For more usage methods, it is recommended to see the instructions for the functions in the source code

DATA PREPARATION

Data download

from fern.utils.data import BaseDownloader


loader = BaseDownloader(host=config.HOST, user=config.USER, password=config.PASSWORD)
loader.read_msssql(sql=config.SQL)
loader.save(config.SOURCE_PATH)

Load the downloaded data from disk
```
loader.load(config.SOURCE_PATH)
```

Data cleaning

from fern.utils.data import BaseCleaner


class DataCleaner(BaseCleaner):
    def clean_label(self, row):
        return row['LABEL']

    def clean_data(self, row):
    		data = row['DATA']
        res = do_clean(data)
        return res


cleaner = DataCleaner(stop_words=config.STOP_WORDS, user_words=config.USER_WORDS)
cleaner.clean(loader.data)

Data transforming

from fern.utils.data import BaseTransformer


class DataTransformer(BaseTransformer):
    def transform_label(self, label):
        res = np.zeros([1] + self.output_shape, np.float32)
        for i in range(len(str(label))):
            number = int(str(label)[i])
            res[:, i, number] = 1.0
        return res


transformer = DataTransformer(
    data=cleaner.data,
    word_path=config.WORDS_LIBRARY,
    min_len=config.MIN_SEQ_LEN,
    max_len=config.MAX_SEQ_LEN,
    min_freq=config.MAX_WORD_FREQ,
    output_shape=config.OUTPUT_SHAPE,
    filter_data=True)
transformer.transform(data=cleaner.data)
transformer.save(config.TRANSFORMED_DATA)

Data segmentation

from fern.utils.data import BaseSplitter


splitter = BaseSplitter(rate_val=config.RATE_VAL)
splitter.split(transformer.data)
splitter.save(config.SPLIT_DATA)

MODEL SEARCH

Configure the list of models to be searched

from fern.utils.model import BaseModel

class TextConv1D_1(BaseModel):
    def build(self):
        inp = layers.Input(shape=(self.max_seq_len,))
        x = layers.Dense(12)(x)
        oup = layers.Activation('softmax')(x)
        model = Model(
            inputs=inp,
            outputs=oup,
            name=self.name)
        return model

class TextConv1D_2(BaseModel):
    def build(self):
        inp = layers.Input(shape=(self.max_seq_len,))
        x = layers.Dense(24)(x)
        oup = layers.Activation('softmax')(x)
        model = Model(
          	inputs=inp,
          	outputs=oup,
          	name=self.name)
        return model

UNOPTIMIZED_MODELS = [TextConv1D_1, TextConv1D_2]

Searching the best model

from fern.utils.train import BaseTrainer


best_score = 0
best_epoch = 0
best_model = ''

for model in UNOPTIMIZED_MODELS:
		tf.keras.backend.clear_session()
		try:
    	my_model = model(
          output_shape=config.OUTPUT_SHAPE, 
          max_seq_len=config.MAX_SEQ_LEN, 
          library_len=library_len)
    	trainer = BaseTrainer(
          model=my_model,
          path_data=config.SPLIT_DATA,
          lr=config.LR,
          batch_size=config.BATCH_SIZE)
    score, epoch = trainer.train(
        config.EPOCHS,
        early_stop=config.EARLY_STOP)
    if score > best_score:
      best_score = score
      best_epoch = epoch
      best_model = my_model.name

print(f'Best Model: {best_model}, Best Score: {best_score}, Best Epoch: {best_epoch}')

TRAINING THE BEST MODEL

my_model = UNOPTIMIZED_MODELS[0](
    output_shape=config.OUTPUT_SHAPE, 
    max_seq_len=config.MAX_SEQ_LEN,
    library_len=library_len)

trainer = ModelTrainer(
    model=my_model,
    path_data=config.SPLIT_DATA,
    lr=config.LR,
    batch_size=config.BATCH_SIZE)

_ = trainer.train(config.BEST_EPOCH, mode='server')
trainer.save(config.MODEL_PATH)

VARIABLE NAMING RULE

In order to facilitate the definition, the following convention is made for the naming of easily divergent variables：

For data variables, write rules for variables of the same type：
- data_train, data_val
- label_train, label_val
For indicator variables, write rules for variables of the same type：
- val_loss, val_acc, val_binary_acc
- train_loss, train_acc
For other variables, according to the rule that first it belongs to a and second it belongs to b：a_b
- path_dataset

CNAGE LOG

CHANGE LOG

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.4.1

May 14, 2023

1.4.0

Nov 26, 2022

1.3.0

Nov 26, 2022

1.3.0.dev1 pre-release

Jun 26, 2022

1.2.2

Nov 26, 2022

1.2.1

Nov 22, 2021

1.2.0

Oct 18, 2021

1.2.0.dev5 pre-release

Sep 27, 2021

1.2.0.dev4 pre-release

Sep 27, 2021

1.2.0.dev3 pre-release

Sep 27, 2021

1.2.0.dev2 pre-release

Sep 27, 2021

1.2.0.dev1 pre-release

Sep 26, 2021

1.1.4

Sep 24, 2021

1.1.4.dev1 pre-release

Sep 24, 2021

1.1.3

Sep 24, 2021

1.1.3.dev1 pre-release

Sep 24, 2021

1.1.2

Sep 23, 2021

1.1.2.dev5 pre-release

Sep 23, 2021

1.1.2.dev4 pre-release

Aug 23, 2021

1.1.2.dev3 pre-release

Aug 18, 2021

1.1.2.dev2 pre-release

Aug 17, 2021

1.1.2.dev1 pre-release

Aug 17, 2021

1.1.1

Aug 17, 2021

1.1.1.dev5 pre-release

Aug 17, 2021

1.1.1.dev4 pre-release

Jul 20, 2021

1.1.1.dev3 pre-release

Jul 20, 2021

1.1.1.dev2 pre-release

Jul 19, 2021

1.1.1.dev1 pre-release

Jul 18, 2021

1.1.0

Jul 14, 2021

1.0.0

Apr 27, 2021

This version

0.9.0

Dec 13, 2020

0.8.1

Nov 16, 2020

0.8.0

Nov 16, 2020

0.7.0

Nov 11, 2020

0.6.3

Oct 16, 2020

0.6.2

Oct 15, 2020

0.6.1

Oct 11, 2020

0.6.0

Oct 11, 2020

0.5.1

Jul 27, 2020

0.5.0

Jul 26, 2020

0.4.1

Jul 22, 2020

0.4.0

Jul 21, 2020

0.3.3

Jul 9, 2020

0.3.2

Jul 6, 2020

0.3.1

Jul 3, 2020

0.3.0

Jul 2, 2020

0.2.0

May 24, 2020

0.1.4

May 20, 2020

0.1.3

May 20, 2020

0.1.2

May 7, 2020

0.1.1

May 3, 2020

0.1.0

May 3, 2020

0.0.2

Apr 29, 2020

0.0.1

Apr 28, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Fern2-0.9.0.tar.gz (18.9 kB view hashes)

Uploaded Dec 13, 2020 Source

Built Distribution

Fern2-0.9.0-py3-none-any.whl (25.2 kB view hashes)

Uploaded Dec 13, 2020 Python 3

Hashes for Fern2-0.9.0.tar.gz

Hashes for Fern2-0.9.0.tar.gz
Algorithm	Hash digest
SHA256	`67847fcad8dc0f01eaf3772db92b4948e77e8ab08919650b00752c906aef02df`
MD5	`b260e0578366fa2823a75fe03689daa8`
BLAKE2b-256	`438ce72799235e491ed2e5179596eea61d176d29958b145dc3924ce9c06081af`

Hashes for Fern2-0.9.0-py3-none-any.whl

Hashes for Fern2-0.9.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`94ce46bd9c1675acc55ce3d6699ca6826f608828da95151483135171fb57fb41`
MD5	`a4c4b7c1b6e5087350cf371859ca0e0b`
BLAKE2b-256	`c8dc1dc967640a53cc6e1930c765433a38527c34f630b71052bee26a44ec460e`