NLP text processing toolkit for Deep Learning
Project description
FERN
English Version | 中文版
Fern defines a model development structure control for NLP. With the help of Fern, the text preprocessing, model building and model training can be implemented quickly. These modules contain the following functions:
- Text preprocessing: data downloader, data cleaner, data transformer and data splitter
- Model building: model saving, loading and architecture printing
- Model training: step /epochs training and evaluation, evaluation function setting, loss function setting and label weight setting
The design purpose of Fern is mainly to solve the problem of too much repetitive code in different NLP projects and reduce the flow code, so as to avoid random bugs in the process data interaction
INSTALL
-
Install from
pypi
$ pip install Fern2
-
Install from source code
$ pip install -e git+https://github.com/Jasonsey/Fern.git@develop
TUTORIAL
This is a quick tutorial that covers the basics of all classes. For more usage methods, it is recommended to see the instructions for the functions in the source code
DATA PREPARATION
-
Data download
from fern.utils.data import BaseDownloader loader = BaseDownloader(host=config.HOST, user=config.USER, password=config.PASSWORD) loader.read_msssql(sql=config.SQL) loader.save(config.SOURCE_PATH)
-
Load the downloaded data from disk
loader.load(config.SOURCE_PATH)
-
Data cleaning
from fern.utils.data import BaseCleaner class DataCleaner(BaseCleaner): def clean_label(self, row): return row['LABEL'] def clean_data(self, row): data = row['DATA'] res = do_clean(data) return res cleaner = DataCleaner(stop_words=config.STOP_WORDS, user_words=config.USER_WORDS) cleaner.clean(loader.data)
-
Data transforming
from fern.utils.data import BaseTransformer class DataTransformer(BaseTransformer): def transform_label(self, label): res = np.zeros([1] + self.output_shape, np.float32) for i in range(len(str(label))): number = int(str(label)[i]) res[:, i, number] = 1.0 return res transformer = DataTransformer( data=cleaner.data, word_path=config.WORDS_LIBRARY, min_len=config.MIN_SEQ_LEN, max_len=config.MAX_SEQ_LEN, min_freq=config.MAX_WORD_FREQ, output_shape=config.OUTPUT_SHAPE, filter_data=True) transformer.transform(data=cleaner.data) transformer.save(config.TRANSFORMED_DATA)
-
Data segmentation
from fern.utils.data import BaseSplitter splitter = BaseSplitter(rate_val=config.RATE_VAL) splitter.split(transformer.data) splitter.save(config.SPLIT_DATA)
MODEL SEARCH
- Configure the list of models to be searched
from fern.utils.model import BaseModel
class TextConv1D_1(BaseModel):
def build(self):
inp = layers.Input(shape=(self.max_seq_len,))
x = layers.Dense(12)(x)
oup = layers.Activation('softmax')(x)
model = Model(
inputs=inp,
outputs=oup,
name=self.name)
return model
class TextConv1D_2(BaseModel):
def build(self):
inp = layers.Input(shape=(self.max_seq_len,))
x = layers.Dense(24)(x)
oup = layers.Activation('softmax')(x)
model = Model(
inputs=inp,
outputs=oup,
name=self.name)
return model
UNOPTIMIZED_MODELS = [TextConv1D_1, TextConv1D_2]
-
Searching the best model
from fern.utils.train import BaseTrainer best_score = 0 best_epoch = 0 best_model = '' for model in UNOPTIMIZED_MODELS: tf.keras.backend.clear_session() try: my_model = model( output_shape=config.OUTPUT_SHAPE, max_seq_len=config.MAX_SEQ_LEN, library_len=library_len) trainer = BaseTrainer( model=my_model, path_data=config.SPLIT_DATA, lr=config.LR, batch_size=config.BATCH_SIZE) score, epoch = trainer.train( config.EPOCHS, early_stop=config.EARLY_STOP) if score > best_score: best_score = score best_epoch = epoch best_model = my_model.name print(f'Best Model: {best_model}, Best Score: {best_score}, Best Epoch: {best_epoch}')
TRAINING THE BEST MODEL
my_model = UNOPTIMIZED_MODELS[0](
output_shape=config.OUTPUT_SHAPE,
max_seq_len=config.MAX_SEQ_LEN,
library_len=library_len)
trainer = ModelTrainer(
model=my_model,
path_data=config.SPLIT_DATA,
lr=config.LR,
batch_size=config.BATCH_SIZE)
_ = trainer.train(config.BEST_EPOCH, mode='server')
trainer.save(config.MODEL_PATH)
VARIABLE NAMING RULE
In order to facilitate the definition, the following convention is made for the naming of easily divergent variables:
-
For data variables, write rules for variables of the same type:
data_train
,data_val
label_train
,label_val
-
For indicator variables, write rules for variables of the same type:
val_loss
,val_acc
,val_binary_acc
train_loss
,train_acc
-
For other variables, according to the rule that first it belongs to a and second it belongs to b:
a_b
path_dataset
CNAGE LOG
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.