Skip to main content

Useful AI library for chemistry.

Project description

TensorFlow Keras Codacy Badge

KME Word Segmentation

AI for tokenize chemical IUPAC name using tensorflow and keras.

Prepare training dataset

What to have

  1. CHAR_INDICES: dictionary with key is character [string], value is number [int] (use to preprocess text to number)

  2. Dict_cut: Input text with determined (by '|' ) where to be cut (use for create label)

Cyclo|prop|ane| |Non|a|-|1|,|8|-|di|yne| 
|1|,|3|-|di|chlorocyclo|hex|ane| |Hept|a|-|1|,|5|-|di|ene
  1. Dict: Input text (raw) (use for train model)
Cyclopropane Nona-1,8-diyne 1,3-dichlorocyclohexane Hepta-1,5-diene

Create dataset

  1. Make JSON value as array of chemical name (dataset, dataset_cut)
  2. Split array for training dataset (90%) (dataset_train, dataset_cut_train) and validation dataset (10%) (dataset_val, dataset_cut_val)
  3. Join item in each array together into text
  4. Create dataset using create_dataset function that take dataset_cut then return X_train (size: [text_length, look_back]) (dataset_cut that have been cut '|') and label (position where to cut 1 = cut, 0 = not cut)
  5. Use tf.data.Dataset.from_tensor_slices((X, y)).batch_size(128) to make data easy to be train

Create Model

  1. We use 1xEmbedding layer, 1xBidirection LSTM layer, Dense Layer
  2. Compiled model optimizer = Adam, loss_function = Categorical Crossentropy (becase we classify 2 label output 1 = cut, 0 = not cut) call_back = [EarlyStopping, ModelCheckpoint] Early stopping : Stop train model if validation_loss is being increase ModelCheckpoint : Save model that has minimum validation_loss

After Train Model

  • The output of the model is array (size: [batch_size, 2] determined which position to be cut (value = 0 -> not cut ; 1 -> cut))
[1 1 1 1 1 0 1 1 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0 0 0 1 0 0]
  • Tokenize dataset (text which didn't determined where to be cut) with label (output from model) using word_tokenize function that return array of text that has been cut
['1', ',', '2', '-', 'di', 'h', 'ydrox', 'y', '-', '2', '-', 'meth', 'yl', 'prop', 'ane']

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

kmeseg-0.1.1-py3-none-any.whl (6.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page