Skip to main content

Useful AI library for chemistry.

Project description

TensorFlow Keras Codacy Badge

KME Word Segmentation

AI for tokenize chemical IUPAC name using tensorflow and keras.

Prepare training dataset

What to have

  1. CHAR_INDICES: dictionary with key is character [string], value is number [int] (use to preprocess text to number)

  2. Dict_cut: Input text with determined (by '|' ) where to be cut (use for create label)

Cyclo|prop|ane| |Non|a|-|1|,|8|-|di|yne| 
|1|,|3|-|di|chlorocyclo|hex|ane| |Hept|a|-|1|,|5|-|di|ene
  1. Dict: Input text (raw) (use for train model)
Cyclopropane Nona-1,8-diyne 1,3-dichlorocyclohexane Hepta-1,5-diene

Create dataset

  1. Make JSON value as array of chemical name (dataset, dataset_cut)
  2. Split array for training dataset (90%) (dataset_train, dataset_cut_train) and validation dataset (10%) (dataset_val, dataset_cut_val)
  3. Join item in each array together into text
  4. Create dataset using create_dataset function that take dataset_cut then return X_train (size: [text_length, look_back]) (dataset_cut that have been cut '|') and label (position where to cut 1 = cut, 0 = not cut)
  5. Use tf.data.Dataset.from_tensor_slices((X, y)).batch_size(128) to make data easy to be train

Create Model

  1. We use 1xEmbedding layer, 1xBidirection LSTM layer, Dense Layer
  2. Compiled model optimizer = Adam, loss_function = Categorical Crossentropy (becase we classify 2 label output 1 = cut, 0 = not cut) call_back = [EarlyStopping, ModelCheckpoint] Early stopping : Stop train model if validation_loss is being increase ModelCheckpoint : Save model that has minimum validation_loss

After Train Model

  • The output of the model is array (size: [batch_size, 2] determined which position to be cut (value = 0 -> not cut ; 1 -> cut))
[1 1 1 1 1 0 1 1 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 0 0 0 1 0 0]
  • Tokenize dataset (text which didn't determined where to be cut) with label (output from model) using word_tokenize function that return array of text that has been cut
['1', ',', '2', '-', 'di', 'h', 'ydrox', 'y', '-', '2', '-', 'meth', 'yl', 'prop', 'ane']

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

kmeseg-0.1.1-py3-none-any.whl (6.7 kB view details)

Uploaded Python 3

File details

Details for the file kmeseg-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: kmeseg-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 6.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.2 importlib_metadata/4.8.1 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.7

File hashes

Hashes for kmeseg-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c015c6bda2a16fda6e051621a8b8410652c56635d51ee692045899252da7335f
MD5 475ee1eedb918f0aa03df829fadd62c7
BLAKE2b-256 d0d2f365af9b3656267cbf9345d6e0bda80bedfe75b8fb26204e54c7933b9ea3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page