%config Completer.use_jedi = False

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
sns.set_theme(palette='Set2')
colors = sns.color_palette()
colors

import torch
from torch import nn, optim
import torch.nn.functional as F
torch.set_default_dtype(torch.float64)

from torch.utils.data import Dataset, DataLoader, random_split

from chembench import dataset
from molmap import MolMap

RDKit WARNING: [08:42:17] Enabling RDKit 2019.09.3 jupyter extensions

from molmapnets.data import SingleFeatureData, DoubleFeatureData
from molmapnets.models import MolMapRegression

Feature extraction

The chembench package collected several different datasets for benchmarking the models. Here we'll use the eSOL dataset, which collects the solubility of all E.coli proteins. The data can be loaded with

data = dataset.load_ESOL()

total samples: 1128

We have the smiles (Simplified Molecular Input Line Entry Specification) for different proteins and their corresponding solubility measure:

data.df.head()

Using MolMap we can extract features using the smiles as input. We can specify the feature type ftype, feature pairwise distance calculation method metric, and feature grid arrangement method fmap_type:

MolMap?

Init signature:
MolMap(
    ftype='descriptor',
    flist=None,
    fmap_type='grid',
    fmap_shape=None,
    split_channels=True,
    metric='cosine',
    var_thr=0.0001,
)
Docstring:      <no docstring>
Init docstring:
paramters
-----------------
ftype: {'fingerprint', 'descriptor'}, feature type
flist: feature list, if you want use some of the features instead of all features, each element in flist should be the id of a feature
fmap_shape: None or tuple, size of molmap, only works when fmap_type is 'scatter', if None, the size of feature map will be calculated automatically
fmap_type:{'scatter', 'grid'}, default: 'gird', if 'scatter', will return a scatter mol map without an assignment to a grid
split_channels: bool, if True, outputs will split into various channels using the types of feature
metric: {'cosine', 'correlation'}, default: 'cosine', measurement of feature distance
var_thr: float, defalt is 1e-4, meaning that feature will be included only if the conresponding variance larger than this value. Since some of the feature has pretty low variances, we can remove them by increasing this threshold
File:           ~/git/bidd-molmap/molmap/map.py
Type:           type
Subclasses:

descriptor = MolMap(ftype='descriptor', metric='cosine',)

fingerprint = MolMap(ftype='fingerprint', metric='cosine')

After setting up the feature extracting method, we can then use the .fit method of the feature object to extract the features. During this step we need to specify the algorithm (method) to embed higher dimensional features to 2D presentation:

descriptor.fit(verbose=0, method='umap', min_dist=0.1, n_neighbors=15,)

2021-07-23 08:42:33,461 - INFO - [bidd-molmap] - Applying grid feature map(assignment), this may take several minutes(1~30 min)
2021-07-23 08:42:36,598 - INFO - [bidd-molmap] - Finished

fingerprint.fit(verbose=0, method='umap', min_dist=0.1, n_neighbors=10,)

/Users/olivier/.local/lib/python3.6/site-packages/umap/umap_.py:1461: UserWarning: Using precomputed metric; transform will be unavailable for new data
  "Using precomputed metric; transform will be unavailable for new data"

2021-07-23 08:43:03,717 - INFO - [bidd-molmap] - Applying grid feature map(assignment), this may take several minutes(1~30 min)
2021-07-23 09:13:47,973 - INFO - [bidd-molmap] - Finished

We can visualise the feature maps easily with MolMap, but the visualisations are removed to avoid crushing the notebook.

Regression using the descriptor map

X = descriptor.batch_transform(data.x)

100%|##########| 1128/1128 [05:51<00:00,  3.57it/s]

X.shape

(1128, 37, 37, 13)

In PyTorch the training data for computer vision problems takes the shape (n_channels, hight, width), while the features extracted from MolMap take the shape (hight, width, n_channels), so we'll first correct it by moving the channels dimension before the feature map dimensions.

torch.movedim(torch.from_numpy(X), -1, 1).shape

torch.Size([1128, 13, 37, 37])

Y = data.y

Y.shape

(1128, 1)

Now from these feature maps we can create the dataset suitable for training models in PyTorch

esol = SingleFeatureData(data.y, X)

train, val, test = random_split(esol, [904,112,112], generator=torch.Generator().manual_seed(7))

len(train), len(val), len(test)

(904, 112, 112)

train_loader = DataLoader(train, batch_size=8, shuffle=True)
val_loader = DataLoader(val, batch_size=8, shuffle=True)
test_loader = DataLoader(test, batch_size=8, shuffle=True)

And we can get one batch of data by making the data loader iterable

x, t = next(iter(train_loader))

t

tensor([[-6.0250],
        [-1.1550],
        [-7.9200],
        [-0.6000],
        [-1.4800],
        [-2.1700],
        [-0.4900],
        [ 0.3200]])

x.shape

torch.Size([8, 13, 37, 37])

Finally with the data prepared we can train the models. These are tests to show that the models work as expected, but we can certainly fine tune the model to achieve better results.

model = MolMapRegression()

epochs = 5
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()

And the training loop

for epoch in range(epochs):

    running_loss = 0.0
    for i, (xb, yb) in enumerate(train_loader):

        xb, yb = xb.to(device), yb.to(device)

        # zero gradients
        optimizer.zero_grad()

        # forward propagation
        pred = model(xb)

        # loss calculation
        loss = criterion(pred, yb)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if (i+1) % 50 == 0:    
            print('[Epoch: %d, Iter: %5d] Training loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / (i+1)))

print('Training finished')

/Users/olivier/opt/anaconda3/envs/molmap/lib/python3.6/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  ../c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)

[Epoch: 1, Iter:    50] Training loss: 4.805
[Epoch: 1, Iter:   100] Training loss: 4.312
[Epoch: 2, Iter:    50] Training loss: 3.147
[Epoch: 2, Iter:   100] Training loss: 2.458
[Epoch: 3, Iter:    50] Training loss: 1.508
[Epoch: 3, Iter:   100] Training loss: 1.397
[Epoch: 4, Iter:    50] Training loss: 1.194
[Epoch: 4, Iter:   100] Training loss: 1.141
[Epoch: 5, Iter:    50] Training loss: 1.016
[Epoch: 5, Iter:   100] Training loss: 1.153
Training finished

Loss on validation data set

running_loss = 0.0
with torch.no_grad():
    for i, (xb, yb) in enumerate(val_loader):

        xb, yb = xb.to(device), yb.to(device)

        # forward propagation
        pred = model(xb)

        # loss calculation
        loss = criterion(pred, yb)
        running_loss += loss.item()
        if (i+1) % 3 == 0:    
            print('[Iter: %5d] Validation loss: %.3f' %
                    (i + 1, running_loss / (i+1)))

[Iter:     3] Validation loss: 1.125
[Iter:     6] Validation loss: 1.235
[Iter:     9] Validation loss: 1.122
[Iter:    12] Validation loss: 1.409

Regression using the fingerprint map

X_fingerprint = fingerprint.batch_transform(data.x)

100%|##########| 1128/1128 [03:16<00:00,  5.58it/s]

X_fingerprint.shape

(1128, 126, 126, 12)

Now from these feature maps we can create the dataset suitable for training models in PyTorch

esol_fingerprint = SingleFeatureData(data.y, X_fingerprint)

train_fingerprint, val_fingerprint, test_fingerprint = random_split(esol_fingerprint, [904,112,112], generator=torch.Generator().manual_seed(7))

len(train), len(val), len(test)

(904, 112, 112)

train_loader_fingerprint = DataLoader(train_fingerprint, batch_size=8, shuffle=True)
val_loader_fingerprint = DataLoader(val_fingerprint, batch_size=8, shuffle=True)
test_loader_fingerprint = DataLoader(test_fingerprint, batch_size=8, shuffle=True)

And we can get one batch of data by making the data loader iterable

x, t = next(iter(train_loader_fingerprint))

t.shape

torch.Size([8, 1])

x.shape

torch.Size([8, 12, 126, 126])

And regression. Different feature maps have different number of channels.

model_fingerprint = MolMapRegression(conv_in1=12)

epochs = 5
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model_fingerprint.to(device)
optimizer = optim.Adam(model_fingerprint.parameters(), lr=0.001)
criterion = nn.MSELoss()

And the training loop

for epoch in range(epochs):

    running_loss = 0.0
    for i, (xb, yb) in enumerate(train_loader_fingerprint):

        xb, yb = xb.to(device), yb.to(device)

        # zero gradients
        optimizer.zero_grad()

        # forward propagation
        pred = model_fingerprint(xb)

        # loss calculation
        loss = criterion(pred, yb)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if (i+1) % 50 == 0:    
            print('[Epoch: %d, Iter: %5d] Training loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / (i+1)))

print('Training finished')

[Epoch: 1, Iter:    50] Training loss: 4.696
[Epoch: 1, Iter:   100] Training loss: 3.811
[Epoch: 2, Iter:    50] Training loss: 1.911
[Epoch: 2, Iter:   100] Training loss: 1.795
[Epoch: 3, Iter:    50] Training loss: 1.104
[Epoch: 3, Iter:   100] Training loss: 1.065
[Epoch: 4, Iter:    50] Training loss: 0.911
[Epoch: 4, Iter:   100] Training loss: 0.917
[Epoch: 5, Iter:    50] Training loss: 0.583
[Epoch: 5, Iter:   100] Training loss: 0.585
Training finished

Loss on validation data set

running_loss = 0.0
with torch.no_grad():
    for i, (xb, yb) in enumerate(val_loader_fingerprint):

        xb, yb = xb.to(device), yb.to(device)

        # forward propagation
        pred = model_fingerprint(xb)

        # loss calculation
        loss = criterion(pred, yb)
        running_loss += loss.item()
        if (i+1) % 3 == 0:    
            print('[Iter: %5d] Validation loss: %.3f' %
                    (i + 1, running_loss / (i+1)))

[Iter:     3] Validation loss: 0.935
[Iter:     6] Validation loss: 0.854
[Iter:     9] Validation loss: 1.148
[Iter:    12] Validation loss: 1.107

Regression using both feature maps

If we want to use both the feature maps, we have to process the training data differently.

Now we can feed both the feature maps to the model as a tuple

double_feature = DoubleFeatureData(data.y, (X, X_fingerprint))

train_double, val_double, test_double = random_split(double_feature, [904,112,112], generator=torch.Generator().manual_seed(7))

len(train_double), len(val_double), len(test_double)

(904, 112, 112)

train_loader_double = DataLoader(train_double, batch_size=8, shuffle=True)
val_loader_double = DataLoader(val_double, batch_size=8, shuffle=True)
test_loader_double = DataLoader(test_double, batch_size=8, shuffle=True)

And we can get one batch of data by making the data loader iterable

x, t = next(iter(train_loader_double))

t.shape

torch.Size([8, 1])

x1, x2 = x
x1.shape, x2.shape

(torch.Size([8, 13, 37, 37]), torch.Size([8, 12, 126, 126]))

And regression. Different feature maps have different number of channels.

model_double = MolMapRegression(conv_in1=13, conv_in2=12)

epochs = 5
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model_double.to(device)
optimizer = optim.Adam(model_double.parameters(), lr=0.001)
criterion = nn.MSELoss()

And the training loop

for epoch in range(epochs):

    running_loss = 0.0
    for i, ((x1, x2), yb) in enumerate(train_loader_double):

        x1, x2, yb = x1.to(device), x2.to(device), yb.to(device)

        # zero gradients
        optimizer.zero_grad()

        # forward propagation
        pred = model_double((x1, x2))

        # loss calculation
        loss = criterion(pred, yb)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if (i+1) % 50 == 0:    
            print('[Epoch: %d, Iter: %5d] Training loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / (i+1)))

print('Training finished')

[Epoch: 1, Iter:    50] Training loss: 4.954
[Epoch: 1, Iter:   100] Training loss: 4.000
[Epoch: 2, Iter:    50] Training loss: 3.246
[Epoch: 2, Iter:   100] Training loss: 2.585
[Epoch: 3, Iter:    50] Training loss: 1.431
[Epoch: 3, Iter:   100] Training loss: 1.264
[Epoch: 4, Iter:    50] Training loss: 0.928
[Epoch: 4, Iter:   100] Training loss: 0.963
[Epoch: 5, Iter:    50] Training loss: 0.629
[Epoch: 5, Iter:   100] Training loss: 0.606
Training finished

Loss on validation data set

running_loss = 0.0
with torch.no_grad():
    for i, ((x1, x2), yb) in enumerate(val_loader_double):

        x1, x2, yb = x1.to(device), x2.to(device), yb.to(device)

        # forward propagation
        pred = model_double((x1, x2))

        # loss calculation
        loss = criterion(pred, yb)
        running_loss += loss.item()
        if (i+1) % 3 == 0:    
            print('[Iter: %5d] Validation loss: %.3f' %
                    (i + 1, running_loss / (i+1)))

print('Validation finished')

[Iter:     3] Validation loss: 0.876
[Iter:     6] Validation loss: 0.909
[Iter:     9] Validation loss: 1.017
[Iter:    12] Validation loss: 0.975
Validation finished

	smiles	measured log solubility in mols per litre
0	OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)...	-0.77
1	Cc1occc1C(=O)Nc2ccccc2	-3.30
2	CC(C)=CCCC(C)=CC(=O)	-2.06
3	c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43	-7.87
4	c1ccsc1	-1.33