Using molmapnets for classification, with descriptors, or fingerprints, or both. Tested on the BACE dataset.
 

Per it's own documentation:

Beta-Secretase 1 (BACE) is a transmembrane aspartic-acid protease human protein encoded by the BACE1 gene. BACE is essential for the generation of beta-amyloid peptide in neural tissue1, a component of amyloid plaques widely believed to be critical in the development of Alzheimer's, rendering BACE an attractive therapeutic target for this devastating disease2.

The BACE dataset provided herein comprises small molecule inhibitors across a three order of magnitude (nM to μM) range of IC50s along with previously undisclosed crystallographic structures. Specifically, we provide 154 BACE inhibitors for affinity, 20 for pose, and 34 for free energy prediction. This dataset was kindly provided by Novartis.

%config Completer.use_jedi = False
import numpy as np
import pandas as pd
import torch
from torch import nn, optim
torch.set_default_dtype(torch.float64)

from torch.utils.data import Dataset, DataLoader, random_split
from chembench import dataset
from molmap import MolMap, feature
RDKit WARNING: [11:51:50] Enabling RDKit 2019.09.3 jupyter extensions
from molmapnets.data import SingleFeatureData, DoubleFeatureData
from molmapnets.models import MolMapMultiClassClassification

Feature extraction

data = dataset.load_BACE()
total samples: 1513

Take a look at the data

data.df.head()
smiles Class
0 O1CC[C@@H](NC(=O)[C@@H](Cc2cc3cc(ccc3nc2N)-c2c... 1
1 Fc1cc(cc(F)c1)C[C@H](NC(=O)[C@@H](N1CC[C@](NC(... 1
2 S1(=O)(=O)N(c2cc(cc3c2n(cc3CC)CC1)C(=O)N[C@H](... 1
3 S1(=O)(=O)C[C@@H](Cc2cc(O[C@H](COCC)C(F)(F)F)c... 1
4 S1(=O)(=O)N(c2cc(cc3c2n(cc3CC)CC1)C(=O)N[C@H](... 1

This is a two class classification data set

data.df.Class.nunique()
2

Create feature map objects

bitsinfo = feature.fingerprint.Extraction().bitsinfo
flist = bitsinfo[bitsinfo.Subtypes.isin(['PubChemFP'])].IDs.tolist()

flist[:5]
['PubChemFP0', 'PubChemFP1', 'PubChemFP2', 'PubChemFP3', 'PubChemFP4']
descriptor = MolMap(ftype='descriptor', metric='cosine',)
fingerprint = MolMap(ftype='fingerprint', fmap_type='scatter', flist=flist)
descriptor.fit(verbose=0, method='umap', min_dist=0.1, n_neighbors=15,)
2021-07-23 11:52:06,977 - INFO - [bidd-molmap] - Applying grid feature map(assignment), this may take several minutes(1~30 min)
2021-07-23 11:52:10,233 - INFO - [bidd-molmap] - Finished
fingerprint.fit(verbose=0, method='umap', min_dist=0.1, n_neighbors=15,)
2021-07-23 11:52:11,410 - INFO - [bidd-molmap] - Applying naive scatter feature map...
2021-07-23 11:52:11,430 - INFO - [bidd-molmap] - Finished

Feature extraction

X1 = descriptor.batch_transform(data.x)
X2 = fingerprint.batch_transform(data.x)
100%|##########| 1513/1513 [10:31<00:00,  2.66it/s]
100%|##########| 1513/1513 [02:24<00:00, 10.50it/s]

We also need to transform the outcome variable

Y = pd.get_dummies(data.df['Class']).values
Y.shape
(1513, 2)

We can visualise the feature maps easily with MolMap, but the visualisations are removed to avoid crushing the notebook.

Classification using only the descriptor map

single_feature = SingleFeatureData(Y, X1)
train, val, test = random_split(single_feature, [1113, 200, 200], generator=torch.Generator().manual_seed(7))
len(train), len(val), len(test)
(1113, 200, 200)
train_loader = DataLoader(train, batch_size=8, shuffle=True)
val_loader = DataLoader(val, batch_size=8, shuffle=True)
test_loader = DataLoader(test, batch_size=8, shuffle=True)

And we can get one batch of data by making the data loader iterable

x, t = next(iter(train_loader))
t.shape
torch.Size([8, 2])
torch.max(t, 1)[1]
tensor([0, 0, 1, 0, 0, 0, 0, 0])
x.shape
torch.Size([8, 13, 37, 37])

Finally with the data prepared we can train the models. These are tests to show that the models work as expected, but we can certainly fine tune the training loop to achieve better results.

model = MolMapMultiClassClassification(n_class=2)

epochs = 5
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

Predicted probabilities

model(x).exp()
/Users/olivier/opt/anaconda3/envs/molmap/lib/python3.6/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  ../c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
tensor([[0.5299, 0.4701],
        [0.5300, 0.4700],
        [0.5299, 0.4701],
        [0.5299, 0.4701],
        [0.5300, 0.4700],
        [0.5299, 0.4701],
        [0.5300, 0.4700],
        [0.5299, 0.4701]], grad_fn=<ExpBackward>)

And we can get the predicted class by

torch.max(model(x).exp(), 1)[1]
tensor([0, 0, 0, 0, 0, 0, 0, 0])

And the training loop

for epoch in range(epochs):

    running_loss = 0.0
    for i, (xb, yb) in enumerate(train_loader):
        
        # fix output shape for loss function        
        yb = torch.max(yb, 1)[1]
        
        xb, yb = xb.to(device), yb.to(device)

        # zero gradients
        optimizer.zero_grad()

        # forward propagation
        pred = model(xb)

        # loss calculation
        loss = criterion(pred, yb)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if (i+1) % 50 == 0:    
            print('[Epoch: %d, Iter: %5d] Training loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / (i+1)))

print('Training finished')
[Epoch: 1, Iter:    50] Training loss: 0.691
[Epoch: 1, Iter:   100] Training loss: 0.690
[Epoch: 2, Iter:    50] Training loss: 0.694
[Epoch: 2, Iter:   100] Training loss: 0.693
[Epoch: 3, Iter:    50] Training loss: 0.624
[Epoch: 3, Iter:   100] Training loss: 0.631
[Epoch: 4, Iter:    50] Training loss: 0.600
[Epoch: 4, Iter:   100] Training loss: 0.605
[Epoch: 5, Iter:    50] Training loss: 0.605
[Epoch: 5, Iter:   100] Training loss: 0.586
Training finished

And let's look at the prediction accuracy on validation data set

correct = 0
total = 0

with torch.no_grad():
    for i, (xb, yb) in enumerate(val_loader):

        yb = torch.max(yb, 1)[1]
        xb, yb = xb.to(device), yb.to(device)

        pred = model(xb)
        pred = torch.max(pred, 1)[1]

        # accuracy calculation
        total += yb.size(0)
        correct += (pred==yb).sum().item()

        
print('Accuracy of the network on the test data: %d %%' % (
    100 * correct / total))
Accuracy of the network on the test data: 72 %

Hmm not very good, but again we only trained for 5 epochs!

Classification using both feature maps

Now we can feed both the feature maps to the model as a tuple

double_feature = DoubleFeatureData(Y, (X1, X2))
train_double, val_double, test_double = random_split(double_feature, [1113, 200, 200], generator=torch.Generator().manual_seed(7))
len(train_double), len(val_double), len(test_double)
(1113, 200, 200)
train_loader_double = DataLoader(train_double, batch_size=8, shuffle=True)
val_loader_double = DataLoader(val_double, batch_size=8, shuffle=True)
test_loader_double = DataLoader(test_double, batch_size=8, shuffle=True)

And we can get one batch of data by making the data loader iterable

x, t = next(iter(train_loader_double))
t.shape
torch.Size([8, 2])
x1, x2 = x
x1.shape, x2.shape
(torch.Size([8, 13, 37, 37]), torch.Size([8, 1, 52, 52]))

And classification. Different feature maps have different number of channels.

model_double = MolMapMultiClassClassification(conv_in1=13, conv_in2=1)

epochs = 5
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model_double.to(device)
optimizer = optim.Adam(model_double.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

And the training loop

for epoch in range(epochs):

    running_loss = 0.0
    for i, ((x1, x2), yb) in enumerate(train_loader_double):

        # fix output shape for loss function        
        yb = torch.max(yb, 1)[1]

        x1, x2, yb = x1.to(device), x2.to(device), yb.to(device)

        # zero gradients
        optimizer.zero_grad()

        # forward propagation
        pred = model_double((x1, x2))

        # loss calculation
        loss = criterion(pred, yb)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if (i+1) % 50 == 0:    
            print('[Epoch: %d, Iter: %5d] Training loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / (i+1)))

print('Training finished')
[Epoch: 1, Iter:    50] Training loss: 0.988
[Epoch: 1, Iter:   100] Training loss: 0.858
[Epoch: 2, Iter:    50] Training loss: 0.709
[Epoch: 2, Iter:   100] Training loss: 0.705
[Epoch: 3, Iter:    50] Training loss: 0.705
[Epoch: 3, Iter:   100] Training loss: 0.710
[Epoch: 4, Iter:    50] Training loss: 0.701
[Epoch: 4, Iter:   100] Training loss: 0.699
[Epoch: 5, Iter:    50] Training loss: 0.685
[Epoch: 5, Iter:   100] Training loss: 0.687
Training finished

Accuracy validation data set

correct = 0
total = 0

with torch.no_grad():
    for i, ((x1, x2), yb) in enumerate(val_loader_double):

        yb = torch.max(yb, 1)[1]
        x1, x2, yb = x1.to(device), x2.to(device), yb.to(device)

        pred = model_double((x1, x2))
        pred = torch.max(pred, 1)[1]

        # accuracy calculation
        total += yb.size(0)
        correct += (pred==yb).sum().item()

        
print('Accuracy of the network on the test data: %d %%' % (
    100 * correct / total))
Accuracy of the network on the test data: 50 %

Nice!