Prepare MolMap generated feature maps for training.

We have basically three tasks to achieve:

  1. The feature maps extracted from MolMap are NumPy arrays while we need Torch tensors;
  2. In PyTorch the training data for computer vision problems takes the shape (n_channels, hight, width), while the features extracted from MolMap take the shape (hight, width, n_channels), we'll have to correct it;
  3. For model training Torch expects data stored in a Dataset object, we'll also need to create these objects.

The inputs are of the shape (n_samples, hight, width, n_channels), we correct them to (n_samples, n_channels, hight, width)

X = np.random.rand(100, 37, 37, 13)
X.shape
(100, 37, 37, 13)
torch.movedim(torch.from_numpy(X), -1, 1).shape
torch.Size([100, 13, 37, 37])

For different tasks we can have outcomes of different shape. For regression we have a scalar output while for classifications we have a vector.

y_reg = np.random.rand(100, 1)
y_reg.shape
(100, 1)
y_clf = np.random.rand(100, 8)
y_clf.shape
(100, 8)

Single feature

Now the Dataset object expected by Torch models, using one single feature

class SingleFeatureData[source]

SingleFeatureData(*args, **kwds) :: Dataset

Process single feature map for model training. y: target X: feature map

Regression data

d_reg = SingleFeatureData(y_reg, X)
d_reg.X.shape
torch.Size([100, 13, 37, 37])
d_reg.y.shape
torch.Size([100, 1])

Split data

train, val, test = random_split(d_reg, [50, 30, 20], generator=torch.Generator().manual_seed(7))

len(train), len(val), len(test)
(50, 30, 20)
train_loader = DataLoader(train, batch_size=8, shuffle=True)
val_loader = DataLoader(val, batch_size=8, shuffle=True)
test_loader = DataLoader(test, batch_size=8, shuffle=True)

And we can get one batch of data by making the data loader iterable

x, t = next(iter(train_loader))
t
tensor([[0.7104],
        [0.9351],
        [0.0879],
        [0.6092],
        [0.3251],
        [0.7344],
        [0.4595],
        [0.7092]], dtype=torch.float64)
x.shape
torch.Size([8, 13, 37, 37])

Classification data

d_clf = SingleFeatureData(y_clf, X)
d_clf.X.shape
torch.Size([100, 13, 37, 37])
d_clf.y.shape
torch.Size([100, 8])

Split data

train, val, test = random_split(d_clf, [50, 30, 20], generator=torch.Generator().manual_seed(7))

len(train), len(val), len(test)
(50, 30, 20)
train_loader = DataLoader(train, batch_size=8, shuffle=True)
val_loader = DataLoader(val, batch_size=8, shuffle=True)
test_loader = DataLoader(test, batch_size=8, shuffle=True)

And we can get one batch of data by making the data loader iterable

x, t = next(iter(train_loader))
t
tensor([[0.8310, 0.9343, 0.9739, 0.7343, 0.3363, 0.9877, 0.7220, 0.2365],
        [0.7070, 0.1497, 0.9926, 0.2526, 0.6560, 0.3483, 0.2039, 0.1662],
        [0.7821, 0.8387, 0.5680, 0.8080, 0.2574, 0.7177, 0.1681, 0.9655],
        [0.6966, 0.7496, 0.9704, 0.0409, 0.5455, 0.4679, 0.1694, 0.7986],
        [0.4942, 0.2321, 0.6251, 0.0752, 0.2691, 0.9629, 0.6358, 0.1475],
        [0.0159, 0.9606, 0.3611, 0.4873, 0.6847, 0.2638, 0.8886, 0.5483],
        [0.9255, 0.7321, 0.9346, 0.9178, 0.5032, 0.4853, 0.4863, 0.8786],
        [0.9479, 0.0577, 0.3369, 0.2861, 0.2183, 0.3099, 0.5837, 0.3486]],
       dtype=torch.float64)
x.shape
torch.Size([8, 13, 37, 37])

Double features

And dataset using two features

class DoubleFeatureData[source]

DoubleFeatureData(*args, **kwds) :: Dataset

Process single feature map for model training. y: target X: tuple of two feature maps

X1 = np.random.rand(100, 37, 37, 13)
X2 = np.random.rand(100, 37, 37, 3)
X1.shape, X2.shape
((100, 37, 37, 13), (100, 37, 37, 3))
d_reg = DoubleFeatureData(y_reg, (X1, X2))
d_reg.X1.shape, d_reg.X2.shape
(torch.Size([100, 13, 37, 37]), torch.Size([100, 3, 37, 37]))
d_reg.y.shape
torch.Size([100, 1])

Split data

train, val, test = random_split(d_reg, [50, 30, 20], generator=torch.Generator().manual_seed(7))

len(train), len(val), len(test)
(50, 30, 20)
train_loader = DataLoader(train, batch_size=8, shuffle=True)
val_loader = DataLoader(val, batch_size=8, shuffle=True)
test_loader = DataLoader(test, batch_size=8, shuffle=True)

And we can get one batch of data by making the data loader iterable

x, t = next(iter(train_loader))
t
tensor([[0.7344],
        [0.4688],
        [0.6977],
        [0.5588],
        [0.3702],
        [0.0779],
        [0.8502],
        [0.7523]], dtype=torch.float64)
x1, x2 = x
x1.shape, x2.shape
(torch.Size([8, 13, 37, 37]), torch.Size([8, 3, 37, 37]))