Prepare MolMap generated feature maps for training.
We have basically three tasks to achieve:
- The feature maps extracted from MolMap are NumPy arrays while we need Torch tensors;
- In PyTorch the training data for computer vision problems takes the shape
(n_channels, hight, width), while the features extracted fromMolMaptake the shape(hight, width, n_channels), we'll have to correct it; - For model training Torch expects data stored in a
Datasetobject, we'll also need to create these objects.
The inputs are of the shape (n_samples, hight, width, n_channels), we correct them to (n_samples, n_channels, hight, width)
X = np.random.rand(100, 37, 37, 13)
X.shape
torch.movedim(torch.from_numpy(X), -1, 1).shape
For different tasks we can have outcomes of different shape. For regression we have a scalar output while for classifications we have a vector.
y_reg = np.random.rand(100, 1)
y_reg.shape
y_clf = np.random.rand(100, 8)
y_clf.shape
Regression data
d_reg = SingleFeatureData(y_reg, X)
d_reg.X.shape
d_reg.y.shape
Split data
train, val, test = random_split(d_reg, [50, 30, 20], generator=torch.Generator().manual_seed(7))
len(train), len(val), len(test)
train_loader = DataLoader(train, batch_size=8, shuffle=True)
val_loader = DataLoader(val, batch_size=8, shuffle=True)
test_loader = DataLoader(test, batch_size=8, shuffle=True)
And we can get one batch of data by making the data loader iterable
x, t = next(iter(train_loader))
t
x.shape
Classification data
d_clf = SingleFeatureData(y_clf, X)
d_clf.X.shape
d_clf.y.shape
Split data
train, val, test = random_split(d_clf, [50, 30, 20], generator=torch.Generator().manual_seed(7))
len(train), len(val), len(test)
train_loader = DataLoader(train, batch_size=8, shuffle=True)
val_loader = DataLoader(val, batch_size=8, shuffle=True)
test_loader = DataLoader(test, batch_size=8, shuffle=True)
And we can get one batch of data by making the data loader iterable
x, t = next(iter(train_loader))
t
x.shape
X1 = np.random.rand(100, 37, 37, 13)
X2 = np.random.rand(100, 37, 37, 3)
X1.shape, X2.shape
d_reg = DoubleFeatureData(y_reg, (X1, X2))
d_reg.X1.shape, d_reg.X2.shape
d_reg.y.shape
Split data
train, val, test = random_split(d_reg, [50, 30, 20], generator=torch.Generator().manual_seed(7))
len(train), len(val), len(test)
train_loader = DataLoader(train, batch_size=8, shuffle=True)
val_loader = DataLoader(val, batch_size=8, shuffle=True)
test_loader = DataLoader(test, batch_size=8, shuffle=True)
And we can get one batch of data by making the data loader iterable
x, t = next(iter(train_loader))
t
x1, x2 = x
x1.shape, x2.shape