10  Convolutional networks

The simplest machine learning models assume that the observed data values are unstructured, meaning that the elements of the data vectors x are treated as if we do not know anything in advance about how the individual elements might relate to each other. If we were to make a random permutation of the ordering of these variables and apply this fixed permutation consistently on all training and test data, there would be no difference in the performance for the models considered so far.

Many applications of machine learning, however, involve structured data in which there are additional relationships between input variables. For example, the words in natural language form a sequence, and if we were to model language as a generative autoregressive process then we would expect each word to depend more strongly on the immediately preceding words and less so on words much earlier in the sequence. Likewise, the pixels of an image have a well-defined spatial relationship to each other in which the input variables are arranged in a two-dimensional grid, and nearby pixels have highly correlated values.

We have already seen that our knowledge of the structure of specific data modalities can be utilized through the addition of a regularization term to the error function in the training objective, through data augmentation, or through modifications to the model architecture. These approaches can help guide the model to respect certain properties such as invariance (9.1.3) and equivariance (9.1.4) with respect to transformations of the input data. In this chapter we will take a look at an architectural approach called a convolutional neural network (CNN), which we will see can be viewed as a sparsely connected multilayer network with parameter sharing, and designed to encode invariances and equivariances specific to image data.

import torch
import torch.nn as nn
import torch.nn.functional as F
from einops.layers.torch import Rearrange

# Define the old network
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

# Define the new network
conv_net_new = nn.Sequential(
    nn.Conv2d(1, 10, kernel_size=5),
    nn.MaxPool2d(kernel_size=2),
    nn.ReLU(),
    nn.Conv2d(10, 20, kernel_size=5),
    nn.MaxPool2d(kernel_size=2),
    nn.ReLU(),
    nn.Dropout2d(),
    Rearrange('b c h w -> b (c h w)'),
    nn.Linear(320, 50),
    nn.ReLU(),
    nn.Dropout(),
    nn.Linear(50, 10),
    nn.LogSoftmax(dim=1)
)

# Create a random tensor to represent a batch of images
x = torch.randn(1, 1, 28, 28)

# Pass the tensor through the old network
conv_net_old = Net()
y_old = conv_net_old(x)
print("Output from the old network:", y_old)
print("Output shape from the old network:", y_old.shape)

# Pass the tensor through the new network
y_new = conv_net_new(x)
print("Output from the new network:", y_new)
print("Output shape from the new network:", y_new.shape)

10.1 Computer vision

10.1.1 Image data

10.1.2 Convolutional filters

10.1.3 Feature detectors

10.1.4 Translation equivariance

10.1.5 Padding

10.1.6 Strided convolutions

10.1.7 Multi-dimensional convolutions

10.1.8 Pooling

10.1.9 Multilayer convolutions

10.1.10 Example network architectures

LeNet ImageNet VGG16 AlexNet

10.2 Visualising trained CNNs

10.2.1 Visual cortex

10.2.2 Visualizing trained filters

10.2.3 Saliency maps

10.2.4 Adversarial attacks

10.2.5 Synthetic images

10.3 Object detection

10.4 Image segmentation

10.5 Style transfer