Blog Needs a Name


Netpicking Part 1: Hello MNIST

The Hello World! introduction to neural network libraries has the user write a small network for the MNIST dataset, train it, test it, get 90% accuracy or more, and thereby get a feel for how the library works. When I started using PyTorch, I followed such a tutorial on their website. But I wondered why a network with Conv2d and ReLU was picked in the tutorial. Why not a different convolution or a Linear layer with a Sigmoid activation?

When someone designs a neural network, why do they pick a particular architecture? Obviously, some conventions have set in, but the primary reason is performance: Network A got a higher accuracy (or AUC) than Network B, so use A. Is there more information I can use when making this decision? Let’s look at a bunch of networks and find out.

Measurements and Baselines

Comparing a bunch of networks seems similar to a programming contest:

A regular program’sis somewhat like a neural net’s
Accuracy on test casesAccuracy on test set
Compilation TimeTraining Time
Binary sizeNumber of parameters
Memory usageMemory required to process a single input
Algorithm Complexitynumber of ops in the computation

I designed Basic, a simple1 model with bias as the baseline for the metrics. I trained it for 4 epochs, with the resulting performance:

networkweightsmemory usagetraining timenumber of opsaccuracy
Basic7850158852.14s40.912

The next step is design a bunch of “similar” networks and obtain their performance metrics.

Getting a bunch of networks

I quickly got bored writing different-yet-similar neural nets manually. Yak shaving to the rescue! I ended up generating 1000 neural networks following a sequential template. The networks are classified along two axes: computation layer, and activation layer. The networks are distributed as per the below table:

↓ Computation / Activation →NoneReLUSELUSigmoidTanh
Linear5523242523
Conv1d5923242123
Conv2d6123232023
Conv3d5723232225
Conv1dThenLinear3417171715
Conv2dThenLinear3814161616
Conv3dThenLinear3318161815
ResNetStyle20100000

For example, there are 23 networks where the computation layer is Conv2d and the activation layer is ReLU.

So 1001 networks. Each trained for 4 epochs. The MNIST test set contains 10000 samples. A prediction of each input has 10 scores. That gives a raw dataset of size (1001, 4, 10000, 10), and a summary dataset of size (1001, 4, 20). Time to crunch the numbers!

Picking the “best” network

Let’s use the below analogy:

There are 1000 students writing the MNIST exam. The exam has 10000 questions, multiple choice. The students use approved study material, which contains 60000 practice questions. Each student has taken the test 4 times. I have also written this exam, and I have a Basic idea of what a good score is. I want to hire one or more students who perform well on this exam.

The following ten are just some of the queries that can be posed:

  1. How did the students perform over 4 attempts?

    Attempt#≥ 80%≥ 90%≥ 95%≥ 99%≥ 100%
    198379235200
    299087946100
    399090750900
    499392452750

    So the exam was easy, but very few got close to a perfect score. Let’s just consider the “good” students: those that got above 80% in all 4 test attempts. Of these, the “really good” students are those that got above 98% in their last test attempts, and they get a gold star.

  2. I know the students studied together and developed common strategies. Which strategy led to more students scoring high marks?

    score1

    Okay, ResNetStyle is first3 (deeper networks better, skip connections are magic, blah blah), but what about everyone else? Unsurprisingly, Conv2d networks are second-best, but Conv3d networks seem to do an equally good job (lower maximum, but higher median and smaller spread). Adding Linear layers after convolution layers does not seem to be beneficial, perhaps the networks didn’t have enough training epochs.

    score2

    • Argh! The move to consider networks without any activation was useless. Networks without activations are just linear functions; combining each network’s weights would produce a matrix that is effectively equal to the one used in Basic.

    • As expected, networks that use the ReLU activation have a higher accuracy on average than any of the others.

    • Networks that use SELU activation are not as good as those with ReLU, but are more consistent.

    • Sigmoid and Tanh activations are both risky choices.

  3. With the Basic strategy, I spent only 52 seconds studying for the test. How about the others?

    time

    • The Conv1d, Linear, and Conv1dThenLinear networks take similar amounts of time to train. Does this mean that the reshape operation is slow? The other networks use 2D-convolutions or higher.

    • The gold stars are all across the board for ResNetStyle networks, and generally on the higher end for the others. However, the gold star in Conv3dThenLinear takes the least amount of training time in its class; are Conv3d networks slower to train?

  4. With the Basic strategy, I had only 7850 keywords as part of my notes. How about the others?

    params

    Again, the gold stars are on the higher end of the distributions. This could imply deeper or wider networks though.

  5. With the Basic strategy, I used only 1588 pages for rough work. How about the others?

    mem

    This plot is similar to the previous one. The memory required to hold the intermediate tensors is related to the layers that output these tensors, examining the gold stars individually may give some information.

  6. With the Basic strategy, I needed only 4 steps to get an answer every time. How about the others?

    ops

    Since all the networks are sequential, more operations means deeper networks. Now “deeper networks are better” can be seen: the higher ends have the gold stars, but not all of them.

  7. Every student took the test 4 times. How did the scores change over each attempt?

    changes

    This graph doesn’t tell much. It makes a case for early stopping: in most cases, the first two epochs are sufficient to pick the best-trained network. There should be a better way to understand this data; are there any networks that were horrible in the first two epochs, and then suddenly found a wonderful local optimum?

  8. How many questions were easy, weird, or confusing?

    Out of the 10000 samples in the test set,

    • 4247 samples were easy questions. All the networks predicted these correctly, so it is impossible to distinguish between the networks using any of these samples.

    • 8 samples were weird questions. More than 90% of networks predicted these incorrectly, but all of them agreed got the same incorrect answer.

      weird

    • 5 samples were confusing questions. There was no clear agreement among the networks as to what the answer was.

      confusing

  9. Let’s take the student with the best score. Is this person the best overall?

    The network with the highest accuracy is ResNetStyle_75, with an accuracy of 99.1%. To be the best overall, it should have the highest accuracy for each class of inputs, so let’s look at the percentile of ResNetStyle_75 at predicting each digit correctly:

    name0123456789
    ResNetStyle_7599.1994.0698.5999.0990.3396.2796.68100.0094.5697.58

    So there are some networks that are more accurate than ResNetStyle_75 at predicting individual classes. ResNetStyle_75 has the worst percentile at predicting 4s correctly.

  10. How does the best student compare to the Basic method?

    networkweightsmemory usagetraining timenumber of opsaccuracy
    ResNetStyle_75531605254083577.6s280.991
    Basic7850158852.14s40.912
    Ratio67.716011.077.01.08

    For an 8% increase in accuracy, ResNetStyle_75 required 67x weights, 160x memory, and 11x training time. How many members should an ensemble of Basic networks have to get a similar accuracy combined? 11? 67? 160?

Closing Notes

Picking the best neural network, not surprisingly, depends on the definition of best (questions 3, 4, 5, and 6). Some use cases may value resource efficiency: a simple network with few parameters, that can be fast and error-prone, would be easier to use in a constrained environment. Other cases may have heavy consequences attached to a wrong prediction, and so will use a large, overparametrized network, located in a server with multiple GPUs, to avoid error at all costs. Maybe the two extremes could work in tandem: the small network can be provide a quick prediction, which can be checked by requesting the large network for a prediction if needed.


  1. The network is just a 784x10 matrix multiplication, adding a bias vector, and a Softmax layer. ↩︎

  2. The computation layer is a ResNet BasicBlock↩︎

  3. I realized after running all the networks that I could’ve modified the BasicBlock to use different activations instead of just ReLU, which would’ve given a nice square matrix of subplots, and info about how the BasicBlock architecture is affected by different activations. ↩︎