Cheat Sheet for Codelab: TensorFlow and Deep Learning without a PhD

At our recent joint event by GDG Seattle and Seattle Data/Analytics/Machine Learning on 1/27 — TensorFlow and Deep Learning without a PhD Session 1: CNN, I helped as a coach for Martin Gorner’s codelab.

For developers who just started to learn machine learning, deep learning and TensorFlow, so many new terminologies can be overwhelming. To add to the confusion, often times there are different names for the same thing. So I compiled this cheat sheet and share my notes taken as I went through the codelab. Hopefully it helps our attendees who are mostly beginners of TensorFlow and Deep Learning, and any of you who are taking the codelab on your own.

The codelab teaches you how to classify handwritten digits with the MNIST dataset, which is equivalent to the “Hello World” for deep learning. Once you grasp the concepts introduced in this codelab, you should have a good idea of what is deep learning and how it works.

Below are some of the commonly used deep learning terminologies used in the codelab:


  • Training data — data used for training the model.
  • Validation data—also called holdout, development data or dev data. Used to tune hyper parameters during training. You can also use it to evaluate your algorithms.
  • Test data — data that the network hasn’t seen and used to evaluate the model.
  • Overfitting — model fits very well to the training data but doesn’t generalize well to unseen data. When you see the divergence between training loss and test loss. Overfitting is often caused by too many neurons or not enough data.


The model formula in the codelab: Y = softmax(X*W + b)

  • Input X, [100, 784] — each mini batch contains 100 grayscale images [28, 28]. Note for color image the shape would be [28, 28, 3]. Flatten each image into a vector of 784 pixels.
  • Weights W, [784, 10] — There are 10 digits (classes) to classify. One weight per pixel so there are 784 weights per digit / class.
  • Biases b, [10] — One bias per digit / class. Since the biases are in a different shape than the inputs and weights, Python and numpy broadcasting will add this to each line of calculation.
  • Softmax — an activation function (whether or not a neuron should fire) that turns weighted sum into a probability (a number between 0 and 1). Often used in the last layer of a neural network for classification.
  • Output Y, [100, 10] — each of the 100 images gets a prediction Y. The formula X*W + b produces a weighted sum which sometimes are called scores, or logits. We then use the Softmax function to turn the score/logits into a prediction of what digit the image contains.

Loss function

Error function Cross entropy -∑Y′*log(Yi) — measures how much our prediction missed compared to the actual digit label.

  • Y′ — actual probability one-hot encoded. This is also called the label or ground truth.
  • Y — computed probability or prediction made by the deep learning model.
  • Gradient descent—a gradient is a slope / partial derivatives.
  • Learning rate — how big is the step you are taking in the gradient descent.
  • Mini batch — This tutorial uses mini-batch size of 100.

Training process

  1. Input data: training digits and labels
  2. Initialize weights and biases with random values
  3. Loss function
  4. Use gradient descent to figure out the direction and how much to adjust the weights and biases
  5. Update weights and biases
  6. Repeat with the next mini-batch of training images and labels until we finish processing all training data.

Protip: during training often times the errors you get are about the shape of one of the tensors. So pay attention to the shape and how you reshape.

Optimizations in the codelab

The codelab started with a simple model with 92% accuracy and then with these optimizations it gets improved to 99% accuracy:

  • Add more layers to the neural network. (Solution in
  • Use RELU instead of Sigmoid activation function.
  • Learning rate decay — slowly reduce the learning rate over time. The code lab uses exponential decay. (Solution in
  • Dropout — a regularization technique that prevents overfitting. During training randomly drop neurons by setting neurons in the network to 0. (Solution in
  • Use a Convolutional Neural Network (CNN) instead of a simple neural network. (Solution in
  • Increase CNN filter size and number of filters and apply dropout. (Solution in

Protip: follow the codelab instructions and try it out on your own. Only when you get really stuck, then look at the solutions and identify the difference between your code and the solutions.

Convolutional Neural Networks

  • Filter — also called a patch or kernel. It’s a small patch that you slide or convolve through the input image. A 3-D tensor in TensorFlow.
  • Stride — the number of entries by which the filter is moved right at each step. An integer in TensorFlow.
  • Padding — you define the padding as either “valid” or “same” in TensorFlow. Valid means no padding and Same means pad it so that output size is the same as the input size.


What is a tensor? A tensor is a multi-dimensional array.

  • Scalar — single value
  • Vector — one dimensional
  • Matrix — two dimensional
  • Placeholder (tf.placeholder)— a placeholder for data, for example input training data.
  • Variable (tf.Variable)— something we ask Tensorfow to figure out for us such as weights and biases.
  • Optimizer — used to compute the gradient /partial derivatives of the error function. TensorFlow has a full library of optimizers. In the codelab GradientDescentOptimizer is used in the simple network and then later on the AdamOptimizer used.
  • Session-in TensorFlow you need to first create a session then call to perform the computations. Note: in the recent versions of TensorFlow with Eager Execution you don’t need to call

Protip: Go to TensorFlow AIP documentation to look up details of each API.

Learning Resources on Recognizing Handwritten Digits

ML GDE (Google Developer Expert) | AI, Art & Design |