Deep Learning Curriculum

From Colettapedia
Jump to navigation Jump to search

ML basics - stuff to know by heart

  • simplest classifier: always predict the class that is seen most often
  • beat random
  • working anaconda installation, histogram, working jupyter notebook
  • You can build a function, solve it analytically, or solve it numerically.
    • analytical method to solve for derivitive, learn in high school. If you know the function you can get the gradient easily
    • How would a blind man walk down to find the colorado river at the bottom of the grand canyon?
  • linear algebra implications of data X, n by M dimensions, understand rows versus columns
  • Hyperplane - the simplest classifier you can think of. Divide space into two, the most general simplest rule, no variation to the rule, have a model where overfit will be minimal; As for a curve, could be overfit.
  • TRain test validate: does it generalize
  • Hyperplane cannot solve the XOR dataset.
  • Most classifier algorithms have parameters, Naive bayes has the least amount of parameters - called hyperparameters, ofter optimized using search
  • If you update the parameters by picking the best test data, you're cheating.
  • General workflow make a prediction, get error, update weights, how much do you overshoot. Oscillations.
  • Eta - how much of the error
  • In general neural networks are universal function approximators, find a function approximating the data.
  • Lego piece of building a neural network: if the summed signals exceeds a certain threshold (activation function)
    • The threshold is what allows us to NON-linearly map an input to an output.
  • Just find a good you tube video to show in class, they can explain it better than you.

DL 101

  • What this lab is not:
    • intro to ML
    • rigorou mathematical formulism
  • Take aways: By the end of this you should be able to
    • adapt this workflow to your own case
    • know where to go for more info
  • Like a regression
  • line of best fit, have a bunch of characteristics, dot product between variables and weights, weighted sum, optimized to minimize the error
  • Intermediary step: cats vs dogs detectors: pointy ear deterctors, fluffy detector: learned higher level features
  • DCNN: Take a while bunch of regressions stacked regression - a giant function
  • Loss: a measure of how wrong we are = (Predicted - actual)
  • gradient operator - gives to you direction in order to minimize the loss
  • minimize how wrong we are.
  • universal function approximator theorem - learn any mapping from input to output, ay function that's represenable by math.
  • A day in the life of a weight in the network: tells you which direction but doesn't tell you how far.
  • a learning rate: scales how far. You want to be conservative, converges slower
  • Convolutions: 2-d pictures in one dimension. apply a source pixels, aka "kernel", weighted sum/destination pixels/"activation map". Connected to a small region rather than the whole network
    • Edge detector. Single convolution being applied with a single kernel. slide the weights over by one. Translation equivariant
    • rectified activation - force output to be positive
    • Kernels are detectors
    • 2D case GIF
    • edge detector in a real time with the taj mahal, we've taken something really simple becomes ridiculously powerful
  • Now bundle convolutions in a layer. Learn the weights to come up with crazy powerful detector
  • Convolutional NN - learning by flow chart
  • If you can learn everything in the world, enough to memorize, but if someone walks in off the street it'sn not going to know their name.
  • Subsample = pooling
  • locally connected vs fully connected
  • RGB is 3 channels, a kernel of weights. If grayscale, you need to change architecture or change teh input.
  • A kernel is a bundle of weights, and a layer is a bundle of kernels
  • A channel is the result of one kernel. 64 kernels, 64 channels, i.e. new images, i.e. output feature map - makes sense to look in a local region, look at all the convolution across
  • Visualize what the DL architecture is looking at
  • attention map of what that kernel is paying attention to. We had edge detectors and vertical edges, and get a corner and curves in the second layer
  • the crops that most activate it
  • layer 3 we can get textures and primitives - combine lower level concepts into higher level images. This kernel detects wheels faces leopard print.
  • deeper layers take simpler concepts builds on simpler concepts learned in the previous layer.
  • A filter will fire when it sees a dog or a person
  • Hyperparameter, knobs to tweak.
  • Myth buters:
    • CNNs only work with big data? False: 1 shot learning, low data case
    • CNNs only work for images? False: Applications in many domains - 1 d convolutions good for EEG brain wave data, speech records, detecting seizures.
    • Neural networks are all black boxes. Used to be true, more like gray boxes, becoming more white. Interpretation of what CNNs have learned.
    • Deconvnets make interpreting what CNNs have learned.
  • Train test validate 2 step, make sure process generalized, by sheer chance the 1 step model, by testing too many models, we've absorbed the. test set is about validationg end to end process.
  • validation set is to tune hyperparameters
  • test set is for validating process. Kaggle leaderboard is the test set, only allowed to test 1-5 times per day.
  • Look at training loss, if you've flatlined, either the model converged, or you ain't gonna converge. Hopefully train loss and test loss converge to the same value.
  • Then look validation loss. Can still improve as long as you're decreasing. Early stopping, if validation loss goes up again, stop.
  • If training loss explode, make sure that no value in gradient i too high.
  • Adam gives every weight its own separate learning rate calculated based on its history.
  • Catastrophic forgetting - set a low learning rate, or else you'll jump out of the bowl.
  • Learning rates can only go down, typically. Can do a warm up to prevent.
  • Transfer learning addresses the small data problem
  • Larger more complicated networks
  • FCN8 higher resolution segmentation map
  • bottom = input, top = output, bottom and top corresponding to a hierarchy of semantics, high level, low level.


  • Aids to diagnosis
  • Create a new metrics
  • average radiologist has 5 minutes per case
  • computer vision tasks, from least complicated to most: gradient of difficulty (from stanford university)
    • image classification, wh don't care where it is, what is that object
    • localization, making assumption where or what is the given image, 1 or more assumption
    • detection: 0 or more assumption, is there something in this image if so where or what is it?
    • Image segmentation: Create a mask to cover the object in the picture
  • harder tasks need more (labelled) data
  • Semantic segmentation: pixel-wise classification - output image is a contour
  • metrics to fix class imbalance problem
  • NVIDIA DIGITS: one-node multiGPU
  • AlexNet is image in, label out
  • But segmentation is image in image out or contour out
  • Current SOTA is UNet for segmentation
  • DENSE = fully connected


  • Continuous Rank Probability Score (CRPS)
    • 2 networks that predict
    • networks capture certainty
    • predict a distribution, not just a single number. CDF
    • certainty means steep curve. Safer bet to predict with less confidence if you're not sure.
    • Create bins into what looks like a classification problem, then differentiate
  • Entropy - how concentrated
  • output of softmax is not a probability statistic.


  • good for multigpu out of the box
  • good for distributed
  • MXnet is fast
  • imperative (eager execution): interactive, something happens when you run it. Get back the intermediate layers/outputs. Great for debugging network. Don't care about speed.


Batch Size

  • larger the batch size, it's faster
  • combination of batch size and learning rate to give you better results


  • gradient descent is a first order method
  • approximate a second order
  • prevents you from getting trapped in saddle points
  • second-order optimization are "dog-slow"

Batch Normalization

  • normalization within a batch


  • coadaptation, single path through the network dominates
  • Forcing the network to learn multiple paths like an ensemble, slows down training, equivalent of Tikhonov regularization, prevents overfitting
  • memorizing the training set as opposed to the underlying characteristics


  • HooChang GAN paper
  • DeepMedic
  • DeepPatient
  • Andrew Ng paper on Atrial Fibrillation
  • N-body

Famous architectures


  • What is deep learning? subset of machine learning with automatic feature extraction.
    • Neural network architecture - a series of layer, at each layer a different task is performed
    • Input layer & output layer
  • What is machine learning? feature extraction
  • ILSVRC-Top 5 Error on ImageNet - DL performs better than Human accuracy for image classification
  • Labeled public datasets for supervised learning
    • curve fitting, linear regression, data, model, objective, optimization
    • y=mx+b, trying to fit a slope, minimize the residual, i.e., difference between fitted line and actual., algorithm to do that is called ordinary least squares.
  • concept of architecture
  • pretrained model - they give you the architecture, the labels and the weights
  • post-train - visualize weights, and activiations
  • Dice metric is f1 score pixelized.
    • Ground truth contour versus predicted contour = 2* Area intersection/ area of union

Use cases

  • real-time classification (pizza, hotdog, etc)
  • recommendation engine
  • handwritten digits
  • Autonomous vehicles