# Deep Learning Curriculum

Jump to navigation
Jump to search

## Contents

## ML basics - stuff to know by heart

- simplest classifier: always predict the class that is seen most often
- beat random
- working anaconda installation, histogram, working jupyter notebook
- You can build a function, solve it analytically, or solve it numerically.
- analytical method to solve for derivitive, learn in high school. If you know the function you can get the gradient easily
- How would a blind man walk down to find the colorado river at the bottom of the grand canyon?

- linear algebra implications of data X, n by M dimensions, understand rows versus columns
- Hyperplane - the simplest classifier you can think of. Divide space into two, the most general simplest rule, no variation to the rule, have a model where overfit will be minimal; As for a curve, could be overfit.
- TRain test validate: does it generalize
- Hyperplane cannot solve the XOR dataset.
- Most classifier algorithms have parameters, Naive bayes has the least amount of parameters - called hyperparameters, ofter optimized using search
- If you update the parameters by picking the best test data, you're cheating.
- General workflow make a prediction, get error, update weights, how much do you overshoot. Oscillations.
- Eta - how much of the error
- In general neural networks are universal function approximators, find a function approximating the data.
- Lego piece of building a neural network: if the summed signals exceeds a certain threshold (activation function)
- The threshold is what allows us to NON-linearly map an input to an output.

- Just find a good you tube video to show in class, they can explain it better than you.

## DL 101

- What this lab is not:
- intro to ML
- rigorou mathematical formulism

- Take aways: By the end of this you should be able to
- adapt this workflow to your own case
- know where to go for more info

- Like a regression
- line of best fit, have a bunch of characteristics, dot product between variables and weights, weighted sum, optimized to minimize the error
- Intermediary step: cats vs dogs detectors: pointy ear deterctors, fluffy detector: learned higher level features
- DCNN: Take a while bunch of regressions stacked regression - a giant function
- Loss: a measure of how wrong we are = (Predicted - actual)
- gradient operator - gives to you direction in order to minimize the loss
- minimize how wrong we are.
- universal function approximator theorem - learn any mapping from input to output, ay function that's represenable by math.
- A day in the life of a weight in the network: tells you which direction but doesn't tell you how far.
- a learning rate: scales how far. You want to be conservative, converges slower
- Convolutions: 2-d pictures in one dimension. apply a source pixels, aka "kernel", weighted sum/destination pixels/"activation map". Connected to a small region rather than the whole network
- Edge detector. Single convolution being applied with a single kernel. slide the weights over by one. Translation equivariant
- rectified activation - force output to be positive
- Kernels are detectors
- 2D case GIF
- edge detector in a real time with the taj mahal, we've taken something really simple becomes ridiculously powerful

- Now bundle convolutions in a layer. Learn the weights to come up with crazy powerful detector
- Convolutional NN - learning by flow chart
- If you can learn everything in the world, enough to memorize, but if someone walks in off the street it'sn not going to know their name.
- Subsample = pooling
- locally connected vs fully connected
- RGB is 3 channels, a kernel of weights. If grayscale, you need to change architecture or change teh input.
- A kernel is a bundle of weights, and a layer is a bundle of kernels
- A channel is the result of one kernel. 64 kernels, 64 channels, i.e. new images, i.e. output feature map - makes sense to look in a local region, look at all the convolution across
- Visualize what the DL architecture is looking at
- attention map of what that kernel is paying attention to. We had edge detectors and vertical edges, and get a corner and curves in the second layer
- the crops that most activate it
- layer 3 we can get textures and primitives - combine lower level concepts into higher level images. This kernel detects wheels faces leopard print.
- deeper layers take simpler concepts builds on simpler concepts learned in the previous layer.
- A filter will fire when it sees a dog or a person
- Hyperparameter, knobs to tweak.
- Myth buters:
- CNNs only work with big data? False: 1 shot learning, low data case
- CNNs only work for images? False: Applications in many domains - 1 d convolutions good for EEG brain wave data, speech records, detecting seizures.
- Neural networks are all black boxes. Used to be true, more like gray boxes, becoming more white. Interpretation of what CNNs have learned.
- Deconvnets make interpreting what CNNs have learned.

- Train test validate 2 step, make sure process generalized, by sheer chance the 1 step model, by testing too many models, we've absorbed the. test set is about validationg end to end process.
- validation set is to tune hyperparameters
- test set is for validating process. Kaggle leaderboard is the test set, only allowed to test 1-5 times per day.
- Look at training loss, if you've flatlined, either the model converged, or you ain't gonna converge. Hopefully train loss and test loss converge to the same value.
- Then look validation loss. Can still improve as long as you're decreasing. Early stopping, if validation loss goes up again, stop.
- If training loss explode, make sure that no value in gradient i too high.
- Adam gives every weight its own separate learning rate calculated based on its history.
- Catastrophic forgetting - set a low learning rate, or else you'll jump out of the bowl.
- Learning rates can only go down, typically. Can do a warm up to prevent.
- Transfer learning addresses the small data problem
- Larger more complicated networks
- FCN8 higher resolution segmentation map
- bottom = input, top = output, bottom and top corresponding to a hierarchy of semantics, high level, low level.

## Practice

- Aids to diagnosis
- Create a new metrics
- average radiologist has 5 minutes per case
- computer vision tasks, from least complicated to most: gradient of difficulty (from stanford university)
- image classification, wh don't care where it is, what is that object
- localization, making assumption where or what is the given image, 1 or more assumption
- detection: 0 or more assumption, is there something in this image if so where or what is it?
- Image segmentation: Create a mask to cover the object in the picture

- harder tasks need more (labelled) data
- Semantic segmentation: pixel-wise classification - output image is a contour
- metrics to fix class imbalance problem
- NVIDIA DIGITS: one-node multiGPU
- AlexNet is image in, label out
- But segmentation is image in image out or contour out
- Current SOTA is UNet for segmentation
- DENSE = fully connected

## Evaluation

- Continuous Rank Probability Score (CRPS)
- 2 networks that predict
- networks capture certainty
- predict a distribution, not just a single number. CDF
- certainty means steep curve. Safer bet to predict with less confidence if you're not sure.
- Create bins into what looks like a classification problem, then differentiate

- Entropy - how concentrated
- output of softmax is not a probability statistic.

## MxNet

- good for multigpu out of the box
- good for distributed
- MXnet is fast
- imperative (eager execution): interactive, something happens when you run it. Get back the intermediate layers/outputs. Great for debugging network. Don't care about speed.

## Hyperparameters

### Batch Size

- larger the batch size, it's faster
- combination of batch size and learning rate to give you better results

### Momentum

- gradient descent is a first order method
- approximate a second order
- prevents you from getting trapped in saddle points
- second-order optimization are "dog-slow"

### Batch Normalization

- normalization within a batch

### Dropout

- coadaptation, single path through the network dominates
- Forcing the network to learn multiple paths like an ensemble, slows down training, equivalent of Tikhonov regularization, prevents overfitting
- memorizing the training set as opposed to the underlying characteristics

## Papers

- HooChang GAN paper
- DeepMedic
- DeepPatient
- Andrew Ng paper on Atrial Fibrillation
- N-body

## Famous architectures

## General

- What is deep learning? subset of machine learning with automatic feature extraction.
- Neural network architecture - a series of layer, at each layer a different task is performed
- Input layer & output layer

- What is machine learning? feature extraction
- ILSVRC-Top 5 Error on ImageNet - DL performs better than Human accuracy for image classification
- Labeled public datasets for supervised learning
- curve fitting, linear regression, data, model, objective, optimization
- y=mx+b, trying to fit a slope, minimize the residual, i.e., difference between fitted line and actual., algorithm to do that is called ordinary least squares.

- concept of architecture
- pretrained model - they give you the architecture, the labels and the weights
- post-train - visualize weights, and activiations
- Dice metric is f1 score pixelized.
- Ground truth contour versus predicted contour = 2* Area intersection/ area of union

## Use cases

- real-time classification (pizza, hotdog, etc)
- recommendation engine
- handwritten digits
- Autonomous vehicles