Deep Learning Curriculum

ML basics - stuff to know by heart

simplest classifier: always predict the class that is seen most often
beat random
working anaconda installation, histogram, working jupyter notebook
You can build a function, solve it analytically, or solve it numerically.
- analytical method to solve for derivitive, learn in high school. If you know the function you can get the gradient easily
- How would a blind man walk down to find the colorado river at the bottom of the grand canyon?
linear algebra implications of data X, n by M dimensions, understand rows versus columns
Hyperplane - the simplest classifier you can think of. Divide space into two, the most general simplest rule, no variation to the rule, have a model where overfit will be minimal; As for a curve, could be overfit.
TRain test validate: does it generalize
Hyperplane cannot solve the XOR dataset.
Most classifier algorithms have parameters, Naive bayes has the least amount of parameters - called hyperparameters, ofter optimized using search
If you update the parameters by picking the best test data, you're cheating.
General workflow make a prediction, get error, update weights, how much do you overshoot. Oscillations.
Eta - how much of the error
In general neural networks are universal function approximators, find a function approximating the data.
Lego piece of building a neural network: if the summed signals exceeds a certain threshold (activation function)
- The threshold is what allows us to NON-linearly map an input to an output.
Just find a good you tube video to show in class, they can explain it better than you.

DL 101

What this lab is not:
- intro to ML
- rigorou mathematical formulism
Take aways: By the end of this you should be able to
- adapt this workflow to your own case
- know where to go for more info
Like a regression
line of best fit, have a bunch of characteristics, dot product between variables and weights, weighted sum, optimized to minimize the error
Intermediary step: cats vs dogs detectors: pointy ear deterctors, fluffy detector: learned higher level features
DCNN: Take a while bunch of regressions stacked regression - a giant function
Loss: a measure of how wrong we are = (Predicted - actual)
gradient operator - gives to you direction in order to minimize the loss
minimize how wrong we are.
universal function approximator theorem - learn any mapping from input to output, ay function that's represenable by math.
A day in the life of a weight in the network: tells you which direction but doesn't tell you how far.
a learning rate: scales how far. You want to be conservative, converges slower
Convolutions: 2-d pictures in one dimension. apply a source pixels, aka "kernel", weighted sum/destination pixels/"activation map". Connected to a small region rather than the whole network
- Edge detector. Single convolution being applied with a single kernel. slide the weights over by one. Translation equivariant
- rectified activation - force output to be positive
- Kernels are detectors
- 2D case GIF
- edge detector in a real time with the taj mahal, we've taken something really simple becomes ridiculously powerful
Now bundle convolutions in a layer. Learn the weights to come up with crazy powerful detector
Convolutional NN - learning by flow chart
If you can learn everything in the world, enough to memorize, but if someone walks in off the street it'sn not going to know their name.
Subsample = pooling
locally connected vs fully connected
RGB is 3 channels, a kernel of weights. If grayscale, you need to change architecture or change teh input.
A kernel is a bundle of weights, and a layer is a bundle of kernels
A channel is the result of one kernel. 64 kernels, 64 channels, i.e. new images, i.e. output feature map - makes sense to look in a local region, look at all the convolution across
Visualize what the DL architecture is looking at
attention map of what that kernel is paying attention to. We had edge detectors and vertical edges, and get a corner and curves in the second layer
the crops that most activate it
layer 3 we can get textures and primitives - combine lower level concepts into higher level images. This kernel detects wheels faces leopard print.
deeper layers take simpler concepts builds on simpler concepts learned in the previous layer.
A filter will fire when it sees a dog or a person
Hyperparameter, knobs to tweak.
Myth buters:
- CNNs only work with big data? False: 1 shot learning, low data case
- CNNs only work for images? False: Applications in many domains - 1 d convolutions good for EEG brain wave data, speech records, detecting seizures.
- Neural networks are all black boxes. Used to be true, more like gray boxes, becoming more white. Interpretation of what CNNs have learned.
- Deconvnets make interpreting what CNNs have learned.
Train test validate 2 step, make sure process generalized, by sheer chance the 1 step model, by testing too many models, we've absorbed the. test set is about validationg end to end process.
validation set is to tune hyperparameters
test set is for validating process. Kaggle leaderboard is the test set, only allowed to test 1-5 times per day.
Look at training loss, if you've flatlined, either the model converged, or you ain't gonna converge. Hopefully train loss and test loss converge to the same value.
Then look validation loss. Can still improve as long as you're decreasing. Early stopping, if validation loss goes up again, stop.
If training loss explode, make sure that no value in gradient i too high.
Adam gives every weight its own separate learning rate calculated based on its history.
Catastrophic forgetting - set a low learning rate, or else you'll jump out of the bowl.
Learning rates can only go down, typically. Can do a warm up to prevent.
Transfer learning addresses the small data problem
Larger more complicated networks
FCN8 higher resolution segmentation map
bottom = input, top = output, bottom and top corresponding to a hierarchy of semantics, high level, low level.

Practice

Aids to diagnosis
Create a new metrics
average radiologist has 5 minutes per case
computer vision tasks, from least complicated to most: gradient of difficulty (from stanford university)
- image classification, wh don't care where it is, what is that object
- localization, making assumption where or what is the given image, 1 or more assumption
- detection: 0 or more assumption, is there something in this image if so where or what is it?
- Image segmentation: Create a mask to cover the object in the picture
harder tasks need more (labelled) data
Semantic segmentation: pixel-wise classification - output image is a contour
metrics to fix class imbalance problem
NVIDIA DIGITS: one-node multiGPU
AlexNet is image in, label out
But segmentation is image in image out or contour out
Current SOTA is UNet for segmentation
DENSE = fully connected

Evaluation

Continuous Rank Probability Score (CRPS)
- 2 networks that predict
- networks capture certainty
- predict a distribution, not just a single number. CDF
- certainty means steep curve. Safer bet to predict with less confidence if you're not sure.
- Create bins into what looks like a classification problem, then differentiate
Entropy - how concentrated
output of softmax is not a probability statistic.

MxNet

good for multigpu out of the box
good for distributed
MXnet is fast
imperative (eager execution): interactive, something happens when you run it. Get back the intermediate layers/outputs. Great for debugging network. Don't care about speed.

Hyperparameters

Batch Size

larger the batch size, it's faster
combination of batch size and learning rate to give you better results

Momentum

gradient descent is a first order method
approximate a second order
prevents you from getting trapped in saddle points
second-order optimization are "dog-slow"

Batch Normalization

normalization within a batch

Dropout

coadaptation, single path through the network dominates
Forcing the network to learn multiple paths like an ensemble, slows down training, equivalent of Tikhonov regularization, prevents overfitting
memorizing the training set as opposed to the underlying characteristics

Papers

HooChang GAN paper
DeepMedic
DeepPatient
Andrew Ng paper on Atrial Fibrillation
N-body

Famous architectures

Keras Model Zoo

General

What is deep learning? subset of machine learning with automatic feature extraction.
- Neural network architecture - a series of layer, at each layer a different task is performed
- Input layer & output layer
What is machine learning? feature extraction
ILSVRC-Top 5 Error on ImageNet - DL performs better than Human accuracy for image classification
Labeled public datasets for supervised learning
- curve fitting, linear regression, data, model, objective, optimization
- y=mx+b, trying to fit a slope, minimize the residual, i.e., difference between fitted line and actual., algorithm to do that is called ordinary least squares.
concept of architecture
pretrained model - they give you the architecture, the labels and the weights
post-train - visualize weights, and activiations
Dice metric is f1 score pixelized.
- Ground truth contour versus predicted contour = 2* Area intersection/ area of union

Use cases

real-time classification (pizza, hotdog, etc)
recommendation engine
handwritten digits
Autonomous vehicles

Deep Learning Curriculum

Contents

ML basics - stuff to know by heart

DL 101

Practice

Evaluation

MxNet

Hyperparameters

Batch Size

Momentum

Batch Normalization

Dropout

Papers

Famous architectures

General

Use cases

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools