ML From Scratch, Part 0: Introduction
“As an apprentice, every new magician must prove to his own satisfaction, at least once, that there is truly great power in magic.” - The Flying Sorcerers, by David Gerrold and Larry Niven
How do you know if you really understand something? You could just rely on the subjective experience of feeling like you understand. This sounds plausible - surely you of all people should know, right? But this runs head-first into in the Dunning-Kruger effect. Introspection is not a reliable guide to self-knowledge.
A more objective criterion is suggested by this pithy quote:
“What I cannot create, I do not understand.” - Richard Feynman
This is a very famous quote, but it’s not entirely unambiguous. If we’re going to use it as a guide, we’ll first have to break it down a little.
The most common interpretation might be, “what I cannot explain to a layperson or to a curious child, I do not understand.” Feynman unambiguously valued the ability to explain complex physics in plain English, as exemplified in this anecdote:
Before the commercial announcement of the Connection Machine CM-1 and all of our future products, Richard would give a sentence-by-sentence critique of the planned presentation. “Don’t say ‘reflected acoustic wave.’ Say [echo].” Or, “Forget all that ‘local minima’ stuff. Just say there’s a bubble caught in the crystal and you have to shake it out.” Nothing made him angrier than making something simple sound complicated. - Danny Hillis
As has often been remarked, explaining things well is often just as beneficial to the teacher as to the student; it helps reinforce ideas and builds intuition.
If that is all that Feynman had meant, though, why use the term “create” at all? Surely “explain” or “teach” is closer to the meaning discussed above. So while “explain in simple terms” is certainly part of it, “create” includes more than just that. Feynman gives us a clue in this story from his autobiography:
“During the conference I was staying with my sister in Syracuse. I brought the paper home and said to her,”I can’t understand these things that Lee and Yang are saying. It’s all so complicated."
“No,” she said, “what you mean is not that you can’t understand it, but that you didn’t invent it. You didn’t figure it out your own way, from hearing the clue. What you should do is imagine you’re a student again, and take this paper upstairs, read every line of it, and check the equations. Then you’ll understand it very easily.”
I took her advice, and checked through the whole thing, and found it to be very obvious and simple. I had been afraid to read it, thinking it was too difficult." - Richard Feynman, Surely You’re Joking, Mr. Feynman!
So in the context of math or physics, “create” means something closer to “derive from first principles by hand.” This is a very strong criteria! If a person could go into an empty office with a stack of scratch paper and a supply of sharp pencils, write down all first principles and proceed to derive every important theorem in their chosen field by hand then it must be conceded that such a person has some real knowledge.
In the context of computer science and programming, “create” might mean something like, “write a program from scratch that implements the given algorithm.” Since machine learning straddles the two, “create” means both: pose a machine learning problem mathematically, reduce the problem to some tractable form on paper, then write and implement an algorithm to produce a numerical approximation of the answer.
Now, if someone attempts this exercise, one of two things will happen. First, they may succeed completely on their first try. If so, great! They’ve proved what they set out to prove. But more likely, they’ll only succeed partially and get stuck at some point. Well, now they have the opportunity to correct a deficiency in their own understanding that they weren’t previously aware of, which is also a great outcome. After all, Feynman didn’t go in empty-handed - he took the challenging paper with him, and surely referenced it often. But at the end, his own notes would record his own complete derivation from start to finish and therefore serve as a testimonial to his own understanding.
It was in the spirit of the above considerations that in the fall of 2018 I set myself a goal: I would, over the course of the next year, derive and implement a representative sample of fundamental models and algorithms from machine learning, entirely from scratch and (insofar as was possible) entirely from my own understanding. Where I found my understanding sufficient, this would be an exercise in recreational programming; where my understanding failed me, it would be a chance to shore up the foundations.
This is possibly less insane than it may appear. Although there are aspects of machine learning that are very technical, for the most part the implementation of practical algorithms requires little more than some moderately advanced statistics, a few semesters of linear algebra, some general familiarity with numerical optimization and of course basic programming skills.
Because an open-ended project like this has a tendency to get out of control, I also decided to set some ground rules to help keep things sane.
First, mathematical derivations are in scope. This usually means posing and solving an optimization problem of some form, such as MLE. This is straight-forward for most of the algorithms on my list but could get a little hairy for things like backpropagation (which requires some fairly non-trivial matrix calculus) or SVMs (which basically requires the entire theory of Quadradic Programming). In practice, the presentation of these derivations is bottlenecked by the necessity of typesetting the equations in \(\LaTeX\), so these will typically be little more than sketches of the proofs.
Second, the algorithms used will be state-of-the-art, or at least reasonably so. For example, while we could solve linear regression with gradient descent, it would be a bit of a cop-out. Instead, we’ll implement what modern statistical software actually does under the hood. One particular consequence of this rule is that I will be implementing vectorized versions of the algorithms whenever possible: while iterating over every example in the training set is often easier to understand, it’s also pretty far removed from the realities of modern implementations which rely heavily on vectorization or even GPU acceleration for performance.
Third, I will implement and test all algorithms on some data set. As the Agile crowd would say, working software is the primary measure of progress. For convenience, I will use Python 3 and allow myself
numpy arrays… but not
numpy.linalg or other high-level libraries like
scipy.optimize; matrix multiplication is about the most complex operation we’ll let the libraries do for us. I considered not using
numpy at all, but it allows us to express algorithms in vectorized notation and frankly the algorithms are both clearer and more realistic with it. This restriction only applies to the implementation of the algorithm itself and excludes tests - I will routinely use higher-level libraries (like
sklearn.datasets) when testing the algorithm.
Fourth and finally, I’ll be publishing write-ups as I go. I’ve found in practice this can be more time consuming than the original exercise. However attempting to explain each algorithm in simple terms to a broad audience should help me to understand them a little better as well.
While I want to touch on every aspect of machine learning, there’s little point in implementing minor variations of basically the same algorithms over and over. Instead, let’s pick one or two representative algorithms from each category and leave it at that. We want to make sure that we get reasonable coverage over the types of ML problems (supervised/unsupervised, regression/classification, etc.), as well as good coverage over the most important algorithms that crop up repeatedly in ML.
Here’s a tentative list of algorithms I would like to tackle:
|Regression||Linear Regression||QR Decomposition|
|Classification||Logistic Regression||Gradient Descent|
|Classification||Decision Tree||Recursive Partition|
|Clustering||Gaussian Mixture Model||EM Algorithm|
|Clustering||Hierarchical Clustering||Agglomerative Clustering|
|Dimension Reduction||Principal Component Analysis||QR Algorithm|
|Recommendation||Low-Rank Matrix Approximation||Alternating Projections|
|Regression||General Additive Models||Backfitting|
|Classification||Support Vector Machines||SMO Algorithm|
Other candidates I considered but ultimately decided were out-of-scope:
- Factor Analysis - We already have PCA for dimensional reduction and GMM as an example of using the EM algorithm to solve for latent random variables.
- K-Means - We’ll do GMM instead.
- K-Nearest Neighbors - A naive algorithm is trivial while a serious algorithm would mostly be implementing a spatial index (such as R-Trees) which takes us pretty far afield from learning algorithms.
- Ensemble models - e.g. Random Forest or Boosted Trees. Not a good fit for the “from scratch” approach and can best be understood as “composing” two or more other mature models.
- CNN, RNN, etc. - We’ll do the vanilla deep neural network from scratch but more advanced topologies are best explored with a framework with automated differentiation.
- Learning-to-Rank - e.g. Bradley-Terry-Luce, etc. These can generally be reduced to logistic regression or viewed as latent variable models and solved with the EM algorithm.
- Felligi-Sunter Record Linkage - another take on the EM algorithm.