ML From Scratch, Part 0: Introduction


Motivation

“As an apprentice, every new magician must prove to his own satisfaction, at least once, that there is truly great power in magic.” - The Flying Sorcerers, by David Gerrold and Larry Niven

How do you know if you really understand something? You could just rely on the subjective experience of feeling like you understand. This sounds plausible - surely you of all people should know, right? But this runs head-first into in the Dunning-Kruger effect. Introspection is not a reliable guide to self-knowledge.

A more objective criterion is suggested by this pithy quote:

“What I cannot create, I do not understand.” - Richard Feynman

This is a very famous quote, but it’s not entirely unambiguous. If we’re going to use it as a guide, we’ll first have to break it down a little.

The most common interpretation might be, “what I cannot explain to a layperson or to a curious child, I do not understand.” Feynman unambiguously valued the ability to explain complex physics in plain English, as exemplified in this anecdote:

Before the commercial announcement of the Connection Machine CM-1 and all of our future products, Richard would give a sentence-by-sentence critique of the planned presentation. “Don’t say ‘reflected acoustic wave.’ Say [echo].” Or, “Forget all that ‘local minima’ stuff. Just say there’s a bubble caught in the crystal and you have to shake it out.” Nothing made him angrier than making something simple sound complicated. - Danny Hillis

As has often been remarked, explaining things well is often just as beneficial to the teacher as to the student; it helps reinforce ideas and builds intuition.

If that is all that Feynman had meant, though, why use the term “create” at all? Surely “explain” or “teach” is closer to the meaning discussed above. So while “explain in simple terms” is certainly part of it, “create” includes more than just that. Feynman gives us a clue in this story from his autobiography:

“During the conference I was staying with my sister in Syracuse. I brought the paper home and said to her,”I can’t understand these things that Lee and Yang are saying. It’s all so complicated."

“No,” she said, “what you mean is not that you can’t understand it, but that you didn’t invent it. You didn’t figure it out your own way, from hearing the clue. What you should do is imagine you’re a student again, and take this paper upstairs, read every line of it, and check the equations. Then you’ll understand it very easily.”

I took her advice, and checked through the whole thing, and found it to be very obvious and simple. I had been afraid to read it, thinking it was too difficult." - Richard Feynman, Surely You’re Joking, Mr. Feynman!

So in the context of math or physics, “create” means something closer to “derive from first principles by hand.” This is a very strong criteria! If a person could go into an empty office with a stack of scratch paper and a supply of sharp pencils, write down all first principles and proceed to derive every important theorem in their chosen field by hand then it must be conceded that such a person has some real knowledge.

In the context of computer science and programming, “create” might mean something like, “write a program from scratch that implements the given algorithm.” Since machine learning straddles the two, “create” means both: pose a machine learning problem mathematically, reduce the problem to some tractable form on paper, then write and implement an algorithm to produce a numerical approximation of the answer.

Now, if someone attempts this exercise, one of two things will happen. First, they may succeed completely on their first try. If so, great! They’ve proved what they set out to prove. But more likely, they’ll only succeed partially and get stuck at some point. Well, now they have the opportunity to correct a deficiency in their own understanding that they weren’t previously aware of, which is also a great outcome. After all, Feynman didn’t go in empty-handed - he took the challenging paper with him, and surely referenced it often. But at the end, his own notes would record his own complete derivation from start to finish and therefore serve as a testimonial to his own understanding.

Ground Rules

It was in the spirit of the above considerations that in the fall of 2018 I set myself a goal: I would, over the course of the next year, derive and implement a representative sample of fundamental models and algorithms from machine learning, entirely from scratch and (insofar as was possible) entirely from my own understanding. Where I found my understanding sufficient, this would be an exercise in recreational programming; where my understanding failed me, it would be a chance to shore up the foundations.

This is possibly less insane than it may appear. Although there are aspects of machine learning that are very technical, for the most part the implementation of practical algorithms requires little more than some moderately advanced statistics, a few semesters of linear algebra, some general familiarity with numerical optimization and of course basic programming skills.

Because an open-ended project like this has a tendency to get out of control, I also decided to set some ground rules to help keep things sane.

First, mathematical derivations are in scope. This usually means posing and solving an optimization problem of some form, such as MLE. This is straight-forward for most of the algorithms on my list but could get a little hairy for things like backpropagation (which requires some fairly non-trivial matrix calculus) or SVMs (which basically requires the entire theory of Quadradic Programming). In practice, the presentation of these derivations is bottlenecked by the necessity of typesetting the equations in \(\LaTeX\), so these will typically be little more than sketches of the proofs.

Second, the algorithms used will be state-of-the-art, or at least reasonably so. For example, while we could solve linear regression with gradient descent, it would be a bit of a cop-out. Instead, we’ll implement what modern statistical software actually does under the hood. One particular consequence of this rule is that I will be implementing vectorized versions of the algorithms whenever possible: while iterating over every example in the training set is often easier to understand, it’s also pretty far removed from the realities of modern implementations which rely heavily on vectorization or even GPU acceleration for performance.

Third, I will implement and test all algorithms on some data set. As the Agile crowd would say, working software is the primary measure of progress. For convenience, I will use Python 3 and allow myself numpy arrays… but not numpy.linalg or other high-level libraries like scipy.optimize; matrix multiplication is about the most complex operation we’ll let the libraries do for us. I considered not using numpy at all, but it allows us to express algorithms in vectorized notation and frankly the algorithms are both clearer and more realistic with it. This restriction only applies to the implementation of the algorithm itself and excludes tests - I will routinely use higher-level libraries (like numpy.linalg, pandas, scipy, or sklearn.datasets) when testing the algorithm.

Fourth and finally, I’ll be publishing write-ups as I go. I’ve found in practice this can be more time consuming than the original exercise. However attempting to explain each algorithm in simple terms to a broad audience should help me to understand them a little better as well.

Project Scope

While I want to touch on every aspect of machine learning, there’s little point in implementing minor variations of basically the same algorithms over and over. Instead, let’s pick one or two representative algorithms from each category and leave it at that. We want to make sure that we get reasonable coverage over the types of ML problems (supervised/unsupervised, regression/classification, etc.), as well as good coverage over the most important algorithms that crop up repeatedly in ML.

Here’s a tentative list of algorithms I would like to tackle:

Problem Model Algorithm Article
Regression Linear Regression QR Decomposition Part 1
Classification Logistic Regression Gradient Descent Part 2
Classification Neural Network Backpropagation Part 3
Classification Decision Tree Recursive Partition Part 4
Clustering Gaussian Mixture Model EM Algorithm Part 5
Clustering Hierarchical Clustering Agglomerative Clustering TBD
Dimension Reduction Principal Component Analysis QR Eigenvalue Algorithm TBD
Recommendation Low-Rank Matrix Approximation Alternating Projections TBD
Regression General Additive Models Backfitting TBD
Classification Support Vector Machines SMO Algorithm TBD

Other candidates I considered but ultimately decided were out-of-scope:

  • Factor Analysis - We already have PCA for dimensional reduction and GMM as an example of using the EM algorithm to solve for latent random variables.
  • K-Means - We’ll do GMM instead, since k-means is just GMM with hard assignment.
  • K-Nearest Neighbors - A naive algorithm is trivial while a serious algorithm would mostly involve implementing a spatial index (such as R-Trees) which takes us pretty far afield from learning algorithms.
  • Ensemble models - e.g. Random Forest or Boosted Trees. Not a good fit for the “from scratch” approach and can best be understood as “composing” two or more other mature models.
  • CNN, RNN, etc. - We’ll do the vanilla deep neural network from scratch but more advanced topologies are best explored with a framework with automated differentiation.
  • Learning-to-Rank - e.g. Bradley-Terry-Luce, etc. These can generally be reduced to logistic regression or viewed as latent variable models and solved with the EM algorithm.
  • Felligi-Sunter Record Linkage - another take on the EM algorithm, and requiring to many prerequisites like Jaro-Winkler distance.

Bottom-Up Approach to Machine Learning

In the spirit of the Feynman technique, let’s spend a few minutes talking through the problem in plain English and see if we can understand why machine learning seems to focus so heavily on a few mathematical techniques and approaches; this, in turn, should make it clear why it’s worth understanding these techniques in depth.

The problem, in the broadest possible terms, is to get a computer to learn how to do something. This is in contrast to traditional programming, where the computer does not usually “learn” anything, but follows a program written by a human programmer. Computers also aren’t very good at “doing” most things, although they are very good (and very fast!) at the few things they can do.

So, what are computers good at? In decreasing order (increasing by the amount of time it takes) computers can do the following:

  1. Addition and subtraction
  2. Multiplication
  3. Division
  4. Comparing two numbers to decide what to do next
  5. Other math functions like exp(), log(), sin(), cos(), etc.
  6. Remembering a billion numbers
  7. Looking something up in a file or database
  8. Talking to another computer over a network

This fairly standard set of costs actually leads directly to some important insights that guide research into practical machine learning.

First, we want to restrict ourselves as much as possible to simple arithmetic. While we may occasionally allow ourselves a division or even, gasp, an exponentiation, we really want to stick to fast operations like addition, multiplication, taking the greater of two numbers with max(a,b), or taking the sign of a number with sign(a).

Second, any “learning” we do should be in the form of updating a structured set of numbers. We call these the “parameters” to distinguished them from the “data.” The parameters may be shaped like a vector, a matrix, or a tree, but if we want to combine parameters and data with simple arithmetic, then both must ultimately be represented as data structures with numeric values.

On the other hand, we want to avoid representing learning as a formatted string or program. For example, the internal state of our learning algorithm could literally be a a string describing a C program:

float f(float* x) { 
    float z = 42; 
    if ( x[0] < 5 && x[1] > 2 ) z -= 10; 
    if ( x[2] > 7 || x[5] == 2 ) z += 3; 
    for ( int i=6; i<11; i++ ) { z += x[i]; }
    if ( x[1] == 1 ) {
        for ( int i=3; i<5; i++ ) { z -= 2 * x[i]; }
    }
    return z;
}

To apply this to data, we would compile this C program and pass our data into the function f(). To “learn”, the algorithm would add, remove, and modify individual lines, characters, or perhaps syntactic statements or expressions. This is sometimes called genetic programming. To be 100% clear, this is only bad if we allow arbitrary programs involving AND, OR, NOT, if/else, while, for, intermediate variables, and the like. Genetic programming can work well if the “genes” of the program are very carefully designed. Indeed, it is sometimes used as the “top level” learning algorithm in so-called automated machine learning frameworks such as TPOT. However, for the kind of fitting and optimization we’re mainly interested in, genetic algorithms are hopelessly inefficient.

Why is learning an arbitrary program problematic? Does it simply not work? Surely any equation we write down could also be represented in a more general form as a program, and surely we could find that program by exhaustive breadth-first search if necessary. And isn’t it also true that every program has a Gödel number? So how is this fundamentally any different than learning a set of numbers?

The problem isn’t that it doesn’t work, or that there’s anything wrong with that approach in theory. The problem is simply that the space of programs we would need to search is extremely large (the number of legal programs grows exponentially with the length of the program with very high fan-out), and it is exceedingly difficult to know if we’re getting “closer” to the right answer or not. That’s a bad combination and means that “sufficient time” is often a lot longer than we’re willing to wait.

To illustrate that second point, consider this (correct) program which finds the greatest common divisor of a pair of numbers:

def gcd(x, y):
    while y != 0:
        (x, y) = (y, x % y)
    return x

Let’s say that the "def gcd(x, y)" is fixed as part of the problem specification. Then there’s literally not a single character, word, or symbol we could change in the body of that function which would not make it incorrect. If I change % to *, it doesn’t terminate, if I change y != 0 to y != 1 it’s so completely wrong it can never return a correct solution even by accident, and so on. Therefore, in the space of all possible programs, this correct program is surrounded on all sides by wildly incorrect programs. That means that a greedy or even an evolutionary algorithm is unlikely to find this elegant program. It is possible to find it via exhaustive breadth first search (where depth is the length of the program) but this is brute force and hard to scale.

So, in practical machine learning, we do not try to learn arbitrary programs, we learn parameters for functions from some family of functions. For example, let’s say a data point is represented as the vector \(\vec{x} \in \mathbb{R}^n\) and our parameters are the vector \(\vec{p} \in \mathbb{R}^n\). Then, keeping in mind that we mostly want to stick to arithmetic, the simplest thing we could do is a dot product between these two vectors:

\[ f(\vec{x} ; \vec{p} ) = \vec{x} \cdot \vec{p} = \sum_{i=1}^n x_i p_i \]

That looks too simple to work, but in fact we’ll see in the next article in this series, that it works surprisingly well for a very large class of problems. Not for all problems of course; and throughout the series we will gradually add complexity to the representation. This will, in turn, create problems for us in terms of fitting/training these more complex models. In parallel, we will develop ever more powerful techniques to deal with these problems as they arise. In particular we will see again and again how a well chosen representation will allow us to find very fast algorithms for learning optimal parameters.

Conclusion

Next time, we’ll start with linear regression, followed by logistic regression and some simple neural networks. As new articles are added, you can find them collected under the “from scratch” tag.