[C5] Andrew Ng - Structuring Machine Learning Projects
Posted keyshaw
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了[C5] Andrew Ng - Structuring Machine Learning Projects相关的知识,希望对你有一定的参考价值。
About this Course
You will learn how to build a successful machine learning project. If you aspire to be a technical leader in AI, and know how to set direction for your team‘s work, this course will show you how.
Much of this content has never been taught elsewhere, and is drawn from my experience building and shipping many deep learning products. This course also has two "flight simulators" that let you practice decision-making as a machine learning project leader. This provides "industry experience" that you might otherwise get only after years of ML work experience.
After 2 weeks, you will:
- Understand how to diagnose errors in a machine learning system, and
- Be able to prioritize the most promising directions for reducing error
- Understand complex ML settings, such as mismatched training/test sets, and comparing to and/or surpassing human-level performance
- Know how to apply end-to-end learning, transfer learning, and multi-task learning
I‘ve seen teams waste months or years through not understanding the principles taught in this course. I hope this two week course will save you months of time.
This is a standalone course, and you can take this so long as you have basic machine learning knowledge. This is the third course in the Deep Learning Specialization.
ML Strategy (1)
Why ML Strategy - 2m
0:00
Hi, welcome to this course on how to structure your machine learning project, that is on machine learning strategy. I hope that through this course you will learn how to much more quickly and efficiently get your machine learning systems working. So, what is machine learning strategy. Let‘s start with a motivating example. Let‘s say you are working on your cat cost file. And after working it for some time, you‘ve gotten your system to have 90% accuracy, but this isn‘t good enough for your application. You might then have a lot of ideas as to how to improve your system. For example, you might think well let‘s collect more data, more training data. Or you might say, maybe your training set isn‘t diverse enough yet, you should collect images of cats in more diverse poses, or maybe a more diverse set of negative examples. Well maybe you want to train the algorithm longer with gradient descent. Or maybe you want to try a different optimization algorithm, like the Adam optimization algorithm. Or maybe trying a bigger network or a smaller network or maybe you want to try to dropout or maybe L2 regularization. Or maybe you want to change the network architecture such as changing activation functions, changing the number of hidden units and so on and so on. When trying to improve a deep learning system, you often have a lot of ideas or things you could try. And the problem is that if you choose poorly, it is entirely possible that you end up spending six months charging in some direction only to realize after six months that that didn‘t do any good. For example, I‘ve seen some teams spend literally six months collecting more data only to realize after six months that it barely improved the performance of their system. So, assuming you don‘t have six months to waste on your problem, won‘t it be nice if you had quick and effective ways to figure out which of all of these ideas and maybe even other ideas, are worth pursuing and which ones you can safely discard. So what I hope to do in this course is teach you a number of strategies, that is, ways of analyzing a machine learning problem that will point you in the direction of the most promising things to try. What I will do in this course also is share with you a number of lessons I‘ve learned through building and shipping large number of deep learning products. And I think these materials are actually quite unique to this course. I don‘t see a lot of these ideas being taught in universities‘ deep learning courses for example. It turns out also that machine learning strategy is changing in the era of deep learning because the things you could do are now different with deep learning algorithms than with previous generation of machine learning algorithms. I hope that these ideas will help you become much more effective at getting your deep learning systems to work.
Orthogonalization - 10m
0:00
One of the challenges with building machine learning systems is that there‘s so many things you could try, so many things you could change. Including, for example, so many hyperparameters you could tune.
0:10
One of the things I‘ve noticed is about the most effective machine learning people is they‘re very clear-eyed about what to tune in order to try to achieve one effect. This is a process we call orthogonalization. Let me tell you what I mean.
0:25
Here‘s a picture of an old school television, with a lot of knobs that you could tune to adjust the picture in various ways.
0:35
So for these old TV sets, maybe there was one knob to adjust how tall vertically your image is and another knob to adjust how wide it is. Maybe another knob to adjust how trapezoidal it is, another knob to adjust how much to move the picture left and right, another one to adjust how much the picture‘s rotated, and so on.
0:58
And what TV designers had spent a lot of time doing was to build the circuitry, really often analog circuitry back then, to make sure each of the knobs had a relatively interpretable function. Such as one knob to tune this, one knob to tune this, one knob to tune this, and so on.
1:17
In contrast, imagine if you had a knob that tunes 0.1 x how tall the image is, + 0.3 x how wide the image is,- 1.7 x how trapezoidal the image is, + 0.8 times the position of the image on the horizontal axis, and so on. If you tune this knob, then the height of the image, the width of the image, how trapezoidal it is, how much it shifts, it all changes all at the same time. If you have a knob like that, it‘d be almost impossible to tune the TV so that the picture gets centered in the display area. So in this context, orthogonalization refers to that the TV designers had designed the knobs so that each knob kind of does only one thing. And this makes it much easier to tune the TV, so that the picture gets centered where you want it to be.
2:14
Here‘s another example of orthogonalization. If you think about learning to drive a car, a car has three main controls, which are steering, the steering wheel decides how much you go left or right, acceleration, and braking. So these three controls, or really one control for steering and another two controls for your speed. It makes it relatively interpretable, what your different actions through different controls will do to your car. But now imagine if someone were to build a car so that there was a joystick, where one axis of the joystick controls 0.3 x your steering angle,- 0.8 x your speed. And you had a different control that controls 2 x the steering angle, + 0.9 x the speed of your car. In theory, by tuning these two knobs, you could get your car to steer at the angle and at the speed you want. But it‘s much harder than if you had just one single control for controlling the steering angle, and a separate, distinct set of controls for controlling the speed. So the concept of orthogonalization refers to that, if you think of one dimension of what you want to do as controlling a steering angle, and another dimension as controlling your speed. Then you want one knob to just affect the steering angle as much as possible, and another knob, in the case of the car, is really acceleration and braking, that controls your speed. But if you had a control that mixes the two together, like a control like this one that affects both your steering angle and your speed, something that changes both at the same time, then it becomes much harder to set the car to the speed and angle you want. And by having orthogonal, orthogonal means at 90 degrees to each other. By having orthogonal controls that are ideally aligned with the things you actually want to control, it makes it much easier to tune the knobs you have to tune. To tune the steering wheel angle, and your accelerator, your braking, to get the car to do what you want. So how does this relate to machine learning?
4:32
For a supervised learning system to do well, you usually need to tune the knobs of your system to make sure that four things hold true. First, is that you usually have to make sure that you‘re at least doing well on the training set. So performance on the training set needs to pass some acceptability assessment. For some applications, this might mean doing comparably to human level performance. But this will depend on your application, and we‘ll talk more about comparing to human level performance next week.
5:04
But after doing well on the training sets, you then hope that this leads to also doing well on the dev set. And you then hope that this also does well on the test set. And finally, you hope that doing well on the test set on the cost function results in your system performing in the real world. So you hope that this resolves in happy cat picture app users, for example. So to relate back to the TV tuning example, if the picture of your TV was either too wide or too narrow, you wanted one knob to tune in order to adjust that. You don‘t want to have to carefully adjust five different knobs, which also affect different things. You want one knob to just affect the width of your TV image. So in a similar way, if your algorithm is not fitting the training set well on the cost function, you want one knob, yes, that‘s my attempt to draw a knob. Or maybe one specific set of knobs that you can use, to make sure you can tune your algorithm to make it fit well on the training set. So the knobs you use to tune this are, you might train a bigger network.
6:16
Or you might switch to a better optimization algorithm, like the Adam optimization algorithm, and so on, into some other options we‘ll discuss later this week and next week.
6:28
In contrast, if you find that the algorithm is not fitting the dev set well, then there‘s a separate set of knobs. Yes, that‘s my not very artistic rendering of another knob, you want to have a distinct set of knobs to try. So for example, if your algorithm is not doing well on the dev set, it‘s doing well on the training set but not on the dev set, then you have a set of knobs around regularization that you can use to try to make it satisfy the second criteria. So the analogy is, now that you‘ve tuned the width of your TV set, if the height of the image isn‘t quite right, then you want a different knob in order to tune the height of the TV image. And you want to do this hopefully without affecting the width of your TV image too much. And getting a bigger training set would be another knob you could use, that helps your learning algorithm generalize better to the dev set. Now, having adjusted the width and height of your TV image, well, what if it doesn‘t meet the third criteria? What if you do well on the dev set but not on the test set? If that happens, then the knob you tune is, you probably want to get a bigger dev set. Because if it does well on the dev set but not the test set, it probably means you‘ve overtuned to your dev set, and you need to go back and find a bigger dev set.
7:52
And finally, if it does well on the test set, but it isn‘t delivering to you a happy cat picture app user, then what that means is that you want to go back and change either the dev set or the cost function.
8:13
Because if doing well on the test set according to some cost function doesn‘t correspond to your algorithm doing what you need it to do in the real world, then it means that either your dev test set distribution isn‘t set correctly, or your cost function isn‘t measuring the right thing. I know I‘m going over these examples quite quickly, but we‘ll go much more into detail on these specific knobs later this week and next week. So if you aren‘t following all the details right now, don‘t worry about it. But I want to give you a sense of this orthogonalization process, that you want to be very clear about which of these maybe four issues, the different things you could tune, are trying to address. And when I train a neural network, I tend not to use early stopping. It‘s not a bad technique, quite a lot of people do it. But I personally find early stopping difficult to think about. Because this is an op that simultaneously affects how well you fit the training set, because if you stop early, you fit the training set less well. It also simultaneously is often done to improve your dev set performance. So this is one knob that is less orthogonalized, because it simultaneously affects two things. It‘s like a knob that simultaneously affects both the width and the height of your TV image. And it doesn‘t mean that it‘s bad, not to use, you can use it if you want. But when you have more orthogonalized controls, such as these other ones that I‘m writing down here, then it just makes the process of tuning your network much easier. So I hope that gives you a sense of what orthogonalization means. Just like when you look at the TV image, it‘s nice if you can say, my TV image is too wide, so I‘m going to tune this knob, or it‘s too tall, so I‘m going to tune that knob, or it‘s too trapezoidal, so I‘m going to have to tune that knob. In machine learning, it‘s nice if you can look at your system and say, this piece of it is wrong. It does not do well on the training set, it does not do well on the dev set, it does not do well on the test set, or it‘s doing well on the test set but just not in the real world. But figure out exactly what‘s wrong, and then have exactly one knob, or a specific set of knobs that helps to just solve that problem that is limiting the performance of machine learning system. So what we‘re going to do this week and next week is go through how to diagnose what exactly is the bottleneck to your system‘s performance. As well as identify the specific set of knobs you could use to tune your system to improve that aspect of its performance. So let‘s start going more into the details of this process.
Single number evaluation metric - 7m
0:00
Whether you‘re tuning hyperparameters, or trying out different ideas for learning algorithms, or just trying out different options for building your machine learning system. You‘ll find that your progress will be much faster if you have a single real number evaluation metric that lets you quickly tell if the new thing you just tried is working better or worse than your last idea. So when teams are starting on a machine learning project, I often recommend that you set up a single real number evaluation metric for your problem. Let‘s look at an example.
0:32
You‘ve heard me say before that applied machine learning is a very empirical process. We often have an idea, code it up, run the experiment to see how it did, and then use the outcome of the experiment to refine your ideas. And then keep going around this loop as you keep on improving your algorithm. So let‘s say for your classifier, you had previously built some classifier A. And by changing the hyperparameters and the training sets or some other thing, you‘ve now trained a new classifier, B. So one reasonable way to evaluate the performance of your classifiers is to look at its precision and recall. The exact details of what‘s precision and recall don‘t matter too much for this example. But briefly, the definition of precision is, of the examples that your classifier recognizes as cats,
1:23
What percentage actually are cats?
1:32
So if classifier A has 95% precision, this means that when classifier A says something is a cat, there‘s a 95% chance it really is a cat. And recall is, of all the images that really are cats, what percentage were correctly recognized by your classifier? So what percentage of actual cats, Are correctly recognized?
2:04
So if classifier A is 90% recall, this means that of all of the images in, say, your dev sets that really are cats, classifier A accurately pulled out 90% of them. So don‘t worry too much about the definitions of precision and recall. It turns out that there‘s often a tradeoff between precision and recall, and you care about both. You want that, when the classifier says something is a cat, there‘s a high chance it really is a cat. But of all the images that are cats, you also want it to pull a large fraction of them as cats. So it might be reasonable to try to evaluate the classifiers in terms of its precision and its recall. The problem with using precision recall as your evaluation metric is that if classifier A does better on recall, which it does here, the classifier B does better on precision, then you‘re not sure which classifier is better.
3:03
And if you‘re trying out a lot of different ideas, a lot of different hyperparameters, you want to rather quickly try out not just two classifiers, but maybe a dozen classifiers and quickly pick out the, quote, best ones, so you can keep on iterating from there.
3:19
And with two evaluation metrics, it is difficult to know how to quickly pick one of the two or quickly pick one of the ten.
3:29
So what I recommend is rather than using two numbers, precision and recall, to pick a classifier, you just have to find a new evaluation metric that combines precision and recall.
3:41
In the machine learning literature, the standard way to combine precision and recall is something called an F1 score. And the details of F1 score aren‘t too important, but informally, you can think of this as the average of precision, P, and recall, R. Formally, the F1 score is defined by this formula, it‘s 2/ 1/P + 1/R. And in mathematics, this function is called the harmonic mean of precision P and recall R. But less formally, you can think of this as some way that averages precision and recall.
4:22
Only instead of taking the arithmetic mean, you take the harmonic mean, which is defined by this formula. And it has some advantages in terms of trading off precision and recall. But in this example, you can then see right away that classifier A has a better F1 score. And assuming F1 score is a reasonable way to combine precision and recall, you can then quickly select classifier A over classifier B.
4:48
So what I found for a lot of machine learning teams is that having a well-defined dev set, which is how you‘re measuring precision and recall, plus a single number evaluation metric, sometimes I‘ll call it single row number.
5:04
Evaluation metric allows you to quickly tell if classifier A or classifier B is better, and therefore having a dev set plus single number evaluation metric distance to speed up iterating.
5:21
It speeds up this iterative process of improving your machine learning algorithm. Let‘s look at another example.
5:29
Let‘s say you‘re building a cat app for cat lovers in four major geographies, the US, China, India, and other, the rest of the world. And let‘s say that your two classifiers achieve different errors
5:45
in data from these four different geographies. So algorithm A achieves 3% error on pictures submitted by US users and so on.
5:56
So it might be reasonable to keep track of how well your classifiers do in these different markets or these different geographies. But by tracking four numbers, it‘s very difficult to look at these numbers and quickly decide if algorithm A or algorithm B is superior. And if you‘re testing a lot of different classifiers, then it‘s just difficult to look at all these numbers and quickly pick one. So what I recommend in this example is, in addition to tracking your performance in the four different geographies, to also compute the average. And assuming that average performance is a reasonable single real number evaluation metric, by computing the average, you can quickly tell that it looks like algorithm C has a lowest average error. And you might then go ahead with that one. You have to pick an algorithm to keep on iterating from. So your work load machine learning is often, you have an idea, you implement it try it out, and you want to know whether your idea helped. So what was seen in this video is that having a single number evaluation metric can really improve your efficiency or the efficiency of your team in making those decisions. Now we‘re not yet done with the discussion on how to effectively set up evaluation metrics. In the next video, I‘m going to share with you how to set up optimizing, as well as satisfying matrix. So let‘s take a look at the next video.
Satisficing and Optimizing metric - 5m
0:00
It‘s not always easy to combine all the things you care about into a single row number evaluation metric. In those cases I‘ve found it sometimes useful to set up satisficing as well as optimizing matrix. Let me show you what I mean. Let‘s say that you‘ve decided you care about the classification accuracy of your cat‘s classifier, this could have been F1 score or some other measure of accuracy, but let‘s say that in addition to accuracy you also care about the running time. So how long it takes to classify an image and classifier A takes 80 milliseconds, B takes 95 milliseconds, and C takes 1,500 milliseconds, that‘s 1.5 seconds to classify an image. So one thing you could do is combine accuracy and running time into an overall evaluation metric. And so the costs such as maybe the overall cost is accuracy minus 0.5 times running time. But maybe it seems a bit artificial to combine accuracy and running time using a formula like this, like a linear weighted sum of these two things. So here‘s something else you could do instead which is that you might want to choose a classifier that maximizes accuracy but subject to that the running time, that is the time it takes to classify an image, that that has to be less than or equal to 100 milliseconds. So in this case we would say that accuracy is an optimizing metric because you want to maximize accuracy. You want to do as well as possible on accuracy but that running time is what we call a satisficing metric. Meaning that it just has to be good enough, it just needs to be less than 100 milliseconds and beyond that you don‘t really care, or at least you don‘t care that much. So this will be a pretty reasonable way to trade off or to put together accuracy as well as running time. And it may be the case that so long as the running time is less that 100 milliseconds, your users won‘t care that much whether it‘s 100 milliseconds or 50 milliseconds or even faster. And by defining optimizing as well as satisficing matrix, this gives you a clear way to pick the, quote, best classifier, which in this case would be classifier B because of all the ones with a running time better than 100 milliseconds it has the best accuracy. So more generally, if you have N matrix that you care about it‘s sometimes reasonable to pick one of them to be optimizing. So you want to do as well as is possible on that one. And then N minus 1 to be satisficing, meaning that so long as they reach some threshold such as running times faster than 100 milliseconds, but so long as they reach some threshold, you don‘t care how much better it is in that threshold, but they have to reach that threshold. Here‘s another example. Let‘s say you‘re building a system to detect wake words, also called trigger words. So this refers to the voice control devices like the Amazon Echo where you wake up by saying Alexa or some Google devices which you wake up by saying okay Google or some Apple devices which you wake up by saying Hey Siri or some Baidu devices we should wake up by saying you ni hao Baidu. Oh I guess, you want to read the Chinese, that‘s ni hao Baidu. Right, so these are the wake words you use to tell one of these voice control devices to wake up and listen to something you want to say. And for these other Chinese characters for ni hao Baidu. So you might care about the accuracy of your trigger word detection system. So when someone says one of these trigger words, how likely are you to actually wake up your device, and you might also care about the number of false positives. So when no one actually said this trigger word, how often does it randomly wake up? So in this case maybe one reasonable way of combining these two evaluation matrix might be to maximize accuracy, so when someone says one of the trigger words, maximize the chance that your device wakes up. And subject to that, you have at most one false positive every 24 hours of operation, right? So that your device randomly wakes up only once per day on average when no one is actually talking to it. So in this case accuracy is the optimizing metric and a number of false positives every 24 hours is the satisficing metric where you‘d be satisfied so long as there is at most one false positive every 24 hours. To summarize, if there are multiple things you care about by say there‘s one as the optimizing metric that you want to do as well as possible on and one or more as satisficing metrics were you‘ll be satisfice. Almost it does better than some threshold you can now have an almost automatic way of quickly looking at multiple core size and picking the, quote, best one. Now these evaluation matrix must be evaluated or calculated on a training set or a development set or maybe on the test set. So one of the things you also need to do is set up training, dev or development, as well as test sets. In the next video, I want to share with you some guidelines for how to set up training, dev, and test sets. So let‘s go on to the next.
Train/dev/test distributions - 6m
0:00
The way you set up your training dev, or development sets and test sets, can have a huge impact on how rapidly you or your team can make progress on building machine learning application. The same teams, even teams in very large companies, set up these data sets in ways that really slows down, rather than speeds up, the progress of the team. Let‘s take a look at how you can set up these data sets to maximize your team‘s efficiency. In this video, I want to focus on how you set up your dev and test sets. So, that dev set is also called the development set, or sometimes called the hold out cross validation set. And, workflow in machine learning is that you try a lot of ideas, train up different models on the training set, and then use the dev set to evaluate the different ideas and pick one. And, keep innovating to improve dev set performance until, finally, you have one clause that you‘re happy with that you then evaluate on your test set. Now, let‘s say, by way of example, that you‘re building a cat crossfire, and you are operating in these regions: in the U.S, U.K, other European countries, South America, India, China, other Asian countries, and Australia. So, how do you set up your dev set and your test set? Well, one way you could do so is to pick four of these regions. I‘m going to use these four but it could be four randomly chosen regions. And say, that data from these four regions will go into the dev set. And, the other four regions, I‘m going to use these four, could be randomly chosen four as well, that those will go into the test set. It turns out, this is a very bad idea because in this example, your dev and test sets come from different distributions. I would, instead, recommend that you find a way to make your dev and test sets come from the same distribution. So, here‘s what I mean. One picture to keep in mind is that, I think, setting up your dev set, plus, your single role number evaluation metric, that‘s like placing a target and telling your team where you think is the bull‘s eye you want to aim at. Because, what happen once you‘ve established that dev set and the metric is that, the team can innovate very quickly, try different ideas, run experiments and very quickly use the dev set and the metric to evaluate crossfires and try to pick the best one. So, machine learning teams are often very good at shooting different arrows into targets and innovating to get closer and closer to hitting the bullseye. So, doing well on your metric on your dev sets. And, the problem with how we‘ve set up the dev and test sets in the example on the left is that, your team might spend months innovating to do well on the dev set only to realize that, when you finally go to test them on the test set, that data from these four countries or these four regions at the bottom, might be very different than the regions in your dev set. So, you might have a nasty surprise and realize that, all the months of work you spent optimizing to the dev set, is not giving you good performance on the test set. So, having dev and test sets from different distributions is like setting a target, having your team spend months trying to aim closer and closer to bull‘s eye, only to realize after months of work that, you‘ll say, "Oh wait, to test it, I‘m going to move target over here." And, the team might say, "Well, why did you make us spend months optimizing for a different bull‘s eye when suddenly, you can move the bull‘s eye to a different location somewhere else?" So, to avoid this, what I recommend instead is that, you take all this randomly shuffled data into the dev and test set. So that, both the dev and test sets have data from all eight regions and that the dev and test sets really come from the same distribution, which is the distribution of all of your data mixed together. Here‘s another example. This is a, actually, true story but with some details changed. So, I know a machine learning team that actually spent several months optimizing on a dev set which was comprised of loan approvals for medium income zip codes. So, the specific machine learning problem was, "Given an input X about a loan application, can you predict why and which is, whether or not, they‘ll repay the loan?" So, this helps you decide whether or not to approve a loan. And so, the dev set came from loan applications. They came from medium income zip codes. Zip codes is what we call postal codes in the United States. But, after working on this for a few months, the team then, suddenly decided to test this on data from low income zip codes or low income postal codes. And, of course, the distributional data for medium income and low income zip codes is very different. And, the crossfire, that they spend so much time optimizing in the former case, just didn‘t work well at all on the latter case. And so, this particular team actually wasted about three months of time and had to go back and really re-do a lot of work. And, what happened here was, the team spent three months aiming for one target, and then, after three months, the manager asked, "Oh, how are you doing on hitting this other target?" This is a totally different location. And, it just was a very frustrating experience for the team. So, what I recommand for setting up a dev set and test set is, choose a dev set and test set to reflect data you expect to get in future and consider important to do well on. And, in particular, the dev set and the test set here, should come from the same distribution. So, whatever type of data you expect to get in the future, and once you do well on, try to get data that looks like that. And, whatever that data is, put it into both your dev set and your test set. Because that way, you‘re putting the target where you actually want to hit and you‘re having the team innovate very efficiently to hitting that same target, hopefully, the same targets well. Since we haven‘t talked yet about how to set up a training set, we‘ll talk about the training set in a later video. But, the important take away from this video is that, setting up the dev set, as well as the validation metric, is really defining what target you want to aim at. And hopefully, by setting the dev set and the test set to the same distribution, you‘re really aiming at whatever target you hope your machine learning team will hit. The way you choose your training set will affect how well you can actually hit that target. But, we can talk about that separately in a later video. So, I know some machine learning teams that could literally have saved themselves months of work could they follow the guidelines in this video. So, I hope these guidelines will help you, too. Next, it turns out, that the size of your dev and test sets, how to choose the size of them, is also changing the area of deep learning. Let‘s talk about that in the next video.
Size of the dev and test sets - 5m
0:00
In the last video, you saw how your dev and test sets should come from the same distribution, but how long should they be? The guidelines to help set up your dev and test sets are changing in the Deep Learning era. Let‘s take a look at some best practices. You might have heard of the rule of thumb in machine learning of taking all the data you have and using a 70/30 split into a train and test set, or have you had to set up train dev and test sets maybe, you would use a 60% training and 20% dev and 20% tests. In earlier eras of machine learning, this was pretty reasonable, especially back when data set sizes were just smaller. So if you had a hundred examples in total, these 70/30 or 60/20/20 rule of thumb would be pretty reasonable. If you had thousand examples, maybe if you had ten thousand examples, these things are not unreasonable. But in the modern machine learning era, we are now used to working with much larger data set sizes. So let‘s say you have a million training examples, it might be quite reasonable to set up your data so that you have 98% in the training set, 1% dev, and 1% test. And when you use DNT to abbreviate dev and test sets. Because if you have a million examples, then 1% of that, is 10,000 examples, and that might be plenty enough for a dev set or for a test set. So, in the modern Deep Learning era where sometimes we have much larger data sets, It‘s quite reasonable to use a much smaller than 20 or 30% of your data for a dev set or a test set. And because Deep Learning algorithms have such a huge hunger for data, I‘m seeing that, the problems we have large data sets that have much larger fraction of it goes into the training set. So, how about the test set? Remember the purpose of your test set is that, after you finish developing a system, the test set helps evaluate how good your final system is. The guideline is, to set your test set to big enough to give high confidence in the overall performance of your system. So, unless you need to have a very accurate measure of how well your final system is performing, maybe you don‘t need millions and millions of examples in a test set, and maybe for your application if you think that having 10,000 examples gives you enough confidence to find the performance on maybe 100,000 or whatever it is, that might be enough. And this could be much less than, say 30% of the overall data set, depend on how much data you have. For some applications, maybe you don‘t need a high confidence in the overall performance of your final system. Maybe all you need is a train and dev set, And I think, not having a test set might be okay. In fact, what sometimes happened was, people were talking about using train test splits but what they were actually doing was iterating on the test set. So rather than test set, what they had was a train dev split and no test set. If you‘re actually tuning to this set, to this dev set and this test set, It‘s better to call the dev set. Although I think in the history of machine learning, not everyone has been completely clean and completely records of about calling the dev set when it really should be treated as test set. But, if all you care about is having some data that you train on, and having some data to tune to, and you‘re just going to shake the final system and not worry too much about how it was actually doing, I think it will be healthy and just call the train dev set and acknowledge that you have no test set. This a bit unusual? I‘m definitely not recommending not having a test set when building a system. I do find it reassuring to have a separate test set you can use to get an unbiased estimate of how I was doing before you shift it, but if you have a very large dev set so that you think you won‘t overfit the dev set too badly. Maybe it‘s not totally unreasonable to just have a train dev set, although it‘s not what I usually recommend. So to summarize, in the era of big data, I think the old rule of thumb of a 70/30 is that, that no longer applies. And the trend has been to use more data for training and less for dev and test, especially when you have a very large data sets. And the rule of thumb is really to try to set the dev set to big enough for its purpose, which helps you evaluate different ideas and pick this up from AOP better. And the purpose of test set is to help you evaluate your final cost buys. You just have to set your test set big enough for that purpose, and that could be much less than 30% of the data. So, I hope that gives some guidance or some suggestions on how to set up your dev and test sets in the Deep Learning era. Next, it turns out that sometimes, part way through a machine learning problem, you might want to change your evaluation metric, or change your dev and test sets. Let‘s talk about it when you might want to do that.
When to change dev/test sets and metrics - 11m
0:00
You‘ve seen how set to have a dev set and evaluation metric is like placing a target somewhere for your team to aim at. But sometimes partway through a project you might realize you put your target in the wrong place. In that case you should move your target. Let‘s take a look at an example. Let‘s say you build a cat classifier to try to find lots of pictures of cats to show to your cat loving users and the metric that you decided to use is classification error. So algorithms A and B have, respectively, 3 percent error and 5 percent error, so it seems like Algorithm A is doing better. But let‘s say you try out these algorithms, you look at these algorithms and Algorithm A, for some reason, is letting through a lot of the pornographic images. So if you shift Algorithm A the users would see more cat images because you‘ll see 3 percent error and identify cats, but it also shows the users some pornographic images which is totally unacceptable both for your company, as well as for your users. In contrast, Algorithm B has 5 percent error so this classifies fewer images but it doesn‘t have pornographic images. So from your company‘s point of view, as well as from a user acceptance point of view, Algorithm B is actually a much better algorithm because it‘s not letting through any pornographic images. So, what has happened in this example is that Algorithm A is doing better on evaluation metric. It‘s getting 3 percent error but it is actually a worse algorithm. In this case, the evaluation metric plus the dev set prefers Algorithm A because they‘re saying, look, Algorithm A has lower error which is the metric you‘re using but you and your users prefer Algorithm B because it‘s not letting through pornographic images. So when this happens, when your evaluation metric is no longer correctly rank ordering preferences between algorithms, in this case is mispredicting that Algorithm A is a better algorithm, then that‘s a sign that you should change your evaluation metric or perhaps your development set or test set. In this case the misclassification error metric that you‘re using can be written as follows: this one over m, a number of examples in your development set, of sum from i equals 1 to mdev, number of examples in this development set of indicator of whether or not the prediction of example i in your development set is not equal to the actual label i, where they use this notation to denote their predictive value. Right. So these are zero. And this indicates a function notation, counts up the number of examples on which this thing inside it‘s true. So this formula just counts up the number of misclassified examples. The problem with this evaluation metric is that they treat pornographic and non-pornographic images equally but you really want your classifier to not mislabel pornographic images, like maybe you recognize a pornographic image in cat image and therefore show it to unsuspecting user, therefore very unhappy with unexpectedly seeing porn. One way to change this evaluation metric would be if you add the weight term here, we call this w(i) where w(i) is going to be equal to 1 if x(i) is non-porn and maybe 10 or maybe even large number like a 100 if x(i) is porn. So this way you‘re giving a much larger weight to examples that are pornographic so that the error term goes up much more if the algorithm makes a mistake on classifying a pornographic image as a cat image. In this example you giving 10 times bigger weights to classify pornographic images correctly. If you want this normalization constant, technically this becomes sum over i of w(i), so then this error would still be between zero and one. The details of this weighting aren‘t important and actually to implement this weighting, you need to actually go through your dev and test sets, so label the pornographic images in your dev and test sets so you can implement this weighting function. But the high level of take away is, if you find that evaluation metric is not giving the correct rank order preference for what is actually better algorithm, then there‘s a time to think about defining a new evaluation metric. And this is just one possible way that you could define an evaluation metric. The goal of the evaluation metric is accurately tell you, given two classifiers, which one is better for your application. For the purpose of this video, don‘t worry too much about the details of how we define a new error metric, the point is that if you‘re not satisfied with your old error metric then don‘t keep coasting with an error metric you‘re unsatisfied with, instead try to define a new one that you think better captures your preferences in terms of what‘s actually a better algorithm. One thing you might notice is that so far we‘ve only talked about how to define a metric to evaluate classifiers. That is, we‘ve defined an evaluation metric that helps us better rank order classifiers when they are performing at varying levels in terms of streaming of porn. And this is actually an example of an orthogonalization where I think you should take a machine learning problem and break it into distinct steps. One step is to figure out how to define a metric that captures what you want to do, and I would worry separately about how to actually do well on this metric. So think of the machine learning task as two distinct steps. To use the target analogy, the first step is to place the target. So define where you want to aim and then as a completely separate step, this is one you can tune which is how do you place the target as a completely separate problem. Think of it as a separate step to tune in terms of how to do well at this algorithm, how to aim accurately or how to shoot at the target. Defining the metric is step one and you do something else for step two. In terms of shooting at the target, maybe your learning algorithm is optimizing some cost function that looks like this, where you are minimizing some of losses on your training set. One thing you could do is to also modify this in order to incorporate these weights and maybe end up changing this normalization constant as well. So it just 1 over a sum of w(i). Again, the details of how you define J aren‘t important, but the point was with the philosophy of orthogonalization think of placing the target as one step and aiming and shooting at a target as a distinct step which you do separately. In other words I encourage you to think of, defining the metric as one step and only after you define a metric, figure out how to do well on that metric which might be changing the cost function J that your neural network is optimizing. Before going on, let‘s look at just one more example. Let‘s say that your two cat classifiers A and B have, respectively, 3 percent error and 5 percent error as evaluated on your dev set. Or maybe even on your test set which are images downloaded off the internet, so high quality well framed images. But maybe when you deploy your algorithm product, you find that algorithm B actually looks like it‘s performing better, even though it‘s doing better on your dev set. And you find that you‘ve been training off very nice high quality images downloaded off the Internet but when you deploy those on the mobile app, users are uploading all sorts of pictures, they‘re much less framed, you haven‘t only covered the cat, the cats have funny facial expressions, maybe images are much blurrier, and when you test out your algorithms you find that Algorithm B is actually doing better. So this would be another example of your metric and dev test sets falling down. The problem is that you‘re evaluating on the dev and test sets a very nice, high resolution, well-framed images but what your users really care about is you have them doing well on images they are uploading, which are maybe less professional shots and blurrier and less well framed. So the guideline is, if doing well on your metric and your current dev sets or dev and test sets‘ distribution, if that does not correspond to doing well on the application you actually care about, then change your metric and your dev test set. In other words, if we discover that your dev test set has these very high quality images but evaluating on this dev test set is not predictive of how well your app actually performs, because your app needs to deal with lower quality images, then that‘s a good time to change your dev test set so that your data better reflects the type of data you actually need to do well on. But the overall guideline is if your current metric and data you are evaluating on doesn‘t correspond to doing well on what you actually care about, then change your metrics and/or your dev/test set to better capture what you need your algorithm to actually do well on. Having an evaluation metric and the dev set allows you to much more quickly make decisions about is Algorithm A or Algorithm B better. It really speeds up how quickly you and your team can iterate. So my recommendation is, even if you can‘t define the perfect evaluation metric and dev set, just set something up quickly and use that to drive the speed of your team iterating. And if later down the line you find out that it wasn‘t a good one, you have better idea, change it at that time, it‘s perfectly okay. But what I recommend against for the most teams is to run for too long without any evaluation metric and dev set up because that can slow down the efficiency of what your team can iterate and improve your algorithm. So that says on when to change your evaluation metric and/or dev and test sets. I hope that these guidelines help you set up your whole team to have a well-defined target that you can iterate efficiently towards improving performance.
Why human-level performance? - 5m
0:01
In the last few years, a lot more machine learning teams have been talking about comparing the machine learning systems to human level performance. Why is this? I think there are two main reasons. First is that because of advances in deep learning, machine learning algorithms are suddenly working much better and so it has become much more feasible in a lot of application areas for machine learning algorithms to actually become competitive with human-level performance. Second, it turns out that the workflow of designing and building a machine learning system, the workflow is much more efficient when you‘re trying to do something that humans can also do. So in those settings, it becomes natural to talk about comparing, or trying to mimic human-level performance. Let‘s see a couple examples of what this means.
0:46
I‘ve seen on a lot of machine learning tasks that as you work on a problem over time, so the x-axis, time, this could be many months or even many years over which some team or some research community is working on a problem. Progress tends to be relatively rapid as you approach human level performance. But then after a while, the algorithm surpasses human-level performance and then progress and accuracy actually slows down. And maybe it keeps getting better but after surpassing human level performance it can still get better, but performance, the slope of how rapid the accuracy‘s going up, often that slows down. And the hope is it achieves some theoretical optimum level of performance.
1:32
And over time, as you keep training the algorithm, maybe bigger and bigger models on more and more data, the performance approaches but never surpasses some theoretical limit, which is called the Bayes optimal error. So Bayes optimal error, think of this as the best possible error.
1:59
And that‘s just the way for any function mapping from x to y to surpass a certain level of accuracy. So for example, for speech recognition, if x is audio clips, some audio is just so noisy it is impossible to tell what is in the correct transcription. So the perfect error may not be 100%. Or for cat recognition. Maybe some images are so blurry, that it is just impossible for anyone or anything to tell whether or not there‘s a cat in that picture. So, the perfect level of accuracy may not be 100%. And Bayes optimal error, or Bayesian optimal error, or sometimes Bayes error for short, is the very best theoretical function for mapping from x to y.
2:52
That can never be surpassed.
2:56
So it should be no surprise that this purple line, no matter how many years you work on a problem you can never surpass Bayes error, Bayes optimal error. And it turns out that progress is often quite fast until you surpass human level performance.
3:12
And it sometimes slows down after you surpass human level performance. And I think there are two reasons for that, for why progress often slows down when you surpass human level performance. One reason is that human level performance is for many tasks not that far from Bayes‘ optimal error. People are very good at looking at images and telling if there‘s a cat or listening to audio and transcribing it. So, by the time you surpass human level performance maybe there‘s not that much head room to still improve.
3:42
But the second reason is that so long as your performance is worse than human level performance, then there are actually certain tools you could use to improve performance that are harder to use once you‘ve surpassed human level performance. So here‘s what I mean.
3:59
For tasks that humans are quite good at, and this includes looking at pictures and recognizing things, or listening to audio, or reading language, really natural data tasks humans tend to be very good at. For tasks that humans are good at, so long as your machine learning algorithm is still worse than the human, you can get labeled data from humans. That is you can ask people, ask higher humans, to label examples for you so that you can have more data to feed your learning algorithm. Something we‘ll talk about next week is manual error analysis. But so long as humans are still performing better than any other algorithm, you can ask people to look at examples that your algorithm‘s getting wrong, and try to gain insight in terms of why a person got it right but the algorithm got it wrong. And we‘ll see next week that this helps improve your algorithm‘s performance.
4:48
And you can also get a better analysis of bias and variance which we‘ll talk about in a little bit. But so long as your algorithm is still doing worse then humans you have these important tactics for improving your algorithm. Whereas once your algorithm is doing better than humans, then these three tactics are harder to apply.
5:07
So, this is maybe another reason why comparing to human level performance is helpful, especially on tasks that humans do well.
5:17
And why machine learning algorithms tend to be really good at trying to replicate tasks that people can do and kind of catch up and maybe slightly surpass human level performance. In particular, even though you know what is bias and what is variance it turns out that knowing how well humans can do on a task can help you understand better how much you should try to reduce bias and how much you should try to reduce variance. I want to show you an example of this in the next video.
Avoidable bias - 6m
0:02
We talked about how you want your learning algorithm to do well on the training set but sometimes you don‘t actually want to do too well and knowing what human level performance is, can tell you exactly how well but not too well you want your algorithm to do on the training set. Let me show you what I mean. We have used Cat classification a lot and given a picture, let‘s say humans have near-perfect accuracy so the human level error is one percent. In that case, if your learning algorithm achieves 8 percent training error and 10 percent dev error, then maybe you wanted to do better on the training set. So the fact that there‘s a huge gap between how well your algorithm does on your training set versus how humans do shows that your algorithm isn‘t even fitting the training set well. So in terms of tools to reduce bias or variance, in this case I would say focus on reducing bias. So you want to do things like train a bigger neural network or run training set longer, just try to do better on the training set. But now let‘s look at the same training error and dev error and imagine that human level performance was not 1%. So this copy is over but you know in a different application or maybe on a different data set, let‘s say that human level error is actually 7.5%. Maybe the images in your data set are so blurry that even humans can‘t tell whether there‘s a cat in this picture. This example is maybe slightly contrived because humans are actually very good at looking at pictures and telling if there‘s a cat in it or not. But for the sake of this example, let‘s say your data sets images are so blurry or so low resolution that even humans get 7.5% error. In this case, even though your training error and dev error are the same as the other example, you see that maybe you‘re actually doing just fine on the training set. It‘s doing only a little bit worse than human level performance. And in this second example, you would maybe want to focus on reducing this component, reducing the variance in your learning algorithm. So you might try regularization to try to bring your dev error closer to your training error for example. So in the earlier courses discussion on bias and variance, we were mainly assuming that there were tasks where Bayes error is nearly zero. So to explain what just happened here, for our Cat classification example, think of human level error as a proxy or as a estimate for Bayes error or for Bayes optimal error. And for computer vision tasks, this is a pretty reasonable proxy because humans are actually very good at computer vision and so whatever a human can do is maybe not too far from Bayes error. By definition, human level error is worse than Bayes error because nothing could be better than Bayes error but human level error might not be too far from Bayes error. So the surprising thing we saw here is that depending on what human level error is or really this is really approximately Bayes error or so we assume it to be, but depending on what we think is achievable, with the same training error and dev error in these two cases, we decided to focus on bias reduction tactics or on variance reduction tactics. And what happened is in the example on the left, 8% training error is really high when you think you could get it down to 1% and so bias reduction tactics could help you do that. Whereas in the example on the right, if you think that Bayes error is 7.5% and here we‘re using human level error as an estimate or as a proxy for Bayes error, but you think that Bayes error is close to seven point five percent then you know there‘s not that much headroom for reducing your training error further down. You don‘t really want it to be that much better than 7.5% because you could achieve that only by maybe starting to offer further training so, and instead, there‘s much more room for improvement in terms of taking this 2% gap and trying to reduce that by using variance reduction techniques such as regularization or maybe getting more training data. So to give these things a couple of names, this is not widely used terminology but I found this useful terminology and a useful way of thinking about it, which is I‘m going to call the difference between Bayes error or approximation of Bayes error and the training error to be the avoidable bias. So what you want is maybe keep improving your training performance until you get down to Bayes error but you don‘t actually want to do better than Bayes error. You can‘t actually do better than Bayes error unless you‘re overfitting. And this, the difference between your training area and the dev error, there‘s a measure still of the variance problem of your algorithm. And the term avoidable bias acknowledges that there‘s some bias or some minimum level of error that you just cannot get below which is that if Bayes error is 7.5%, you don‘t actually want to get below that level of error. So rather than saying that if you‘re training error is 8%, then the 8% is a measure of bias in this example, you‘re saying that the avoidable bias is maybe 0.5% or 0.5% is a measure of the avoidable bias whereas 2% is a measure of the variance and so there‘s much more room in reducing this 2% than in reducing this 0.5%. Whereas in contrast in the example on the left, this 7% is a measure of the avoidable bias, whereas 2% is a measure of how much variance you have. And so in this example on the left, there‘s much more potential in focusing on reducing that avoidable bias. So in this example, understanding human level error, understanding your estimate of Bayes error really causes you in different scenarios to focus on different tactics, whether bias avoidance tactics or variance avoidance tactics. There‘s quite a lot more nuance in how you factor in human level performance into how you make decisions in choosing what to focus on. Thus in the next video, go deeper into understanding of what human level performance really mean.
Understanding human-level performance - 11m
0:00
The term human-level performance is sometimes used casually in research articles. But let me show you how we can define it a bit more precisely. And in particular, use the definition of the phrase, human-level performance, that is most useful for helping you drive progress in your machine learning project.
0:19
So remember from our last video that one of the uses of this phrase, human-level error, is that it gives us a way of estimating Bayes error. What is the best possible error any function could, either now or in the future, ever, ever achieve? So bearing that in mind, let‘s look at a medical image classification example. Let‘s say that you want to look at a radiology image like this, and make a diagnosis classification decision.
0:49
And suppose that a typical human, untrained human, achieves 3% error on this task. A typical doctor, maybe a typical radiologist doctor, achieves 1% error. An experienced doctor does even better, 0.7% error. And a team of experienced doctors, that is if you get a team of experienced doctors and have them all look at the image and discuss and debate the image, together their consensus opinion achieves 0.5% error. So the question I want to pose to you is, how should you define human-level error? Is human-level error 3%, 1%, 0.7% or 0.5%? Feel free to pause this video to think about it if you wish. And to answer that question, I would urge you to bear in mind that one of the most useful ways to think of human error is as a proxy or an estimate for Bayes error. So please feel free to pause this video to think about it for a while if you wish. But here‘s how I would define human-level error. Which is if you want a proxy or an estimate for Bayes error, then given that a team of experienced doctors discussing and debating can achieve 0.5% error, we know that Bayes error is less than equal to 0.5%. So because some system, team of these doctors can achieve 0.5% error, so by definition, this directly, optimal error has got to be 0.5% or lower. We don‘t know how much better it is, maybe there‘s a even larger team of even more experienced doctors who could do even better, so maybe it‘s even a little bit better than 0.5%. But we know the optimal error cannot be higher than 0.5%. So what I would do in this setting is use 0.5% as our estimate for Bayes error. So I would define human-level performance as 0.5%. At least if you‘re hoping to use human-level error in the analysis of bias and variance as we saw in the last video.
2:56
Now, for the purpose of publishing a research paper or for the purpose of deploying a system, maybe there‘s a different definition of human-level error that you can use which is so long as you surpass the performance of a typical doctor. That seems like maybe a very useful result if accomplished, and maybe surpassing a single radiologist, a single doctor‘s performance might mean the system is good enough to deploy in some context.
3:22
So maybe the takeaway from this is to be clear about what your purpose is in defining the term human-level error. And if it is to show that you can surpass a single human and therefore argue for deploying your system in some context, maybe this is the appropriate definition. But if your goal is the proxy for Bayes error, then this is the appropriate definition. To see why this matters, let‘s look at an error analysis example.
3:51
Let‘s say, for a medical imaging diagnosis example, that your training error is 5% and your dev error is 6%. And in the example from the previous slide, our human-level performance, and I‘m going to think of this as proxy for Bayes error.
4:12
Depending on whether you defined it as a typical doctor‘s performance or experienced doctor or team of doctors, you would have either 1% or 0.7% or 0.5% for this. And remember also our definitions from the previous video, that this gap between Bayes error or estimate of Bayes error and training error is calling that a measure of the avoidable bias. And this as a measure or an estimate of how much of a variance problem you have in your learning algorithm.
4:44
So in this first example, whichever of these choices you make, the measure of avoidable bias will be something like 4%. It will be somewhere between I guess, 4%, if you take that to 4.5%, if you use 0.5%, whereas this is 1%.
5:06
So in this example, I would say, it doesn‘t really matter which of the definitions of human-level error you use, whether you use the typical doctor‘s error or the single experienced doctor‘s error or the team of experienced doctor‘s error. Whether this is 4% or 4.5%, this is clearly bigger than the variance problem. And so in this case, you should focus on bias reduction techniques such as train a bigger network. Now let‘s look at a second example. Let‘s see your training error is 1% and your dev error is 5%. Then again it doesn‘t really matter, seems but academic whether the human-level performance is 1% or 0.7% or 0.5%. Because whichever of these definitions you use, your measure of avoidable bias will be, I guess somewhere between 0% if you use that, to 0.5%, right? That‘s the gap between the human-level performance and your training error, whereas this gap is 4%. So this 4% is going to be much bigger than the avoidable bias either way. And so they‘ll just suggest you should focus on variance reduction techniques such as regularization or getting a bigger training set. But where it really matters will be if your training error is 0.7%. So you‘re doing really well now, and your dev error is 0.8%. In this case, it really matters that you use your estimate for Bayes error as 0.5%.
6:36
Because in this case, your measure of how much avoidable bias you have is 0.2% which is twice as big as your measure for your variance, which is just 0.1%.
6:48
And so this suggests that maybe both the bias and variance are both problems but maybe the avoidable bias is a bit bigger of a problem. And in this example, 0.5% as we discussed on the previous slide was the best measure of Bayes error, because a team of human doctors could achieve that performance. If you use 0.7 as your proxy for Bayes error, you would have estimated avoidable bias as pretty much 0%, and you might have missed that. You actually should try to do better on your training set.
7:18
So I hope this gives a sense also of why making progress in a machine learning problem gets harder as you achieve or as you approach human-level performance. In this example, once you‘ve approached 0.7% error, unless you‘re very careful about estimating Bayes error, you might not know how far away you are from Bayes error. And therefore how much you should be trying to reduce aviodable bias. In fact, if all you knew was that a single typical doctor achieves 1% error, and it might be very difficult to know if you should be trying to fit your training set even better.
7:54
And this problem arose only when you‘re doing very well on your problem already, only when you‘re doing 0.7%, 0.8%, really close to human-level performance.
8:04
Whereas in the two examples on the left, when you are further away human-level performance, it was easier to target your focus on bias or variance. So this is maybe an illustration of why as your pro human-level performance is actually harder to tease out the bias and variance effects. And therefore why progress on your machine learning project just gets harder as you‘re doing really well.
8:25
So just to summarize what we‘ve talked about. If you‘re trying to understand bias and variance where you have an estimate of human-level error for a task that humans can do quite well, you can use human-level error as a proxy or as a approximation for Bayes error.
8:47
And so the difference between your estimate of Bayes error tells you how much avoidable bias is a problem, how much avoidable bias there is. And the difference between training error and dev error, that tells you how much variance is a problem, whether your algorithm‘s able to generalize from the training set to the dev set. And the big difference between our discussion here and what we saw in an earlier course was that instead of comparing training error to 0%,
9:18
And just calling that the estimate of the bias. In contrast, in this video we have a more nuanced analysis in which there is no particular expectation that you should get 0% error. Because sometimes Bayes error is non zero and sometimes it‘s just not possible for anything to do better than a certain threshold of error.
9:41
And so in the earlier course, we were measuring training error, and seeing how much bigger training error was than zero. And just using that to try to understand how big our bias is. And that turns out to work just fine for problems where Bayes error is nearly 0%, such as recognizing cats. Humans are near perfect for that, so Bayes error is also near perfect for that. So that actually works okay when Bayes error is nearly zero. But for problems where the data is noisy, like speech recognition on very noisy audio where it‘s just impossible sometimes to hear what was said and to get the correct transcription. For problems like that, having a better estimate for Bayes error can help you better estimate avoidable bias and variance. And therefore make better decisions on whether to focus on bias reduction tactics, or on variance reduction tactics.
10:30
So to recap, having an estimate of human-level performance gives you an estimate of Bayes error. And this allows you to more quickly make decisions as to whether you should focus on trying to reduce a bias or trying to reduce the variance of your algorithm.
10:45
And these techniques will tend to work well until you surpass human-level performance, whereupon you might no longer have a good estimate of Bayes error that still helps you make this decision really clearly.
10:58
Now, one of the exciting developments in deep learning has been that for more and more tasks we‘re actually able to surpass human-level performance. In the next video, let‘s talk more about the process of surpassing human-level performance.
Surpassing human-level performance - 6m
0:00
[inaudible] teams often find it exciting to surpass human-level performance on the specific recreational classification task. Let‘s talk over some of the things you see if you try to accomplish this yourself. We‘ve discussed before how machine learning progress gets harder as you approach or even surpass human-level performance. Let‘s talk over one more example of why that‘s the case. Let‘s say you have a problem where a team of humans discussing and debating achieves 0.5% error, a single human 1% error, and you have an algorithm of 0.6% training error and 0.8% dev error. So in this case, what is the avoidable bias? So this one is relatively easier to answer, 0.5% is your estimate of base error, so your avoidable bias is, you‘re not going to use this 1% number as reference, you can use this difference, so maybe you estimate your avoidable bias is at least 0.1% and your variance as 0.2%. So there‘s maybe more to do to reduce your variance than your avoidable bias perhaps. But now let‘s take a harder example, let‘s say, a team of humans and single human performance, the same as before, but your algorithm gets 0.3% training error, and 0.4% dev error. Now, what is the avoidable bias? It‘s now actually much harder to answer that. Is the fact that your training error, 0.3%, does this mean you‘ve over-fitted by 0.2%, or is base error, actually 0.1%, or maybe is base error 0.2%, or maybe base error is 0.3%? You don‘t really know, but based on the information given in this example, you actually don‘t have enough information to tell if you should focus on reducing bias or reducing variance in your algorithm. So that slows down the efficiency where you should make progress. Moreover, if your error is already better than even a team of humans looking at and discussing and debating the right label, for an example, then it‘s just also harder to rely on human intuition to tell your algorithm what are ways that your algorithm could still improve the performance? So in this example, once you‘ve surpassed this 0.5% threshold, your options, your ways of making progress on the machine learning problem are just less clear. It doesn‘t mean you can‘t make progress, you might still be able to make significant progress, but some of the tools you have for pointing you in a clear direction just don‘t work as well. Now, there are many problems where machine learning significantly surpasses human-level performance. For example, I think, online advertising, estimating how likely someone is to click on that. Probably, learning algorithms do that much better today than any human could, or making product recommendations, recommending movies or books to you. I think that web sites today can do that much better than maybe even your closest friends can. All logistics predicting how long will take you to drive from A to B or predicting how long to take a delivery vehicle to drive from A to B, or trying to predict whether someone will repay a loan, and therefore, whether or not you should approve a loan offer. All of these are problems where I think today machine learning far surpasses a single human‘s performance. Notice something about these four examples. All four of these examples are actually learning from structured data, where you might have a database of what has users clicked on, database of proper support for, databases of how long it takes to get from A to B, database of previous loan applications and their outcomes. And these are not natural perception problems, so these are not computer vision, or speech recognition, or natural language processing task. Humans tend to be very good in natural perception task. So it is possible, but it‘s just a bit harder for computers to surpass human-level performance on natural perception task. And finally, all of these are problems where there are teams that have access to huge amounts of data. So for example, the best systems for all four of these applications have probably looked at far more data of that application than any human could possibly look at. And so, that‘s also made it relatively easy for a computer to surpass human-level performance. Now, the fact that there‘s so much data that computer could examine, so it can petrifies that‘s called patterns than even the human mind. Other than these problems, today there are speech recognition systems that can surpass human-level performance. And there are also some computer vision, some image recognition tasks, where computers have surpassed human-level performance. But because humans are very good at this natural perception task, I think it was harder for computers to get there. And then there are some medical tasks, for example, reading ECGs or diagnosing skin cancer, or certain narrow radiology task, where computers are getting really good and maybe surpassing a single human-level performance. And I guess one of the exciting things about recent advances in deep learning is that even for these tasks we can now surpass human-level performance in some cases, but it has been a bit harder because humans tend to be very good at this natural perception task. So surpassing human-level performance is often not easy, but given enough data there‘ve been lots of deep learning systems have surpassed human-level performance on a single supervisory problem. So that makes sense for an application you‘re working on. I hope that maybe someday you manage to get your deep learning system to also surpass human-level performance.
Improving your model performance - 4m
0:00
You‘ve heard about orthogonalization, how to set up your dev and test sets, human-level performance as a proxy for Bayes error and how to estimate your avoidable bias and variance. Let‘s pull it all together into a set of guidelines to how to improve the performance of your learning algorithm. So, I think getting a supervised learning algorithm to work well means fundamentally hoping or assuming they can do two things. First, is that you can fit the training set pretty well, and you can think of this as roughly saying that you can achieve low avoidable bias. And the second thing you‘re assuming you can do well, is that doing well on the training set generalizes pretty well to the dev set or the test set, and this is sort of saying that variance is not too bad. And in the spirit of orthogonalization, what you see is that there‘s a certain set of knobs you can use to fix avoidable bias issues, such as training a bigger network or training longer, and as a separate set of things you could use to address variance problems, such as regularization or getting more training data. So, to summarize up the process we‘ve seen in the last several videos, if you want to improve the performance of your machine learning system, I would recommend looking at the difference between your training error and your proxy for Bayes error and just gives you a sense of the avoidable bias. In other words, just how much better do you think you should be trying to do on your training set. And then look at the difference between your dev error and your training error as an estimate of how much of a variance problem you have. In other words, how much harder you should be working to make your performance generalized from the training set to the dev set that it wasn‘t trained on explicitly.
1:57
So, to whatever extent you want to try to reduce avoidable bias, I will try to apply tactics like train a bigger model. So, you can just do better on your training sets or train longer, use a better optimization algorithm, such as ADS momentum or RMSprop, or use a better algorithm like Adam,
2:27
or one other thing you could try is to just find a better neural network architecture or better set of hyperparameters, and this could include everything from changing the activation function to changing the number of layers or hidden units. Although if you do that, it would be in the direction of increasing the model size to trying out other models or other model architectures, such as recurrent neural networks and convolutional neural networks, which we‘ll see in later courses. Whether or not a new neural network architecture will fit your training set better is sometimes hard to tell in advance, but sometimes you can get much better results with a better architecture. Next to the extent that you find out variance is a problem, some of the many techniques you could try then includes the following: you can try to get more data because getting more data to train on could help you generalize better to dev set data that your algorithm room didn‘t see, you could try regularization. So, this includes things like L2 regularization or dropout or data augmentation, which we talked about in the previous course, or once again, you can also try various NN architecture/hyperparameters search to see if that can help you find a neural network architecture that is better suited for your problem. I think that this notion of bias or avoidable bias and variance is one of those things that‘s easily learnt but tough to master. And you‘re able to systematically apply the concepts from this week‘s videos. You actually will be much more efficient and much more systematic and much more strategic than a lot of machine learning teams in terms of how to systematically go about improving the performance of your machine learning system. So, that this week‘s homework will allow you to practice and exercise more your understanding of these concepts. Best of luck with this week‘s homework, and I look forward to also seeing you in next week‘s videos.
(Reading) Machine Learning flight simulator - 2m
To help you practice strategies for machine learning, the following exercise will present an in-depth scenario and ask how you would act. Consider airplane pilots who’s training involves time spent in flight simulators. These flight simulators accelerate the pilots’ learning by allowing them to experience a volume and variety of scenarios that they otherwise may have needed a much longer time to acquire.
The following exercise is a “flight simulator” for machine learning. Rather than you needing to spend years working on a machine learning project before you get to experience certain scenarios, you’ll get to experience them right here.
Personal note from Andrew: I’ve found practicing with scenarios like these to be useful for training PhD students and advanced Deep Learning researchers. This is the first time this type of “airplane simulator” for machine learning strategy has ever been made broadly available. I hope this helps you gain “real experience” with machine learning much faster than even full-time machine learning researchers typically do from work experience.
(Optional) Heroes of Deep Learning - Andrej Karpathy interview - 15m
0:02
So welcome Andrej, I‘m really glad you could join me today. >> Yeah, thank you for having me. >> So a lot of people already know your work in deep learning, but not everyone knows your personal story. So let us start by telling us, how did you end up doing all these work in deep learning? >> Yeah, absolutely. So I think my first exposure to deep learning once when I was an undergraduate at the University of Toronto. And so Geoff Hinton was there, and he was teaching a class on deep learning. And at that time, it was restricted both from machines trained on and these digits. And I just really like the way Geoff talked about training the network, like the mind of the network, and he was using these terms. And I just thought there was a flavor of something magical happening when this was training on those digits. And so that‘s my first exposure to it, although I didn‘t get into it in a lot of detail at that time. And then when I was doing my master‘s degree at University of British Columbia,
0:57
I took a class with and that was again on machine learning. And that‘s the first time I delved deeper into these networks and so on. And what was interesting is that I was very interested in artificial intelligence, and so I took classes in artificial intelligence. But a lot of what I was seeing there was just very not satisfying. It was a lot of depth-first search, breadth-first search, alpha-beta pruning, and all these things. And I was not understanding how, I was not satisfied. And so when I was seeing neural networks for the first time in machine learning, which is this term that I think is more technical and not as well known in most people talk about artificial intelligence. Machine learning was more a technical term, I would almost say. And so I was dissatisfied with artificial intelligence. When I saw machine learning, I was like, this is the AI that I want to spend time on, this is what‘s really interesting. And that‘s what took me down those directions is that this is almost a new computing paradigm, I would say.
1:48
Because normally, humans write code, but here in this case, the optimization writes code. And so you‘re creating the input/out specification, and then you have lots of examples of it, and then the optimization writes code, and sometimes it can write code better than you. And so I thought that was just a very new way of thinking about programming, and that‘s what intrigued me about it. >> Then through your work, one of the things you‘ve come to be known for is that you‘re now this human benchmark for the image classification competition. How did that come about? >> So basically, their ImageNet challenge is it‘s sometimes compared to the world cup of computer vision. So whether people are going to care about this benchmark and number, our error rate goes down over time. And it was not obvious to me where a human would be on this scale. I‘ve done a similar smaller scale experiment on CIFAR-10 dataset earlier. So what I did in CIFAR-10 dataset is I was just looking at these 32 x 32 images, and I was trying to classify them myself. At the time, this was only ten categories, so it‘s fairly simple to create an interface for it. And I think I had an error rate of about 6% on that. And then based on what I was seeing and how hard a task was, I think I predicted that the lowest error rate we‘d achieve would be. Look, okay, I can‘t remember the exact numbers. I think, I guess, 10%, and we‘re now down to 3 or 2% or something crazy. So that was my first fun experiment of human baseline. And I thought it was really important for the same purposes that you point out in some of your lectures. I mean, you really want that number to understand how well humans are doing it, so we can compare machine learning algorithms to it. And for ImageNet, it seems that there was a discrepancy between how important this benchmark was and how much focus there was on getting a lower number and us not even understanding how humans are doing on this benchmark. So I created this javascript interface, and I was showing myself the images, and then the problem with ImageNet is you don‘t have just 10 categories, you have 1,000. It was almost like a UI challenge. Obviously, I can‘t remember 1,000 categories, so how do I make it so that it‘s something fair? And so I listed out all the categories, and I gave myself examples of them. And so for each image, I was scrolling through 1,000 categories and just trying to see, based on the examples I was seeing for each category, what this image might be. And I thought it was an extremely instructed exercise by itself. I mean, I did not understand that a third of ImageNet is dogs and dog species, and so that was interesting to see that network spends a huge amount of time caring about dogs, I think. A third of its performance comes from dogs. And yeah, so this was something that I did for maybe a week or two. I put everything else on hold. I thought it was a very fun exercise. I got a number in the end, and then I thought that one person is not enough. I wanted to have multiple other people, and so I was trying to organize within the lab to get other people to do the same thing. And I think people are not as willing to contribute, say like a week or two of pretty painstaking work, just like yeah sitting down for five hours and trying to figure out which dog breed this is. And so I was not able to get enough data in that respect, but we got at least some approximate performance, which I thought was fun. And then this was picked up, and it wasn‘t obvious to me at the time. I just wanted to know the number, but this became like a thing. [LAUGH] And people really liked the fact that this happened, and refer to jokingly as the reference human. And of course, that‘s hilarious to me, yeah. [LAUGH] >> Were you surprised when software, finally surpassed your performance? >> Absolutely. So yeah, absolutely. I mean, especially, sometimes it‘s really hard to see in the image what it is. It‘s just like a tiny blob of a black dot is obviously somewhere there. And I‘m not seeing. I‘m guessing between like 20 categories, and the network just gets it, and I don‘t understand how that comes about. So there‘s some superhumanness to it. But also, I think the network is extremely good at these kind of statistics of work types and textures.
5:46
I think in that respect, I was not surprised that the network could better measure those fine statistics across lots of images. In many cases, I was surprised because some of the images require you to read. It‘s just a bottle, and you can‘t see what it is, but it actually tells you what it is in text. And so as a human, I can read it, and it‘s fine, but the network would have to learn to read to identify the object, because it wasn‘t obvious from it. >> One of the things you‘ve become well-known for, and that the deep learning community has been grateful to you for, has been your teaching the class and putting that online. Tell me a little bit about how that came about. >> Yeah, absolutely. So I think I felt very strongly that basically, this technology was transformative in that a lot of people want to use it. It‘s almost like a hammer. And what I wanted to do, I was in a position to randomly hand out this hammer to a lot of people. And I just found that very compelling. It‘s not necessarily advisable from the perspective of the PhD student, because you‘re putting your research on hold. I mean, this became like 120% of my time. And I had to put all of research on hold for maybe, I mean, I thought the class twice, and each time, it‘s maybe four months. And so that time is basically spent entirely on the class, so it‘s not super advisable from that perspective, but it was basically the highlight of my PhD. It‘s not even related to research. I think teaching a class was definitely the highlight of my PhD. Just seeing the students, just the fact that they‘re real excited, it was a very different class. Normally, you‘re being taught things that were discovered in 1800 or something like that. But we were able to come to class and say, look, there‘s this paper from a week ago, or even yesterday. And there‘s new results, and I think the undergraduate students and the other students, they just really enjoyed that aspect of the class and the fact that they actually understood. So this is not nuclear physics or rocket science. This is you need to know calculus, and then your algebra, and you can actually understand everything that happens under the hood. So I think just the fact that it‘s so powerful, the fact that it keeps changing on a daily basis, people felt right they‘re on the forefront of something big. And I think that‘s why people really enjoy that class a lot. >> And you‘ve really helped a lot of people and had a lot of hammers. >> Yeah. >> As someone that‘s been doing deep learning for quite some time now, the field is evolving rapidly. I‘d be curious to hear, how has your own thinking, how has your understanding of deep learning changed over these many years? >> Yeah, it‘s basically like when I was seeing Restricted Boltzmann machines for the first time on DIGITS.
8:08
It wasn‘t obvious to me how this technology was going to be used and how big of a deal it would be. And also, when I was starting to work in computer vision, convolutional networks, they were around, but they were not something that a lot of the computer vision community participated using anytime soon. I think the perception was that this works for small cases but would never scale for large images. >> And that was just extremely incorrect. [LAUGH] And so basically, I‘m just surprised by how general technology is and how good the results are. That was largest surprise, I would say, and it‘s not only that. So that‘s one thing that it worked so well on, say, like ImageNet. But the other thing that I think no one saw coming, or at least for sure I did not see coming, is that you can take these pretrained networks and that you can transfer. You can fine tune them on arbitrary other tasks. Because now, you‘re not just solving ImageNet, and you need millions of examples. This also happens to be very general feature extractor, and I think that‘s a second insight that I think fewer people saw coming. And there were these papers, they are just locked here. All the things that people have been working on in computer vision. Sync classification, actual recognition, object recognition, base attributes and so on. And people are just crushing each task just by fine tuning the network. And so that, to me, was very surprising. >> Yes, and somehow I guess supervised learning gets most of the press, and even though pretrained fine-tuning or transfer learning is actually working very well, people seem to talk less about that for some reason. >> Right, exactly.
9:36
Yeah, I think what has not worked as much is some of these hopes are on unsupervised learning, which I think has been really why a lot of researchers have gotten into the field in around 2007 and so on. And I think the promise of that has still not been delivered, and I think I find that also surprising is that the supervised learning part worked so well. And the enterprise learning, it‘s still in a state of, yeah, it‘s still not obvious how it‘s going to be used or how that‘s going to work, even though a lot of people are still deep believers, I would say to use the term, in this area >> So I know that you‘re one of the persons who‘s been thinking a lot about the long-term future of AI. Do you want to share your thoughts on that? >> So I spent the last maybe year and a half at OpenAI thinking a lot about these topics, and it seems to me like the field will split into two trajectories. One will be applied AI, which is just making these neural networks, training them, mostly with supervised learning, potentially unsupervised learning. And getting better, say, image recognizers or something like that. And I think the other will be artificial general intelligence directions, which is how do you get neural networks that are entirely dynamical system that thinks and speaks and can do everything that a human can do and has intelligent in that way. And I think that what‘s been interesting is that, for example in computer vision. The way we approached it in the beginning, I think, was wrong in that we tried to break it down by different parts. So we were like, okay, humans recognize people, humans recognize scenes, humans recognize objects. So we‘re just going to do everything that humans do, and then once we have all those things, and now we have different areas. And once we have all those things, we‘re going to figure out how to put them together. And I think that was a wrong approach, and we‘ve seen how that going to played out historically. And so I think there‘s something similar that‘s going on that‘s likely on a higher level with AI. So people are asking, well, okay, people plan, people do experiments to figure out how the world works, or people talk to other people, so we need language. And people are trying to decompose it by function, accomplish each piece, and then put it together into some kind of brain. And I just think it‘s just incorrect approach. And so what I‘ve been a much bigger fan of is not decomposing that way but having a single kind of neural network that is the complete dynamical system that you‘re always working with a full agent. And then the question is, how do you actually create objectives such that when you optimize over the weights that make up that brain, you get intelligent behavior out? And so that‘s been something that I‘ve been thinking about a lot at OpenAI. I think there are a lot of different ways that people have thought about approaching this problem.
12:11
For example, going in a supervised learning direction, I have this essay online. It‘s not an essay, it‘s a short story that I wrote. And the short story tries to come up with a hypothetical world of what it might look like if the way we approach this AGI is just by scaling up supervised learning, which we know works. And so that gets into something that looks like Amazon Mechanical Turk where people associates into lots of robot bodies, and they perform tasks, and then we train on that as a supervised learning dataset to imitate humans and what that might look like, and so on. And so then there are other directions, like unsurprised learning from algorithmic information theory, things like AIXI, or from artificial life, things that‘ll look more like artificial evolution. And so that‘s what I spend my time thinking a lot about. And I think I had the correct answer, but I‘m not willing to reveal it here. [LAUGH] >> I can at least learn more by reading your blog post. >> Yeah, absolutely.
13:03
So you‘ve already given out a lot of advice, and today, there are a lot of people still wanting to enter the field of AI into deep learning. So for people in that position, what advice do you have for them? >> Yeah, absolutely. So I think when people talk to me about CS231n and why they thought it was a very useful course, what I keep hearing again and again is just people appreciate the fact that we got all the way through the low-level details. And they were not working with the library, they saw the real code. And they saw how everything was implemented, and implemented chunks of it themselves. And so just going all the way down and understanding everything under you,
13:42
it‘s really important to not abstract away things. You need to have a full understanding of the whole stack. And that‘s where I learned the most myself as well when I was learning this stuff is just implementing it myself from scratch was the most important. It was the piece that I felt gave me the best kind of bang for the buck in terms of understanding. So I wrote my own library. It‘s called ConvNetJS. It was written in Javascript, and it implements convolutional neural network. That was my way of learning about application. And so that‘s something that I keep advising people is that you not work with flow or something else. You can work with it once you have written at something yourself on the lowest detail, you understand everything under you, and now you are comfortable to. Now, it‘s possible to use some these frameworks that abstract some of it away from you, but you know what‘s under the hood. And so that‘s been something that helped me the most. That‘s something that people appreciate the most when they take 231n, and that‘s what I would advise a lot of people. >> So rather than run neural network, and it‘ll all happen like that. >> Yeah, and in some kind of sequence of layers, and I know that when I add some dropout layers, it makes it work better, like that‘s not what you want. In that case, you‘re not going to be able to debug effectively, you‘re not going to be able to improve on models effectively.
14:48
Yeah, with that answer, I‘m really glad that deep learning course got AI course. It starts a lot with many weeks of Python programming first and then [INAUDIBLE]. >> Yeah, good, good. >> Thank you very much for sharing your insights and advice. You‘re already heroes of many people in the deep learning world, so I‘m really glad, really grateful you could join us here today. >> Yeah, thank you for having me.
ML Strategy (2)
Carrying out error analysis - 10m
0:00
Hello, and welcome back. If you‘re trying to get a learning algorithm to do a task that humans can do. And if your learning algorithm is not yet at the performance of a human. Then manually examining mistakes that your algorithm is making, can give you insights into what to do next. This process is called error analysis. Let‘s start with an example. Let‘s say you‘re working on your cat classifier, and you‘ve achieved 90% accuracy, or equivalently 10% error, on your dev set. And let‘s say this is much worse than you‘re hoping to do. Maybe one of your teammates looks at some of the examples that the algorithm is misclassifying, and notices that it is miscategorizing some dogs as cats. And if you look at these two dogs, maybe they look a little bit like a cat, at least at first glance. So maybe your teammate comes to you with a proposal for how to make the algorithm do better, specifically on dogs, right? You can imagine building a focus effort, maybe to collect more dog pictures, or maybe to design features specific to dogs, or something. In order to make your cat classifier do better on dogs, so it stops misrecognizing these dogs as cats. So the question is, should you go ahead and start a project focus on the dog problem?
1:19
There could be several months of works you could do in order to make your algorithm make few mistakes on dog pictures.
1:27
So is that worth your effort? Well, rather than spending a few months doing this, only to risk finding out at the end that it wasn‘t that helpful. Here‘s an error analysis procedure that can let you very quickly tell whether or not this could be worth your effort. Here‘s what I recommend you do. First, get about, say 100 mislabeled dev set examples, then examine them manually. Just count them up one at a time, to see how many of these mislabeled examples in your dev set are actually pictures of dogs. Now, suppose that it turns out that 5% of your 100 mislabeled dev set examples are pictures of dogs. So, that is, if 5 out of 100 of these mislabeled dev set examples are dogs, what this means is that of the 100 examples. Of a typical set of 100 examples you‘re getting wrong, even if you completely solve the dog problem, you only get 5 out of 100 more correct. Or in other words, if only 5% of your errors are dog pictures, then the best you could easily hope to do, if you spend a lot of time on the dog problem. Is that your error might go down from 10% error, down to 9.5% error, right? So this a 5% relative decrease in error, from 10% down to 9.5%. And so you might reasonably decide that this is not the best use of your time. Or maybe it is, but at least this gives you a ceiling, right? Upper bound on how much you could improve performance by working on the dog problem, right?
3:10
In machine learning, sometimes we call this the ceiling on performance. Which just means, what‘s in the best case? How well could working on the dog problem help you?
3:22
But now, suppose something else happens. Suppose that we look at your 100 mislabeled dev set examples, you find that 50 of them are actually dog images. So 50% of them are dog pictures. Now you could be much more optimistic about spending time on the dog problem. In this case, if you actually solve the dog problem, your error would go down from this 10%, down to potentially 5% error. And you might decide that halving your error could be worth a lot of effort. Focus on reducing the problem of mislabeled dogs. I know that in machine learning, sometimes we speak disparagingly of hand engineering things, or using too much value insight. But if you‘re building applied systems, then this simple counting procedure, error analysis, can save you a lot of time. In terms of deciding what‘s the most important, or what‘s the most promising direction to focus on.
4:19
In fact, if you‘re looking at 100 mislabeled dev set examples, maybe this is a 5 to 10 minute effort. To manually go through 100 examples, and manually count up how many of them are dogs. And depending on the outcome, whether there‘s more like 5%, or 50%, or something else. This, in just 5 to 10 minutes, gives you an estimate of how worthwhile this direction is. And could help you make a much better decision, whether or not to spend the next few months focused on trying to find solutions to solve the problem of mislabeled dogs. In this slide, we‘ll describe using error analysis to evaluate whether or not a single idea, dogs in this case, is worth working on. Sometimes you can also evaluate multiple ideas in parallel doing error analysis. For example, let‘s say you have several ideas in improving your cat detector. Maybe you can improve performance on dogs? Or maybe you notice that sometimes, what are called great cats, such as lions, panthers, cheetahs, and so on. That they are being recognized as small cats, or house cats. So you could maybe find a way to work on that. Or maybe you find that some of your images are blurry, and it would be nice if you could design something that just works better on blurry images.
5:37
And maybe you have some ideas on how to do that.
5:41
So if carrying out error analysis to evaluate these three ideas, what I would do is create a table like this.
5:50
And I usually do this in a spreadsheet, but using an ordinary text file will also be okay.
5:57
And on the left side, this goes through the set of images you plan to look at manually. So this maybe goes from 1 to 100, if you look at 100 pictures. And the columns of this table, of the spreadsheet, will correspond to the ideas you‘re evaluating. So the dog problem, the problem of great cats, and blurry images. And I usually also leave space in the spreadsheet to write comments. So remember, during error analysis, you‘re just looking at dev set examples that your algorithm has misrecognized.
6:30
So if you find that the first misrecognized image is a picture of a dog, then I‘d put a check mark there. And to help myself remember these images, sometimes I‘ll make a note in the comments. So maybe that was a pit bull picture. If the second picture was blurry, then make a note there. If the third one was a lion, on a rainy day, in the zoo that was misrecognized. Then that‘s a great cat, and the blurry data. Make a note in the comment section, rainy day at zoo, and it was the rain that made it blurry, and so on.
7:05
Then finally, having gone through some set of images, I would count up what percentage of these algorithms. Or what percentage of each of these error categories were attributed to the dog, or great cat, blurry categories. So maybe 8% of these images you examine turn out be dogs, and maybe 43% great cats, and 61% were blurry. So this just means going down each column, and counting up what percentage of images have a check mark in that column. As you‘re part way through this process, sometimes you notice other categories of mistakes. So, for example, you might find that Instagram style filter, those fancy image filters, are also messing up your classifier. In that case, it‘s actually okay, part way through the process, to add another column like that. For the multi-colored filters, the Instagram filters, and the Snapchat filters. And then go through and count up those as well, and figure out what percentage comes from that new error category.
8:12
The conclusion of this process gives you an estimate of how worthwhile it might be to work on each of these different categories of errors. For example, clearly in this example, a lot of the mistakes we made on blurry images, and quite a lot on were made on great cat images. And so the outcome of this analysis is not that you must work on blurry images. This doesn‘t give you a rigid mathematical formula that tells you what to do, but it gives you a sense of the best options to pursue. It also tells you, for example, that no matter how much better you do on dog images, or on Instagram images. You at most improve performance by maybe 8%, or 12%, in these examples. Whereas you can to better on great cat images, or blurry images, the potential improvement. Now there‘s a ceiling in terms of how much you could improve performance, is much higher. So depending on how many ideas you have for improving performance on great cats, on blurry images. Maybe you could pick one of the two, or if you have enough personnel on your team, maybe you can have two different teams. Have one work on improving errors on great cats, and a different team work on improving errors on blurry images.
9:27
But this quick counting procedure, which you can often do in, at most, small numbers of hours. Can really help you make much better prioritization decisions, and understand how promising different approaches are to work on.
9:40
So to summarize, to carry out error analysis, you should find a set of mislabeled examples, either in your dev set, or in your development set. And look at the mislabeled examples for false positives and false negatives. And just count up the number of errors that fall into various different categories. During this process, you might be inspired to generate new categories of errors, like we saw. If you‘re looking through the examples and you say gee, there are a lot of Instagram filters, or Snapchat filters, they‘re also messing up my classifier. You can create new categories during that process. But by counting up the fraction of examples that are mislabeled in different ways, often this will help you prioritize. Or give you inspiration for new directions to go in. Now as you‘re doing error analysis, sometimes you notice that some of your examples in your dev sets are mislabeled. So what do you do about that? Let‘s discuss that in the next video.
Cleaning up incorrectly labeled data - 13m
0:00
The data for your supervised learning problem comprises input X and output labels Y. What if you going through your data and you find that some of these output labels Y are incorrect, you have data which is incorrectly labeled? Is it worth your while to go in to fix up some of these labels? Let‘s take a look. In the cat classification problem, Y equals one for cats and zero for non cats. So, let‘s say you‘re looking through some data and that‘s a cat, that‘s not a cat, that‘s a cat, that‘s a cat, that‘s not a cat, that‘s at a cat. No, wait. That‘s actually not a cat. So this is an example with an incorrect label. So I‘ve used the term, mislabeled examples, to refer to if your learning algorithm outputs the wrong value of Y. But I‘m going to say, incorrectly labeled examples, to refer to if in the data set you have in the training set or the dev set or the test set, the label for Y, whatever a human label assigned to this piece of data, is actually incorrect. And that‘s actually a dog so that Y really should have been zero. But maybe the labeler got that one wrong. So if you find that your data has some incorrectly labeled examples, what should you do? Well, first, let‘s consider the training set. It turns out that deep learning algorithms are quite robust to random errors in the training set. So long as your errors or your incorrectly labeled examples, so long as those errors are not too far from random, maybe sometimes the labeler just wasn‘t paying attention or they accidentally, randomly hit the wrong key on the keyboard. If the errors are reasonably random, then it‘s probably okay to just leave the errors as they are and not spend too much time fixing them. There‘s certainly no harm to going into your training set and be examining the labels and fixing them. Sometimes that is worth doing but your effort might be okay even if you don‘t. So long as the total data set size is big enough and the actual percentage of errors is maybe not too high. So I see a lot of machine learning algorithms that trained even when we know that there are few X mistakes in the training set labels and usually works okay. There is one caveat to this which is that deep learning algorithms are robust to random errors. They are less robust to systematic errors. So for example, if your labeler consistently labels white dogs as cats, then that is a problem because your classifier will learn to classify all white colored dogs as cats. But random errors or near random errors are usually not too bad for most deep learning algorithms. Now, this discussion has focused on what to do about incorrectly labeled examples in your training set. How about incorrectly labeled examples in your dev set or test set? If you‘re worried about the impact of incorrectly labeled examples on your dev set or test set, what they recommend you do is during error analysis to add one extra column so that you can also count up the number of examples where the label Y was incorrect. So for example, maybe when you count up the impact on a 100 mislabeled dev set examples, so you‘re going to find a 100 examples where your classifier‘s output disagrees with the label in your dev set. And sometimes for a few of those examples, your classifier disagrees with the label because the label was wrong, rather than because your classifier was wrong. So maybe in this example, you find that the labeler missed a cat in the background. So put the check mark there to signify that example 98 had an incorrect label. And maybe for this one, the picture is actually a picture of a drawing of a cat rather than a real cat. Maybe you want the labeler to have labeled that Y equals zero rather than Y equals one. And so put another check mark there. And just as you count up the percent of errors due to other categories like we saw in the previous video, you‘d also count up the fraction of percentage of errors due to incorrect labels. Where the Y value in your dev set was wrong and that accounted for why your learning algorithm made a prediction that differed from what the label on your data says. So the question now is, is it worthwhile going in to try to fix up this 6% of incorrectly labeled examples. My advice is, if it makes a significant difference to your ability to evaluate algorithms on your dev set, then go ahead and spend the time to fix incorrect labels. But if it doesn‘t make a significant difference to your ability to use the dev set to evaluate cost buyers, then it might not be the best use of your time. Let me show you an example that illustrates what I mean by this. So, three numbers I recommend you look at to try to decide if it‘s worth going in and reducing the number of mislabeled examples are the following. I recommend you look at the overall dev set error. And so in the example we had from the previous video, we said that maybe our system has 90% overall accuracy. So 10% error. Then you should look at the number of errors or the percentage of errors that are due to incorrect labels. So it looks like in this case, 6% of the errors are due to incorrect labels. So 6% of 10% is 0.6%. And then you should look at errors due to all other causes. So if you made 10% error on your dev set and 0.6% of those are because the labels is wrong, then the remainder, 9.4% of them, are due to other causes such as misrecognizing dogs being cats, great cats and their images. So in this case, I would say there‘s 9.4% worth of error that you could focus on fixing, whereas the errors due to incorrect labels is a relatively small fraction of the overall set of errors. So by all means, go in and fix these incorrect labels if you want but it‘s maybe not the most important thing to do right now. Now, let‘s take another example. Suppose you‘ve made a lot more progress on your learning problem. So instead of 10% error, let‘s say you brought the errors down to 2%, but still 0.6% of your overall errors are due to incorrect labels. So now, if you want to examine a set of mislabeled dev set images, set that comes from just 2% of dev set data you‘re mislabeling, then a very large fraction of them, 0.6 divided by 2%, so that is actually 30% rather than 6% of your labels. Your incorrect examples are actually due to incorrectly label examples. And so errors due to other causes are now 1.4%. When such a high fraction of your mistakes as measured on your dev set due to incorrect labels, then it maybe seems much more worthwhile to fix up the incorrect labels in your dev set. And if you remember the goal of the dev set, the main purpose of the dev set is, you want to really use it to help you select between two classifiers A and B. So you‘re trying out two classifiers A and B, and one has 2.1% error and the other has 1.9% error on your dev set. But you don‘t trust your dev set anymore to be correctly telling you whether this classifier is actually better than this because your 0.6% of these mistakes are due to incorrect labels. Then there‘s a good reason to go in and fix the incorrect labels in your dev set. Because in this example on the right is just having a very large impact on the overall assessment of the errors of the algorithm, whereas example on the left, the percentage impact is having on your algorithm is still smaller. Now, if you decide to go into a dev set and manually re-examine the labels and try to fix up some of the labels, here are a few additional guidelines or principles to consider. First, I would encourage you to apply whatever process you apply to both your dev and test sets at the same time. We‘ve talk previously about why you want to dev and test sets to come from the same distribution. The dev set is tagging you into target and when you hit it, you want that to generalize to the test set. So your team really works more efficiently to dev and test sets come from the same distribution. So if you‘re going in to fix something on the dev set, I would apply the same process to the test set to make sure that they continue to come from the same distribution. So we hire someone to examine the labels more carefully. Do that for both your dev and test sets. Second, I would urge you to consider examining examples your algorithm got right as well as ones it got wrong. It is easy to look at the examples your algorithm got wrong and just see if any of those need to be fixed. But it‘s possible that there are some examples that you haven‘t got right, that should also be fixed. And if you only fix ones that your algorithms got wrong, you end up with more bias estimates of the error of your algorithm. It gives your algorithm a little bit of an unfair advantage. We just try to double check what it got wrong but you don‘t also double check what it got right because it might have gotten something right, that it was just lucky on fixing the label would cause it to go from being right to being wrong, on that example. The second bullet isn‘t always easy to do, so it‘s not always done. The reason it‘s not always done is because if you classifier‘s very accurate, then it‘s getting fewer things wrong than right. So if your classifier has 98% accuracy, then it‘s getting 2% of things wrong and 98% of things right. So it‘s much easier to examine and validate the labels on 2% of the data and it takes much longer to validate labels on 98% of the data, so this isn‘t always done. That‘s just something to consider. Finally, if you go into a dev and test data to correct some of the labels there, you may or may not decide to go and apply the same process for the training set. Remember we said that at this other video that it‘s actually less important to correct the labels in your training set. And it‘s quite possible you decide to just correct the labels in your dev and test set which are also often smaller than a training set and you might not invest all that extra effort needed to correct the labels in a much larger training set. This is actually okay. We‘ll talk later this week about some processes for handling when your training data is different in distribution than you dev and test data. Learning algorithms are quite robust to that. It‘s super important that your dev and test sets come from the same distribution. But if your training set comes from a slightly different distribution, often that‘s a pretty reasonable thing to do. I will talk more about how to handle this later this week. So I‘d like to wrap up with just a couple of pieces of advice. First, deep learning researchers sometimes like to say things like, "I just fed the data to the algorithm. I trained in and it worked." There is a lot of truth to that in the deep learning error. There is more of feeding data in algorithm and just training it and doing less hand engineering and using less human insight. But I think that in building practical systems, often there‘s also more manual error analysis and more human insight that goes into the systems than sometimes deep learning researchers like to acknowledge. Second is that somehow I‘ve seen some engineers and researchers be reluctant to manually look at the examples. Maybe it‘s not the most interesting thing to do, to sit down and look at a 100 or a couple hundred examples to counter the number of errors. But this is something that I so do myself. When I‘m leading a machine learning team and I want to understand what mistakes it is making, I would actually go in and look at the data myself and try to counter the fraction of errors. And I think that because these minutes or maybe a small number of hours of counting data can really help you prioritize where to go next. I find this a very good use of your time and I urge you to consider doing it if those machines are in your system and you‘re trying to decide what ideas or what directions to prioritize things. So that‘s it for the error analysis process. In the next video, I want to share a view of some thoughts on how error analysis fits in to how you might go about starting out on the new machine learning project.
Build your first system quickly, then iterate - 6m
0:00
If you‘re working on a brand new machine learning application, one of the piece of advice I often give people is that, I think you should build your first system quickly and then iterate. Let me show you what I mean. I‘ve worked on speech recognition for many years. And if you‘re thinking of building a new speech recognition system, there‘s actually a lot of directions you could go and a lot of things you could prioritize. For example, there are specific techniques for making speech recognition systems more robust to noisy background. And noisy background could mean cafe noise, like a lot of people talking in the background or car noise, the sounds of cars and highways or other types of noise. There are ways to make a speech recognition system more robust to accented speech. There are specific problems associated with speakers that are far from the microphone, this is called far-field speech recognition. Young children speech poses special challenges, both in terms of how they pronounce individual words as well as their choice of words and the vocabulary they tend to use. And if sometimes the speaker stutters or if they use nonsensical phrases like oh, ah, um, there are different choices and different techniques for making the transcript that you output, still read more fluently. So, there are these and many other things you could do to improve a speech recognition system. And more generally, for almost any machine learning application, there could be 50 different directions you could go in and each of these directions is reasonable and would make your system better. But the challenge is, how do you pick which of these to focus on. And even though I‘ve worked in speech recognition for many years, if I‘m building a new system for a new application domain, I would still find it maybe a little bit difficult to pick without spending some time thinking about the problem. So what we recommend you do, if you‘re starting on building a brand new machine learning application, is to build your first system quickly and then iterate. What I mean by that is I recommend that you first quickly set up a dev/test set and metric. So this is really deciding where to place your target. And if you get it wrong, you can always move it later, but just set up a target somewhere. And then I recommend you build an initial machine learning system quickly. Find the training set, train it and see. Start to see and understand how well you‘re doing against your dev/test set and your values and metric. When you build your initial system, you then be able to use bias/variance analysis which we talked about earlier as well as error analysis which we talked about just in the last several videos, to prioritize the next steps. In particular, if error analysis causes you to realize that a lot of the errors are from the speaker being very far from the microphone, which causes special challenges to speech recognition, then that will give you a good reason to focus on techniques to address this called far-field speech recognition which basically means handling when the speaker is very far from the microphone. Of all the value of building this initial system, it can be a quick and dirty implementation, you know, don‘t overthink it, but all the value of the initial system is having some learned system, having some trained system allows you to localize bias/variance, to try to prioritize what to do next, allows you to do error analysis, look at some mistakes, to figure out all the different directions you can go in, which ones are actually the most worthwhile. So to recap, what I recommend you do is build your first system quickly, then iterate. This advice applies less strongly if you‘re working on an application area in which you have significant prior experience. It also implies to build less strongly if there‘s a significant body of academic literature that you can draw on for pretty much the exact same problem you‘re building. So, for example, there‘s a large academic literature on face recognition. And if you‘re trying to build a face recognizer, it might be okay to build a more complex system from the get-go by building on this large body of academic literature. But if you are tackling a new problem for the first time, then I would encourage you to really not overthink or not make your first system too complicated. Well, just build something quick and dirty and then use that to help you prioritize how to improve your system. So I‘ve seen a lot of machine learning projects and I‘ve seen some teams overthink the solution and build something too complicated. I‘ve also seen some teams underthink and then build something maybe too simple. Well on average, I‘ve seen a lot more teams overthink and build something too complicated. And I‘ve seen teams build something too simple. So I hope this helps, and if you are applying to your machine learning algorithms to a new application, and if your main goal is to build something that works, as opposed to if your main goal is to invent a new machine learning algorithm which is a different goal, then your main goal is to get something that works really well. I‘d encourage you to build something quick and dirty. Use that to do bias/variance analysis, use that to do error analysis and use the results of those analysis to help you prioritize where to go next.
Training and testing on different distributions - 10m
0:00
Deep learning algorithms have a huge hunger for training data. They just often work best when you can find enough label training data to put into the training set. This has resulted in many teams sometimes taking whatever data you can find and just shoving it into the training set just to get it more training data. Even if some of this data, or even maybe a lot of this data, doesn‘t come from the same distribution as your dev and test data. So in a deep learning era, more and more teams are now training on data that comes from a different distribution than your dev and test sets. And there‘s some subtleties and some best practices for dealing with when you‘re training and test distributions differ from each other. Let‘s take a look. Let‘s say that you‘re building a mobile app where users will upload pictures taken from their cell phones, and you want to recognize whether the pictures that your users upload from the mobile app is a cat or not. So you can now get two sources of data. One which is the distribution of data you really care about, this data from a mobile app like that on the right, which tends to be less professionally shot, less well framed, maybe even blurrier because it‘s shot by amateur users. The other source of data you can get is you can crawl the web and just download a lot of, for the sake of this example, let‘s say you can download a lot of very professionally framed, high resolution, professionally taken images of cats. And let‘s say you don‘t have a lot of users yet for your mobile app. So maybe you‘ve gotten 10,000 pictures uploaded from the mobile app. But by crawling the web you can download huge numbers of cat pictures, and maybe you have 200,000 pictures of cats downloaded off the Internet.
1:48
So what you really care about is that your final system does well on the mobile app distribution of images, right? Because in the end, your users will be uploading pictures like those on the right and you need your classifier to do well on that. But you now have a bit of a dilemma because you have a relatively small dataset, just 10,000 examples drawn from that distribution. And you have a much bigger dataset that‘s drawn from a different distribution. There‘s a different appearance of image than the one you actually want. So you don‘t want to use just those 10,000 images because it ends up giving you a relatively small training set.
2:28
And using those 200,000 images seems helpful, but the dilemma is this 200,000 images isn‘t from exactly the distribution you want. So what can you do? Well, here‘s one option. One thing you can do is put both of these data sets together so you now have 210,000 images. And you can then take the 210,000 images and randomly shuffle them into a train, dev, and test set. And let‘s say for the sake of argument that you‘ve decided that your dev and test sets will be 2,500 examples each. So your training set will be 205,000 examples.
3:17
Now so set up your data this way has some advantages but also disadvantages. The advantage is that now you‘re training, dev and test sets will all come from the same distribution, so that makes it easier to manage. But the disadvantage, and this is a huge disadvantage, is that if you look at your dev set, of these 2,500 examples, a lot of it will come from the web page distribution of images, rather than what you actually care about, which is the mobile app distribution of images.
3:48
So it turns out that of your total amount of data, 200,000, so I‘ll just abbreviate that 200k, out of 210,000, we‘ll write that as 210k, that comes from web pages. So all of these 2,500 examples on expectation, I think 2,381 of them will come from web pages. This is on expectation, the exact number will vary around depending on how the random shuttle operation went. But on average, only 119 will come from mobile app uploads.
4:27
So remember that setting up your dev set is telling your team where to aim the target. And the way you‘re aiming your target, you‘re saying spend most of the time optimizing for the web page distribution of images, which is really not what you want.
4:42
So I would recommend against option one, because this is setting up the dev set to tell your team to optimize for a different distribution of data than what you actually care about.
4:54
So instead of doing this, I would recommend that you instead take another option, which is the following. The training set, let‘s say it‘s still 205,000 images, I would have the training set have all 200,000 images from the web. And then you can, if you want, add in 5,000 images from the mobile app. And then for your dev and test sets, I guess my data sets size aren‘t drawn to scale. Your dev and test sets would be all mobile app images.
5:38
So the training set will include 200,000 images from the web and 5,000 from the mobile app. The dev set will be 2,500 images from the mobile app, and the test set will be 2,500 images also from the mobile app. The advantage of this way of splitting up your data into train, dev, and test, is that you‘re now aiming the target where you want it to be. You‘re telling your team, my dev set has data uploaded from the mobile app and that‘s the distribution of images you really care about, so let‘s try to build a machine learning system that does really well on the mobile app distribution of images. The disadvantage, of course, is that now your training distribution is different from your dev and test set distributions. But it turns out that this split of your data into train, dev and test will get you better performance over the long term. And we‘ll discuss later some specific techniques for dealing with your training sets coming from different distribution than your dev and test sets. Let‘s look at another example. Let‘s say you‘re building a brand new product, a speech activated rearview mirror for a car. So this is a real product in China. It‘s making its way into other countries but you can build a rearview mirror to replace this little thing there, so that you can now talk to the rearview mirror and basically say, dear rearview mirror, please help me find navigational directions to the nearest gas station and it‘ll deal with it.
7:19
So this is actually a real product, and let‘s say you‘re trying to build this for your own country.
7:27
So how can you get data to train up a speech recognition system for this product? Well, maybe you‘ve worked on speech recognition for a long time so you have a lot of data from other speech recognition applications, just not from a speech activated rearview mirror. Here‘s how you could split up your training and your dev and test sets. So for your training, you can take all the speech data you have that you‘ve accumulated from working on other speech problems, such as data you purchased over the years from various speech recognition data vendors. And today you can actually buy data from vendors of x, y pairs, where x is an audio clip and y is a transcript. Or maybe you‘ve worked on smart speakers, smart voice activated speakers, so you have some data from that. Maybe you‘ve worked on voice activated keyboards and so on. And for the sake of argument, maybe you have 500,000 utterences from all of these sources. And for your dev and test set, maybe you have a much smaller data set that actually came from a speech activated rearview mirror.
8:34
Because users are asking for navigational queries or trying to find directions to various places. This data set will maybe have a lot more street addresses, right? Please help me navigate to this street address, or please help me navigate to this gas station. So this distribution of data will be very different than these on the left.
8:58
But this is really the data you care about, because this is what you need your product to do well on, so this is what you set your dev and test set to be. So what you do in this example is set your training set to be the 500,000 utterances on the left, and then your dev and test sets which I‘ll abbreviate D and T, these could be maybe 10,000 utterances each. That‘s drawn from actual the speech activated rearview mirror. Or alternatively, if you think you don‘t need to put all 20,000 examples from your speech activated rearview mirror into the dev and test sets, maybe you can take half of that and put that in the training set.
9:43
So then the training set could be 510,000 utterances, including all 500 from there and 10,000 from the rearview mirror.
9:58
And then the dev and test sets could maybe be 5,000 utterances each. So of the 20,000 utterances, maybe 10k goes into the training set and 5k into the dev set and 5,000 into the test set. So this would be another reasonable way of splitting your data into train, dev, and test. And this gives you a much bigger training set, over 500,000 utterances, than if you were to only use speech activated rearview mirror data for your training set. So in this video, you‘ve seen a couple examples of when allowing your training set data to come from a different distribution than your dev and test set allows you to have much more training data. And in these examples, it will cause your learning algorithm to perform better. Now one question you might ask is, should you always use all the data you have? The answer is subtle, it is not always yes. Let‘s look at a counter-example in the next video.
Bias and Variance with mismatched data distributions - 18m
0:00
Estimating the bias and variance of your learning algorithm really helps you prioritize what to work on next. But the way you analyze bias and variance changes when your training set comes from a different distribution than your dev and test sets. Let‘s see how.
0:16
Let‘s keep using our cat classification example and let‘s say humans get near perfect performance on this. So, Bayes error, or Bayes optimal error, we know is nearly 0% on this problem. So, to carry out error analysis you usually look at the training error and also look at the error on the dev set. So let‘s say, in this example that your training error is 1%, and your dev error is 10%. If your dev data came from the same distribution as your training set, you would say that here you have a large variance problem, that your algorithm‘s just not generalizing well from the training set which it‘s doing well on to the dev set, which it‘s suddenly doing much worse on. But in the setting where your training data and your dev data comes from a different distribution, you can no longer safely draw this conclusion. In particular, maybe it‘s doing just fine on the dev set, it‘s just that the training set was really easy because it was high res, very clear images, and maybe the dev set is just much harder.
1:23
So maybe there isn‘t a variance problem and this just reflects that the dev set contains images that are much more difficult to classify accurately.
1:33
So the problem with this analysis is that when you went from the training error to the dev error, two things changed at a time. One is that the algorithm saw data in the training set but not in the dev set. Two, the distribution of data in the dev set is different. And because you changed two things at the same time, it‘s difficult to know of this 9% increase in error, how much of it is because the algorithm didn‘t see the data in the dev set, so that‘s some of the variance part of the problem. And how much of it, is because the dev set data is just different.
2:09
So, in order to tease out these two effects, and if you didn‘t totally follow what these two different effects are, don‘t worry, we will go over it again in a second. But in order to tease out these two effects it will be useful to define a new piece of data which we‘ll call the training-dev set. So, this is a new subset of data, which we carve out that should have the same distribution as training sets, but you don‘t explicitly train in your network on this. So here‘s what I mean.
2:40
Previously we had set up some training sets and some dev sets and some test sets as follows. And the dev and test sets have the same distribution, but the training sets will have some different distribution. What we‘re going to do is randomly shuffle the training sets and then carve out just a piece of the training set to be the training-dev set. So just as the dev and test set have the same distribution, the training set and the training-dev set, also have the same distribution.
3:21
But, the difference is that now you train your neural network, just on the training set proper. You won‘t let the neural network, you won‘t run that obligation on the training-dev portion of this data. To carry out error analysis, what you should do is now look at the error of your classifier on the training set, on the training-dev set, as well as on the dev set.
3:44
So let‘s say in this example that your training error is 1%.
3:53
And let‘s say the error on the training-dev set is 9%, and the error on the dev set is 10%, same as before.
4:08
What you can conclude from this is that when you went from training data to training dev data the error really went up a lot. And only the difference between the training data and the training-dev data is that your neural network got to sort the first part of this. It was trained explicitly on this, but it wasn‘t trained explicitly on the training-dev data. So this tells you that you have a variance problem.
4:40
Because the training-dev error was measured on data that comes from the same distribution as your training set. So you know that even though your neural network does well in a training set, it‘s just not generalizing well to data in the training-dev set which comes from the same distribution, but it‘s just not generalizing well to data from the same distribution that it hadn‘t seen before.
5:04
So in this example we have really a variance problem.
5:09
Let‘s look at a different example. Let‘s say the training error is 1%, and the training-dev error is 1.5%, but when you go to the dev set your error is 10%. So now, you have actually a pretty low variance problem, because when you went from training data that you‘ve seen to the training-dev data that the neural network has not seen, the error increases only a little bit, but then it really jumps when you go to the dev set. So this is a data mismatch problem, where data mismatched.
5:44
So this is a data mismatch problem, because your learning algorithm was not trained explicitly on data from training-dev or dev, but these two data sets come from different distributions. But whatever algorithm it‘s learning, it works great on training-dev but it doesn‘t work well on dev. So somehow your algorithm has learned to do well on a different distribution than what you really care about, so we call that a data mismatch problem.
6:17
Let‘s just look at a few more examples. I‘ll write this on the next row since I‘m running out of space on top. So Training error, Training-Dev error, and Dev error.
6:33
Let‘s say that training error is 10%, training-dev error is 11%, and dev error is 12%. Remember that human level proxy for Bayes error is roughly 0%. So if you have this type of performance, then you really have a bias, an avoidable bias problem, because you‘re doing much worse than human level. So this is really a high bias setting.
7:07
And one last example. If your training error is 10%, your training-dev error is 11% and your dev error is 20 %, then it looks like this actually has two issues. One, the avoidable bias is quite high, because you‘re not even doing that well on the training set. Humans get nearly 0% error, but you‘re getting 10% error on your training set. The variance here seems quite small,
7:38
but this data mismatch is quite large. So for for this example I will say, you have a large bias or avoidable bias problem as well as a data mismatch problem.
7:56
So let‘s take what we‘ve done on this slide and write out the general principles.
8:02
The key quantities I would look at are human level error, your training set error, your training-dev set error.
8:21
So that‘s the same distribution as the training set, but you didn‘t train explicitly on it. Your dev set error, and depending on the differences between these errors, you can get a sense of how big is the avoidable bias, the variance, the data mismatch problems.
8:38
So let‘s say that human level error is 4%. Your training error is 7%. And your training-dev error is 10%. And the dev error is 12%. So this gives you a sense of the avoidable bias.
8:55
because you know, you‘d like your algorithm to do at least as well or approach human level performance maybe on the training set. This is a sense of the variance. So how well do you generalize from the training set to the training-dev set?
9:10
This is the sense of how much of a data mismatch problem have you have. And technically you could also add one more thing, which is the test set performance, and we‘ll write test error. You shouldn‘t be doing development on your test set because you don‘t want to overfit your test set. But if you also look at this, then this gap here tells you the degree of overfitting to the dev set. So if there‘s a huge gap between your dev set performance and your test set performance, it means you maybe overtuned to the dev set. And so maybe you need to find a bigger dev set, right? So remember that your dev set and your test set come from the same distribution. So the only way for there to be a huge gap here, for it to do much better on the dev set than the test set, is if you somehow managed to overfit the dev set. And if that‘s the case, what you might consider doing is going back and just getting more dev set data. Now, I‘ve written these numbers, as you go down the list of numbers, always keep going up. Here‘s one example of numbers that doesn‘t always go up, maybe human level performance is 4%, training error is 7%, training-dev error is 10%, but let‘s say that we go to the dev set. You find that you actually, surprisingly, do much better on the dev set. Maybe this is 6%, 6% as well.
10:36
So you have seen effects like this, working on for example a speech recognition task, where the training data turned out to be much harder than your dev set and test set. So these two were evaluated on your training set distribution and these two were evaluated on your dev/test set distribution. So sometimes if your dev/test set distribution is much easier for whatever application you‘re working on then these numbers can actually go down. So if you see funny things like this, there‘s an even more general formulation of this analysis that might be helpful. Let me quickly explain that on the next slide.
11:17
So, let me motivate this using the speech activated rear-view mirror example. It turns out that the numbers we‘ve been writing down can be placed into a table where on the horizontal axis, I‘m going to place different data sets. So for example, you might have data from your general speech recognition task.
11:43
So you might have a bunch of data that you just collected from a lot of speech recognition problems you worked on from small speakers, data you have purchased and so on. And then you all have the rear view mirror specific speech data, recorded inside the car.
12:04
So on this x axis on the table, I‘m going to vary the data set. On this other axis, I‘m going to label different ways or algorithms for examining the data. So first, there‘s human level performance, which is how accurate are humans on each of these data sets?
12:27
Then there is the error on the examples that your neural network has trained on.
12:38
And then finally there‘s error on the examples that your neural network has not trained on.
12:50
So turns out that what we‘re calling on a human level on the previous slide, there‘s the number that goes in this box, which is how well do humans do on this category of data. Say data from all sorts of speech recognition tasks, the thousand utterances that you could into your training set. And the example in the previous slide is this 4%. This number here was our, maybe the training error.
13:23
Which in the example in the previous slide was 7%
13:29
Right, if you‘re learning algorithm has seen this example, performed gradient descent on this example, and this example came from your training set distribution, or some general speech recognition distribution. How well does your algorithm do on the example it has trained on?
13:45
Then here is the training-dev set error. It‘s usually a bit higher, which is for data from this distribution, from general speech recognition, if your algorithm did not train explicitly on some examples from this distribution, how well does it do? And that‘s what we call the training dev error.
14:10
And then if you move over to the right, this box here is the dev set error, or maybe also the test set error.
14:20
Which was 6% in the example just now. And dev and test error, it‘s actually technically two numbers, but either one could go into this box here.
14:32
And this is if you have data from your rearview mirror, from actually recorded in the car from the rearview mirror application, but your neural network did not perform back propagation on this example, what is the error?
14:46
So what we‘re doing in the analysis in the previous slide was look at differences between these two numbers, these two numbers, and these two numbers.
14:57
And this gap here is a measure of avoidable bias.
15:03
This gap here is a measure of variance, and this gap here was a measure of data mismatch.
15:13
And it turns out that it could be useful to also throw in the remaining two entries in this table.
15:21
And so if this turns out to be also 6%, and the way you get this number is you ask some humans to label their rearview mirror speech data and just measure how good humans are at this task. And maybe this turns out also to be 6%. And the way you do that is you take some rearview mirror speech data, put it in the training set so the neural network learns on it as well, and then you measure the error on that subset of the data. But if this is what you get, then, well, it turns out that you‘re actually already performing at the level of humans on this rearview mirror speech data, so maybe you‘re actually doing quite well on that distribution of data. When you do this more subsequent analysis, it doesn‘t always give you one clear path forward, but sometimes it just gives you additional insights as well. So for example, comparing these two numbers in this case tells us that for humans, the rearview mirror speech data is actually harder than for general speech recognition, because humans get 6% error, rather than 4% error. But then looking at these differences as well may help you understand bias and variance and data mismatch problems in different degrees. So this more general formulation is something I‘ve used a few times. I‘ve not used it, but for a lot of problems you find that examining this subset of entries, kind of looking at this difference and this difference and this difference, that that‘s enough to point you in a pretty promising direction. But sometimes filling out this whole table can give you additional insights.
16:55
Finally, we‘ve previously talked a lot about ideas for addressing bias. Talked about techniques on addressing variance, but how do you address data mismatch? In particular training on data that comes from different distribution that your dev and test set can get you more data and really help your learning algorithm‘s performance. But rather than just bias and variance problems, you now have this new potential problem of data mismatch. What are some good ways that you could use to address data mismatch? I‘ll be honest and say there actually aren‘t great or at least not very systematic ways to address data mismatch. But there are some things you could try that could help. Let‘s take a look at them in the next video. So what we‘ve seen is that by using training data that can come from a different distribution as a dev and test set, this could give you a lot more data and therefore help the performance of your learning algorithm. But instead of just having bias and variance as two potential problems, you now have this third potential problem, data mismatch. So what if you perform error analysis and conclude that data mismatch is a huge source of error, how do you go about addressing that? It turns out that unfortunately there are super systematic ways to address data mismatch, but there are a few things you can try that could help. Let‘s take a look at them in the next video.
Addressing data mismatch - 10m
0:00
If your training set comes from a different distribution, than your dev and test set, and if error analysis shows you that you have a data mismatch problem, what can you do? There are completely systematic solutions to this, but let‘s look at some things you could try. If I find that I have a large data mismatch problem, what I usually do is carry out manual error analysis and try to understand the differences between the training set and the dev/test sets. To avoid overfitting the test set, technically for error analysis, you should manually only look at a dev set and not at the test set. But as a concrete example, if you‘re building the speech-activated rear-view mirror application, you might look or, I guess if it‘s speech, listen to examples in your dev set to try to figure out how your dev set is different than your training set. So, for example, you might find that a lot of dev set examples are very noisy and there‘s a lot of car noise. And this is one way that your dev set differs from your training set. And maybe you find other categories of errors. For example, in the speech-activated rear-view mirror in your car, you might find that it‘s often mis-recognizing street numbers because there are a lot more navigational queries which will have street address. So, getting street numbers right is really important. When you have insight into the nature of the dev set errors, or you have insight into how the dev set may be different or harder than your training set, what you can do is then try to find ways to make the training data more similar. Or, alternatively, try to collect more data similar to your dev and test sets. So, for example, if you find that car noise in the background is a major source of error, one thing you could do is simulate noisy in-car data. So a little bit more about how to do this on the next slide. Or you find that you‘re having a hard time recognizing street numbers, maybe you can go and deliberately try to get more data of people speaking out numbers and add that to your training set. Now, I realize that this slide is giving a rough guideline for things you could try. This isn‘t a systematic process and, I guess, it‘s no guarantee that you get the insights you need to make progress. But I have found that this manual insight, together we‘re trying to make the data more similar on the dimensions that matter that this often helps on a lot of the problems. So, if your goal is to make the training data more similar to your dev set, what are some things you can do? One of the techniques you can use is artificial data synthesis and let‘s discuss that in the context of addressing the car noise problem. So, to build a speech recognition system, maybe you don‘t have a lot of audio that was actually recorded inside the car with the background noise of a car, background noise of a highway, and so on. But, it turns out, there‘s a way to synthesize it. So, let‘s say that you‘ve recorded a large amount of clean audio without this car background noise. So, here‘s an example of a clip you might have in your training set. By the way, this sentence is used a lot in AI for testing because this is a short sentence that contains every alphabet from A to Z, so you see this sentence a lot. But, given that recording of "the quick brown fox jumps over the lazy dog," you can then also get a recording of car noise like this. So, that‘s what the inside of a car sounds like, if you‘re driving in silence. And if you take these two audio clips and add them together, you can then synthesize what saying "the quick brown fox jumps over the lazy dog" would sound like, if you were saying that in a noisy car. So, it sounds like this. So, this is a relatively simple audio synthesis example. In practice, you might synthesize other audio effects like reverberation which is the sound of your voice bouncing off the walls of the car and so on. But through artificial data synthesis, you might be able to quickly create more data that sounds like it was recorded inside the car without needing to go out there and collect tons of data, maybe thousands or tens of thousands of hours of data in a car that‘s actually driving along. So, if your error analysis shows you that you should try to make your data sound more like it was recorded inside the car, then this could be a reasonable process for synthesizing that type of data to give you a learning algorithm. Now, there is one note of caution I want to sound on artificial data synthesis which is that, let‘s say, you have 10,000 hours of data that was recorded against a quiet background. And, let‘s say, that you have just one hour of car noise. So, one thing you could try is take this one hour of car noise and repeat it 10,000 times in order to add to this 10,000 hours of data recorded against a quiet background. If you do that, the audio will sound perfectly fine to the human ear, but there is a chance, there is a risk that your learning algorithm will over fit to the one hour of car noise. And, in particular, if this is the set of all audio that you could record in the car or, maybe the sets of all car noise backgrounds you can imagine, if you have just one hour of car noise background, you might be simulating just a very small subset of this space. You might be just synthesizing from a very small subset of this space. And to the human ear, all these audio sounds just fine because one hour of car noise sounds just like any other hour of car noise to the human ear. But, it‘s possible that you‘re synthesizing data from a very small subset of this space, and the neural network might be overfitting to the one hour of car noise that you may have. I don‘t know if it will be practically feasible to inexpensively collect 10,000 hours of car noise so that you don‘t need to repeat the same one hour of car noise over and over but you have 10,000 unique hours of car noise to add to 10,000 hours of unique audio recording against a clean background. But it‘s possible, no guarantees. But it is possible that using 10,000 hours of unique car noise rather than just one hour, that could result in better performance through learning algorithm. And the challenge with artificial data synthesis is to the human ear, as far as your ears can tell, these 10,000 hours all sound the same as this one hour, so you might end up creating this very impoverished synthesized data set from a much smaller subset of the space without actually realizing it. Here‘s another example of artificial data synthesis. Let‘s say you‘re building a self driving car and so you want to really detect vehicles like this and put a bounding box around it let‘s say. So, one idea that a lot of people have discussed is, well, why should you use computer graphics to simulate tons of images of cars? And, in fact, here are a couple of pictures of cars that were generated using computer graphics. And I think these graphics effects are actually pretty good and I can imagine that by synthesizing pictures like these, you could train a pretty good computer vision system for detecting cars. Unfortunately, the picture that I drew on the previous slide again applies in this setting. Maybe this is the set of all cars and, if you synthesize just a very small subset of these cars, then to the human eye, maybe the synthesized images look fine. But you might overfit to this small subset you‘re synthesizing. In particular, one idea that a lot of people have independently raised is, once you find a video game with good computer graphics of cars and just grab images from them and get a huge data set of pictures of cars, it turns out that if you look at a video game, if the video game has just 20 unique cars in the video game, then the video game looks fine because you‘re driving around in the video game and you see these 20 other cars and it looks like a pretty realistic simulation. But the world has a lot more than 20 unique designs of cars, and if your entire synthesized training set has only 20 distinct cars, then your neural network will probably overfit to these 20 cars. And it‘s difficult for a person to easily tell that, even though these images look realistic, you‘re really covering such a tiny subset of the sets of all possible cars. So, to summarize, if you think you have a data mismatch problem, I recommend you do error analysis, or look at the training set, or look at the dev set to try this figure out, to try to gain insight into how these two distributions of data might differ. And then see if you can find some ways to get more training data that looks a bit more like your dev set. One of the ways we talked about is artificial data synthesis. And artificial data synthesis does work. In speech recognition, I‘ve seen artificial data synthesis significantly boost the performance of what were already very good speech recognition system. So, it can work very well. But, if you‘re using artificial data synthesis, just be cautious and bear in mind whether or not you might be accidentally simulating data only from a tiny subset of the space of all possible examples. So, that‘s it for how to deal with data mismatch. Next, I like to share with you some thoughts on how to learn from multiple types of data at the same time.
Transfer learning - 11m
0:00
One of the most powerful ideas in deep learning is that sometimes you can take knowledge the neural network has learned from one task and apply that knowledge to a separate task. So for example, maybe you could have the neural network learn to recognize objects like cats and then use that knowledge or use part of that knowledge to help you do a better job reading x-ray scans. This is called transfer learning. Let‘s take a look. Let‘s say you‘ve trained your neural network on image recognition. So you first take a neural network and train it on X Y pairs, where X is an image and Y is some object. An image is a cat or a dog or a bird or something else. If you want to take this neural network and adapt, or we say transfer, what is learned to a different task, such as radiology diagnosis, meaning really reading X-ray scans, what you can do is take this last output layer of the neural network and just delete that and delete also the weights feeding into that last output layer and create a new set of randomly initialized weights just for the last layer and have that now output radiology diagnosis. So to be concrete, during the first phase of training when you‘re training on an image recognition task, you train all of the usual parameters for the neural network, all the weights, all the layers and you have something that now learns to make image recognition predictions. Having trained that neural network, what you now do to implement transfer learning is swap in a new data set X Y, where now these are radiology images. And Y are the diagnoses you want to predict and what you do is initialize the last layers‘ weights. Let‘s call that W.L. and P.L. randomly. And now, retrain the neural network on this new data set, on the new radiology data set. You have a couple options of how you retrain neural network with radiology data. You might, if you have a small radiology dataset, you might want to just retrain the weights of the last layer, just W.L. P.L., and keep the rest of the parameters fixed. If you have enough data, you could also retrain all the layers of the rest of the neural network. And the rule of thumb is maybe if you have a small data set, then just retrain the one last layer at the output layer. Or maybe that last one or two layers. But if you have a lot of data, then maybe you can retrain all the parameters in the network. And if you retrain all the parameters in the neural network, then this initial phase of training on image recognition is sometimes called pre-training, because you‘re using image recognitions data to pre-initialize or really pre-train the weights of the neural network. And then if you are updating all the weights afterwards, then training on the radiology data sometimes that‘s called fine tuning. So you hear the words pre-training and fine tuning in a deep learning context, this is what they mean when they refer to pre-training and fine tuning weights in a transfer learning source. And what you‘ve done in this example, is you‘ve taken knowledge learned from image recognition and applied it or transferred it to radiology diagnosis. And the reason this can be helpful is that a lot of the low level features such as detecting edges, detecting curves, detecting positive objects. Learning from that, from a very large image recognition data set, might help your learning algorithm do better in radiology diagnosis. It‘s just learned a lot about the structure and the nature of how images look like and some of that knowledge will be useful. So having learned to recognize images, it might have learned enough about you know, just what parts of different images look like, that that knowledge about lines, dots, curves, and so on, maybe small parts of objects, that knowledge could help your radiology diagnosis network learn a bit faster or learn with less data. Here‘s another example. Let‘s say that you‘ve trained a speech recognition system so now X is input of audio or audio snippets, and Y is some ink transcript. So you‘ve trained in speech recognition system to output your transcripts. And let‘s say that you now want to build a "wake words" or a "trigger words" detection system. So, recall that a wake word or the trigger word are the words we say in order to wake up speech control devices in our houses such as saying "Alexa" to wake up an Amazon Echo or "OK Google" to wake up a Google device or "hey Siri" to wake up an Apple device or saying "Ni hao baidu" to wake up a baidu device. So in order to do this, you might take out the last layer of the neural network again and create a new output node. But sometimes another thing you could do is actually create not just a single new output, but actually create several new layers to your neural network to try to put the labels Y for your wake word detection problem. Then again, depending on how much data you have, you might just retrain the new layers of the network or maybe you could retrain even more layers of this neural network. So, when does transfer learning make sense? Transfer learning makes sense when you have a lot of data for the problem you‘re transferring from and usually relatively less data for the problem you‘re transferring to. So for example, let‘s say you have a million examples for image recognition task. So that‘s a lot of data to learn a lot of low level features or to learn a lot of useful features in the earlier layers in neural network. But for the radiology task, maybe you have only a hundred examples. So you have very low data for the radiology diagnosis problem, maybe only 100 x-ray scans. So a lot of knowledge you learn from English recognition can be transferred and can really help you get going with radiology recognition even if you don‘t have all the data for radiology. For speech recognition, maybe you‘ve trained the speech recognition system on 10000 hours of data. So, you‘ve learned a lot about what human voices sounds like from that 10000 hours of data, which really is a lot. But for your trigger word detection, maybe you have only one hour of data. So, that‘s not a lot of data to fit a lot of parameters. So in this case, a lot of what you learn about what human voices sound like, what are components of human speech and so on, that can be really helpful for building a good wake word detector, even though you have a relatively small dataset or at least a much smaller dataset for the wake word detection task. So in both of these cases, you‘re transferring from a problem with a lot of data to a problem with relatively little data. One case where transfer learning would not make sense, is if the opposite was true. So, if you had a hundred images for image recognition and you had 100 images for radiology diagnosis or even a thousand images for radiology diagnosis, one would think about it is that to do well on radiology diagnosis, assuming what you really want to do well on this radiology diagnosis, having radiology images is much more valuable than having cat and dog and so on images. So each example here is much more valuable than each example there, at least for the purpose of building a good radiology system. So, if you already have more data for radiology, it‘s not that likely that having 100 images of your random objects of cats and dogs and cars and so on will be that helpful, because the value of one example of image from your image recognition task of cats and dogs is just less valuable than one example of an x-ray image for the task of building a good radiology system. So, this would be one example where transfer learning, well, it might not hurt but I wouldn‘t expect it to give you any meaningful gain either. And similarly, if you‘d built a speech recognition system on 10 hours of data and you actually have 10 hours or maybe even more, say 50 hours of data for wake word detection, you know it won‘t, it may or may not hurt, maybe it won‘t hurt to include that 10 hours of data to your transfer learning, but you just wouldn‘t expect to get a meaningful gain. So to summarize, when does transfer learning make sense? If you‘re trying to learn from some Task A and transfer some of the knowledge to some Task B, then transfer learning makes sense when Task A and B have the same input X. In the first example, A and B both have images as input. In the second example, both have audio clips as input. It has to make sense when you have a lot more data for Task A than for Task B. All this is under the assumption that what you really want to do well on is Task B. And because data for Task B is more valuable for Task B, usually you just need a lot more data for Task A because you know, each example from Task A is just less valuable for Task B than each example for Task B. And then finally, transfer learning will tend to make more sense if you suspect that low level features from Task A could be helpful for learning Task B. And in both of the earlier examples, maybe learning image recognition teaches you enough about images to have a radiology diagnosis and maybe learning speech recognition teaches you about human speech to help you with trigger word or wake word detection. So to summarize, transfer learning has been most useful if you‘re trying to do well on some Task B, usually a problem where you have relatively little data. So for example, in radiology, you know it‘s difficult to get that many x-ray scans to build a good radiology diagnosis system. So in that case, you might find a related but different task, such as image recognition, where you can get maybe a million images and learn a lot of load-over features from that, so that you can then try to do well on Task B on your radiology task despite not having that much data for it. When transfer learning makes sense? It does help the performance of your learning task significantly. But I‘ve also seen sometimes seen transfer learning applied in settings where Task A actually has less data than Task B and in those cases, you kind of don‘t expect to see much of a gain. So, that‘s it for transfer learning where you learn from one task and try to transfer to a different task. There‘s another version of learning from multiple tasks which is called multitask learning, which is when you try to learn from multiple tasks at the same time rather than learning from one and then sequentially, or after that, trying to transfer to a different task. So in the next video, let‘s discuss multitasking learning.
Multi-task learning - 12m
0:00
So whereas in transfer learning, you have a sequential process where you learn from task A and then transfer that to task B. In multi-task learning, you start off simultaneously, trying to have one neural network do several things at the same time. And then each of these task helps hopefully all of the other task. Let‘s look at an example.
0:20
Let‘s say you‘re building an autonomous vehicle, building a self driving car. Then your self driving car would need to detect several different things such as pedestrians, detect other cars, detect stop signs.
0:37
And also detect traffic lights and also other things.
0:43
So for example, in this example on the left, there is a stop sign in this image and there is a car in this image but there aren‘t any pedestrians or traffic lights. So if this image is an input for an example, x(i), then Instead of having one label y(i), you would actually a four labels. In this example, there are no pedestrians, there is a car, there is a stop sign and there are no traffic lights. And if you try and detect other things, there may be y(i) has even more dimensions. But for now let‘s stick with these four. So y(i) is a 4 by 1 vector. And if you look at the training test labels as a whole, then similar to before, we‘ll stack the training data‘s labels horizontally as follows, y(1) up to y(m). Except that now y(i) is a 4 by 1 vector so each of these is a tall column vector. And so this matrix Y is now a 4 by m matrix, whereas previously, when y was single real number, this would have been a 1 by m matrix. So what you can do is now train a neural network to predict these values of y. So you can have a neural network input x and output now a four dimensional value for y. Notice here for the output there I‘ve drawn four nodes. And so the first node when we try to predict is there a pedestrian in this picture. The second output will predict is there a car here, predict is there a stop sign and this will predict maybe is there a traffic light.
2:20
So y hat here is four dimensional.
2:26
So to train this neural network, you now need to define the loss for the neural network. And so given a predicted output y hat i which is 4 by 1 dimensional. The loss averaged over your entire training set would be 1 over m sum from i = 1 through m, sum from j = 1 through 4 of the losses of the individual predictions.
2:59
So it‘s just summing over at the four components of pedestrian car stop sign traffic lights. And this script L is the usual logistic loss.
3:14
So just to write this out, this is -yj i log y hat ji- 1- y log 1- y hat.
3:31
And the main difference compared to the earlier binding classification examples is that you‘re now summing over j equals 1 through 4.
3:40
And the main difference between this and softmax regression, is that unlike softmax regression, which assigned a single label to single example. This one image can have multiple labels.
3:55
So you‘re not saying that each image is either a picture of a pedestrian, or a picture of car, a picture of a stop sign, picture of a traffic light. You‘re asking for each picture, does it have a pedestrian, or a car a stop sign or traffic light, and multiple objects could appear in the same image. In fact, in the example on the previous slide, we had both a car and a stop sign in that image, but no pedestrians and traffic lights. So you‘re not assigning a single label to an image, you‘re going through the different classes and asking for each of the classes does that class, does that type of object appear in the image?
4:31
So that‘s why I‘m saying that with this setting, one image can have multiple labels. If you train a neural network to minimize this cost function, you are carrying out multi-task learning. Because what you‘re doing is building a single neural network that is looking at each image and basically solving four problems. It‘s trying to tell you does each image have each of these four objects in it.
5:00
And one other thing you could have done is just train four separate neural networks, instead of train one network to do four things. But if some of the earlier features in neural network can be shared between these different types of objects, then you find that training one neural network to do four things results in better performance than training four completely separate neural networks to do the four tasks separately.
5:23
So that‘s the power of multi-task learning.
5:26
And one other detail, so far I‘ve described this algorithm as if every image had every single label. It turns out that multi-task learning also works even if some of the images we‘ll label only some of the objects. So the first training example, let‘s say someone, your labeler had told you there‘s a pedestrian, there‘s no car, but they didn‘t bother to label whether or not there‘s a stop sign or whether or not there‘s a traffic light. And maybe for the second example, there is a pedestrian, there is a car, but again the labeler, when they looked at that image, they just didn‘t label it, whether it had a stop sign or whether it had a traffic light, and so on. And maybe some examples are fully labeled, and maybe some examples, they were just labeling for the presence and absence of cars so there‘s some question marks, and so on. So with a data set like this, you can still train your learning algorithm to do four tasks at the same time, even when some images have only a subset of the labels and others are sort of question marks or don‘t cares. And the way you train your algorithm, even when some of these labels are question marks or really unlabeled is that in this sum over j from 1 to 4, you would sum only over values of j with a 0 or 1 label.
6:41
So whenever there‘s a question mark, you just omit that term from summation but just sum over only the values where there is a label. And so that allows you to use datasets like this as well. So when does multi-task learning makes sense? So when does multi-task learning make sense? I‘ll say it makes sense usually when three things are true. One is if your training on a set of tasks that could benefit from having shared low-level features. So for the autonomous driving example, it makes sense that recognizing traffic lights and cars and pedestrians, those should have similar features that could also help you recognize stop signs, because these are all features of roads.
7:23
Second, this is less of a hard and fast rule, so this isn‘t always true. But what I see from a lot of successful multi-task learning settings is that the amount of data you have for each task is quite similar. So if you recall from transfer learning, you learn from some task A and transfer it to some task B. So if you have a million examples of task A then and 1,000 examples for task B, then all the knowledge you learned from that million examples could really help augment the much smaller data set you have for task B. Well how about multi-task learning? In multi-task learning you usually have a lot more tasks than just two. So maybe you have, previously we had 4 tasks but let‘s say you have 100 tasks. And you‘re going to do multi-task learning to try to recognize 100 different types of objects at the same time. So what you may find is that you may have 1,000 examples per task and so if you focus on the performance of just one task, let‘s focus on the performance on the 100th task, you can call A100. If you are trying to do this final task in isolation, you would have had just a thousand examples to train this one task, this one of the 100 tasks that by training on these 99 other tasks. These in aggregate have 99,000 training examples which could be a big boost, could give a lot of knowledge to argument this otherwise, relatively small 1,000 example training set that you have for task A100. And symmetrically every one of the other 99 tasks can provide some data or provide some knowledge that help every one of the other tasks in this list of 100 tasks.
9:02
So the second bullet isn‘t a hard and fast rule but what I tend to look at is if you focus on any one task, for that to get a big boost for multi-task learning, the other tasks in aggregate need to have quite a lot more data than for that one task. And so one way to satisfy that is if a lot of tasks like we have in this example on the right, and if the amount of data you have in each task is quite similar. But the key really is that if you already have 1,000 examples for 1 task, then for all of the other tasks you better have a lot more than 1,000 examples if those other other task are meant to help you do better on this final task. And finally multi-task learning tends to make more sense when you can train a big enough neural network to do well on all the tasks. So the alternative to multi-task learning would be to train a separate neural network for each task. So rather than training one neural network for pedestrian, car, stop sign, and traffic light detection, you could have trained one neural network for pedestrian detection, one neural network for car detection, one neural network for stop sign detection, and one neural network for traffic light detection.
10:06
So what a researcher, Rich Carona, found many years ago was that the only times multi-task learning hurts performance compared to training separate neural networks is if your neural network isn‘t big enough. But if you can train a big enough neural network, then multi-task learning certainly should not or should very rarely hurt performance. And hopefully it will actually help performance compared to if you were training neural networks to do these different tasks in isolation. So that‘s it for multi-task learning. In practice, multi-task learning is used much less often than transfer learning. I see a lot of applications of transfer learning where you have a problem you want to solve with a small amount of data. So you find a related problem with a lot of data to learn something and transfer that to this new problem. But multi-task learning is just more rare that you have a huge set of tasks you want to use that you want to do well on, you can train all of those tasks at the same time. Maybe the one example is computer vision. In object detection I see more applications of multi-task any where one neural network trying to detect a whole bunch of objects at the same time works better than different neural networks trained separately to detect objects. But I would say that on average transfer learning is used much more today than multi-task learning, but both are useful tools to have in your arsenal. So to summarize, multi-task learning enables you to train one neural network to do many tasks and this can give you better performance than if you were to do the tasks in isolation. Now one note of caution, in practice I see that transfer learning is used much more often than multi-task learning. So I do see a lot of tasks where if you want to solve a machine learning problem but you have a relatively small data set, then transfer learning can really help. Where if you find a related problem but you have a much bigger data set, you can train in your neural network from there and then transfer it to the problem where we have very low data. So transfer learning is used a lot today. There are some applications of transfer multi-task learning as well, but multi-task learning I think is used much less often than transfer learning. And maybe the one exception is computer vision object detection, where I do see a lot of applications of training a neural network to detect lots of different objects. And that works better than training separate neural networks and detecting the visual objects. But on average I think that even though transfer learning and multi-task learning often you‘re presented in a similar way, in practice I‘ve seen a lot more applications of transfer learning than of multi-task learning. I think because often it‘s just difficult to set up or to find so many different tasks that you would actually want to train a single neural network for. Again, with some sort of computer vision, object detection examples being the most notable exception. So that‘s it for multi-task learning. Multi-task learning and transfer learning are both important tools to have in your tool bag. And finally, I‘d like to move on to discuss end-to-end deep learning. So let‘s go onto the next video to discuss end-to-end learning.
What is end-to-end deep learning? - 11m
0:00
One of the most exciting recent developments in deep learning, has been the rise of end-to-end deep learning. So what is the end-to-end learning? Briefly, there have been some data processing systems, or learning systems that require multiple stages of processing. And what end-to-end deep learning does, is it can take all those multiple stages, and replace it usually with just a single neural network. Let‘s look at some examples. Take speech recognition as an example, where your goal is to take an input X such an audio clip, and map it to an output Y, which is a transcript of the audio clip. So traditionally, speech recognition required many stages of processing. First, you will extract some features, some hand-designed features of the audio. So if you‘ve heard of MFCC, that‘s an algorithm for extracting a certain set of hand designed features for audio. And then having extracted some low level features, you might apply a machine learning algorithm, to find the phonemes in the audio clip. So phonemes are the basic units of sound. So for example, the word cat is made out of three sounds. The Cu- Ah- and Tu- so they extract those. And then you string together phonemes to form individual words. And then you string those together to form the transcripts of the audio clip. So, in contrast to this pipeline with a lot of stages, what end-to-end deep learning does, is you can train a huge neural network to just input the audio clip, and have it directly output the transcript. One interesting sociological effect in AI is that as end-to-end deep learning started to work better, there were some researchers that had for example spent many years of their career designing individual steps of the pipeline. So there were some researchers in different disciplines not just in speech recognition. Maybe in computer vision, and other areas as well, that had spent a lot of time you know, written multiple papers, maybe even built a large part of their career, engineering featuresor engineering other pieces of the pipeline. And when end-to-end deep learning just took the last training set and learned the function mapping from x and y directly, really bypassing a lot of these intermediate steps, it was challenging for some disciplines to come around to accepting this alternative way of building AI systems. Because it really obsoleted in some cases, many years of research in some of the intermediate components. It turns out that one of the challenges of end-to-end deep learning is that you might need a lot of data before it works well. So for example, if you‘re training on 3,000 hours of data to build a speech recognition system, then the traditional pipeline, the full traditional pipeline works really well. It‘s only when you have a very large data set, you know one to say 10,000 hours of data, anything going up to maybe 100,000 hours of data that the end-to end-approach then suddenly starts to work really well. So when you have a smaller data set, the more traditional pipeline approach actually works just as well. Often works even better. And you need a large data set before the end-to-end approach really shines. And if you have a medium amount of data, then there are also intermediate approaches where maybe you input audio and bypass the features and just learn to output the phonemes of the neural network, and then at some other stages as well. So this will be a step toward end-to-end learning, but not all the way there. Test. So this is a picture of a face recognition turnstile built by a researcher, Yuanqing Lin at Baidu, where this is a camera and it looks at the person approaching the gate, and if it recognizes the person then, you know the turnstile automatically lets them through. So rather than needing to swipe an RFID badge to enter this facility, in increasingly many offices in China and hopefully more and more in other countries as well, you can just approach the turnstile and if it recognizes your face it just lets you through without needing you to carry an RFID badge. So, how do you build a system like this? Well, one thing you could do is just look at the image that the camera is capturing. Right? So, I guess this is my bad drawing, but maybe this is a camera image. And you know, you have someone approaching the turnstile. So this might be the image X that you that your camera is capturing. And one thing you could do is try to learn a function mapping directly from the image X to the identity of the person Y. It turns out this is not the best approach. And one of the problems is that you know, the person approaching the turnstile can approach from lots of different directions. So they could be green positions, they could be in blue position. You know, sometimes they‘re closer to the camera, so they appear bigger in the image. And sometimes they‘re already closer to the camera, so that face appears much bigger. So what it has actually done to build these turnstiles, is not to just take the raw image and feed it to a neural net to try to figure out a person‘s identity. Instead, the best approach to date, seems to be a multi-step approach, where first, you run one piece of software to detect the person‘s face. So this first detector to figure out where‘s the person‘s face. Having detected the person‘s face, you then zoom in to that part of the image and crop that image so that the person‘s face is centered. Then, it is this picture that I guess I drew here in red, this is then fed to the neural network, to then try to learn, or estimate the person‘s identity. And what researchers have found, is that instead of trying to learn everything on one step, by breaking this problem down into two simpler steps, first is figure out where is the face. And second, is look at the face and figure out who this actually is. This second approach allows the learning algorithm or really two learning algorithms to solve two much simpler tasks and results in overall better performance. By the way, if you want to know how the second step actually works I‘ve simplified the discussion. By the way, if you want to know how step two here actually works, I‘ve actually simplified the description a bit. The way the second step is actually trained, as you train in your network, that takes as input two images, and what then your network does is it takes this input two images and it tells you if these two are the same person or not. So if you then have say 10,000 employees IDs on file, you can then take this image in red, and quickly compare it against maybe all 10,000 employee IDs on file to try to figure out if this picture in red is indeed one of your 10000 employees that you should allow into this facility or that should allow into your office building. This is a turnstile that is giving employees access to a workplace.So why is it that the two step approach works better? There are actually two reasons for that. One is that each of the two problems you‘re solving is actually much simpler. But second, is that you have a lot of data for each of the two sub-tasks. In particular, there is a lot of data you can obtain for phase detection, for task one over here, where the task is to look at an image and figure out where is the person‘s face and the image. So there is a lot of data. There is a lot of label data X, comma Y where X is a picture and y shows the position of the person‘s face. So you could build a neural network to do task one quite well. And then separately, there‘s a lot of data for task two as well. Today, leading companies have let‘s say, hundreds of millions of pictures of people‘s faces. So given a closely cropped image, like this red image or this one down here, today leading face recognition teams have at least hundreds of millions of images that they could use to look at two images and try to figure out the identity or to figure out if it‘s the same person or not. So there‘s also a lot of data for task two. But in contrast, if you were to try to learn everything at the same time, there is much less data of the form X comma Y. Where X is image like this taken from the turnstile, and Y is the identity of the person. So because you don‘t have enough data to solve this end-to-end learning problem, but you do have enough data to solve sub-problems one and two, in practice, breaking this down to two sub-problems results in better performance than a pure end-to-end deep learning approach. Although if you had enough data for the end-to-end approach, maybe the end-to-end approach would work better, but that‘s not actually what works best in practice today. Let‘s look at a few more examples. Take machine translation. Traditionally, machine translation systems also had a long complicated pipeline, where you first take say English, text and then do text analysis. Basically, extract a bunch of features off the text, and so on. And after many many steps you‘d end up with say, a translation of the English text into French. Because, for machine translation, you do have a lot of pairs of English comma French sentences. End-to-end deep learning works quite well for machine translation. And that‘s because today, it is possible to gather large data sets of X-Y pairs where that‘s the English sentence and that‘s the corresponding French translation. So in this example, end-to-end deep learning works well. One last example, let‘s say that you want to look at an X-ray picture of a hand of a child, and estimate the age of a child. You know, when I first heard about this problem, I thought this is a very cool crime scene investigation task where you find maybe tragically the skeleton of a child, and you want to figure out how the child was. It turns out that typical application of this problem, estimating age of a child from an X-ray is less dramatic than this crime scene investigation I was picturing. It turns out that pediatricians use this to estimate whether or not a child is growing or developing normally. But a non end-to-end approach to this, would be you locate an image and then you segment out or recognize the bones. So, just try to figure out where is that bone segment? Where is that bone segment? Where is that bone segment? And so on. And then. Knowing the lengths of the different bones, you can sort of go to a look up table showing the average bone lengths in a child‘s hand and then use that to estimate the child‘s age. And so this approach actually works pretty well. In contrast, if you were to go straight from the image to the child‘s age, then you would need a lot of data to do that directly and as far as I know, this approach does not work as well today just because there isn‘t enough data to train this task in an end-to-end fashion. Whereas in contrast, you can imagine that by breaking down this problem into two steps. Step one is a relatively simple problem. Maybe you don‘t need that much data. Maybe you don‘t need that many X-ray images to segment out the bones. And task two, by collecting statistics of a number of children‘s hands, you can also get decent estimates of that without too much data. So this multi-step approach seems promising. Maybe more promising than the end-to-end approach, at least until you can get more data for the end-to-end learning approach. So an end-to-end deep learning works. It can work really well and it can really simplify the system and not require you to build so many hand-designed individual components. But it‘s also not panacea, it doesn‘t always work. In the next video, I want to share with you a more systematic description of when you should, and maybe when you shouldn‘t use end-to-end deep learning and how to piece together these complex machine learning systems.
Whether to use end-to-end deep learning - 10m
0:00
Let‘s say in building a machine learning system you‘re trying to decide whether or not to use an end-to-end approach. Let‘s take a look at some of the pros and cons of end-to-end deep learning so that you can come away with some guidelines on whether or not an end-to-end approach seems promising for your application. Here are some of the benefits of applying end-to-end learning. First is that end-to-end learning really just lets the data speak. So if you have enough X,Y data then whatever is the most appropriate function mapping from X to Y, if you train a big enough neural network, hopefully the neural network will figure it out. And by having a pure machine learning approach, your neural network learning input from X to Y may be more able to capture whatever statistics are in the data, rather than being forced to reflect human preconceptions. So for example, in the case of speech recognition earlier speech systems had this notion of a phoneme which was a basic unit of sound like C, A, and T for the word cat. And I think that phonemes are an artifact created by human linguists. I actually think that phonemes are a fantasy of linguists that are a reasonable description of language, but it‘s not obvious that you want to force your learning algorithm to think in phonemes. And if you let your learning algorithm learn whatever representation it wants to learn rather than forcing your learning algorithm to use phonemes as a representation, then its overall performance might end up being better. The second benefit to end-to-end deep learning is that there‘s less hand designing of components needed. And so this could also simplify your design work flow, that you just don‘t need to spend a lot of time hand designing features, hand designing these intermediate representations. How about the disadvantages. Here are some of the cons. First, it may need a large amount of data. So to learn this X to Y mapping directly, you might need a lot of data of X, Y and we were seeing in a previous video some examples of where you could obtain a lot of data for subtasks. Such as for face recognition, we could find a lot data for finding a face in the image, as well as identifying the face once you found a face, but there was just less data available for the entire end-to-end task. So X, this is the input end of the end-to-end learning and Y is the output end. And so you need all the data X Y with both the input end and the output end in order to train these systems, and this is why we call it end-to-end learning value as well because you‘re learning a direct mapping from one end of the system all the way to the other end of the system. The other disadvantage is that it excludes potentially useful hand designed components. So machine learning researchers tend to speak disparagingly of hand designing things. But if you don‘t have a lot of data, then your learning algorithm doesn‘t have that much insight it can gain from your data if your training set is small. And so hand designing a component can really be a way for you to inject manual knowledge into the algorithm, and that‘s not always a bad thing. I think of a learning algorithm as having two main sources of knowledge. One is the data and the other is whatever you hand design, be it components, or features, or other things. And so when you have a ton of data it‘s less important to hand design things but when you don‘t have much data, then having a carefully hand-designed system can actually allow humans to inject a lot of knowledge about the problem into an algorithm deck and that should be very helpful. So one of the downsides of end-to-end deep learning is that it excludes potentially useful hand-designed components. And hand-designed components could be very helpful if well designed. They could also be harmful if it really limits your performance, such as if you force an algorithm to think in phonemes when maybe it could have discovered a better representation by itself. So it‘s kind of a double edged sword that could hurt or help but it does tend to help more, hand-designed components tend to help more when you‘re training on a small training set. So if you‘re building a new machine learning system and you‘re trying to decide whether or not to use end-to-end deep learning, I think the key question is, do you have sufficient data to learn the function of the complexity needed to map from X to Y? I don‘t have a formal definition of this phrase, complexity needed, but intuitively, if you‘re trying to learn a function from X to Y, that is looking at an image like this and recognizing the position of the bones in this image, then maybe this seems like a relatively simple problem to identify the bones of the image and maybe they‘ll need that much data for that task. Or given a picture of a person, maybe finding the face of that person in the image doesn‘t seem like that hard a problem, so maybe you don‘t need too much data to find the face of a person. Or at least maybe you can find enough data to solve that task, whereas in contrast, the function needed to look at the hand and map that directly to the age of the child, that seems like a much more complex problem that intuitively maybe you need more data to learn if you were to apply a pure end-to-end deep learning approach. So let me finish this video with a more complex example. You may know that I‘ve been spending time helping out an autonomous driving company, Drive.ai. So I‘m actually very excited about autonomous driving. So how do you build a car that drives itself? Well, here‘s one thing you could do, and this is not an end-to-end deep learning approach. You can take as input an image of what‘s in front of your car, maybe radar, lighter, other sensor readings as well, but to simplify the description, let‘s just say you take a picture of what‘s in front or what‘s around your car. And then to drive your car safely you need to detect other cars and you also need to detect pedestrians. You need to detect other things, of course, but we‘ll just present a simplified example here. Having figured out where are the other cars and pedestrians, you then need to plan your own route. So in other words, if you see where are the other cars, where are the pedestrians, you need to decide how to steer your own car, what path to steer your own car for the next several seconds. And having decided that you‘re going to drive a certain path, maybe this is a top down view of a road and that‘s your car. Maybe you‘ve decided to drive that path, that‘s what a route is, then you need to execute this by generating the appropriate steering, as well as acceleration and braking commands. So in going from your image or your sensory inputs to detecting cars and pedestrians, that can be done pretty well using deep learning, but then having figured out where the other cars and pedestrians are going, to select this route to exactly how you want to move your car, usually that‘s not to done with deep learning. Instead that‘s done with a piece of software called Motion Planning. And if you ever take a course in robotics you‘ll learn about motion planning. And then having decided what‘s the path you want to steer your car through, there‘ll be some other algorithm, we‘re going to say it‘s a control algorithm that then generates the exact decision, that then decides exactly how much to turn the steering wheel and how much to step on the accelerator or step on the brake. So I think what this example illustrates is that you want to use machine learning or use deep learning to learn some individual components and when applying supervised learning you should carefully choose what types of X to Y mappings you want to learn depending on what task you can get data for. And in contrast, it is exciting to talk about a pure end-to-end deep learning approach where you input the image and directly output a steering. But given data availability and the types of things we can learn with neural networks today, this is actually not the most promising approach or this is not an approach that I think teams have gotten to work best. And I think this pure end-to-end deep learning approach is actually less promising than more sophisticated approaches like this, given the availability of data and our ability to train neural networks today. So that‘s it for end-to-end deep learning. It can sometimes work really well but you also have to be mindful of where you apply end-to-end deep learning. Finally, thank you and congrats on making it this far with me. If you finish last week‘s videos and this week‘s videos then I think you will already be much smarter and much more strategic and much more able to make good prioritization decisions in terms of how to move forward on your machine learning project, even compared to a lot of machine learning engineers and researchers that I see here in Silicon Valley. So congrats on all that you‘ve learned so far and I hope you now also take a look at this week‘s homework problems which should give you another opportunity to practice these ideas and make sure that you‘re mastering them.
(Optional) Heroes of Deep Learning - Ruslan Salakhutdinov interview - 17m
0:03
Welcome, Rus, I‘m really glad you could join us here today. >> Thank you, thank you Andrew. >> So today you‘re the director of research at Apple, and you also have a faculty, a professor for Carnegie Mellon University. So I‘d love to hear a bit about your personal story. How did you end up doing this deep learning work that you do? Yeah, it‘s actually, to some extent it was, I started in deep learning to some extent by luck. I did my master‘s degree at Toronto, and then I took a year off. I was actually working in the financial sector. It‘s a little bit surprising. And at that time, I wasn‘t quite sure whether I want to go for my PhD or not. And then something happened, something surprising happened. I was going to work one morning, and I bumped into Geoff Hinton. And Geoff told me, hey, I have this terrific idea. Come to my office, I‘ll show you. And so, we basically walked together and he started telling me about these Boltzmann Machines and contrasting divergence, and some of the tricks which I didn‘t at that time quite understand what he was talking about.
1:10
But that really, really excited, that was very exciting and really excited me. And then basically, within three months I started my PhD with Geoff.
1:21
So that was kind of the beginning, because that was back in 2005, 2006. And this is where some of the original deep learning algorithms, using Restricted Boltz Machines, unsupervised pre-training, were kind of popping up. And so that‘s how I started it, was really. That one particular morning when I bumped into Geoff completely changed my future career moving forward. >> And then in fact you were co-author on one of the very early papers on Restricted Boltzmann Machines that really helped with this resurgence of neural networks and deep learning. Tell me a bit more what that was like working on that seminal- >> Yeah, this was actually a really, this was exciting, yeah, it was the first year, it was my first year as a PGD student. And Geoff and I were trying to explore these ideas of using Restricted Boltz Machines and using pre-training tricks to train multiple layers. And specifically we were trying to focus on auto-encoders, how do we do a non-linear extension of PCA effectively? And it was very exciting, because we got these systems to work on which was exciting, but then the next steps for us were to really see whether we can extend these models to dealing with faces. I remember we had this Olivetti faces dataset. And then we started looking at can we do compression for documents? And we started looking at all these different data, real value count, binary, and throughout a year, I was a first year PhD student, so it was a big learning experience for me. But and really within six or seven months, we were able to get really interesting results, I mean really good results. I think that we were able to train these very deep auto-encoders. This is something that you couldn‘t do at that time using sort of traditional optimization techniques. And then it turned out into really, really exciting period for us.
3:27
That was super exciting, yeah, because it was a lot of learning for me, but at the same time, the results turned out to be really, really impressive for what we were trying to do.
3:42
So in the early days of those researches of deep learning, a lot of the activity was centered on Restricted Boltzmann Machines, and then Deep Boltzmann Machines. There‘s still a lot of exciting research there being done, including some in your group, but what‘s happening with Boltzmann Machines and Restricted Boltzmann Machines? >> Yeah, that‘s a very good question. I think that, in the early days, the way that we were using Restricted Boltz Machines is you sort of can imagine training a stack of these Restricted Boltz Machines that would allow you to learn effectively one layer at a time. And there‘s a good theory behind when you add a particular layer, it can prove variational bound and so forth under certain conditions. So there was a theoretical justification, and these models were working quite well in terms of being able to pre-train these systems. And then around 2009, 2010, once the Computes started showing up, GPUs, then a lot of us started realizing that actually directly optimizing these deep neural networks was giving similar results or even better results. >> So just standard backprop without the pre-training or the Restricted Boltz Machine. >> That‘s right, that‘s right. And that‘s sort of over three or four years, and it was exciting for the whole community, because people felt that wow, you can actually train these deep models using these pre-training mechanisms. And then, with more Computes people started realizing that you can just basically do standard backpropagation, something that we couldn‘t do
5:11
back in 2005 or 2004, because it would take us months to do it on CPU‘s. And so that was a big change. The other thing that I think that we haven‘t really figured out what to do with Boltz Machines and Deep Boltzmann Machines. I believe they‘re very powerful models, because you can think of them generative models. They‘re trying to model coupling distributions in the data, but when we start looking at learning algorithms, learning algorithms right now, they require using Markov Chain Monte Carlo and variational learning and such, which is not as scalable as backpropagation algorithms. So we yet have to figure out more efficient ways of training these models, and also the use of convolution, it‘s something that‘s fairly difficult to integrate into these models. I remember some of your work on using probabilistic max pooling for sort of building these generative models of different objects, and using these ideas of convolution was also very, very exciting, but at the same time, it‘s still extremely hard to train these models, so. >> How likely is work? >> Yes, how likely is work, right? And so we still have to figure out where.
6:27
On the other side, some of the recent work using variational auto-encoders, for example, which could be viewed as interactive versions of Bolzmann Machines. We have figured out ways of training these modules, a work by Max Welling and Diederik Kingma, on using reparameterization tricks. And now we can use backpropagation algorithm within the stochastic system, which is driving a lot of progress right now. But we haven‘t quite figured out how to do that in the case of Boltzmann Machines. >> So that‘s actually a very interesting perspective I actually wasn‘t aware of, which is that in an early era where computers were slower, that the RBM, pre-training was really important, it was only faster computation that drove switching to standard backprop. In terms of the evolution of the community‘s thinking in deep learning and other topics, I know you spend a lot time thinking about this, the generative, unsupervised versus supervised approaches. Do you share a bit about how your thinking about that has evolved over time? >> Yeah, I think that‘s a really, I feel like it‘s a very important topic, particularly if we think about unsupervised, semi-supervised or generative models because to some extent a lot of successes that we‘ve seen recently is due to supervised learning, and back in the early days, unsupervised learning was primarily viewed as unsupervised pre-training, because we didn‘t know how to train these multi layer systems. And even today, if you‘re working in settings where you have lots and lots of unlabeled data and a small fraction of labeled examples, these unsupervised pre-training models, building these generative models, can help for supervised eyes. So I think that a lot of us in the community, it kind of was the belief. When I started doing my PhD, was all about generative models and trying to learn these stacks of model because that was the only way for us to train these systems. Today, there is a lot of work right now in generative modeling. If you look at Generative Adversarial Networks. If you look at variational auto-encoders, deep energy models is something that my lab is working on right now as well. I think it‘s very exciting research, but perhaps we haven‘t quite figured it out, again, for many of you who are thinking about getting into deep learning field, this is one area that‘s, I think we‘ll make a lot of progress in, hopefully in the near future. >> So, unsupervised learning. >> Unsupervised learning, right. Or maybe you can think of it as unsupervised learning, or semi-supervised learning, where you have, I give you some hints or some examples of what different things mean and I throw you lots and lots of unlabeled data. >> So that was actually a very important insight that in an earlier era of deep learning where computers where just slower, the Restricted Boltzmann Machine and Deep Boltzmann Machine that was needed for initializing the neural network weights, but as computers got faster, straight backprop then started to work much better. So one other topic that I know you spend a lot of time thinking about is the supervised learning versus generative models, unsupervised learning approaches. So how does your, tell me a bit about how your thinking on that debate has evolved over time? >> I think that we all believe that we should be able to make progress there. It‘s just all the work on Boltz machines, variational auto-encoders, GANs. You think a lot of these models as generative models, but we haven‘t quite figured it out how to really make them work and how can you make use of large moments. And even for, I see a lot of in IT sector, companies have lots and lots of data, lots of unlabeled data, lots of efforts for going through annotations because that‘s the only way for us to make progress right now. And it seems like we should be able to make use of unlabeled data because it‘s just abundance of it. And we haven‘t quite figured out how to do that yet.
10:44
So you mentioned for people wanting to enter deep learning research, unsupervised learning is exciting area. Today there are a lot of people wanting to enter deep learning, either research or applied work, so for this global community, either research or applied work, what advice would you have? >> Yes, I think that one of the key advices I think I should give is people entering that field, I would encourage them to just try different things and not be afraid to try new things, and not be afraid to try to innovate. I can give you one example, which is when I was a graduate student, we were looking at neural nets, and these are highly non-convex systems that are hard to optimize. And I remember talking to my friends within the optimization community. And the feedback was always that, well, there‘s no way you can solve these problems because these are non-convex, we don‘t understand optimization, how could you ever even do that compared to doing convex optimization? And it was surprising, because in our lab we never really cared that much about those specific problems. We‘re thinking about how can we optimize and whether we can get interesting results. And that effectively was driving the community so we we‘re not scared, maybe to some extent because we were lacking actually the theory behind optimization. But I would encourage people to just try and not be afraid to try to tackle hard problems. >> Yeah, and I remember you once said, don‘t learn to code just into high level deep learning frameworks, but actually understand deep learning. >> Yes, that‘s right. I think that it‘s one of the things that I try to do when I teach a deep learning class is, one of the homeworks, I‘m asking people to actually code backpropogation algorithms for convolutional neural networks. And it‘s painful, but at the same time, if you do it once, you‘ll really understand how these systems operate, and how they work.
12:49
And how you can efficiently implement them on GPU, and I think it‘s important for you to, when you go into research or industry, you have a really good understanding of what these systems are doing. So it‘s important, I think. >> Since you have both academic experience as professor, and corporate experience, I‘m curious, if someone wants to enter deep learning, what are the pros and cons of doing a PhD versus joining a company? >> Yeah, I think that‘s actually a very good question.
13:22
In my particular lab, I have a mix of students. Some students want to go and take an academic route. Some students want to go and take an industry route. And it‘s becoming very challenging because you can do amazing research in industry, and you can also do amazing research in academia. But in terms of pros and cons, in academia, I feel like you have more freedom to work on long-term problems, or if you think about some crazy problem, you can work on it, so you have a little bit more freedom. At the same time the research that you‘re doing in industry is also very exciting because in many cases with your research you can impact millions of users if you develop a core AI technology. And obviously, within the industry you have much more resources in terms of Compute, and be able to do really amazing things. So there are pluses and minuses, it really depends on what you want to do. And right now it‘s interesting, very interesting environment where academics move to industry, and then folks from industry move to academia, but not as much. And so it‘s, it‘s very exciting times. >> It sounds like academic machine learning is great and corporate machine learning is great, and the most important thing is just jump in, right? Either one, just jump in. >> It really depends on your preferences because you can do amazing research in either place. >> So you‘ve mentioned unsupervised learning is one exciting frontier for research. Are there other areas that you consider exciting frontiers for research? >> Yeah, absolutely. I think that what I see now, in the community right now, particularly in deep learning community, is there are a few trends.
15:17
One particular area I think is really exciting is the area of deep reinforcement learning.
15:24
Because we were able to figure out how we could train agents in virtual worlds. And this is something that in just the last couple of years, you see a lot, of lot of progress, of how can we scale these systems, how can we develop new algorithms, how can we get agents to communicate to each other, with each other, and I think that that area is, and in general, the settings where you‘re interacting with the environment is super exciting. The other area that I think is really exciting as well is the area of reasoning and natural language understanding. So can we build dialogue-based systems? Can we build systems that can reason, that can read text and be able to answer questions intelligently. I think this is something that a lot of research is focusing on right now. And then there‘s another sort of sub-area also is this area of being able to learn from few examples. So typically people think of it as one-shot learning or transfer learning, a setting where you learn something about the world, and I throw you a new task at you and you can solve this task very quickly. Much like humans do without requiring lots and lots of labeled examples. And so this is something that‘s, a lot of us in the community are trying to figure out how we can do that and how can we come closer to human-like learning abilities. >> Thank you, Rus, for sharing all the comments and insights. That was interesting to see, hearing the story of your early days doing this as well. >> [LAUGH]. Thanks, Andrew, yeah.
17:07
Thanks for having me.
以上是关于[C5] Andrew Ng - Structuring Machine Learning Projects的主要内容,如果未能解决你的问题,请参考以下文章