Recently, professor Pedro Domingos, one of the most well-known machine learning researchers in the world, wrote an article entitled, "A Few Useful Things to Know about Machine Learning". In this article, he fairly covers the important points one would want to know about this domain. It is a great article for someone thorough with the technical terminologies.
However, in this article, I have tried to summarize few of his insights in a simpler language and possibly with a added flavour of humour.
How do Machine Learning Algorithms work?
When we speak of ML, the term learning models appears (you might want to remember this term), which is nothing but a combination of generic machine learning algorithms and a learning component. Learning Models are of two types, Supervised and Unsupervised Machine learning models. Her, I will talk about the Supervised Learning Model.
In supervised machine learning models, the output can be predicted from a given set of inputs. For inputs say (x1, x2,x3....xn) the outputs would be (y1, y2, y3,....yn). Here, the network is trained until it reaches the desired amount of accuracy. Now there can be a possiblity of various input models which does the above mentioned tasks and one has to choose the most accurate one. So, how does one choose between the models?
Deciding on the model basically consists of three steps:
Collection of a set of models to apply the algorithm on.
Finding out a method to determine which model is good.
Minimize the number of tests required to test a model.
Let me explain this with the help of an analogy. It has been a tiring week and you want a nice restaurant to celebrate your Friday night. But, then again, there are so many restaurants around your neighbourhood to choose from! To be absolutely sure, you would might have to test between all the restaurants!
Now, it is not possible for you to visit each and every restaurant in order to assure its quality. How do you find out then? Well, there can be various tricks like asking your friends who stay in the nearby area, or using Zomato to get a quick review. There are a lot of ways in which you can find a good restaurant without trying all of them. Maybe, doing so, you can land up in one of the best places.
In case of machine learning, the problem is even more huge as the number of possible data sets is infinite and choosing from such an infinite data sets is definitely not easy. A perfect data model would be one which could make a proper prediction even if the data would be outside the actual sets of data.
So, once you discover a good model, you can hope that the model also works with new data points which are not a part of the original data set. In this way, you can use a "good" model to predict the output from a given set of inputs.
Now that we have an idea of supervised learning models, let's talk about few problems which hamper the predictive models:
The efficiency of a model should be determined not only by how it performs on the training data but also how it performs with unknown data; which is the main purpose of this study. Overfitting is a phenomenon which affects this prediction. It occurs in extremely complex models; such a model reacts a lot even with a minor fluctuation of the training data.
To avoid this it is always advised to train your classifier on unseen data. Even if it passes the test, do split testing again. For example - considering various data on monthly expenditure, train your model for all months but one, and then test its efficiency for that untrained month.
Domingos tries to solve the problem of analyzing the errors in your model by a method called Bias-Variance decomposition. It solves the problem of Bias-variance trade-off, where the two possible sources of error are minimized thus preventing supervised learning algorithms to make correct predictions on unknown data.
Curse of dimensionality
The curse of dimensionality is another problem arises while organizing data in high dimensional spaces which do not occur in low dimensional spaces. The scenario is, when the dimension of the space increases, the volume of the space increases and the available data falls short.
Let us consider a problem; you have a model with few input and output fields, and you are unhappy with it as it is not performing as expected. What would you do? The first thing that would come on your mind is to augment the number of input fields. Your logic being, the more the number of input fields, the higher is the number of examples the model can learn from. Are you sure your logic is correct?
It all depends. If the input field helps in determining the objective very well, it helps. But more often than not, either the input fields are not very informative or they contain information which is already specified in other input fields, and in that case this model will be worse than the original model. This is because the more the number of input fields, the model will derive some redundant relationship between them which will not meet the objective.
So, the more the number of input fields, you have to provide an equivalent number of training data to make up for the extra space created by the additional inputs. So, adding irrelevant input fields is definitely not going to help.
Thus, the problem of ‘curse of dimensionality’ can be removed by selecting only those inputs which are relevant to making the prediction.
Let’s stop here today! In my next post, I will discuss on other things to avoid while modeling of data.
Inspired from the article by Charles Parker.