What is Machine Learning?

HH Tu's picture

Nowadays, if a programmer wants to solve a word parsing problem, he would write a program to solve it. First of all, he needs to input a file and write some code instructions to parse it, then the program collects the useful information and output it. This is simple, but unfortunately, it cannot be the only rule to solve all the problems in the world. Humans can identify whether an e-mail is spam or ham easily, but it is not easy to find a useful algorithm to do it.

Spam mails can be different and thus very difficult to identify. Even the human brain cannot remember or identify every possible case. Today if we can rely on computers to help us collect data, auto-extract useful information, order the results to what we want, etc. and even self-learn to give us a prediction, it would be great!! The point is we don't have a direct algorithm but we have data.

Assume we have thousands of clients around the world and have tens of millions of e-mails every day. If we want to identify whether an e-mail is spam or ham, we can see previous mails and give an approximate prediction. According to traditional statistical analysis, it will take a lot of substantial time and money!! Furthermore, spam mail behavior changes over time and mail types change due to the different locations in the world. If we just follow the traditional rule, it will fail some day. But from another perspective, if we know this e-mail is sent by the spammer to broadcast advertisements or sent by a general manager to issue an order, we can easily handle it. You can write code instructions to quickly filter it out or leave it, but it is not an easy job to get this information.

We still have hope! Spam mails are not totally random. If we can collect and collate enough information, and give reasonable assumption with analysis, we can expect to find a good prediction that is closely approximated to the real answer. Exact correct prediction is not possible (unless you are God), but we can rely on computers to auto-collect data from existing e-mail sources and output useful information. This procedure is the value of machine learning.

Machine learning applications are currently used for optimizing network traffic identification, bank lending credit ranking, stock prediction, medical clinical data, biological nervous system and even space plan. A well-known case is a computer that plays chess against a human brain!! With all of these examples, we can say that machine learning is already in our lives. The question is, how do we let the computer learn? We start from the way people think.

How do people identify e-mails as spam or ham? Can you say why? Because it is not the same as normal mail, it is spam. Spam mails contain weird contents, unfamiliar senders, etc. None of these are standard criteria. Another example is face recognition. Can you explain why you know your father's or your mother's face? It is because you have seen them from birth. They are not strangers and you see them everyday so they are not unfamiliar, but this is still not a standard rule. From above examples, we actually build some characteristics: people's outline and the first impression of e-mail; symmetrical face and the words in e-mails; people have eyes, nose, mouth and an e-mail has recipient, sender, attachments. This is what we already do in our lives, but there are lots of rules that we use and so it is not easy to just write an algorithm to solve it. Here comes machine learning, it can collect data and analyse the attributes, find out what attributes are useful and can be coupled with our purpose (eg. mail prediction).

Modern machine learning consists of lots of statistics and calculus, because we have to find a correlation with optimization to achieve the goal. Machine learning can be divided into two parts. The first part is learning through a large amount of information and data, with optimizing to produce a represented model to use. The second part is prediction. We use the previous represented model to receive future input that will estimate the result and give an useful prediction. In reality, continuous learning is another important issue. Things will change, but we use adaptive learning.