Bayesian statistics has influenced the way we use the internet. It will soon influence how we navigate online. Currently everything from spam filters to misspelled words to video recommendations is based on some form of Bayesian probability. In the past, humans were doing the editing. Now they are done automatically using Statistics.
The use of Bayesian mathematics for online applications follows a common progression that parallels the way Bayesian math is taught in colleges. Bayesian formulas are not inherently logical–often after the first time learning it you quickly forget the formula. I, like many other students, had an Aha! moment where all of a sudden one understands the power of this tool; one starts looking at the world in a different light. Paul Graham in his article “A plan for Spam” describes it perfectly: “I spent about six months writing software that looked for individual spam features before I tried the statistical approach. What I found was that recognizing that last few percent of spams got very hard, and that as I made the filters stricter I got more false positives.” Although computers are not able to classify data as well as humans (for now at least) they do see relationships among data which we miss. Not only that, but computers are able to see many more relationships that we ourselves can. Now most spam filters are based upon Paul Graham´s essay.
Bayesian mathematics already changed the way we browse the internet or look for information online. Before Google came along the way to find what you where looking for was through a directory. I just went to yahoo.com and couldn´t find the directory anywhere on their homepage; just like point-based spam systems which was proven to be completely obsolete compared to statistics. In the past, everything had to be classified by people into directories and sub directories. This is very similar to the point approach. As you create more and more rules or subdirectories you find that you get a lot more false positives; if the website is a travel website then it goes in the travel directory etc.
Statistics takes a completely different approach by looking through the entire website and matching it with your search term. So far, it has proven to be a much more effective way of locating information online.
By using statistics we are able to start to personalize a user experience, and do automatic classification. There are lots of machine learning systems out there. However the jump between using a good system to a great system is a lot smaller than the jump from no system to a good system. In the field of finance it is the same. Initially, with the use of systems such as an option pricing formula, people where able to see in a market which was blind before. Although we all know the limitations of this simple formula, it is still pretty good. Currently we are seeing trading desks come up with more and more accurate models (and price discrepancies get smaller and smaller). But none of those systems made as much of a difference as the initial one. I think we will see this in search engines: if someone creates search results that are slightly better than Google´s, it will not be enough. In fact, there are now companies (such as ask.com) which already claim to have done just that, but the cost here of getting the results perfect is not as big as in finance. (people can just look at the next result). One area where we should see increasingly better algorithms is in spam. There are already techniques (such as spam bombing, etc) to get past Bayesian filters. The future of this area will certainly be very interesting.
A Bayesian approach can go two ways. Up until now we have seen a lot of the first one: the ability to classify results. The next change that we will see with machine learning is the ability to classify people. Programs such as AJAX made websites easier to surf (when used correctly). Machine Learning techniques have the ability to present us with the information that we want to see. Currently, most websites follow a directory based approach; first you pick your vertical. For example, I want to see world or business news or I add my RSS feeds to my favorite reader. I then browse through them one by one. This should be replaced (just like yahoo directories) with a one page solution. There is no reason you should not be able to go to your homepage and see exactly what you want. With the website learning, you see what you’re interested in as you surf.
Often the simpler the system is, the better it performs. The first jump from no system to machine learning is often the biggest because it changes the way you behave; everything after that is just there to squeeze out accuracy and performance. There are many techniques that I am currently using for work, as well as learning. Specifically, Neural Networks (a whole other beast) and, more recently, Bayesian Networks. A word of caution though: Do not expect machine learning to be perfect. “I think it’s possible to stop spam, and that content-based filters are the way to do it. ” Paul Graham said. We still have not stopped spam. And I don’t think we will do so any time soon.
-Greg
