Machine Learning (Theory)

7/4/2008

More Presentation Preparation

Filed under: Machine Learning, ResearchJohn Langford @ 8:01 am

We’ve discussed presentation preparation before, but I have one more thing to add: transitioning. For a research presentation, it is substantially helpful for the audience if transitions are clear. A common outline for a research presentation in machine leanring is:

  1. The problem. Presentations which don’t describe the problem almost immediately lose people, because the context is missing to understand the detail.
  2. Prior relevant work. In many cases, a paper builds on some previous bit of work which must be understood in order to understand what the paper does. A common failure mode seems to be spending too much time on prior work. Discuss just the relevant aspects of prior work in the language of your work. Sometimes this is missing when unneeded.
  3. What we did. For theory papers in particular, it is often not possible to really cover the details. Prioritizing what you present can be very important.
  4. How it worked. Many papers in Machine Learning have some sort of experimental test of the algorithm. Sometimes this is missing when the work is theoretical.

What seems to often happen, is that there is no transitioning in the presentation. This can happen in one of two ways:

  1. Content Confusion. Sometimes the problem description is merged into (2), and (3). Sometimes (2) and (3) are merged. When this happens, it can be very difficult to follow. The solution is to rewrite to isolate the presentation components.
  2. Untransition. Sometimes the presentation does have a reasonable structure as above, but there are just no transitions in the delivery, creating apparent content confusion. This is easy to fix. An approach I often use is to just have an outline slide with the next subject highlighted between pieces of the transition. The delivery of the presentation can also handle this well. For example, have an extra long pause after stating the problem and check to see if the audience has questions.

7/2/2008

Proprietary Data in Academic Research?

Filed under: Machine Learning, ResearchJohn Langford @ 10:36 am

Should results of experiments on proprietary datasets be in the academic research literature?

The arguments I can imagine in the “against” column are:

  1. Experiments are not repeatable. Repeatability in experiments is essential to science because it allows others to compare new methods with old and discover which is better.
  2. It’s unfair. Academics who don’t have insider access to proprietary data are at a substantial disadvantage when competing with others who do.

I’m unsympathetic to argument (2). To me, it looks like their are simply some resource constraints, and these should not prevent research progress. For example, we wouldn’t prevent publishing about particle accelerator experiments by physicists at CERN because physicists at CMU couldn’t run their own experiments.

Argument (1) seems like a real issue.

The argument for is:

  1. Yes, they are another form of evidence that an algorithm is good. The degree to which they are evidence is less than for publicly repeatable experiments, but greater than nothing.
  2. What if research can only be done in a proprietary setting? It has to be good for society at large to know what works.
  3. Consider the game theory perspective. For example, suppose ICML decides to reject all papers with experiments on proprietary datasets. And suppose KDD decides to consider them as weak evidence. The long term result may be that beginning research on new topics which is only really doable in companies starts and then grows at KDD.

I consider the arguments for to be stronger than the arguments against, but I’m aware others have other beliefs. I think it would be good to have a policy statement from machine learning conferences in their call for papers, as trends suggest this becoming a more serious problem in the mid-term future.

6/30/2008

ICML has a comment system

Filed under: Conferences, Machine LearningJohn Langford @ 5:54 am

Mark Reid has stepped up and created a comment system for ICML papers which Greger Linden has tightly integrated.

My understanding is that Mark spent quite a bit of time on the details, and there are some cool features like working latex math mode. This is an excellent chance for the ICML community to experiment with making ICML year-round, so I hope it works out. Please do consider experimenting with it.

6/27/2008

Reviewing Horror Stories

Filed under: Conferences, Research, Reviewing John Langford @ 1:18 pm

Essentially everyone who writes research papers suffers rejections. They always sting immediately, but upon further reflection many of these rejections come to seem reasonable. Maybe the equations had too many typos or maybe the topic just isn’t as important as was originally thought. A few rejections do not come to seem acceptable, and these form the basis of reviewing horror stories, a great material for conversations. I’ve decided to share three of mine, now all safely a bit distant in the past.

  1. Prediction Theory for Classification Tutorial. This is a tutorial about tight sample complexity bounds for classification that I submitted to JMLR. The first decision I heard was a reject which appeared quite unjust to me—for example one of the reviewers appeared to claim that all the content was in standard statistics books. Upon further inquiry, several citations were given, none of which actually covered the content. Later, I was shocked to hear the paper was accepted. Apparently, the paper accidentally went to two different action editors, who each chose distinct reviewers.
  2. Cover Tree. This paper was the first one to give a datastructure for nearest neighbor search for an arbitrary metric which both (a) took logarithmic time under dimensionality constraint and (b) always required space competitive with brute force nearest neighbor search. Previous papers had done (a) or (b), but not both, and achieving both appears key to a practical algorithm, which we backed up with experiments and code.

    The cover tree paper suffered a triple rejection, the last one of which seems particularly poor to me. We submitted the draft to SODA, and got back 3 reviews. The first was blank. The second was a paragraph of positive but otherwise uninformative text. The third was blank. The decision was reject. We were rather confused, so we emailed the program chair asking if the decision was right and if so whether there was any more information we could get. We got back only a form letter providing no further information. Since then, the paper was accepted at ICML.

  3. Ranking Reduction. This paper shows that learning how to predict which of a pair of items is better strongly transfers to optimizing a ranking loss, in contrast to (for example) simply predicting a score and ordering according to predicted score.

    We submitted this paper to NIPS and it had the highest average review of any learning theory paper. The decision was to reject. Based upon what we could make out from a statement by the program committee, the logic of this decision is mostly kindly describable as badly flawed—somehow they confused the algorithm, the problem, and the analysis into a mess. Later it was accepted at COLT. (A bit of disclosure: I was on the program committee at NIPS that year, although obviously not involved in the decision on this paper.)

In all cases where a rejection occurs, the default presumption is that the correct decision was made because most of the time a good (or at least reasonable) decision was made. Consequently, it seems important to point out that there are some objective signs each of the above cases involved poor decisions.

  1. The tutorial paper is fairly widely cited (Google scholar places it 8th amongst my papers), and I continue to find it useful material for a lecture when teaching a class.
  2. The cover tree is also fairly widely cited, and I know from various emails and download counts that it is used by several people. It also won an award from IBM. To this day, it seems odd that an algorithms paper was only publishable at a machine learning conference.
  3. It’s really too soon to tell with the ranking paper, but it was one of the few COLT papers invited to a journal special issue, and there has since been substantial additional work by Mehryar Mohri and Nir Ailon which broadens the claim to other ranking metrics and makes it more computationally tractable.

One of the reasons you hear for why a paper was rejected and then accepted is that the paper improved in the meantime. That’s often true, but in each of the above cases I don’t believe there were any substantial changes between submissions (and for the tutorial it was a perfect accidental experiment).

Normally reviewing horror stories are the academic equivalent of warstories, but these ones have slightly more point. They have each informed my thinking about how reviewing should be done. Relating these stories might make this thinking a bit more understandable.

  1. Reviewer Choice. The tutorial case brings home the impact of how reviewers are chosen. If a paper is to have 3 reviews, it seems like a good idea to choose the reviewers in diverse ways, rather than one way. For example, at a conference, one reviewer by bidding preference, one reviewer by area chair, and one reviewer by another area chair or the program chair’s choice might reduce variance.
  2. Uniform Author feedback. The standard at NIPS was to have author feedback when the ranking paper was submitted. In effect, the standard was not followed for the ranking paper, and it’s easy to imagine this making a substantial difference given how badly flawed the basis of rejection was. It is also easy to imagine that author feedback might have made a difference in the tutorial rejection, as the reviewer was wrong (author feedback was not the standard then).
  3. Decision Basis. It’s helpful to relate the basis of decision by the program committee, especially when it is not summarized in the reviews. The cover tree case was one of the things which led me to add summaries to some of the NIPS papers when I was on the program committee, and I am committed to doing the same for SODA papers I’m reviewing this year. Not having a summary saves the program committee the embarassment of accidentally admitting mistakes, but it is badly disrepectful of the authors and generally promotes misunderstanding.
  4. Fast decisions are bad. It’s not possible to reliably make good decisions about technical matters quickly. I suspect that the time crunch of the NIPS program committee meeting was a contributing factor in the ranking paper case.

As anyone educated in machine learning or statistics understands, drawing 4 conclusions from 3 datapoints is problematic, so the above should be understood as suggestions subject to further evidence.

6/9/2008

The Minimum Sample Complexity of Importance Weighting

Filed under: Machine LearningJohn Langford @ 4:47 pm

This post is about a trick that I learned from Dale Schuurmans which has been repeatedly useful for me over time.

The basic trick has to do with importance weighting for monte carlo integration. Consider the problem of finding:
N = Ex ~ D f(x)
given samples from D and knowledge of f.

Often, we don’t have samples from D available. Instead, we must make do with samples from some other distribution Q. In that case, we can still often solve the problem, as long as Q(x) isn’t 0 when D(x) is nonzero, using the importance weighting formula:
Ex ~ Q f(x) D(x)/Q(x)

A basic question is: How many samples from Q are required in order to estimate N to some precision? In general the convergence rate is not bounded, because f(x) D(x)/Q(x) is not bounded given the assumptions.
Nevertheless, there is one special value Q(x) = f(x) D(x) / N where the sample complexity turns out to be 1, which is typically substantially better than the sample complexity of the original problem.

This observation underlies the motivation for voluntary importance weighting algorithms. Even under pretty terrible approximations, the logic of “Q(x) is something like f(x) D(x)” often yields substantial improvements over sampling directly from D(x).

5/25/2008

Inappropriate Mathematics for Machine Learning

Filed under: Machine Learning, TheoryJohn Langford @ 8:33 am

Reviewers and students are sometimes greatly concerned by the distinction between:

  1. An open set and a closed set.
  2. A Supremum and a Maximum.
  3. An event which happens with probability 1 and an event that always happens.

I don’t appreciate this distinction in machine learning & learning theory. All machine learning takes place (by definition) on a machine where every parameter has finite precision. Consequently, every set is closed, a maximal element always exists, and probability 1 events always happen.

The fundamental issue here is that substantial parts of mathematics don’t appear well-matched to computation in the physical world, because the mathematics has concerns which are unphysical. This mismatched mathematics makes irrelevant distinctions. We can ask “what mathematics is appropriate to computation?” Andrej has convinced me that a pretty good answer to this question is constructive mathematics.

So, here’s a basic challenge: Can anyone name a situation where any of the distinctions above (or similar distinctions) matter in machine learning?

5/23/2008

Three levels of addressing the Netflix Prize

Filed under: Competitions, ProblemsYehuda Koren @ 10:03 am

In October 2006, the online movie renter, Netflix, announced the Netflix Prize contest. They published a comprehensive dataset including more than 100 million movie ratings, which were performed by about 480,000 real customers on 17,770 movies.  Competitors in the challenge are required to estimate a few million ratings.  To win the “grand prize,” they need to deliver a 10% improvement in the prediction error compared with the results of Cinematch, Netflix’s proprietary recommender system. Best current results deliver 9.12% improvement, which is quite close to the 10% goal, yet painfully distant.

 The Netflix Prize breathed new life and excitement into recommender systems research. The competition allowed the wide research community to access a large scale, real life dataset. Beyond this, the competition changed the rules of the game. Claiming that your nice idea could outperform some mediocre algorithms on some toy dataset is no longer acceptable. Now researchers should face a new golden standard, and check how their seemingly elegant ideas are measured against best known results on an objective yardstick. I believe that this is a blessed change, which can help in shifting the focus to the few really useful ideas, rather than flooding us with a myriad of papers with questionable practical contributions. Well, days will tell…

 So where to start a truly meaningful research? What can really make a difference in perfecting a recommender system? I do not pretend have a real answer, but I will try to give some personal impressions. While working on the Netflix Prize, sifting through many ideas, implementing maybe a hundred different algorithms, we have come to recognize the few things that really matter. I will concentrate here on high level lessons that will hopefully help other practitioners in coming up with developments of a true practical value.

 I would like to characterize algorithms at three different levels. The first level answers the “what?” question - What do we want to model? Here we decide which features of the data to address. Do we want to model the numerical value of ratings, or maybe which movies people rate (regardless of rating value)? Do we want to address the date-dependent dynamics of users’ behavior? Some will want to model certain pieces of metadata associated with the movies, such as interactions with actors, directors, etc. Or, maybe, we would like to analyze the demographics of the users?

 The next level, the second one, answers the “which?” question - Which model are we going to pick? Will we model ratings through a neighborhood model or through a latent factor model? Within a neighborhood model, should we look at relationships between users, between movies, or maybe both? Within latent factor models we also have plenty of further choices - should we stick with the good old SVD, or move to fancier probabilistic models (e.g., pLSA, LDA)? Or maybe, we should jump to neural networks such as RBMs?

 Finally the last level answers the “how?” question - How are we going to implement the chosen model? Even after choosing a model, we have much flexibility in deciding how to optimize it. For example, nearest neighbor models can vary from quite simplistic correlation based models, to more sophisticated models that try to derive parameters directly from the data. Likewise, there are many ways to fit an SVD model, ranging from gradient descent and alternating least squares to deeper formulations such as EM, MAP, MCMC, Gibbs sampling and more.

 When designing an algorithm, one should go through the three levels, likely, but not necessarily, in the order I listed them. A major question is where most efforts should be invested? Which level has most influence on the quality of the outcome?

 My impression is that quite often most effort is allocated in the wrong direction. Most papers appear to concentrate on the third level, designing the best techniques for optimizing a single model or a particular cost function on which they are fixated. This is not very surprising, because the third level is the most technical one and offers the most flexibility. In particular, it allows researchers to express their prowess. Here, we can find papers with mathematical breakthroughs allowing squeezing some extra points from a model, getting us closer to the optimum, in a shorter time and with less overfitting. Well, no doubt, that’s wonderful… However, the practical value of these developments is quite limited, especially when using an ensemble of various models, where squeezing the best out of a single model is not really delivered to the bottom line.

 Concentrating efforts on the second level is more fruitful. Not all models are built equal for the task at hand. For example, user-based neighborhood models were found to be vastly inferior to item (movie) -based ones. Moreover, latent factor models were proven to be more accurate than the neighborhood ones (considering that you use the right latent factor model, which happens to be SVD). Most importantly, the design of a good ensemble blending complementing predictors should be mostly done at this level. It is very beneficial to blend SVD with a neighborhood technique and with an RBM. A simple mixture like this, involving quick and straightforward implementations, would probably vastly outperform some very well tuned and elaborated individual models. So this level is certainly important, receives quite a bit of attention at the literature, but not nearly as important as the first level.

 The first level, which decides the aspects of the data to be modeled, is where most pivotal choices are taken. Selecting the right features will make a huge impact on the quality of the results. For example, going beyond the numerical values of the ratings to analyzing which movies are chosen to be rated has a tremendous effect on prediction accuracy. On the other hand, modeling metadata associated with movies, such as identity of actors, or associated keywords, is not a prudent choice regarding the Netflix data. Similarly, modeling the date-dependent dynamics of users’ behavior is very useful. This first level receives less attention in the literature. Perhaps, because it is somewhat application dependant and harder to generalize. However, I can’t emphasize enough its importance.

 In practice, the borders between the three levels that I describe may be quite fuzzy. Moreover, these three levels can be sometimes strongly interlaced with each other, as at the end, a single implementation should fulfill all three levels. However, these days, whatever I think or hear about the Netflix data, I immediately try to relate to those three levels. The more it relates to the first level, the more interested I become, whereas I tend to almost completely ignore improvements related to the third level (well, that’s after exploring that level enough in the past). Just my 2 cents…

4/30/2008

Concerns about the Large Scale Learning Challenge

Filed under: CompetitionsJohn Langford @ 8:45 pm

The large scale learning challenge for ICML interests me a great deal, although I have concerns about the way it is structured.

From the instructions page, several issues come up:

  1. Large Definition My personal definition of dataset size is:
    1. small A dataset is small if a human could look at the dataset and plausibly find a good solution.
    2. medium A dataset is mediumsize if it fits in the RAM of a reasonably priced computer.
    3. large A large dataset does not fit in the RAM of a reasonably priced computer.

    By this definition, all of the datasets are medium sized. This might sound like a pissing match over dataset size, but I believe it is more than that.

    The fundamental reason for these definitions is that they correspond to transitions in the sorts of approaches which are feasible. From small to medium, the ability to use a human as the learning algorithm degrades. From medium to large, it becomes essential to have learning algorithms that don’t require random access to examples.

  2. No Loading Time The medium scale nature of the datasets is tacitly acknowledged in the rules which exclude data loading time. My experience is that parsing and loading large datasets is often the computational bottleneck. For example when comparing Vowpal Wabbit to SGD I used wall-clock time which makes SGD look a factor of 40 or so worse than Leon’s numbers only using training time after loading. This timing difference is entirely due to the overhead of parsing, even though the format parsed is a carefully optimized binary language. (No ‘excluding loading time’ number can be found for VW, of course, because loading and learning are intertwined.)
  3. Optimal Parameter Time The rules specify that the algorithm should be timed with optimal parameters. It’s very common for learning algorithms to have a few parameters controlling learning rate or regularization. However, no constraints are placed on the number or meaning of these parameters. As an extreme form of abuse, for example, your initial classifier could be declared a parameter. With an appropriate choice of this initial parameter (which you can freely optimize on the data), training time is zero.
  4. Parallelism One approach to dealing with large amounts of data is to add computers that operate in parallel. This is very natural (the brain is vastly parallel at the neuron level), and there are substantial research questions in parallel machine learning. Nevertheless it doesn’t appear to be supported by the contest. There are good reasons for this: parallel architectures aren’t very standard yet, and buying multiple computers is still substantially more expensive than buying the RAM to fit the dataset sizes. Nevertheless, it’s disappointing to exclude such a natural avenue. The rules even appear unclear on whether or not the final test run is on an SMP machine.

As a consequence of this design, the contest prefers algorithms that load all data into memory then operate on it. It also essentially excludes parallel algorithms. These design decisions discourage large scale algorithms (where large is as defined above) in favor of medium scale learning algorithms. The design also favors highly parameterized learning algorithms over less parameterized algorithms, which is the opposite of my personal preference for research direction.

Many of these issues are eliminatable or at least partially addressable. Limiting the parameter size to ‘20 characters on the commandline’ or in some other reasonable way seems essential. It’s probably too late to get large datasets, but using wall-clock time would at least avoid bias against large scale algorithms. If the final evaluation is going to take place on an SMP machine, at least detailing that would be helpful.

Despite these concerns, it’s important to be clear that this is an interesting contest. Even without any rule changes, it’s outcome tells us something about which sorts of algorithms work at a medium scale. That’s good information to know if you are interested in tackling larger scale algorithms. The datasets are also large enough to break every Theta(m2) algorithm. We should also respect the organizers: setting up any contest of this sort is quite a bit of work that’s difficult to nail down perfectly in advance.

update: Soeren has helped setup an SMP parallel track which address some of the concerns above. See the site for details, and see you there.

4/27/2008

Watchword: Supervised Learning

Filed under: Definitions, SupervisedJohn Langford @ 7:40 pm

I recently discovered that supervised learning is a controversial term. The two definitions are:

  1. Known Loss Supervised learning corresponds to the situation where you have unlabeled examples plus knowledge of the loss of each possible predicted choice. This is the definition I’m familiar and comfortable with. One reason to prefer this definition is that the analysis of sample complexity for this class of learning problems are all pretty similar.
  2. Any kind of signal Supervised learning corresponds to the situation where you have unlabeled examples plus any source of side information about what the right choice is. This notion of supervised learning seems to subsume reinforcement learning, which makes me uncomfortable, because it means there are two words for the same class. This also means there isn’t a convenient word to describe the first definition.

Reviews suggest there are people who are dedicated to the second definition out there, so it can be important to discriminate which you mean.

4/26/2008

Eliminating the Birthday Paradox for Universal Features

Filed under: OnlineJohn Langford @ 11:45 am

I want to expand on this post which describes one of the core tricks for making Vowpal Wabbit fast and easy to use when learning from text.

The central trick is converting a word (or any other parseable quantity) into a number via a hash function. Kishore tells me this is a relatively old trick in NLP land, but it has some added advantages when doing online learning, because you can learn directly from the existing data without preprocessing the data to create features (destroying the online property) or using an expensive hashtable lookup (slowing things down).

A central concern for this approach is collisions, which create a loss of information. If you use m features in an index space of size n the birthday paradox suggests a collision if m > n0.5, essentially because there are m2 pairs. This is pretty bad, because it says that with a vocabulary of 105 features, you might need to have 1010 entries in your table.

It turns out that redundancy is great for dealing with collisions. Alex and I worked out a couple cases, the most extreme example of which is when you simply duplicate the base word and add a symbol before hashing, creating two entries in your weight array corresponding to the same word. We can ask: what is the probability(*) that there exists a word where both entries collide with an entry for some other word? Answer: about 4m3/n2. Plugging in numbers, we see that this implies perhaps only n=108 entries are required to avoid a collision. This number can be further reduced to 107 by increasing the degree of duplication to 4 or more.

The above is an analysis of explicit duplication. In a real world dataset with naturally redundant features, you can have the same effect implicitly, allowing for tolerance of a large number of collisions.

This argument is information theoretic, so it’s possible that rates of convergence to optimal predictors are slowed by collision, even if the optimal predictor is unchanged. To think about this possibility, analysis particular to specific learning algorithms is necessary. It turns out that many learning algorithms are inherently tolerant of a small fraction of collisions, including large margin algorithms.

(*) As in almost all hash function analysis, the randomization is over the choice of (random) hash function.

4/22/2008

Taking the next step

Filed under: Conferences, ResearchJohn Langford @ 9:16 pm

At the last ICML, Tom Dietterich asked me to look into systems for commenting on papers. I’ve been slow getting to this, but it’s relevant now.

The essential observation is that we now have many tools for online collaboration, but they are not yet much used in academic research. If we can find the right way to use them, then perhaps great things might happen, with extra kudos to the first conference that manages to really create an online community. Various conferences have been poking at this. For example, UAI has setup a wiki, COLT has started using Joomla, with some dynamic content, and AAAI has been setting up a “student blog“. Similarly, Dinoj Surendran setup a twiki for the Chicago Machine Learning Summer School, which was quite useful for coordinating events and other things.

I believe the most important thing is a willingness to experiment. A good place to start seems to be enhancing existing conference websites. For example, the ICML 2007 papers page is basically only useful via grep. A much more human-readable version of the page would organize the papers by topic. If the page wiki-editable, this would almost happen automatically. Adding the ability for people to comment on the papers might make the website more useful beyond the time of the conference itself.

There are several aspects of an experiment which seem intuitively important to me. I found the wikipatterns site a helpful distillation of many of these intuitions. Here are various concerns I have:

  1. Mandate An official mandate is a must-have. Any such enhancement needs to be an official part of the website, or the hesitation to participate will probably be too much.
  2. Permissive Comments Allowing anyone to comment on a website is somewhat scary to academics, because we are used to peer-reviewing papers before publishing. Nevertheless, it seems important to not strongly filter comments, because:
    1. The added (human) work of filtering is burdensome.
    2. The delay introduced acts as a barrier to participation.

    The policy I’ve followed on hunch.net is allowing comments from anyone exhibiting evidence of intelligence—i.e. filtering essentially only robots. This worked as well I hoped, and not as badly as I feared.

  3. Spam Spam is a serious issue for dynamic websites, because it adds substantially to the maintenance load. There are basically two tacks to take here:
    1. Issue a userid/passwd to every conference registrant (and maybe others that request it), the just allow comments from them.
    2. Allow comments from anyone, but use automated filters. I’ve been using Akismet, but recaptcha is also cool.

    I favor the second approach, because it’s more permissive, and it makes participation easier. However, it may increase the maintenance workload.

  4. Someone Someone to shepard the experiment is needed. I’m personally overloaded with other things at the moment (witness the slow post rate), so I don’t have significant time to devote. Nevertheless, I’m sure there are many people in the community with as good a familiarity with the internet and web applications as myself.
  5. Software Choice I don’t have strong preferences for the precise choice of software, but some guidelines seem good.
    1. Open Source I have a strong preference for open source solutions, of which there appear to be several reasonable choices. The reason is that open source applications leave you free (or at least freer) to switch and change things, which seems essential when experimenting.
    2. Large User base When going with an open source solution, something with a large user base is likely to have fewer rough edges.

    I have some preference for systems using flat files for datastorage rather than a database because they are easier to maintain or (if necessary) operate on. This is partly due to a bad experience I had with the twiki setup for MLSS—basically an attempt to transfer data to an upgraded mysql failed because of schema issues I failed to resolve.

    I’m sure there are many with more experience using wiki and comment systems—perhaps they can comment on exact software choices. Wikimatrix seems to provide frighteningly detailed comparisons of different wiki software.

4/21/2008

The Science 2.0 article

Filed under: ResearchJohn Langford @ 9:26 pm

I found the article about science using modern tools interesting, especially the part about ‘blogophobia’, which in my experience is often a substantial issue: many potential guest posters aren’t quite ready, because of the fear of a permanent public mistake, because it is particularly hard to write about the unknown (the essence of research), and because the system for public credit doesn’t yet really handle blog posts.

So far, science has been relatively resistant to discussing research on blogs. Some things need to change to get there. Public tolerance of the occasional mistake is essential, as is a willingness to cite (and credit) blogs as freely as papers.

I’ve often run into another reason for holding back myself: I don’t want to overtalk my own research. Nevertheless, I’m slowly changing to the opinion that I’m holding back too much: the real power of a blog in research is that it can be used to confer with many people, and that just makes research work better.

4/12/2008

Blog compromised

Filed under: MetaJohn Langford @ 10:40 am

Iain noticed that hunch.net had zero width divs hiding spammy URLs. Some investigation reveals that the wordpress version being used (2.0.3) had security flaws. I’ve upgraded to the latest, rotated passwords, and removed the spammy URLs. I don’t believe any content was lost. You can check your own and other sites for a similar problem by greping for “width:0″ or “width: 0″ in the delivered html source.

It Doesn’t Stop

Filed under: AI, ResearchJohn Langford @ 5:08 am

I’ve enjoyed the Terminator movies and show. Neglecting the whacky aspects (time travel and associated paradoxes), there is an enduring topic of discussion: how do people deal with intelligent machines (and vice versa)?

In Terminator-land, the primary method for dealing with intelligent machines is to prevent them from being made. This approach works pretty badly, because a new angle on building an intelligent machine keeps coming up. This is partly a ploy for writer’s to avoid writing themselves out of a job, but there is a fundamental truth to it as well: preventing progress in research is hard.

The United States, has been experimenting with trying to stop research on stem cells. It hasn’t worked very well—the net effect has been retarding research programs a bit, and exporting some research to other countries. Another less recent example was encryption technology, for which the United States generally did not encourage early public research and even discouraged as a munition. This slowed the development of encryption tools, but I now routinely use tools such as ssh and GPG.

Although the strategy of preventing research may be doomed, it does bring up a Bill Joy type of question: should we actively chose to do research in a field where knowledge can be used to great harm? As an example, the Terminator series illustrates the dark fears of AI gone bad. Many researchers avoid this question by not thinking about it, but this is a substantial question of concern to society at large, and whether or not society supports a direction of research.

My answer is “yes, we should do research”. The reason is simple: I believe that good AI is the best chance of the survival of civilization. This might seem like a leap, but considering the following.

  1. Civilization is not stable. Anyone who believes otherwise needs to try to smell the 1908. Just a lifetime ago, humans could barely fly and computers were people. These radical changes in the abilities of a civilization are strong evidence against stability. Further evidence of instabilities come from long term world changing trends such as greenhouse gas accumulation and population graphs.
  2. Instability is bad in the long run. There are quite a number of doomsday-for-civilization scenarios kicking around—nuclear, plague, grey goo, black holes, etc… Many people find doomsday scenarios triggered by malevolence or accident to be unconvincing, since doomsday claims are so commonly debunked (remember the Y2K computer bug armageddon?). I am naturally skeptical myself, but it only takes one. In the next 10000 years, the odds of something going wrong seem fair.
  3. … for a closed system. There is one really good caveat to instability, which is redundancy. Perhaps if we Earthlings screwup, our descendendents on Alpha Centauri can come pick up the pieces. The fundamental driver here is light speed latency: if it takes years for two groups to communicate, then it is unlikely they’ll manage to coordinate (with malevolence or accident) a simultaneous doomsday.
  4. But real space travel requires AI. Getting from one star system to another with known physics turns out to be very hard. The best approaches I know involve giant lasers and multiple solar sails or fusion powered rockets, taking many years. Merely getting there, of course, is not enough—we need to get there with a kernel of civilization, capable of growing anew in the new system. Any adjacent star system may not have an earth-like planet implying the need to support a space-based civilization. Since travel between star systems is so prohibitively difficult, a basic question is: how small can we make a kernel of civilization? Many science fiction writers and readers think of generation ships, which would necessarily be enormous to support the air, food, and water requirements of a self-sustaining human population. A much simpler and easier solution comes with AI. A good design might mass 103 kilograms or so and be designed to first land on an asteroid, then mine it, first creating a large solar cell array, and replicas to seed other asteroids. Eventually, hallowed out asteroids could support human life if the requisite materials (Oxygen, Carbon, Hydrogen, etc..) are found. The fundamental observation here is that intelligence and knowledge require very little mass.

I hope we manage to crack AI, opening the door to real space travel, so that civilization doesn’t stop.

3/23/2008

Interactive Machine Learning

A new direction of research seems to be arising in machine learning: Interactive Machine Learning. This isn’t a familiar term, although it does include some familiar subjects.

What is Interactive Machine Learning? The fundamental requirement is (a) learning algorithms which interact with the world and (b) learn.

For our purposes, let’s define learning as efficiently competing with a large set of possible predictors. Examples include:

  1. Online learning against an adversary (Avrim’s Notes). The interaction is almost trivial: the learning algorithm makes a prediction and then receives feedback. The learning is choosing based upon the advice of many experts.
  2. Active Learning. In active learning, the interaction is choosing which examples to label, and the learning is choosing from amongst a large set of hypotheses.
  3. Contextual Bandits. The interaction is choosing one of several actions and learning only the value of the chosen action (weaker than active learning feedback).

More forms of interaction will doubtless be noted and tackled as time progresses. I created a webpage for my own research on interactive learning which helps define the above subjects a bit more.

What isn’t Interactive Machine Learning?
There are several learning settings which fail either the interaction or the learning test.

  1. Supervised Learning doesn’t fit. The basic paradigm in supervised learning is that you ask experts to label examples, and then you learn a predictor based upon the predictions of these experts. This approach has essentially no interaction.
  2. Semisupervised Learning doesn’t fit. Semisupervised learning is almost the same as supervised learning, except that you also throw in many unlabeled examples.
  3. Bandit algorithms don’t fit. They have the interaction, but not much learning happens because the sample complexity results only allow you to choose from amongst a small set of strategies. (One exception is EXP4 (page 66), which can operate in the contextual bandit setting.)
  4. MDP learning doesn’t fit. The interaction is there, but the set of policies learned over is still too limited—essentially the policies just memorize what to do in each state.
  5. Reinforcement learning may or may not fit, depending on whether you think of it as MDP learning or in a much broader sense.

All of these not-quite-interactive-learning topics are of course very useful background information for interactive machine learning.

Why now? Because it’s time, of course.

  1. We know from other fields and various examples that interaction is very powerful.
    1. From online learning against an adversary, we know that independence of samples is unnecessary in an interactive setting—in fact you can even function against an adversary.
    2. From active learning, we know that interaction sometimes allows us to use exponentially fewer labeled samples than in supervised learning.
    3. From context bandits, we gain the ability to learn in settings where traditional supervised learning just doesn’t apply.
    4. From complexity theory we have “IP=PSPACE” roughly: interactive proofs are as powerful as polynomial space algorithms, which is a strong statement about the power of interaction.
  2. We know that this analysis is often tractable. For example, since Sanjoy’s post on Active Learning, much progress has been made. Several other variations of interactive settings have been proposed and analyzed. The older online learning against an adversary work is essentially completely worked out for the simpler cases (except for computational issues).
  3. Real world problems are driving it. One of my favorite problems at the moment is the ad display problem—How do you learn which ad is most likely to be of interest? The contextual bandit problem is a big piece of this problem.
  4. It’s more fun. Interactive learning is essentially a wide-open area of research. There are plenty of kinds of natural interaction which haven’t been formalized or analyzed. This is great for beginnners, because it means the problems are simple, and their solution does not require huge prerequisites.
  5. It’s a step closer to AI. Many people doing machine learning want to reach AI, and it seems clear that any AI must engage in interactive learning. Mastering this problem is a next step.

Basic Questions

  1. For natural interaction form [insert yours here], how do you learn? Some of the techniques for other methods of interactive learning may be helpful.
  2. How do we blend interactive and noninteractive learning? In many applications, there is already a pool of supervised examples around.
  3. Are there general methods for reducing interactive learning problems to supervised learning problems (which we know better)?

3/15/2008

COLT Open Problems

COLT has a call for open problems due March 21. I encourage anyone with a specifiable open problem to write it down and send it in. Just the effort of specifying an open problem precisely and concisely has been very helpful for my own solutions, and there is a substantial chance others will solve it. To increase the chance someone will take it up, you can even put a bounty on the solution. (Perhaps I should raise the $500 bounty on the K-fold cross-validation problem as it hasn’t yet been solved).

3/7/2008

Spock Challenge Winners

Filed under: Competitions, Machine LearningJohn Langford @ 9:19 pm

The spock challenge for named entity recognition was won by Berno Stein, Sven Eissen, Tino Rub, Hagen Tonnies, Christof Braeutigam, and Martin Potthast.

2/27/2008

The Stats Handicap

Filed under: Funding, Machine Learning, StatisticsJohn Langford @ 1:35 pm

Graduating students in Statistics appear to be at a substantial handicap compared to graduating students in Machine Learning, despite being in substantially overlapping subjects.

The problem seems to be cultural. Statistics comes from a mathematics background which emphasizes large publications slowly published under review at journals. Machine Learning comes from a Computer Science background which emphasizes quick publishing at reviewed conferences. This has a number of implications:

  1. Graduating statistics PhDs often have 0-2 publications while graduating machine learning PhDs might have 5-15.
  2. Graduating ML students have had a chance for others to build on their work. Stats students have had no such chance.
  3. Graduating ML students have attended a number of conferences and presented their work, giving them a chance to meet people. Stats students have had fewer chances of this sort.

In short, Stats students have had relatively few chances to distinguish themselves and are heavily reliant on their advisors for jobs afterwards. This is a poor situation, because advisors have a strong incentive to place students well, implying that recommendation letters must always be considered with a grain of salt.

This problem is more or less prevalent depending on which Stats department students go to. In some places the difference is substantial, and in other places not.

One practical implication of this, is that when considering graduating stats PhDs for hire, some amount of affirmative action is in order. At a minimum, this implies spending extra time getting to know the candidate and what the candidate can do is in order.

2/17/2008

The Meaning of Confidence

Filed under: Definitions, Machine LearningJohn Langford @ 10:36 am

In many machine learning papers experiments are done and little confidence bars are reported for the results. This often seems quite clear, until you actually try to figure out what it means. There are several different kinds of ‘confidence’ being used, and it’s easy to become confused.

  1. Confidence = Probability. For those who haven’t worried about confidence for a long time, confidence is simply the probability of some event. You are confident about events which have a large probability. This meaning of confidence is inadequate in many applications because we want to reason about how much more information we have, how much more is needed, and where to get it. As an example, a learning algorithm might predict that the probability of an event is 0.5, but it’s unclear if the probability is 0.5 because no examples have been provided or 0.5 because many examples have been provided and the event is simply fundamentally uncertain.
  2. Classical Confidence Intervals. These are common in learning theory. The essential idea is that world has some true-but-hidden value, such as the error rate of a classifier. Given observations from the world (such as err-or-not on examples), an interval is constructed around the hidden value. The semantics of the classical confidence interval is: the (random) interval contains the (determistic but unknown) value, with high probability. Classical confidence intervals (as applied in machine learning) typically require that observations are independent. They have some drawbacks discussed previously. One drawback of concern is that classical confidence intervals breakdown rapidly when conditioning on information.
  3. Bayesian Confidence Intervals. These are common in several machine learning applications. If you have a prior distribution over the way the world creates observations, then you can use Bayes law to construct a posterior distribution over the way the world creates observations. With respect to this posterior distribution, you construct an interval containing the truth with high probability. The semantics of a Bayesian confidence interval is “If the world is drawn from the prior the interval contains the truth with high probability”. No assumption of independent samples is required. Unlike classical confidence intervals, it’s easy to have a statement conditioned on features. For example, “the probability of disease given the observations is in [0.8,1]”. My principal source of uneasiness with respect to Bayesian confidence intervals is the “If the world is drawn from the prior” clause—I believe it is difficult to know and specify a correct prior distribution. Many Bayesians aren’t bothered by this, but the meaning of a Bayesian confidence interval becomes unclear if you work with an incorrect (or subjective) prior.
  4. Asymptotic Intervals. This is also common in applied machine learning, which I strongly dislike. The basic line of reasoning seems to be: “Someone once told me that if observations are IID, then their average converges to a normal distribution, so let’s use an unbiased estimate of the mean and variance, assume convergence, and then construct a confidence interval for the mean of a gaussian”. Asymptotic intervals are asymptotically equivalent to classical confidence intervals, but they can differ spectacularly with finite sample sizes. The simplest example of this is when a classifier has zero error rate on a test set. A classical confidence interval for the error rate is [0,log(1/d)/n] where n is the size of the test set and d is the probability that the interval contains the truth. For asymptotic intervals you get [0,0] which is bogus in all applications I’ve encountered.
  5. Internal Confidence Intervals. This is not used much, except in agnostic active learning analysis. The essential idea, is that we cease to make intervals about the world, and instead make intervals around our predictions of the world. The real world might assign label 0 or label 1 given a particular context x, and we could only discover the world’s truth by actually observing x,y labeled examples. Yet, it turns out to sometimes be easy to infer “our learning algorithm will definitely predict label 1 given features x“. This allowed dependence on x means we can efficiently guide exploration. A basic question is: can this notion of internal confidence guide other forms of exploration?
  6. Gamesman intervals. Vovk and Shafer have been working on new foundations of probability, where everything is stated in terms of games. In this setting, a confidence interval is (roughly) a set of predictions output by an adaptive rule with the property that it contains the true observation a large fraction of the time. This approach has yet to catch on, but it is interesting because it provides a feature dependent confidence interval without making strong assumptions about the world.

2/10/2008

Complexity Illness

Filed under: Conferences, Machine Learning, ResearchJohn Langford @ 6:34 pm

One of the enduring stereotypes of academia is that people spend a great deal of intelligence, time, and effort finding complexity rather than simplicity. This is at least anecdotally true in my experience.

  1. Math++ Several people have found that adding useless math makes their paper more publishable as evidenced by a reject-add-accept sequence.
  2. 8 page minimum Who submitted a paper to ICML violating the 8 page minimum? Every author fears that the reviewers won’t take their work seriously unless the allowed length is fully used. The best minimum violation I know is Adam’s paper at SODA on generating random factored numbers, but this is deeply exceptional. It’s a fair bet that 90% of papers submitted are exactly at the page limit. We could imagine that this is because papers naturally take more space, but few people seem to be clamoring for more space.
  3. Journalong Has anyone been asked to review a 100 page journal paper? I have. Journal papers can be nice, because they give an author the opportunity to write without sharp deadlines or page limit constraints, but this can and does go awry.

Complexity illness is a burden on the community. It means authors spend more time filling out papers, reviewers spend more time reviewing, and (most importantly) effort is misplaced on complex solutions over simple solutions, ultimately slowing (sometimes crippling) the long term impact of an academic community.

It’s difficult to imagine an author-driven solution to complexity illness, because the incentives are simply wrong. Reviewing based on solution value rather than complexity is a good way for individual people to reduce the problem. More generally, it would be great to have a system which explicitly encourages research without excessive complexity. The best example of this seems to be education—it’s the great decomplexifier. The process of teaching something greatly encourages teaching the simple solution, because that is what can be understood. This seems to be true both of traditional education and less conventional means such as wikipedia articles. I’m not sure exactly how to use this observation—Is there some way we can shift conference formats towards the process of creating teachable material?

Next Page »