Machine Learning (Theory)

8/24/2010

Alex Smola starts a blog

Tags: Announcements, Machine Learning jl@ 5:44 pm

Adventures in Data Land.

8/23/2010

Boosted Decision Trees for Deep Learning

Tags: Deep, Machine Learning, Supervised jl@ 11:18 am

About 4 years ago, I speculated that decision trees qualify as a deep learning algorithm because they can make decisions which are substantially nonlinear in the input representation. Ping Li has proved this correct, empirically at UAI by showing that boosted decision trees can beat deep belief networks on versions of Mnist which are artificially hardened so as to make them solvable only by deep learning algorithms.

This is an important point, because the ability to solve these sorts of problems is probably the best objective definition of a deep learning algorithm we have. I’m not that surprised. In my experience, if you can accept the computational drawbacks of a boosted decision tree, they can achieve pretty good performance.

Geoff Hinton once told me that the great thing about deep belief networks is that they work. I understand that Ping had very substantial difficulty in getting this published, so I hope some reviewers step up to the standard of valuing what works.

8/22/2010

KDD 2010

Tags: Conferences, Machine Learning jl@ 6:39 pm

There were several papers that seemed fairly interesting at KDD this year. The ones that caught my attention are:

  1. Xin Jin, Mingyang Zhang, Nan Zhang, and Gautam Das, Versatile Publishing For Privacy Preservation. This paper provides a conservative method for safely determining which data is publishable from any complete source of information (for example, a hospital) such that it does not violate privacy rules in a natural language. It is not differentially private, so no external sources of join information can exist. However, it is a mechanism for publishing data rather than (say) the output of a learning algorithm.
  2. Arik Friedman Assaf Schuster, Data Mining with Differential Privacy. This paper shows how to create effective differentially private decision trees. Progress in differentially private datamining is pretty impressive, as it was defined in 2006.
  3. David Chan, Rong Ge, Ori Gershony, Tim Hesterberg, Diane Lambert, Evaluating Online Ad Campaigns in a Pipeline: Causal Models At Scale This paper is about automated estimation of ad campaign effectiveness. The double robust estimation technique seems intuitively appealing and plausibly greatly enhances effectiveness.
  4. Naoki Abe et al. Optimizing Debt Collections Using Constrained Reinforcement Learning This is an application paper about optimizing the New York State income tax collection agency. As you might expect, there are several cludgy aspects due to working within legal and organizational constraints. They deal with them, and expect to end up making NY state around $108/year. Too bad I live in NY :)
  5. Vikas C Raykar, Balaji Krishnapuram, and Shinpeng Yu Designing Efficient Cascaded Classifiers: Tradeoff between Accuracy and Cost This paper is about a continuization based solution to designing a cost-efficient yet accurate classifier cascade. It’s a step beyond the Viola Jones style boosting with cutouts, but I suspect not yet a final solution.
  6. D. Sculley, Combined Regression and Ranking. There are lots of applications where you want both a correct ordering and an estimated value of each item. This paper shows a simple combined-loss approach to getting both which empirically improves on either metric.

In addition, I enjoyed Konrad Feldman’s invited talk on Quantcast’s data and learning systems which sounded pretty slick.

In general, it seems like KDD is substantially maturing as a conference. The work on empirically effective privacy-preserving algorithms and some of the stats-work is ahead of what I’ve seen at other machine learning conferences. Presumably this is due to KDD being closer to the business side of machine learning and hence more aware of what are real problems there. An annoying aspect of KDD as a publishing venue is that they don’t put the papers on the conference website, due to ACM constraints. A substantial compensation is that all talks are scheduled to appear on videolectures.net and, as you can see, most papers can be found on author webpages.

KDD also experimented with crowdvine again this year so people could announce which talks they were interested in and setup meetings. My impression was that it worked a bit less well than last year, partly because it wasn’t pushed as much by the conference organizers. Small changes in the interface might make a big difference—for example, just providing a ranking of papers by interest might make it pretty compelling.

8/21/2010

Rob Schapire at NYC ML Meetup

Tags: Announcements, Machine Learning jl@ 8:10 pm

I’ve been wanting to attend the NYC ML Meetup for some time and hope to make it next week on the 25th. Rob Schapire is talking about “Playing Repeated Games”, which in my experience is far more relevant to machine learning than the title might indicate.

8/20/2010

The Workshop on Cores, Clusters, and Clouds

Tags: Announcements, Workshop jl@ 8:47 am

Alekh, John, Ofer, and I are organizing a workshop at NIPS this year on learning in parallel and distributed environments. The general interest level in parallel learning seems to be growing rapidly, so I expect quite a bit of attendance. Please join us if you are parallel-interested.

And, if you are working in the area of parallel learning, please consider submitting an abstract due Oct. 17 for presentation at the workshop.

7/18/2010

ICML & COLT 2010

The papers which interested me most at ICML and COLT 2010 were:

  1. Thomas Walsh, Kaushik Subramanian, Michael Littman and Carlos Diuk Generalizing Apprenticeship Learning across Hypothesis Classes. This paper formalizes and provides algorithms with guarantees for mixed-mode apprenticeship and traditional reinforcement learning algorithms, allowing RL algorithms that perform better than for either setting alone.
  2. István Szita and Csaba Szepesvári Model-based reinforcement learning with nearly tight exploration complexity bounds. This paper and anotherrepresent the frontier of best-known algorithm for Reinforcement Learning in a Markov Decision Process.
  3. James Martens Deep learning via Hessian-free optimization. About a new not-quite-online second order gradient algorithm for learning deep functional structures. Potentially this is very powerful because while people have often talked about end-to-end learning, it has rarely worked in practice.
  4. Chrisoph Sawade, Niels Landwehr, Steffen Bickel. and Tobias Scheffer Active Risk Estimation. When a test set is not known in advance, the model can be used to safely aid test set evaluation using importance weighting techniques. Relative to the paper, placing a lower bound on p(y|x) is probably important in practice.
  5. H. Brendan McMahan and Matthew Streeter Adaptive Bound Optimization for Online Convex Optimization and the almost-same paper John Duchi, Elad Hazan, and Yoram Singer, Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. These papers provide tractable online algorithms with regret guarantees over a family of metrics rather than just euclidean metrics. They look pretty useful in practice.
  6. Nicolò Cesa-Bianchi, Claudio Gentile, Fabio Vitale, Giovanni Zappella, Active Learning on Trees and Graphs Various subsets of these authors have other papers about actively learning graph-obeying functions which in total provide a good basis for understanding what’s possible and how to learn.

The program chairs for ICML did a wide-ranging survey over participants. The results seem to suggest that participants generally agree with the current ICML process. I expect there is some amount of anchoring effect going on where participants have an apparent preference for the known status quo, although it’s difficult to judge the degree of that. Some survey results which aren’t of that sort are:

  1. 7.7% of reviewers say author feedback changed their mind. It would be interesting to know for which fraction of accepted papers reviewers had their mind changed, but that isn’t there.
  2. 85.4% of authors don’t know if the reviewers read their response, believe they read and ignored it, or believe they didn’t read it. Authors clearly don’t feel like they are communicating with reviewers.
  3. 58.6% support growing the conference with the largest fraction suggesting poster-only papers.
  4. Other conferences attended by the ICML community in order are NIPS, ECML/PKDD, AAAI, IJCAI, AIStats, UAI, KDD, ICDM, COLT, SIGIR, ECAI, EMNLP, CoNLL. This is pretty different from the standard colocation list for ICML. Many possibilities are precluded by scheduling, but AAAI, IJCAI, UAI, KDD, COLT, SIGIR are all serious possibilities some of which haven’t been used much in the past.

My experience with Mark’s new paper discussion site is generally positive—having comments emailed to interested parties really helps the discussion. There are a few comments that authors haven’t responded to, so if you are an author you might want to sign up to receive comments.

In addition, I was the workshop chair for ICML&COLT this year. My overall impression was that things went reasonably well, with the exception of internet connectivity at Dan Panorama which was a minidisaster courtesy of a broken per-machine authentication system. One of the things I’m particularly happy about was the Learning to Rank Challenge workshop. I think it would be great if ICML can continue to attract new challenge workshops in the future. If anyone else has comments about the workshops, I’d love to hear them.

7/2/2010

MetaOptimize

Tags: Announcements, Machine Learning jl@ 12:39 am

Joseph Turian creates MetaOptimize for discussion of NLP and ML on big datasets. This includes a blog, but perhaps more importantly a question and answer section. I’m hopeful it will take off.

6/20/2010

2010 ICML discussion site

A substantial difficulty with the 2009 and 2008 ICML discussion system was a communication vacuum, where authors were not informed of comments, and commenters were not informed of responses to their comments without explicit monitoring. Mark Reid has setup a new discussion system for 2010 with the goal of addressing this.

Mark didn’t want to make it to intrusive, so you must opt-in. As an author, find your paper and “Subscribe by email” to the comments. As a commenter, you have the option of providing an email for follow-up notification.

6/13/2010

The Good News on Exploration and Learning

Consider the contextual bandit setting where, repeatedly:

  1. A context x is observed.
  2. An action a is taken given the context x.
  3. A reward r is observed, dependent on x and a.

Where the goal of a learning agent is to find a policy for step 2 achieving a large expected reward.

This setting is of obvious importance, because in the real world we typically make decisions based on some set of information and then get feedback only about the single action taken. It also fundamentally differs from supervised learning settings because knowing the value of one action is not equivalent to knowing the value of all actions.

A decade ago the best machine learning techniques for this setting where implausibly inefficient. Dean Foster once told me he thought the area was a research sinkhole with little progress to be expected. Now we are on the verge of being able to routinely attack these problems, in almost exactly the same sense that we routinely attack bread and butter supervised learning problems. Just as for supervised learning, we know how to create and reuse datasets, how to benchmark algorithms, how to reuse existing supervised learning algorithms in this setting, and how to achieve optimal rates of learning quantitatively similar to supervised learning.

This is also one of the times when understanding the basic theory can make a huge difference in your success. There are many wrong ways to attack contextual bandit problems or prepare datasets, and taking a wrong turn can easily mean the difference between failure and success. Understanding how contextual bandit problems differ from basic supervised learning problems is critical to routine success here.

All of the above is not meant to claim that everything is done research-wise here so we’ll try to outline where the current boundary of research lies as best we can. However, we are surely at a point both in terms of application demand (especially for internet applications of search, advertising, page optimization, but also medical applications and surely others) and methodology supply (with basic reliable techniques now easily available or created) where these techniques are shifting from theory esoterica to required education.

Given the above, Alina and I decided to prepare a tutorial to be given at Yahoo! Labs summer school (my first India trip!), ICML, KDD, and hopefully videolectures.net. Please join us. The subjects we plan to cover are essentially the keys to the kingdom of solving shallow interactive learning problems.

5/20/2010

Google Predict

Tags: Machine Learning jl@ 9:25 am

Slashdot points out Google Predict. I’m not privy to the details, but this has the potential to be extremely useful, as in many applications simply having an easy mechanism to apply existing learning algorithms can be extremely helpful. This differs goalwise from MLcomp—instead of public comparisons for research purposes, it’s about private utilization of good existing algorithms. It also differs infrastructurally, since a system designed to do this is much less awkward than using Amazon’s cloud computing. The latter implies that datasets several order of magnitude larger can be handled up to limits imposed by network and storage.

5/10/2010

Aggregation of estimators, sparsity in high dimension and computational feasibility

Tags: Machine Learning, Statistics jl@ 2:04 pm

(I’m channeling for Jean-Yves Audibert here, with some minor tweaking for clarity.)

Since Nemirovski’s Saint Flour lecture notes, numerous researchers have studied the following problem in least squares regression: predict as well as
(MS) the best of d given functions (like in prediction with expert advice; model = finite set of d functions)
(C) the best convex combination of these functions (i.e., model = convex hull of the d functions)
(L) the best linear combination of these functions (i.e., model = linear span of the d functions)
It is now well known (see, e.g., Sacha Tsybakov’s COLT’03 paper) that these tasks can be achieved since there exist estimators having an excess risk of order (log d)/n for (MS), min( sqrt((log d)/n), d/n ) for (C) and d/n for (L), where n is the training set size. Here, “risk” is amount of extra loss per example which may be suffered due to the choice of random sample.

The practical use of these results seems rather limited to trivial statements like: do not use the OLS estimator when the dimension d of the input vector is larger than n (here the d functions are the projections on each of the d components). Nevertheless, it provides a rather easy way to prove that there exists a learning algorithm having an excess risk of order s (log d)/n, with respect to the best linear combination of s of the d functions (s-sparse linear model). Indeed, it suffices to consider the algorithm which

  1. cuts the training set into two parts, say of equal size for simplicity,
  2. uses the first part to train linear estimators corresponding to every possible subset of s features. Here you can use your favorite linear estimator (the empirical risk minimizer on a compact set or robust but more involved ones are possible rather than the OLS), as long as it solves (L) with minimal excess risk.
  3. uses the second part to predict as well as the “d choose s” linear estimators built on the first part. Here you choose your favorite aggregate solving (MS). The one I prefer is described in p.5 of my NIPS’07 paper, but you might prefer the progressive mixture rule or the algorithm of Guillaume Lecué and Shahar Mendelson. Note that empirical risk minimization and cross-validation completely fail for this task with excess risk of order sqrt((log d)/n) instead of (log d)/n.

It is an easy exercise to combine the different excess risk bounds and obtain that the above procedure achieves an excess risk of s (log d)/n. The nice thing compared to works on Lasso, Dantzig selectors and their variants is that you do not need all these assumptions saying that your features should be “not too much” correlated. Naturally, the important limitation of the above procedure, which is often encountered when using classical model selection approach, is its computational intractability. So this leaves open the following fundamental problem:
is it possible to design a computationally efficient algorithm with the s (log d)/n guarantee without assuming low correlation between the explanatory variables?

5/2/2010

What’s the difference between gambling and rewarding good prediction?

Tags: Machine Learning jl@ 11:07 pm

After a major financial crisis, there is much discussion about how finance has become a casino gambling with other’s money, keeping the winnings, and walking away when the money is lost.

When thinking about financial reform, all the many losers in the above scenario are apt to take the view that this activity should be completely, or nearly completely curtailed. But, a more thoughtful view is that sometimes there is a real sense in which there are right and wrong decisions, and we as a society would really prefer that the people most likely to make right decisions are making them. A crucial question then is: “What is the difference between gambling and rewarding good prediction?”

We discussed this before the financial crisis. The cheat-sheet sketch is that the online learning against an adversary problem, algorithm, and theorems, provide a good mathematical model for thinking about this question. What I would like to do here is map this onto various types of financial transactions. The basic mapping is between “wealth” and “weight”, with the essential idea that you can think of wealth as either money or degree of control over decision making. The core algorithms start with a “wealth” spread over many experts, each of which makes predictions and then has it’s wealth updated according to a soft exponential of the value of it’s prediction.

  1. Going Long. The basic strategy here is to buy low and sell high. This strategy is not inherently sound from a learning theory point of view, because a single purchased item can sometimes drop to zero value. Similarly, a single purchased item can sometimes grow radically in value. Neither of these properties are desirable from the viewpoint of a learning algorithm. In the zero value case, a good decision maker can be wiped out by one decision, while in the large value case, a lucky decision maker can randomly achieve overwhelming credit. Nevertheless, there is a sense in which this strategy is compatible. If each item purchased either doubles or halves in value, the fluctuation in the wealth of a decision maker is analogous to the fluctuation in the relative weight of on an expert in the online learning framework.
  2. … with diversification. Going long with diversification implies purchasing several items and selling them later. Adding diversification to the “Long” strategy helps it align substantially better with an optimal learning theory strategy. Single points of failure are avoided, while random fluctuations up in wealth are reduced.
  3. Going Short. The short strategy is borrowing an item (typically a stock), selling it high, then buying it back low to cover the debt. It’s technique used to make money when a stock decreases in value. This technique was banned for a time during the crisis. From the perspective of learning theory, short selling is more dangerous than long, because it’s possible to end up with negative wealth when a stock is sold short, and then it increases in value. To avoid this, it’s necessary to have sufficient collateral to cover the short at all times. If this collateral is at least twice the value when shorting occurs, it’s hard for participants to become wealthy by luck, because wealth at most doubles. Diversification is also a potentially useful helper strategy.
  4. Insurance. Credit Default Swaps are effectively a form of insurance where one party pays another small amounts unless something bad happens, in which case large amounts of money go the other direction. In the financial crisis, credit default swaps made the crisis viral, as the “pay up” clauses triggered, particularly wiping out AIG. Insurance has the same general problem as short selling—it can result in negative wealth unless there is sufficient collateral. It also has the same solution.
  5. Clawback. The basic idea of a “clawback” is that when someone fouls up really badly, you extract it from their past paychecks. As far as I can tell, this sort of clause exists in nearly no contracts, but it’s a popular proposal in retrospect, particularly for certain AIG employees who destroyed their company. The driving problem here is that the actual value of a decision is not known for some time, and it’s misestimated in the short term. Learning theory suggests that you should apply updates to estimated value as soon as possible to adjust wealth, which would correspond to a potential 100% clawback clause.

Two things strike me in considering the above.

The first is that for normal people interacting with the financial system a set of financial rules + good sense have developed such that wealth tends to grow and shrink in a manner similar to what learning theory would suggest is near optimal. For example, most people use the going long strategy by default and most diversify. Most don’t use the short strategy, but those that do must have sufficient collateral. Normal people don’t have access to credit default swaps, and normal insurance has real collateral requirements. Clawbacks are automatic, as normal people bet with their own money and take their own losses.

The second is that larger actors have become quite skillful at avoiding the rules, with unsecured credit default swaps, unsecured shorts, and no clawback rules. But, learning theory is math, so it can’t really be avoided—instead what happens is inefficient decision making via inefficient learning algorithms on a societal scale.

My belief is effective financial reform will impose limits on agents just as learning theory implies. This is also the answer to the title question—it’s gambling if the corresponding learning algorithm has high regret, and it’s rewarding good prediction if the corresponding learning algorithm has low regret. Since this is already done effectively for normal people, shifting all agents towards the limits imposed in that direction works. This means lower bounds on collateral (or equivalently upper bounds on leverage), and standardized markets where all agents can interact on an equal basis. Adding in automatic clawback provisions for all performance-based pay would also probably be very effective.

A full dose of this medicine may upset many people directly affected by such legislation, as it limits their actions and imposes downsides. But this needn’t be so, because the math is straightforward, very robust, and designed precisely to pick out the good decision makers giving them wealth as rapidly as responsibly possible to make and control bigger decisions. If you are a good decision maker, then you should want this.

On the research front, there are substantial improvements we could hope for. Some basic questions are: How can we better structure marketplaces to allocate wealth according to the dynamics of an online learning algorithm? And what are the holes in the mapping between online learning and markets that need repair? And how do you repair them? And how do the repairs effect learning algorithms when backported? Good answer to this question could be radically valuable. Yiling and Jenn have a paper mapping out connections between prediction markets and online learning this year at EC, which is of interest for this direction of research.

4/28/2010

CI Fellows program renewed

Tags: CS, Funding jl@ 4:01 pm

Lev Reyzin points out the CI Fellows program is renewed. CI Fellows are essentially NSF funded computer science postdocs for universities and industry research labs. I’ve been lucky and happy to have Lev visit me for a year under last year’s program, so I strongly recommend participating if it suits you.

As with last year, the application timeline is very short, with everything due by May 23.

4/26/2010

Compassionate Reviewing

Most long conversations between academics seem to converge on the topic of reviewing where almost no one is happy. A basic question is: Should most people be happy?

The case against is straightforward. Anyone who watches the flow of papers realizes that most papers amount to little in the longer term. By it’s nature research is brutal, where the second-best method is worthless, and the second person to discover things typically gets no credit. If you think about this for a moment, it’s very different from most other human endeavors. The second best migrant laborer, construction worker, manager, conductor, quarterback, etc… all can manage quite well. If a reviewer has even a vaguely predictive sense of what’s important in the longer term, then most people submitting papers will be unhappy.

But this argument unravels, in my experience. Perhaps half of reviews are thoughtless or simply wrong with a small part being simply malicious. And yet, I’m sure that most reviewers genuinely believe they can predict what will and will not be useful in the longer term. This disparity is a lack of communication. When academics have conversations about reviewing, the presumption of participants in each conversation is that they all share about the same beliefs about what will be useful, and what will take off. Such conversations rarely go into specifics, because the specifics are boring in particular, technical, and because their is a real chance of disagreement on the specifics themselves.

When double blind reviewing was first being considered for ICML, I remember speaking about the experience in the Crypto community, where in my estimate the reviewing was both fairer and less happy. Many conferences in machine learning have shifted to doubleblind reviewing, and I think we have seen this come to pass here as well. Without double blind reviewing, it is common to have an “in” crowd who everyone respects and whose papers are virtually always accepted. These people are happy, and the rest have little voice. With double blind reviewing, everyone suffers substantial rejections.

We might say “fine, at least it’s fair”, but in my experience there is a real problem. From a viewpoint external to the community, when the reviewing is poor and the viewpoint of people in the community highly contradictory, nothing good happens. Outsiders (i.e. most people) viewing the acrimony choose some other way to solve problems, proposals don’t get funded, and the community itself tends to fracture. For example, in cryptography, TCC (not double blind) has started, presumably because the top theory people got tired of having their papers rejected at Crypto (double blind). From a process-of-research standpoint, this seems suboptimal, as different groups using different methods to solve similar problems are particularly the people who you would prefer talking to each other.

What seems to be lost with double blind reviewing is some amount of compassion, unfairly allocated. In a double blind system, any given paper is plausibly from someone you don’t know, and since most papers go nowhere, plausibly not going anywhere. Consequently, the bias starts “against” for all work, a disadvantage which can be quite difficult to overcome. Some time ago, I discussed how I thought motivation should be the responsibility of the reviewer. Aaron Hertzman strongly disagreed on the grounds that this belief could dead end your career as an author. I’ve come to appreciate his viewpoint to an extent. But, it misses the point slightly—the question of “What is good for the community?” differs from “What is good for the author?” In a healthy community, reviewers will actively understand why a piece of work is or is not important, filling in and extending the motivation as they consider the problem.

So, a question is: How can we get compassionate reviewing? (And in a fair way?) It might help somewhat for reviewers to actively consider, as part of their review, the level and mechanism of impact that a paper may have. Reducing reviewing load is certainly helpful, but it is not sufficient alone, because many people naturally interpret a reduced reviewing load as time to work on other things. And, some mechanisms seem to even harm. For example, the two-phase reviewing process that ICML currently uses might save 0.5 reviews/paper, while guaranteeing that for half of the papers, the deciding review is done hastily with no author feedback, a recipe for mistakes.

What creates a great deal of compassion? Public responsibility helps (witness workshops more interesting than conferences). A natural conversation helps (the current method of single round response tends to be very stilted). And time, of course, helps. What else?

4/24/2010

COLT Treasurer is now Phil Long

Tags: Conferences, Funding jl@ 2:14 pm

For about 5 years, I’ve been the treasurer of the Association for Computational Learning, otherwise known as COLT, taking over from John Case before me. A transfer of duties to Phil Long is now about complete. This probably matters to almost no one, but I wanted to describe things a bit for those interested.

The immediate impetus for this decision was unhappiness over reviewing decisions at COLT 2009, one as an author and several as a member of the program committee. I seem to have disagreements fairly often about what is important work, partly because I’m focused on learning theory with practical implications, partly because I define learning theory more broadly than is typical amongst COLT members, and partly because COLT suffers a bit from insider-clique issues. The degree to which these issues come up varies substantially each year so last year is not predictive of this one. And, it’s important to understand that COLT remains healthy with these issues not nearly so bad as they were. Nevertheless, I would like to see them taken more actively into account than I’ve been able to persuade people so far.

After thinking about it for a few days before acting, I decided to go ahead with the transfer for another reason: I’ve been suffering from multitask poisoning. Partly this is Ada, but partly it’s many other things, each of which takes a small bit of my time, in aggregate leaving me disappointing people, myself in particular. The effect of this has been quite obvious in terms of the posting rate on hunch.net.

Fortunately, Phil Long was ready to take up the duties, and he’s well positioned to do so.

Despite the above, I found being treasurer not particularly difficult. The functions of the treasury part of ACL have been

  1. Self-insurance for the conference each year. Prior to the formation of ACL-the-nonprofit (which Bob was instrumental in), COLT used to buy insurance against the possibility that some disaster would strike canceling the conference while leaving the local organizer on the hook for substantial expenses. When I came in, the treasury was a little bit low for this function, and when I left, somewhat too high.
  2. Budget fragmentation avoidance. Local organizers typically have a local account from which they spend for expenses and collect registration fees. Without the ACL, dealing with net positive or negative local accounts from year to year was awkward. With the ACL, it’s easy to square things up at the end of each year.
  3. A stable point of contact for funding related things. COLT is partly sponsored by several big CS-related companies including IBM, Microsoft, and Google. Providing a stable point of contact definitely helps ease this process. This also helps on the publishing side, where Omnipress is the current publisher of proceedings.
  4. Budget advice for local organizers. Somewhat to my surprise, the proper role of the treasurer was typically asking the local organizer to reduce registration fees rather than increase. The essential observation is that local organizers, because they operate out of a local account, tend to be a bit conservative in budget estimates. On the other hand, because ACL has an adequate interest bearing account, we should expect and desire to spend the interest in each typical year. In effect, ACL is naturally in a position to sponsor COLT to a small but nontrivial degree.

After having been treasurer for a little while, I’m convinced that having a nonprofit to back a conference is a good idea easing many difficulties with relatively small effort.

4/14/2010

MLcomp: a website for objectively comparing ML algorithms

Tags: Machine Learning jakester@ 8:37 pm
Much of the success and popularity of machine learning has been driven by its practical impact. Of course, the evaluation of empirical work is an integral part of the field. But are the existing mechanisms for evaluating algorithms and comparing results good enough? We (Percy and Jake) believe there are currently a number of shortcomings:

  1. Incomplete Disclosure: You read a paper that proposes Algorithm A which is shown to outperform SVMs on two datasets.  Great.  But what about on other datasets?  How sensitive is this result?   What about compute time – does the algorithm take two seconds on a laptop or two weeks on a 100-node cluster?
  2. Lack of Standardization: Algorithm A beats Algorithm B on one version of a dataset.  Algorithm B beats Algorithm A on another version yet uses slightly different preprocessing.  Though doing a head-on comparison would be ideal, it would be tedious since the programs probably use different dataset formats and have a large array of options.  And what if we wanted to compare on more than just one dataset and two algorithms?
  3. Incomplete View of State-of-the-Art: Basic question: What’s the best algorithm for your favorite dataset?  To find out, you could simply plow through fifty papers, get code from any author willing to reply, and reimplement the rest. Easy right? Well maybe not…

We’ve thought a lot about how to solve these problems. Today, we’re launching a new website, MLcomp.org, which we think is a good first step.

What is MLcomp? In short, it’s a collaborative website for objectively comparing machine learning programs across various datasets.  On the website, a user can do any combination of the following:

  1. Upload a program to our online repository.
  2. Upload a dataset.
  3. Run any user’s program on any user’s dataset.  (MLcomp provides the computation for free using Amazon’s EC2.)
  4. For any executed run, view the results (various error metrics and time/memory usage statistics).
  5. Download any dataset, program, or run for further use.

An important aspect of the site is that it’s collaborative: by uploading just one program or dataset, a user taps into the entire network of existing programs and datasets for comparison.  While data and code repositories do exist (e.g., UCI, mloss.org), MLcomp is unique in that data and code interact to produce analyzable results.

MLcomp is under active development.  Currently, seven machine learn task types (classification, regression, collaborative filtering, sequence tagging, etc.) are supported, with hundreds of standard programs and datasets already online.  We encourage you to browse the site and hopefully contribute more!  Please send comments and feedback to mlcomp.support (AT) gmail.com.

3/26/2010

A Variance only Deviation Bound

At the PAC-Bayes workshop earlier this week, Olivier Catoni described a result that I hadn’t believed was possible: a deviation bound depending only on the variance of a random variable.

For people not familiar with deviation bounds, this may be hard to appreciate. Deviation bounds, are one of the core components for the foundations of machine learning theory, so developments here have a potential to alter our understanding of how to learn and what is learnable. My understanding is that the basic proof techniques started with Bernstein and have evolved into several variants specialized for various applications. All of the variants I knew had a dependence on the range, with some also having a dependence on the variance of an IID or martingale random variable. This one is the first I know of with a dependence on only the variance.

The basic idea is to use a biased estimator of the mean which is not influenced much by outliers. Then, a deviation bound can be proved by using the exponential moment method, with the sum of the bias and the deviation bounded. The use of a biased estimator is clearly necessary, because an unbiased empirical average is inherently unstable—which was precisely the reason I didn’t think this was possible.

Precisely how this is useful for machine learning isn’t clear yet, but it opens up possibilities. For example, it’s common to suffer from large ranges in exploration settings, such as contextual bandits or active learning.

3/15/2010

The Efficient Robust Conditional Probability Estimation Problem

I’m offering a reward of $1000 for a solution to this problem. This joins the cross validation problem which I’m offering a $500 reward for. I believe both of these problems are hard but plausibly solvable, and plausibly with a solution of substantial practical value. While it’s unlikely these rewards are worth your time on an hourly wage basis, the recognition for solving them definitely should be :-)

The Problem

The problem is finding a general, robust, and efficient mechanism for estimating a conditional probability P(y|x) where robustness and efficiency are measured using techniques from learning reductions.

In particular, suppose we have access to a binary regression oracle B which has two interfaces—one for specifying training information and one for testing. Training information is specified as B(x’,y’) where x’ is a feature vector and y’ is a scalar in [0,1] with no value returned. Testing is done according to B(x’) with a value in [0,1] returned.

A learning reduction consists of two algorithms R and R-1 which transform examples from the original input problem into examples for the oracle and then transform the oracle’s predictions into a prediction for the original problem.

The algorithm R takes as input a single example (x,y) where x is an feature vector and y is a discrete variable taking values in {1,…,k}. R then specifies a training example (x’,y’) for the oracle B. R can then create another training example for B based on all available information. This process repeats some finite number of times before halting without returning information.

A basic observation is that for any oracle algorithm, a distribution D(x,y) over multiclass examples and a reduction R induces a distribution over a sequence (x’,y’)* of oracle examples. We collapse this into a distribution D’(x’,y’) over oracle examples by drawing uniformly from the sequence.

The algorithm R-1 takes as input a single example (x,y) and returns a value in [0,1] after using (only) the testing interface of B zero or more times.

We measure the power of an oracle and a reduction according to squared-loss regret. In particular we have:


reg(D,R-1)=E(x,y)~ D[(R-1(x,y)-D(y|x))2]

and similarly letting mx’=E(x’,y’)~ D’[y'].

reg(D’,B)=E(x’,y’)~ D’(B(x’) – mx’)2

The open problem is to specify R and R-1 satisfying the following theorem:

For all multiclass distributions D(x,y), for all binary oracles B: The computational complexity of R and R-1 are O(log k)
and


reg(D,R-1) < = C reg(D’,B)

where C is a universal constant.

Alternatively, this open problem is satisfied by proving there exists no deterministic algorithms R,R-1 satisfying the above theorem statement.

Motivation

The problem of conditional probability estimation is endemic to machine learning applications. In fact, in some branches of machine learning, this is simply considered “the problem”. Typically conditional probability estimation is done in situations where the conditional probability of only one bit is required, however there are a growing number of applications where a well-estimated conditional probability over a more complex object is required. For example, all known methods for solving general contextual bandit problems require knowledge of or good estimation of P(a | x) where a is an action.

There is a second intrinsic motivation which is matching the lower bound. No method faster than O(log k) can be imagined because the label y requires log2 k bits to specify and hence read. Similarly it’s easy to prove no learning reduction can provide a regret ratio with C<1.

The motivation for using the learning reduction framework to specify this problem is a combination of generality and the empirical effectiveness in application of learning reductions. Any solution to this will be general because any oracle B can be plugged in, even ones which use many strange kinds of prior information, features, and active multitask hierachical (insert your favorite adjective here) structure.

Related Results

The state of the art is summarized here which shows it’s possible to have a learning reduction satisfying the above theorem with either:

  1. C replaced by (log2 k)2 (using a binary tree structure)
  2. or the computational time increased to O(k) (using an error correcting code structure).

Hence, answering this open problem in the negative shows that there is an inherent computation vs. robustness tradeoff.

There are two other closely related problems, where similar analysis can be done.

  1. For multiclass classification, where the goal is predicting the most likely class, a result analogous to the open problem is provable using error correcting tournaments.
  2. For multiclass classification in a partial label setting, no learning reduction can provide a constant regret guarantee.

Silly tricks that don’t work

Because Learning reductions are not familiar to everyone, It’s helpful to note certain tricks which do not work here to prevent false leads and provide some intuition.

Ignore B’s predictions and use your favorite learning algorithm instead.

This doesn’t work, because the quantification is for all D. Any specified learning algorithm will have some D on which it has nonzero regret. On the other hand, because R calls the oracle at least once, there is a defined induced distribution D’. Since the theorem must hold for all D and B, it must hold for a D your specified learning algorithm fails on and for a B for which reg(D’,B)=0 implying the theorem is not satisfied.

Feed random examples into B and vacuously satisfy the theorem by making sure that the right hand side is larger than a constant.

This doesn’t work because the theorem is stated in terms of squared loss regret rather than squared loss. In particular, if the oracle is given examples of the form (x’,y’) where y’ is uniformly at random either 0 or 1, any oracle specifying B(x’)=0.5 has zero regret.

Feed pseudorandom examples into B and vacuously satisfy the theorem by making sure that the right hand side is larger than a constant.

This doesn’t work, because the quantification is “for all binary oracles B”, and there exists one which, knowing the pseudorandom seed, can achieve zero loss (and hence zero regret).

Just use Boosting to drive the LHS to zero.

Boosting theorems require a stronger oracle—one which provides an edge over some constant baseline for each invocation. The oracle here is not limited in this fashion since it could completely err for a small fraction of invocations.

Take an existing structure, parameterize it, randomize over the parameterization, and then average over the random elements.

Employing this approach is not straightforward, because the average in D’ is over an increased number of oracle examples. Hence, at a fixed expected (over oracle examples) regret, the number of examples allowed to have a large regret is increased.

3/12/2010

Netflix Challenge 2 Canceled

Tags: Announcements, Competitions jl@ 6:33 pm

The second Netflix prize is canceled due to privacy problems. I continue to believe my original assessment of this paper, that the privacy break was somewhat overstated. I still haven’t seen any serious privacy failures on the scale of the AOL search log release.

I expect privacy concerns to continue to be a big issue when dealing with data releases by companies or governments. The theory of maintaining privacy while using data is improving, but it is not yet in a state where the limits of what’s possible are clear let alone how to achieve these limits in a manner friendly to a prediction competition.

2/26/2010

Yahoo! ML events

Yahoo! is sponsoring two machine learning events that might interest people.

  1. The Key Scientific Challenges program (due March 5) for Machine Learning and Statistics offers $5K (plus bonuses) for graduate students working on a core problem of interest to Y! If you are already working on one of these problems, there is no reason not to submit, and if you aren’t you might want to think about it for next year, as I am confident they all press the boundary of the possible in Machine Learning. There are 7 days left.
  2. The Learning to Rank challenge (due May 31) offers an $8K first prize for the best ranking algorithm on a real (and really used) dataset for search ranking, with presentations at an ICML workshop. Unlike the Netflix competition, there are prizes for 2nd, 3rd, and 4th place, perhaps avoiding the heartbreak the ensemble encountered. If you think you know how to rank, you should give it a try, and we might all learn something. There are 3 months left.
Older Posts »

Powered by WordPress