[224000060010] |
Under the NLP Christmas Tree
[224000060020] |Lance Fortnow eloquenty states that all he wants for Christmas is a proof that P != NP.
[224000060030] |This got me to thinking about what I, as an NLPer, would most like for Christmas.
[224000060040] |Unfortunately, things in AI aren't as clear-cut as things in theory because it's hard to tell if we've succeeded.
[224000060050] |That said, what one might call a Halbert's Program for NLP is a (set of) techniques for super-human aggregation of information from multiple sources, media and domains specific to a given user.
[224000060060] |There are several "enabling" technologies that (I believe) are necessary to achieve this goal (those that I actively work on are starred):
[224000060070] |Human-level machine translation models (data in, system out)
[224000060080] |Turn-key structured prediction techniques with lots of features, complex loss functions and hidden variables*
[224000060090] |(Weak?) semantic entailment on raw text
[224000060100] |Personalization that doesn't obviate privacy
[224000060110] |Domain adaptation of arbitrary statistical models with little new annotated data*
[224000060120] |Automatic speech recognition
[224000060130] |Graph, equation and table "understanding"
[224000060140] |Weakly supervised summarization, more complex than extraction*
[224000060150] |Database querying from natural language queries
[224000060160] |Social network analysis
[224000060170] |Plan recognition, opinion and point-of-view analysis
[224000060180] |There are doubtless other interesting things and, of course, each of these might have its own set of enabling technologies (and there are also some cross dependencies).
[224000060190] |One perhaps controversial aspect is that I consider MT to be an enabling technology; this is because I don't necessarily care about reading Chinese documents -- I care about the information in the Chinese documents.
[224000060200] |To me, everything is about getting information, regardless of the source.
[224000070010] |Structured Prediction 1: What's Out There
[224000070020] |Structured prediction (SP) -- the machine learning task of generating outputs with complex internal structure -- is near and dear to my heart.
[224000070030] |It is also a generalization of many NLP problems (summarization, MT, IE, coref, parsing, tagging, etc.).
[224000070040] |This is the first part of a multi-part post on structured prediction (first: what's out there; second: what I do; third: where we should go).
[224000070050] |The most important aspect of the SP problem (aside from the data) is the loss function: l(y,y') gives us how much it costs us to produce y' when the correct answer is y.
[224000070060] |Our goal, as always, is to minimize the expected loss over the distribution that generated our training data.
[224000070070] |The common way of doing this is to posit a large set Y of possible outputs and to try to find a function f(x,y) on an input x and output y that ranks outputs: f(x,y) >f(x,z) if y is a better output than z.
[224000070080] |Once we've learned such a function, the only thing left is to find a way to search over all y to find the one with highest score.
[224000070090] |Aside from the parameterization of f (which is almost always taken to be linear the feature set), the only thing left that matters is how we measure one f against another.
[224000070100] |For notational covenience, let g(x,y) = exp(-f(x,y)).
[224000070110] |Here are some common ways to score f:
[224000070120] |Generative: f is good if g(x,y) / sum_{x',y'} g(x',y') is high for all (x,y) in the training data
[224000070130] |Conditional: f is good if g(x,y) / sum_{y'} g(x,y') is high for all (x,y) in the training data
[224000070140] |Discriminative-M: f is good if f(x,y) >f(x,y') + l(y,y') for all (x,y) in the training data and all y', where errors are punished equally
[224000070150] |Discriminative-S: f is good if f(x,y) >f(x,y') for all (x,y) in the training data and all y', where errors are punished proportional to l(y,y').
[224000070160] |Assuming all optimizations are equally difficult (or easy), these techniques should be ranked according to how closely they match our true desire: low expected loss.
[224000070170] |To me, the Discriminative-S method is the closest.
[224000070180] |To see the distinction between D-S and D-M, consider the case where we can easily find a function that ranks the correct outputs higher than the incorrect outputs, but cannot distinguish between the incorrect outputs.
[224000070190] |From the perspective of expected loss, this is sufficient.
[224000070200] |From the perspective of D-S, this is also sufficient.
[224000070210] |But it's not good enough for D-M.
[224000070220] |D-M requires that if some really bad y' exists, then f(x,y) has to be much larger (by l(y,y')) than f(x,y').
[224000070230] |I don't see a reason why we should care about this.
[224000070240] |Another way of looking at these methods is how we can optimize them.
[224000070250] |Let's suppose f is just w*h(x,y), where h(x) = sum_r h(x,r) is a decomposition over regions (adjacent tags in sequences, for instance), w is a weight vector and h provides a feature vector.
[224000070260] |We can compute the gradient of the optimization criterea (using exponentiated gradients for the discriminative methods) for each of the methods on a given input (x,y) as:
[224000070270] |sum_{r in y} h(x,r) - sum_{y'} sum_{r' in y'} X h(x,r')
[224000070280] |This is just the difference between the true features h(x,r) and the guessed features h(x,r'), where the guesses are weighed by X. The only difference between the above techniques is how we define X.
[224000070290] |Conditional: X = (1/Z) exp [ sum_{r' in y'} w*h(x,r') ]
[224000070300] |Discriminative-M: X = (1/Z) exp [ sum_{r' in y'} e C [ (t-1) l(y,r') + w*h(x,r') ] ]
[224000070310] |Discriminative-S: X = (1/Z) exp [ sum_{r' in y'} e C l(y,r')^(t-1) w*h(x,r') ]
[224000070320] |In all cases, Z is defined so that the sum is unity.
[224000070330] |In the latter two, e is a learning rate, C is the SVM slack hyperparameter and t is the iteration number.
[224000070340] |The conditional X can be interpreted as the posterior probability of this (incorrect) region under the current model.
[224000070350] |Going from conditional to D-M involves the insertion of the loss function, leading to a loss-augmented posterior.
[224000070360] |The (t-1) term essentially means that as more iterations are performed, the loss term becomes increasingly important.
[224000070370] |In the D-S version, we multiply by the loss rather than add it.
[224000070380] |What does this tell us?
[224000070390] |Well, it tells us exactly what we saw before: the gradient on the D-M model is high when the loss is high, even if the (non-loss-augmented) posterior is zero.
[224000070400] |This means that even if our model would never produce a wrong region, but that region has high loss, the gradient is still high at that point.
[224000070410] |This contrasts with the D-S gradient, which, once the posterior disappears, doesn't care about the loss anymore.
[224000070420] |The only tricky thing in any of these models is to compute the sum over y' efficiently (or, alternatively, approximate it by computing the max over y').
[224000070430] |In well formed cases, like sequence labeling or parsing (under strict assumptions), this is possible using standard dynamic programming (or sometimes, max-flow) methods.
[224000070440] |Incidentally, the "conditional" model refers to CRFs, the "discriminative-s" model refers to SVMISOs and the "discriminative-m" model refers to M3Ns.
[224000080010] |The Mad Paper Rush
[224000080020] |The HLT/NAACL deadline passed last month and the ACL deadline is next month.
[224000080030] |The cluster here seems more swamped than usual, people are rushing to get experiments done and papers written.
[224000080040] |It seems that most research gets done two months prior to the due dates and papers are rarely, if ever, submitted early.
[224000080050] |(Actually, I saw a statistic from ICML that most early-submitted papers actually got rejected in the end!)
[224000080060] |As someone who is trying to organize a ski trip to Mammoth two weeks before the ACL deadline and having difficulty getting people to come, I wonder why there's always this last minute rush.
[224000080070] |The ACL deadline was published at least four months ago, and even if not, it's actually later this year than usual.
[224000080080] |The only explanation that makes sense to me is that the rush is to run a handful of well-controlled experiments based on continuous tweaking of a system.
[224000080090] |The further you can delay running these experiments, the more you can tweak.
[224000080100] |Importantly, assuming you're a good researcher, the experiments are not to convince yourself that your approach is viable.
[224000080110] |The experiments are to convince your potential reviewers that your approach is viable.
[224000080120] |Based on this, there seem to be two ways to cut down on this rush (assuming people don't like it).
[224000080130] |First is to reduce the amount of tweaking done.
[224000080140] |Second is to reduce the number of well-controlled experiments run.
[224000080150] |Unfortunately, both of these will probably result in lower chances of your paper being accepted (confirming the above-cited ICML statistic).
[224000080160] |But this is bad.
[224000080170] |Tweaking will improve your scores (hopefully), but will rarely be mentioned in the paper, leading to irreproducible results.
[224000080180] |Running too many experiments can cloud the point of your paper and significantly cut down on your time to work on real things.
[224000080190] |Despite this, people continue to tweak and run too many experiments (myself included).
[224000080200] |This seems to be because the cost of failure is too high: if your paper is rejected, you basically have to wait another year to resubmit, so you want to cover all bases.
[224000080210] |Two solutions come to mind.
[224000080220] |First, we could spread our conferences further apart, meaning that you have only to wait 6 months.
[224000080230] |Second, we could try to ensure that reviewers know that with a limit of 8 pages, one cannot run all possible experiments, and that they can suggest alternative contrastive experiments to be run, but if the experiments back up the main point(s) of the paper, this should be sufficient.
[224000080240] |I don't think the "review/response" approach will fix this because running a new experiment is not what author responses are for, and this is just delaying the problem, rather than fixing it (though I do advocate the response phase).
[224000090010] |Long Distance Collaboration
[224000090020] |For those who have engaged in such an affair, I'm curious how you make it work well.
[224000090030] |I've been working with John Langford recently, drawing out and exploiting connections between structured prediction and reinforcement learning.
[224000090040] |I'm visiting family in Chicago now and will be going to TTI tomorrow to visit.
[224000090050] |Up until this visit, we've talked mostly over email, less over phone and even less in person (some at ICML and NIPS 2005).
[224000090060] |But the rate of information transfer in such collaboration is significantly less than if John had an office down the hall from me.
[224000090070] |How do people make this work?
[224000090080] |Site visits.
[224000090090] |This is/was the defacto standard in business for ages, before the internet came.
[224000090100] |People also tended to collaborate less and in business there is perhaps less of a push for collaboration rather than subcontracting.
[224000090110] |While undoubtedly the best method for interacting, there is significant overhead in terms of cost, travel time, etc., especially in different continents.
[224000090120] |Telephone.
[224000090130] |Another good business standard; unfortunately, I'm really a white-board kind of guy, and it's really hard to draw on someone else's white board.
[224000090140] |Talking about math and referencing papers is also difficult over the phone.
[224000090150] |Large time differences can also be a nuisance (I recall talking to Yee Whye at about midnight my time because of the time difference to Singapore).
[224000090160] |E-mail.
[224000090170] |This seems to be what is used most: it is cheap and doesn't suffer from the time difference, though replies are delayed.
[224000090180] |It enables you to think more about a topic before replying, but cuts down on often-constructive back-and-forth discussion (since that would take too long).
[224000090190] |The medium is also limited: typing math into email is ugly for anything complex, and you still can't draw on someone else's white board.
[224000090200] |Twice in my life I've attached .eps figures, but this is highly ineffective.
[224000090210] |So, what do other people do?
[224000090220] |Is there any hope?
[224000090230] |It seems that email could theoretically be retrofitted to fix a lot of the problems I have with it (multiple media, drawing, math, etc.), but does such technology exist currently?
[224000100010] |NLP as Glorified Memorization
[224000100020] |The view of NLP as essentially a hunt-and-return technology has been gathering momentum since the burgeoning of the web.
[224000100030] |Example-based MT takes this view to machine translation, and phrase-based statisitcal MT is essentially EBMT done with statistics.
[224000100040] |In question answering (factoid style), the situation is even more dramatic.
[224000100050] |Deepak's thesis was essentially devoted to the idea that the answer to any question can be found in huge corpora by relatively simple pattern-matching.
[224000100060] |To a somewhat lesser degree, information extraction technology is something like smoothed (or backed-off) memorization, and performance is largely driving by one's ability to obtain gazeteers relevant to one's task.
[224000100070] |Pushing such memorization technology further will doubtless lead to continued success, and there are many open research questions here.
[224000100080] |I would love others to answer these questions, but I have little interest in answering them myself.
[224000100090] |Fortunately, I think there are many interesting real-world problems for which simple memorization techniques will not work, and deeper "analysis" or "understanding" is required.
[224000100100] |Any QA/summarization task that focuses on something other than "general world knowledge" fits into this category.
[224000100110] |I might want to ask questions to my email client about past emails I've recieved.
[224000100120] |The answer will likely exist only once, and likely not in the form I ask the question.
[224000100130] |I might want to ask questions about scientific research, either from PubMed or REXA.
[224000100140] |I might want to ask about the issues involved in the election of the Canadian PM (something I know nothing about) or the confirmation hearings of Samuel Alito (something I know comparatively more about).
[224000100150] |And I would want the answers tailored to me.
[224000100160] |If I owned a large corporation or were running a campaign, I would want to know what my supporters and detractors were saying about me, and who was listening to whom.
[224000100170] |I could be proven wrong: maybe memorization techniques can solve some/all of these problems, but I doubt it.
[224000100180] |What other problems are people interested in that may not be solvable with memorization?
[224000110010] |The MT Blues: Would a program get its facts right?
[224000110020] |I just came across an article on Slate entitled The Translator's Blues: Will I get replaced by a computer program?, written by Jesse Browner, a human translator.
[224000110030] |I've no idea how wide-read Slate is, and I think Browner is correct that his job is not (yet) in jeopardy, there are several common misconceptions in the article that really shouldn't be there.
[224000110040] |First, the results of his example sentence are atrocious.
[224000110050] |I see no reason why even the rule-based systems should get this wrong (though I suspect his poor results had to do largely with the difficulty of inputting accented characters).
[224000110060] |Language Weaver stood out, which is good, because it's a good system.
[224000110070] |In fact, in all the news texts he tried, it is reported to fare well.
[224000110080] |And he's right that the technology has limits: it could not translate the first sentence of Don Quixote well.
[224000110090] |This is the domain adaptation problem: Language Weaver's system was trained on news data and (surprise surprise) it performs well on news and not on fiction.
[224000110100] |I think the important point is that the problem is domain.
[224000110110] |IMO, it has nothing to do with context, which is what Browner attributes the problem to.
[224000110120] |The con/pen example is something that I believe a decent statistical MT system should and would get right; if not now, in a few years once language models are better.
[224000110130] |It's unfortunate that Browner doesn't say that Language Weaver is (largely) a statistical systems, because he seems to say that they suck, yet LW was by far the best system he tried.
[224000110140] |My final comment is in reference to the attribution to Mike Collins, who clearly knows much more about this field than many.
[224000110150] |Most likely Mike was quoted out of context, but Language Weaver's system -- and, in fact, most good research systems out there -- are based on machine learning.
[224000110160] |That's what "statistical" means.
[224000110170] |(Incidentally, for those less literary minded, the first sentence of Don Quixote should be translated something like "In a village of La Mancha, the name of which I have no desire to call to mind, there lived not long since one of those gentlemen that keep a lance in the lance-rack, an old buckler, a lean hack, and a greyhound for coursing."
[224000110180] |I'm not convinced that a professional translator who worked for the UN would find this easy to translate either.)
[224000120010] |Structured Prediction 2: My Perspective
[224000120020] |In Structured Prediction 1: What's out there?, I discussed current SP technology: Markov fields, CRFs, M3Ns and SVMISOs.
[224000120030] |Here I'll discuss my work briefly.
[224000120040] |I'll come back to SP once more to discuss what seems to be on the horizon.
[224000120050] |With so many options, why bother with a new one?
[224000120060] |There are problems that are not solved by the standard approaches.
[224000120070] |The biggest problem is the "argmax assumption," which states that you can solve the following problem:
[224000120080] |arg max_{y in Y} f(x,y; w) = arg max_{y in Y} w*Phi(x,y)
[224000120090] |Where the equality is under the standard linear predictor assumption.
[224000120100] |This maximization requires that, for an input x, we are able to compute the highest scoring possible output (efficiently).
[224000120110] |The problem is that for many (most?) NLP problems, without making completely unreasonable assumptions, this maximization is intractable.
[224000120120] |The standard assumptions that make this operation polynomial time are the Markov assumption for sequences and the context-free assumption for trees.
[224000120130] |But we know that both of these assumptions are invalid, and a few papers have tried to deal with this by using sampling, reranking or approximate inference techniques.
[224000120140] |On the other hand, if you look at what people in NLP actually do to solve problems, they use approximate search (MT is a good example).
[224000120150] |In this way, the search space can be tuned to precisely capture the functional (feature) dependencies needed.
[224000120160] |My perspective is that if we're going to use search anyway, we should take this into account while learning.
[224000120170] |There is no point to learning a model that ranks the best output highest if we cannot actually find this output.
[224000120180] |And once we start using approximate search, any performance guarantees related to learning pretty much go out the window.
[224000120190] |The really cool thing is that a completely different brand of machine learning people have considered learning in the context of search, but they call it reinforcement learning.
[224000120200] |The key difference between SP and RL, as I see it, is that in RL you can affect the world whereas in SP you cannot.
[224000120210] |In the standard "action/state/observation" terminology, in SP you only get one observation at the very beginning (the input x), while in RL the observations keep coming in as you take more actions.
[224000120220] |This seems to be the fundamental difference between the two problems.
[224000120230] |So, by treating SP in a search framework, what we basically want is a function that takes our input x, a partial output represented by a state in our search space s, and predicts the best action to take: the best thing to add onto the partial output.
[224000120240] |For instance, in sequence labeling with left-to-right decoding, we might predict the label of the next word.
[224000120250] |In MT, we might predict a French phrase the translate and its English translation and append this to a translation prefix.
[224000120260] |The action set is potentially large, but not impossibly so (if it were, the problem would be intractable anyway).
[224000120270] |The great thing is that this enables us to formally cast structured prediction as a cost-sensitive multiclass classification problem.
[224000120280] |Lots of algorithms have been developed for the multiclass problem, and we can seek to apply these to the SP problem through this reduction.
[224000120290] |This gives us simple, efficient algorithms with good practical performance and strong theoretical guarantees.
[224000120300] |The key requirement that makes all this work (essentially my version of the arg max assumption) is that for a given input x, true output y and state s in the search space, we can compute the action a that is optimal to execute from s.
[224000120310] |This is the "optimal policy assumption."
[224000120320] |(We can actually weaken this assumption to require only a bound on the optimal policy, which is necessary for optimizing things like BLEU or ROUGE, at least under interesting models.)
[224000120330] |It is provably a strictly weaker assumption than the assumptions made by CRFs and M3Ns, essentially because it doesn't care about the feature space, which means we can have whatever the heck features we want (a good thing for NLP).
[224000130010] |Most Influential NLP Papers
[224000130020] |I conducted a mini survey recently, asking people I knew what they thought were the most influential papers in NLP from the past two decades.
[224000130030] |Here are the wholly unscientific results, sorted from most votes and subsorted by author.
[224000130040] |Note that I only got responses from 7 people.
[224000130050] |I've not listed papers that got only one vote and have not included my personal votes.
[224000130060] |(7 votes): Brown et al., 1993; The Mathematics of Statistical Machine Translation(5 votes): Collins, 1997; Three Generative, Lexicalised Models for Statistical Parsing (4 votes): Marcus, 1993 Building a large annotated corpus of English: the Penn Treebank(3 votes): Berger et al., 1996; A maximum entropy approach to natural language processing(2 votes): Bikel et al., 1997; An Algorithm that Learns What's in a Name(2 votes): Collins, 2002; Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms(2 votes): Lafferty et al., 2001; Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data(2 votes): Och, 2003; Minimum Error Rate Training for Statistical Machine Translation(2 votes): Papineni et al., 2001; Bleu: a method for automatic evaluation of machine translation(2 votes): Ratnaparkhi, 1999; Learning to Parse Natural Language with Maximum Entropy Models(2 votes): Yarowsky, 1995; Unsupervised Word Sense Disambiguation Rivaling Supervised Methods
[224000130070] |This seems to be dominated by parsing and techology/machine learning papers, with a smattering of MT thrown in for good measure.
[224000130080] |I feel that the former represent a sort of "lowest common demoninator."
[224000130090] |There were several singleton votes for various papers in summarization, language modeling, IR, and other fields, but, with the exception of the MT papers, getting >1 vote meant that you had to be useful to more than just your one community.
[224000130100] |It's also a bit surprising that Collins' parser is so high, but no other parsers got any votes (Charniak, Lin, etc.).
[224000130110] |Feel free to say what you think is missing!
[224000140010] |Domain Adaptation
[224000140020] |We NLPers face this problem all the time: we have training data from one domain/genre but really want to work in another.
[224000140030] |Eg., the treebank is WSJ text but we care about email or web pages or whatever.
[224000140040] |We'd like to be able to intelligently use our annotated WSJ text to get a good statistical model for a different domain.
[224000140050] |I've been working on this problem for a while now and have a partial solution.
[224000140060] |I'm most interested in the case that we have lots of annotated "out of domain" (OOD) data and a little annotated "in domain" (ID) data.
[224000140070] |Domain Adaptation for Statistical Classifiers is a paper that's been accepted to JAIR that presents one way to model this problem.
[224000140080] |The key idea is to model the OOD/ID data distributions as mixtures.
[224000140090] |There are three mixture components: a "truly ID" distribution, a "truly OOD" distribution and a "general" distribution.
[224000140100] |We say the OOD data comes from a mixture of "truly OOD" and "general," while the ID data comes from a mixture of "truly ID" and "general."
[224000140110] |The learning task is to tear apart our data sets to figure out what "general" (and hence relevant to the ID task) information there is in the OOD data.
[224000140120] |The framework, when applied to maximum entropy models, gives relatively simple update equations for model parameters.
[224000140130] |The derivations take a bit of thought, but are not insane.
[224000140140] |The approach is a partial solution because it's limited to maximum entropy models.
[224000140150] |I think the problem should be amenable to a more learning theoretic analysis, but haven't had time to make much headway here.
[224000140160] |I'm a bit surprised this problem hasn't gotten more attention in the NLP community (a similar problem -- speaker adaptation -- exists in the speech community).
[224000140170] |Or perhaps I've missed it in NLP.
[224000140180] |It seems like we, as NLPers, should really care about this issue.
[224000150010] |Predictive Models
[224000150020] |There are at least two reasonable goals for NLP research: (1) produce predictive systems that mimic humans; (2) produce explanatory models.
[224000150030] |(2) gets more into CL and CogSci and I may discuss it at a later date.
[224000150040] |This post is about (1), at least the machine learning approach to (1).
[224000150050] |Our goal is to mimic a human on some task.
[224000150060] |This means we will, at the end of the day, want to measure our performance against human performance.
[224000150070] |This measurement is our loss function, and is, in my mind, the only thing that really matters.
[224000150080] |Coming up with a good loss function is really really hard and I will certainly discuss this issue in another post.
[224000150090] |But suppose we can come up with one: l(t,h) = how much it hurts to produce a hypothesis h when the true answer was t.
[224000150100] |The subsequent question is: given some collected data, how can I minimize l.
[224000150110] |The dominant approach seems to be: approximate l by some well-known loss (binary loss, squared error, absolute loss, etc.) and then use one of the many good classifiers/regressors to solve our original problem.
[224000150120] |This step irks me greatly.
[224000150130] |Not because this approximation is always bad, but because it is never (rarely) stated.
[224000150140] |Many papers say something like "Our task in XYZ is to minimize misclassification error of problems created by ABC."
[224000150150] |This is rarely actually true.
[224000150160] |I would be much happier with the following explanation: "Our task in XYZ is to minimize TUV loss, but we don't know how to do that, so we'll instead minimize misclassification error."
[224000150170] |This issue manifests itself commonly in another way: putting the cart before the horse.
[224000150180] |We often gather a data set, label it, solve the labeling problem (eg., by multiclass classification) and then go back and define the problem in such a way that this is an appropriate solution.
[224000150190] |This is harder to detect, but equally disingenuous.
[224000150200] |So let's say we've done things right: we've defined our loss function and we've gotten some labeled data.
[224000150210] |Now we have to minimize our loss on this data.
[224000150220] |If we can do this directly, great.
[224000150230] |If not, we can say so and that we'll approximate it by some other loss l'.
[224000150240] |Now we can start looking at features.
[224000150250] |We have an l' that we can minimize and we want to come up with features that are useful for this minimization.
[224000150260] |In particular, if ours is a structured problem, and l' is invariant to some change in structure, we needn't reflect this aspect of the structure in our feature functions function (the prototypical example here being Markov features in sequence labeling for Hamming [per-label] loss, though this issue itself could take a whole other post).
[224000150270] |We should set aside part of the training data for manual inspection to do feature engineering.
[224000150280] |Experimentation is the last real step.
[224000150290] |We've split off a training set, development set and test set (*).
[224000150300] |We train a model on the training set, evaluate it on the development set.
[224000150310] |We look at the dev predictions and see what went wrong.
[224000150320] |We reengineer some features to fix these problems and repeat.
[224000150330] |Only at the end do we run on the test set and report our results in a paper.
[224000150340] |(*) I'm not a fan of cross-validation in such a cyclic process.
[224000150350] |The features will be engineered based on the training and development data sets, and to then fold all this back into the bag and do cross-validation will give us unrealistically good results.
[224000150360] |And as a very final step, in the case that our optimized l' is not the same as l, I would like to see some results that show that they at least correlate reasonably well.
[224000150370] |And easy way to do this is to run several versions of the model and to report both l and l' scores and let us see that improving l' does indeed improve l.
[224000170010] |The Art of Loss Functions
[224000170020] |There are roughly four types of loss functions that are used in NLP research.
[224000170030] |The real loss function given to us by the world.
[224000170040] |Typically involves notions of money saved, time saved, lives saved, hopes of tenure saved, etc.
[224000170050] |We rarely have any access to this function.
[224000170060] |The human-evaluation function.
[224000170070] |Typical examples are fluency/adequecy judgments, relevance assessments, etc.
[224000170080] |We can perform these evaluations, but they are slow and costly.
[224000170090] |They require humans in the loop.
[224000170100] |Automatic correlation-driving functions.
[224000170110] |Typical examples are Bleu, Rouge, word error rate, mean-average-precision.
[224000170120] |These require humans at the front of the loop, but after that are cheap and quick.
[224000170130] |Typically some effort has been put into showing correlation between these and something higher up.
[224000170140] |Automatic intuition-driven functions.
[224000170150] |Typical examples are accuracy (for anything), f-score (for parsing, chunking and named-entity recognition), alignment error rate (for word alignment) and perplexity (for language modeling).
[224000170160] |These also require humans at the front of the loop, but differ from (3) in that they are not actually compared with higher-up tasks.
[224000170170] |Note that the difference between (3) and (4) changes over time: there are many (4)s that could easily become (3)s given a few rounds of experimentation.
[224000170180] |(I apologize if I called something a (4) that should be a (3).)
[224000170190] |It is important to always keep in mind that our goal is to improve (1).
[224000170200] |The standard line is: I can't compute (1) so I approximate it with (2). I can't optimize (2) so I approximate it with (3). (Or, I don't know how to define (2) so I approximate it with (4).)
[224000170210] |I strongly feel that the higher up on this list one is, the harder it is to actually define the evaluation metric.
[224000170220] |It is very easy to define things at the (4) level because there are essentially no requirements on such a loss function other than (A) hard to game and (B) intuitively reasonable.
[224000170230] |It is a bit harder to define things at the (3) level because an additional critereon is added: (C) must be able to convince people that this new loss function approximates an established (1) or (2). Defining something at the (2) level is shockingly difficult and (1) is virtually impossible.
[224000170240] |There are two questions that are crutial to progress.
[224000170250] |The first is: what is the minimum acceptable evaluation.
[224000170260] |We cannot reasonably require (1) in all cases.
[224000170270] |In many cases, I think it absolutely reasonable to require (2); eg. journal papers for which a working definition of (2) is known.
[224000170280] |Conference papers can usually get by with (3) or (4). I would never accept (4) when a (3) is known.
[224000170290] |But when not (consider named entity recognition), it's all we have.
[224000170300] |It's unclear if this is a good breakdown or not.
[224000170310] |I could argue never to accept (4): that you must show your results improve something humans care about, at least approximately.
[224000170320] |But this seems limiting.
[224000170330] |First, it means that if we want to solve a subproblem, we have to formally show that solving it will be useful before we actually solve it.
[224000170340] |This is a good thing to do a but of internally, but to do a sophisticated analysis to the point of publishable results to move something from (4) to (3) is a lot of effort for (often) little gain.
[224000170350] |Especially in the context of a larger research agenda.
[224000170360] |I could also argue that for many results, (4) is always sufficient.
[224000170370] |The argument here is that, from a machine learning perspective, if I can optimize (4), I can optimize anything else you give me.
[224000170380] |While I've made this argument personally, it's just not true.
[224000170390] |A more complex loss function might be much harder to optimize (consider optimizing Bleu-1 without brevity penalty versus full Bleu-4).
[224000170400] |Moreover, it may turn out that whatever biases my model has are good for one (4) and not for another. This is an okay (but not great) argument in an ML conference, but I wouldn't buy it in NLP.
[224000170410] |So how do other people rank these things?
[224000170420] |How much pressure is there to bring a (4) to a (3)? And how often should we do (2)?
[224000180010] |Structured Prediction 2.5: Features
[224000180020] |I decided to add an additional post on structured prediction before the final one.
[224000180030] |This is also highly related to the post on joint inference.
[224000180040] |One can claim that when doing structured prediction, the features X never have longer range than the loss function, where X is either "need" or "should."
[224000180050] |To this day I struggle to understand this issue: I don't believe it is as simple as one would like.
[224000180060] |Consider our old favorite: sequence labeling under Hamming (per label) loss.
[224000180070] |The arguement essentially goes: we don't care (i.e., our loss function doesn't care) about how well our labels fit together.
[224000180080] |All it cares about is getting each right individually.
[224000180090] |By essentially the same argument as the joint inference post, one can then see that one should train a single classifier that does not use any "Markov-style" features.
[224000180100] |On the other hand, I know that I have always seen Markov-style features to help, and I believe that most people who work on SP would agree.
[224000180110] |Just as in the joint inference case, I think that if you want to be guaranteed the same information to train a classifier on, you need to move to a cross-product feature space.
[224000180120] |Unlike John, I do fear this.
[224000180130] |I fear this for several reasons.
[224000180140] |First, we already have millions of features.
[224000180150] |Going to 10^12 features is just not practical (blah blah kernels blah blah).
[224000180160] |This can also lead to severe generalization issues, and I don't think the state of the art in machine learning is capable of dealing with this.
[224000180170] |It's trikcy, though, because I largely believe the joint inference argument, and largely don't believe this argument.
[224000180180] |Yet they are essentially the same.
[224000180190] |But I think there might be a more fundamental issue here, beyond cross-product feature spaces, generalization bounds and so on.
[224000180200] |This is the issue of: if we know a problem has structure, even if the loss function (even the true one) doesn't reflect this, should we make use of this structure.
[224000180210] |The answer to "Why?" is that it might make it easier on our models to learn.
[224000180220] |The answer to "Why not?" is that by doing so we require more "human in the loop" than otherwise.
[224000190010] |Tutorial: Bayesian Techniques for NLP
[224000190020] |I'm giving a tutorial on Bayesian methods for NLP at HLT-NAACL 2006.
[224000190030] |I gave a similar tutorial about a year ago here at ISI.
[224000190040] |This gave me a pretty good idea of what I want to keep in and what I want to cut out.
[224000190050] |The topics I intend to cover are, roughly:
[224000190060] |Bayesian paradigm: priors, posteriors, normalization, etc.
[224000190070] |Graphical models, expectation maximization, non-bayesian inference techniques
[224000190080] |Common statistical distributions: uniform, binomial/multinomial, beta/dirichlet
[224000190090] |Simple inference: integration, summaring, monte carlo
[224000190100] |Advanced inference: MCMC, Laplace, Variational
[224000190110] |Survey of popular models: LDA, Topics and Syntax, Words and Pictures
[224000190120] |Pointers to literature
[224000190130] |All of this is, of course, cast in the context of NLP problems: all discrete distributions, language applications, etc., that hopefully both NLP and IR people will find interesting (maybe even some speech people, too).
[224000190140] |Does anyone have anything they'd really like to hear that's not on the list?
[224000190150] |Or anything that's on the list that they don't care about?
[224000190160] |Keep in mind several constraints: 3 hours (minus coffee time), generally accessible, focused on NLP applications, and something I know something about.
[224000190170] |(For instance, I covered expectation propagation in the tutorial last year, but decided to cut it for this to give more time to other issues.)
[224000190180] |Note that I am also preparing a written tutorial that covers roughly the same material.
[224000200010] |Structured Prediction 3: What's to Come
[224000200020] |Different researchers have different priorities, and not all those listed here are things I personally plan to work on.
[224000200030] |There are many open questions about structured prediction that I'd like an answer to.
[224000200040] |Hopefully we'll make some headway at the atomic learning workshop, and I'll of course post after that.
[224000200050] |Move to more complex, "intractable" structures.
[224000200060] |More with less: improving with only weak feedback.
[224000200070] |Learning structure: the analogue of learning graphical models.
[224000200080] |Learning from heterogenous data.
[224000200090] |Measuring sample complexity.
[224000200100] |Trading off expressivity with tractability.
[224000200110] |Integrated prediction and data mining.
[224000200120] |Computational complexity versus sample complexity.
[224000200130] |My short term goal is [1] and [6] (and I've worked a bit on [4], but not for SP).
[224000200140] |Andrew McCallum seems to be working hard on [1] (though using different machinery) and [7].
[224000200150] |I think Ben Taskar has some new work on [1] and is probably closer to [5] than any of us.
[224000200160] |Thorsten Joachims has done interesting work in [1] and [2] (the latter not for SP though), but I think (hope) he might bring some of that across. I don't really know who (if anyone!) is working on the other topics.
[224000200170] |The question that drives me is: to what extent can we let the NLPer forget about the fact that there's machine learning going on in the background.
[224000200180] |I think that, as of now, at least from my perspective, the answer is that the NLPer must be able to structure a search space for his problem, and devise a way for turning the "true output" into an "optimal policy" by breaking it down into regions.
[224000200190] |The rest is purely trivial crank turning (no math required).
[224000200200] |The answers to the questions above will help in this endeavor.
[224000200210] |A very basic question, hinted on before, is: when do we even need to consider structure?
[224000200220] |If we don't, life is much easier for the NLPer.
[224000200230] |But we need to be able to say when we do and when we don't need structure.
[224000200240] |And when we do, we want it to be as painless as possible.
[224000210010] |Evaluation Criterea: Correlation versus Generalization
[224000210020] |When developing a "third level" loss function (i.e., an automatic metric), one often shows that the ranking or the scores of the automatic function correlates well with the ranking or scores of the human-level function.
[224000210030] |I'm not sure this is the best thing to do.
[224000210040] |The problem is that what correlation tell us is just that: on the data sample we have, our automatic evaluation function (AE) correlates well with the human evaluation function (HE). But we don't actually care (directly) about the correlation on past data.
[224000210050] |We are about the ability to make future predictions of quality.
[224000210060] |Namely, we often want to know:
[224000210070] |If I improve my system by 1 point AE, will this improve my HE score?
[224000210080] |This is a generalization question, not a correlation question.
[224000210090] |But we know how to analyze generalization questions: Machine learning theorists do it all the time!
[224000210100] |So, what do we know: if we want AE to predict HE well in the generalization sense, then it is (with high probability) sufficient that AE predicts HE well on our observed (training) data, and that AE is not "too complex."
[224000210110] |Usually this latter holds.
[224000210120] |AE is often chosen to be as simple as possible, otherwise people probably won't buy it.
[224000210130] |Usually it's chosen from a small finite set, which means we can do really simple PAC-style analyses.
[224000210140] |So this should be enough, right: we have good prediction of the training data (the correlation) and a small hypothesis class, so we'll get good generalization.
[224000210150] |Thus, correlation is enough.
[224000210160] |Well, not quite.
[224000210170] |Correlation+small hypothesis class is enough only when the "training data" is i.i.d.
[224000210180] |And our problem assumptions kill both of the "i"s. First, our data is not independent.
[224000210190] |In summarization (and to a perhaps slightly lesser extent in MT), the systems are very very similar to eachother.
[224000210200] |In fact, when people submit multiple runs, they are hugely similar.
[224000210210] |Second, our data is not identically distributed to the test data.
[224000210220] |The whole point of this exercise was to answer the question: if I improve my AE will by HE improve?
[224000210230] |But, at least if I have the top system, if I improve my AE too much, my system is no longer from the same distribution of systems that generated the training data.
[224000210240] |So the data is hardly i.i.d. and easy generalization bounds cannot be had.
[224000210250] |So what can we do about this?
[224000210260] |Well, recognizing it is a start.
[224000210270] |More practically, it might be worth subsampling from systems when deriving correlation results.
[224000210280] |The systems should be chosen to be (a) as diverse as possible and (b) as high performance as possible.
[224000210290] |The diversity helps with independence.
[224000210300] |The high performance means that we're not geting supurb correlation on crummy systems and no correlation on the great systems, leading to poor generalization as systems get better.
[224000210310] |This doesn't fix the problem, but it might help.
[224000210320] |And, of course, whenever possible we should actually run HE.
[224000220010] |HLT-NAACL Program Up
[224000220020] |The program for HLT-NAACL 2006 has been posted (I didn't submit anything).
[224000220030] |I'm partcularly looking forward to:
[224000220040] |A fast finite-state relaxation method (Tromble + Eisner)
[224000220050] |Name matching in English and Arabic (Freeman, Condon + Ackerman)
[224000220060] |Do we need phrases?
[224000220070] |(Quirk + Menezes)
[224000220080] |Using syntax for false entailment (Snow, Vanderwende + Menezes)
[224000220090] |Grammatical machine translation (Riezler + Maxwell)
[224000220100] |SRL, Wornet and Wikipedia for Coreference (Ponzetto + Strube)
[224000220110] |I'm sure others will be good too.
[224000220120] |Anyone see anything else especially intruiging?
[224000230010] |Is X useful for Y?
[224000230020] |This post has two parts, both addressing the issue of showing that one task (X) is useful for another task (Y); eg., syntax is useful for MT, or WSD is useful for IR, or ....
[224000230030] |The first question is: on whom is the onus of making such an argument.
[224000230040] |(I'm presupposing that it makes sense to make such an argument.)
[224000230050] |There are three options: (1) a person who does X; (2) a person who does Y; a third party.
[224000230060] |Arguments in favor of 1: if I work on a task, I am likely to want to justify its importance.
[224000230070] |For 2: I know the most about Y, so if X is useful, I can probably get it to work; also, other (2)s are more likely to buy my results.
[224000230080] |For 3: ???.
[224000230090] |One could argue (3) to be (more) unbiased, but I'll make an idealized assumption that (1) and (2) care about furthering science, not some personal research program.
[224000230100] |Given that, I think the best case scenario is to have a joint authored paper between a (2) and a (1); that failing, I'd probably prefer a paper by a (2).
[224000230110] |I'd prefer it not from a potential bias perspective, but because a (2) is more likely to produce a Y-system that's state-of-the-art, so showing that X improves on this is going to be more convincing.
[224000230120] |Of course if a (1) can do that, that's fine too.
[224000230130] |The second issue that I see is that in a lot of cases it's not cut and dry.
[224000230140] |Take for example a recent paper that showed that syntax is useful for EDT.
[224000230150] |I believe this paper is probably right, given the maturity of the system into which syntax was added.
[224000230160] |However, take my EDT system.
[224000230170] |I worked on this for about 15 months not using syntax.
[224000230180] |I used all sorts of crazy features.
[224000230190] |If I add syntax (which I've done), it improves things a little.
[224000230200] |Not a lot.
[224000230210] |And only for coref, not for mention tagging.
[224000230220] |Why?
[224000230230] |Likely, I've engineered around having syntax.
[224000230240] |If I had added syntax at the beginning, maybe I could have done away with many of the other features I have.
[224000230250] |We see the same thing in IR: if some complex NLP technique seems to improve IR systems but is expensive to run, then often people will keep tearing it apart until they find some tiny little feature that's doing all the important work.
[224000230260] |If so, can we still say that the original task helped?
[224000230270] |I think, given this, there is a better way to measure the usefulness of X to Y other than: X improved Y's performance.
[224000230280] |Consider someone setting out to build an EDT system.
[224000230290] |They want to know whether they should include syntax or not.
[224000230300] |The real question is: assuming I get comparable performance, is it easier to include syntax or to engineer around it.
[224000230310] |I don't know how to measure this automatically (lines of code is perhaps a reasonable surrogate, assuming the same coder writes both), but it seems like such a measure would be much more useful and telling than whether or not performance on an arbitrary system goes up.
[224000240010] |Snowbird Accepted Papers Up
[224000240020] |The list of accepted papers for the Snowbird Learning Workshop is up.
[224000240030] |I'm interested to see:
[224000240040] |Training CRFs with SMD (Vishwanathan, Shraudolph, Schmidt + Murphy)
[224000240050] |Sub-gradient method structured learning (Bagnell, Ratliff + Zinkevich)
[224000240060] |AdaBoost is MaxEnt over the error distribution (Haffner)
[224000240070] |Structural correspondences for transfer learning (Blitzer, Crammer, McDonald + Pereira)
[224000240080] |Rule-extraction from SVMs (Braun + Buhmann)
[224000240090] |Prior beliefs in SSL (Luxburg + Ben-David)
[224000240100] |Model compression (Caruana, Bucila + Niculescu-Mizil)
[224000250010] |Viterbi Search is Not Magic
[224000250020] |This post is about the use of Viterbi search in NLP applications.
[224000250030] |Skip the next paragraph if you recall readily how VS works.
[224000250040] |The easiest formulation of VS is for a sequence labeling problem: given an input x, find a label vector y1,...,yN that maximizes the score [ f(x,y0,y1) + f(x,y1,y2) + ... + f(x,y{N-1},yN) ].
[224000250050] |(Here, y0 is some special symbol.)
[224000250060] |When there are K possible values for each label, VS solves this by filling in a dynamic programming matrix a of size (N+1)*K, where a(0,k) is zero if k is and negative infinity otherwise.
[224000250070] |Recursively, a(n+1,k') is [ max_k [ a(n,k) + f(x,k,k') ] ].
[224000250080] |The largest a(N+1, . ) gives the highest score; by storing backpointers (which k was the largest) in another matrix, we can recover the whole sequence once we reach the end.
[224000250090] |This runs in time O(NK^2), as opposed to O(N^K) as a naive implementation would run.
[224000250100] |I am becoming somewhat notorious for not using Viterbi search, even for sequence labeling problems (ironic, given that I am part of the Viterbi School of Engineering).
[224000250110] |The purpose of this post is to show that this is not unreasonable.
[224000250120] |My evolving feeling on this has two angles:
[224000250130] |Viterbi is good for something like an HMM where we can't model the whole input as features and so we need to talk back-and-forth via Viterbi.
[224000250140] |But when we get to use any feature at any time (as in any modern model), this advantage goes away.
[224000250150] |I think people ascribe some "magic" to Viterbi.
[224000250160] |At least in NLP, the Viterbi algorithm and its generalizations (inside-outside, etc.) are treated as stand alone units.
[224000250170] |They're not.
[224000250180] |They're simply based on the observation that under certain constraints, when executing search, one can do efficient memoization.
[224000250190] |If you throw away the "magic" of Viterbi and treat it as jsut another search algorithm, you see that there's nothing special there.
[224000250200] |I have a hard time convincing people of 2.
[224000250210] |I'm not quite sure why.
[224000250220] |I almost liken it to trying to convince a classically trained logician the ways of constructive logic are right (being one of the former, I'm not yet convinced).
[224000250230] |The problem is that we're so used to seeing VS as this very special array-filling operation that we forget what's really going on.
[224000250240] |We could just as easily do search and store stuff in a hash table.
[224000250250] |It wouldn't be quite as efficient, but complexity wise (assuming O(1) access to the table), it would be the same.
[224000250260] |The "magic" of Viterbi is that it efficiently integrates all information from the different sources assuming that the conditional independences hold and assuming that all conditional probabilities are correctly estimated.
[224000250270] |But we know this is not the case.
[224000250280] |That said, I think there is something to be said for later decisions being able to influence earlier ones.
[224000250290] |But I don't take this as given.
[224000250300] |And I don't think it's always the case.
[224000250310] |This is a big open question for me.
[224000250320] |By using a more complex search algorithm, you're forcing more decisions to be made but are (we hope) making these decisions easier.
[224000250330] |This is a trade-off.
[224000250340] |One angle is not a priori better than the other.
[224000260010] |Yesterday was Structured Learning Day
[224000260020] |I'm out in Chicago at the atomic learning workshop and yesterday was structured prediction day.
[224000260030] |There were a lot of cool talks on pretty much all the major SP techniques: Charles talked about efficient training of CRFs, Yasemin talked about semi-supervised learning in M3Ns and SVMISOs, Ben talked about efficient training of M3Ns for combinatorial-style problems (like word alignment), Drew talked about planning as a SP problem, Vasin talked about the tradeoffs of local vs. global inference in SP, and I of course talked about search.
[224000260040] |I really enjoyed Drew's talk.
[224000260050] |He is considering the problem of robot navigation, which is typically handled using A* search and some very crafty hand-written heuristic rules for generating the cost map over which A* search runs.
[224000260060] |He wants to get rid of the crafty part and replace it with learning.
[224000260070] |The idea is simple and cute: use observed features the robot receives to learn to produce a cost map so that when A* runs it finds the right path.
[224000260080] |Here, he defines the right path by having a human remote control the robot and drive it to the correct place.
[224000260090] |He casts this as a max-margin problem and solves it using very fast subgradient methods.
[224000260100] |There's no paper on this yet, but if all goes well there will be soon.
[224000260110] |I think there's a lot of stuff here that could strongly influence how we think about SP.
[224000260120] |There were many questions raised at the workshop that I deserve significant thought.
[224000260130] |These include:
[224000260140] |What is structured learning?
[224000260150] |How does structured learning related to multi-task learning (ala Rich Caruana and others)?
[224000260160] |What sort of techniques are there for dealing with un- or partially-labeled data for structured learning?
[224000260170] |What can be done about featuritis (i.e., throwing in all possible features)?
[224000260180] |Also: Is this a legitimate concern?
[224000260190] |How important is it what loss function we optimize?
[224000260200] |When is local learning enough (ala Dan Roth) and when must you do "global learning?"
[224000260210] |What training techniques scale sufficiently?
[224000260220] |Almost all of these questions deserve separate posts, and of course they are non-independent.
[224000260230] |Modulo feedback, I'll probably start at the top and work my way down.
[224000260240] |(Of course, if John beats me to it, I'll just post on his blog and cross-link here.)
[224000270010] |Heuristics
[224000270020] |Deepak brings up a great discussion topic that really needs its own post.
[224000270030] |Hopefully I'll catch this here before anyone continues to reply there (don't).
[224000270040] |The basic question is this:
[224000270050] |Why cannot people simply use heuristicy and hackery approaches that have been proven over years to work well?
[224000270060] |Deepak's point, which I think is well taken and should not be forgotten, is that a simple "hacky" thing (I don't intend "hacky" to be condescending...Deepak used it first!) often only does at most epsilon worse than a mathematically compelling technique, and sometimes even better.
[224000270070] |I think you can make a stronger argument.
[224000270080] |Hacky things allow us to essentailly encode any information we want into a system.
[224000270090] |We don't have to find a mathematically convenient way of doing so (though things are better now that we don't use generative models for everything).
[224000270100] |I think there are several responses to these arguments, but I don't think anything topples the basic point.
[224000270110] |But before I go into those, I want to say that I think there's a big difference between simple techniques and heuristicy techniques.
[224000270120] |I am of course in favor of the former.
[224000270130] |The question is: why shouldn't I just use heuristics.
[224000270140] |Here are some responses.
[224000270150] |That's just how I am.
[224000270160] |I studied math for too long.
[224000270170] |It's purely a personal choice.
[224000270180] |This is not compelling on a grand scale, but at a personal level, it is unlikely one will do good research if one is not interested in the topic and approach.
[224000270190] |We don't want to just solve one problem.
[224000270200] |Heuristic techniques are fantastic for solving a very particular problem.
[224000270210] |But they don't generalize to other similar problems.
[224000270220] |The extent to which they don't generalize (or the difficulty to force generalization) of course varies.
[224000270230] |Similar to #2, I want a technique that can be described more briefly than simply the source code.
[224000270240] |Heuristic techniques by definition cannot.
[224000270250] |Why?
[224000270260] |I'd like to know what's really going on.
[224000270270] |Maybe there's some underlying principle that can be applied to other problems.
[224000270280] |Lasting impact.
[224000270290] |Heuristics rarely have lasting (scientific) impact because they're virtually impossible to reproduce.
[224000270300] |Of course, a lasting impact for something mathematically clever but worthless is worse than worthless.
[224000270310] |There are probably others that other people can come up with.
[224000270320] |That's my general perspective.
[224000270330] |In response to specific issues Deepak brought up:
[224000270340] |...a more complicated model gives very little improvements and generally never scales easily.
[224000270350] |I think it's worthwhile separating the problem from the solution.
[224000270360] |The issue with a lot of techniques not scaling is very true.
[224000270370] |This is why I don't want to use them :).
[224000270380] |I want to make my own techniques that do scale and that make as few assumptions as possible (regarding features, loss, data, etc.).
[224000270390] |I think we're on our way to this.
[224000270400] |Together with John and Daniel, we have some very promising results.
[224000270410] |...working on such (SP) problems means getting married more to principles of Machine Learning than trying to make any progress towards real world problems.
[224000270420] |I, personally, want to solve real-world problems.
[224000270430] |I cannot speak for others.
[224000270440] |...a lot of lot of smart (young?) people in the NLP/ML community simply cannot admit the fact that simple techniques go a long way...
[224000270450] |This is very unfortunate if true.
[224000270460] |I, for one, believe that simple techniques do go a long way.
[224000270470] |And I'm not at all in favor of using a steamroller to kill a fly.
[224000270480] |But I just don't feel like they can or will go all the way in any scalable manner (scalable in O(person time) not O(computation time)).
[224000270490] |I would never (anymore!) build a system for answering factoid questions like "how tall is the empire state building?" that is not simple.
[224000270500] |But what about the next level "how much taller is the empire state building than the washington monument?"
[224000270510] |Okay, now things are interesting, but this is still a trivial question to answer.
[224000270520] |Google identifies this as the most relevant page.
[224000270530] |And it does contain the necessary information.
[224000270540] |And I can imagine building heuristics that could answer "how much Xer is X that Y" based on asking two separate questions.
[224000270550] |But where does this process end?
[224000270560] |I will work forever on this problem.
[224000270570] |(Obviously this is just a simple stupid example which means you can find holes in it, but I still believe the general point.)
[224000280010] |What is Structured Prediction?
[224000280020] |It is surprisingly difficult to concretely define structured prediction.
[224000280030] |The root of the question is: what does it mean for a problem to be structured.
[224000280040] |A very reasonable condition seems to be the following:
[224000280050] |Condition 1: Output elements decompose into variable length vectors over a finite set.
[224000280060] |This notion of decomposition is fairly ubiquitous (it subsumes things like sequence labeling, parsing, MT, protein folding, etc.).
[224000280070] |Unfortunately, it does not seem sufficient.
[224000280080] |Requiring only C1 means that problems like binary classification, multitask learning and other clearly non-structured problems fall under the heading of SP.
[224000280090] |It also doesn't stress the fact that there's some relation between the elements of these vectors.
[224000280100] |There are, it seems, essentially two places to introduce dependence between elements in the output vectors: the features or the loss.
[224000280110] |This leads to two new conditions:
[224000280120] |Condition 2-Feat: The feature function does not decompose over any vector representation of the output.
[224000280130] |Condition 2-Loss: The loss function does not decompose over any vector representation of the output.
[224000280140] |(For both of these, by "does not decompose" I mean that it is not invariant under permutations of the output vector.
[224000280150] |For example, in sequence labeling world, Hamming loss does decompose.)
[224000280160] |In some sense, C2-Loss subsumes C2-Feat.
[224000280170] |This is because it is reasonable to believe that if the loss function doesn't decompose, then we will need/want to introduce features that span at least as much of the "nondecomposition" of the output vector as the loss.
[224000280180] |For instance, in sequence labeling under a second-order Hamming loss (accuracy on adjacent labels, which does not decompose), one would presumably want to use bigram-style features.
[224000280190] |By mixing and matching these conditions, we obtain several different notions of structured prediction.
[224000280200] |Condition 1 alone leads to fully decomposable problems and for these we can use independent classifiers ala Dan Roth and others.
[224000280210] |Condition 1 and Condition 2-Feat is the standard interpretation of structure prediction and leads to, among other things, max-margin Markov nets.
[224000280220] |Condition 1 and Condition 2-Loss is my preferred interpretation and leads to, as far as I know, no published work (though we have a paper at Snowbird on it, and something conditionally accepted at ICML).
[224000280230] |Given that 1 and 2-Feat is standard, I have some desire to defend my interpretation.
[224000280240] |To me, the 1+2-Feat interpretation is specifying SP as a solution, not a problem.
[224000280250] |That is, the world doesn't force us to use structure: we simply do so because we believe structured features will be useful.
[224000280260] |On the other hand, when the loss function doesn't decompose, the world is essentially telling us that we had really better pay attention to the structure.
[224000280270] |We can't reasonably assume to do well otherwise.
[224000280280] |I like to define problems in terms of what the world gives us, not what we create for ourselves.
[224000280290] |My interpretation successfully excludes problems like multiclass classification and multitask learning from the realm of structured prediction.
[224000280300] |Interestingly, it also excludes sequence labeling under Hamming loss and collective classification (ala statistical relational learning).
[224000280310] |I don't see this as necessarily a bad thing.
[224000280320] |Especially since this definition now includes problems I care about like MT, summarization, ASR, etc. under any reasonable loss function.
[224000280330] |(Any reasonable loss function will not decompose.)
[224000300010] |SIGIR Program Up
[224000300020] |The SIGIR program has been posted.
[224000300030] |I'm looking forward to:
[224000300040] |Learning User Interaction Models (Agichtein, Brill, Dumais, Rango)
[224000300050] |Improving the Estimation of Relevance Models (Diaz, Metzler)
[224000300060] |LDA-based Document Models (Wei, Croft)
[224000300070] |Tackling Concept Drift by Temporal Inductive Transfer (Forman)
[224000300080] |Large Scale Semi-supervised Linear SVMs (Sindhwani, Keerthi)
[224000300090] |One thing that surprises me is the distribution of industry labs' papers.
[224000300100] |There are 12 papers with at least one Microsoft author, four for IBM, but only one each for Yahoo and Google (and the Google one was work done while the relevant author was at IBM, I believe).
[224000300110] |Does MSR just encourage publication more?
[224000300120] |Or is it more closely tied to SIGIR (SIGIR is in Seattle this year)?
[224000300130] |Or did Yahoo and Google just get unlucky with reviews?
[224000310010] |Rexa is Live
[224000310020] |http://rexa.info
[224000310030] |(More info here, including a post by Andrew McCallum, the man behind the machine!)
[224000320010] |Unsupervised Learning: Why?
[224000320020] |Unsupervised learning is very popular for NLP problems, including MT, summarization, IE, etc.
[224000320030] |I've done work in this area, as have many other people.
[224000320040] |Beginning about 6 months ago, I have been asking myself: should we ever really do unsupervised learning?
[224000320050] |This question may seem silly, but there is a fairly compelling argument.
[224000320060] |The argument is that, regarless of whether we are in academia or industry, we will have to convince someone else that our system is doing a good job.
[224000320070] |In order to do this, we need evaluation data.
[224000320080] |Which means we need to annotate.
[224000320090] |But once we've annotated, at the very least we should do semi-supervised learning, not unsupervised learning.
[224000320100] |The problem with this argument is that it presupposes the existence of an automatic evaluation metric.
[224000320110] |My newly formed perspective is the following.
[224000320120] |We should only do unsupervised learning if we do not have a trustworthy automatic evaluation metric (i.e., a Type III metric).
[224000320130] |I cannot currently think of a compelling argument against this.
[224000330010] |Please Pass the Pragmatics
[224000330020] |I've spent the last few weeks (lots of plane trips) reading Pragmatics by Levinson et al. Pragmatics is a difficult to define field (like many :P), but from a linguistic perspective, it is roughly "the study of the ability of language users to pair sentences with contexts in which they would be appropriate." (pg 24)
[224000330030] |The book talks about five aspects of pragmatics:
[224000330040] |Deixis: methods of directly encoding context into language; "Meet me here a week from now with a stick about this big." (pg 55) Without knowing the context, we cannot know the meanings of me, now (and hence a week from now), this big.
[224000330050] |Diectic pronouns are pronouns that reference speaker- and hearer-known context.
[224000330060] |Implicature: Grice's theory says that a speaker should: be cooperative, be truthful (quality), be informative (quantity), be relevant, and be perspicuous (manner: avoid obscurity and ambiguity, be brief and orderly). (pgs 101-102) This theory is insufficient, but useful.
[224000330070] |Implicature also includes notions of quantification (perhaps, some, many, etc.) and metaphors.
[224000330080] |Presupposition: roughly describes that which is immediately inferrable but not the new information in an utterance.
[224000330090] |Eg., "Sue cried before she finished her thesis" presupposes that Sue finished her thesis, but this is not the new information (which is that she cried). (pg 187) There are both semantic and pragmatic presuppositions; the latter involve shared knowledge between speaker and hearer.
[224000330100] |(For fun, compare "Sue died before she finished her thesis.")
[224000330110] |Speech Acts: a speech act is a statement that does something rather than just one that says something (eg., "I declare war on Zanzibar." (pg 228)).
[224000330120] |The basic question of speech acts seems to be that of identifying them and understanding how they differ from normal statements.
[224000330130] |Conversational Structure: analysis of the sequential (and anti-sequential) nature of conversations, interruptions, etc.
[224000330140] |I find implicature and presupposition fascinating, because this is essentially where hidden meaning comes out in text.
[224000330150] |(See, here I've presupposed that hidden meaning is fascinating!)
[224000330160] |I've really seen very few papers that look at computational aspects of these problems (speech acts and conversational structure may be exceptions).
[224000330170] |The textual entailment work, as far as I can tell, focuses on semantic implication rather than pragmatic implication, though of course this is related (but much easier, I would imagine!).
[224000330180] |One immediate question is whether there is an appliation that demands pragmatic understanding (or, say, understanding of pragmatic implicature and presupposition).
[224000330190] |This, I'm not sure.
[224000330200] |I'm curious how divergent pragmatic issues are crosslingually.
[224000330210] |My hunch is "not much" which implies that this is not necessary for translation purposes (though Czech morphologically encodes for the new/old distinction).
[224000330220] |IR and IE also seem impervious to the pragmatic hammer.
[224000330230] |I can come up with artificial situations that suggest QA might benefit, but I feel these are too artificial (i.e., "Did Sue finish her thesis?").
[224000330240] |Even summarization seems fairly robust to a lack of pragmatic understanding, although the "new/old" issue is important here.
[224000330250] |Perhaps what is old in the real discourse is not new for the reader of the summary.
[224000330260] |But if it just comes through as presupposition, it's unclear that anything is lost.
[224000330270] |I'm at something of a loss here, because it seems like human conversation has pragmatic influences so deeply embedded that it is surprising that I feel we can do without them for NLP problems.
[224000340010] |Google Beta SMT System
[224000340020] |Is online, with a description here..
[224000340030] |I can only judge fluency/reasonableness, not faithfullness to the input, but of the three AJ pages I tried, one came out reasonably (I could get the gist, sort of) and one came out quite poorly.
[224000340040] |Better, I'd say, on average than Babelfish's implementation of Asian anything to English, but probably worse than Babelfish's French to English (which has been cooking for ages).
[224000340050] |It's unfortunate that there aren't Bleu scores posted for the standard test sets, though it wouldn't be too hard to run these, if someone has the test data.
[224000340060] |(The results Franz linked to are probably not the same system, since the system they submitted to the eval last year took fourty machine hours to translate a sentence.
[224000340070] |Google engineers are good, but...)
[224000340080] |If someone would run the test set, I'd be eternally grateful (and very curious).
[224000340090] |Given that this is free, NIST may run it as a baseline for the NIST MT eval and the GALE eval, which could have substantial impact on the community, if it is any good.
[224000350010] |Getting Started in NLP
[224000350020] |Since starting the blog, a few people have asked me how one can get started in NLP, while residing in a department lacking NLP researchers.
[224000350030] |This is a difficult question: I fell into NLP quite naturally when I was at CMU and made an easy transition to grad school at USC, both of which have awesome NLP groups.
[224000350040] |Lacking such internal support, one has to be much more ambitious to get to the point where one could do real research in the field.
[224000350050] |The obvious avenues for support are: reading books (which ones?) and papers (from where and by whom?), going to nearby conferences (which ones?) and experimentation (on what?).
[224000350060] |(New option: read and post to this blog!)
[224000350070] |The four standard books in the field are Statistical NLP (Manning + Schutze), Speech and Language Processing (Jurafsky + Martin), Statistical Language Learning (Charniak) and Natural Language Understanding (Allen).
[224000350080] |The latter two are much older, though some people prefer Charniak to Manning + Schutze.
[224000350090] |I would probably pick up Manning + Schutze if I could only buy one.
[224000350100] |From this book, I think that skimming Chapters 1, 4, 6 and 13 should give a reasonable (but not uniformly sampled) representation of background knowledge everyone should know.
[224000350110] |Unfortunately, this misses many topics: information extraction, summarization, question answering, dialog systems, discourse, morphology, ontologies, pragmatics, semantics, sentiment analysis and textual entailment.
[224000350120] |Finding good papers for beginners is hard.
[224000350130] |Without guidance, skimming titles and abstracts of papers published in ACL, NAACL, HLT or COLING since 2002 or 2003 should enable someone to find out what looks interesting to them.
[224000350140] |I know many advisors take this approach with new students.
[224000350150] |The ACL anthology is great for finding old papers.
[224000350160] |I'll probably post at a later date about what are the "must reads" for the areas I know best.
[224000350170] |Once you've found a few papers you like, I'd check out the respective author's web pages and see if they have any related work (best bet is probably to look at the advisor's page: often s/he will have multiple students working on similar topics).
[224000350180] |Also, advisor's often have course material and slides from tutorials: these are great places to get introductory-level material.
[224000350190] |If you happen to get lucky (I never have) and one of the above conferences is located nearby, I'd just go.
[224000350200] |Presentations of papers (if they're good) are often better from the perspective of getting the high-level overview than the papers themselves, since papers have to be technically complete.
[224000350210] |I'm perhaps overcommitting myself, given my promise to talk more about structured prediction, but over the next few weeks/months, I'll work on a "Getting Starting in X" series.
[224000350220] |X will likely range over the set { summarization, sequence labeling, information extraction, machine translation, language modeling }.
[224000350230] |Requests for other topics will be heard, keeping in mind I'm not an expert in many areas.
[224000360010] |Supervised Hidden Variable Models
[224000360020] |Hidden variable models have been extremely successful in unsupervised NLP problems, such as the famous alignment models for machine translation.
[224000360030] |Recently, two parsing papers and a textual entailment paper have applied hidden variable techniques to supervised problems.
[224000360040] |I find this approach appealing because it allows one to learn hidden variable models that are (hopefully) optimal for the end goal.
[224000360050] |This contrasts with, or example, unsupervised word alignment, where it is unclear if the same alignment is good for two different tasks, for instance translation or projection.
[224000360060] |The question that interests me is "what exactly are hidden variables in these models?"
[224000360070] |As a simple example, let's take the following problem: we have an English sentence and an Arabic sentence and we want to know if they're aligned.
[224000360080] |Dragos has a paper that shows that we can do a good job of making this determination if we do the following: construct an alignment (using, eg., GIZA++) and then compute features over the alignment (length of aligned subphrases, number of unaligned words, etc.).
[224000360090] |One can imagine similar models for textual entailment, paraphrase identification, etc.
[224000360100] |I would ideally like to learn the alignments as part of training the classifier, treating them essentially as hidden variables in the supervised problem.
[224000360110] |So what happens in such a model?
[224000360120] |We learn to produce alignments and then define features over the alignments.
[224000360130] |From an information theoretic perspective, we needn't bother with the alignments: we can just define features on the original sentence pair.
[224000360140] |Perhaps these features will be somewhat unnatural, but I don't think so.
[224000360150] |To build the alignments, we'll have features like "the word 'man' is aligned to the word 'XXX'" (where 'XXX' is Arabic for 'man').
[224000360160] |Then on top of this we'd have features that count how many words are successfully aligned.
[224000360170] |It's somewhat unclear if this is much more than just including all word pair features for a pure classification technique.
[224000360180] |The key difference appears to be in the fact that we impose a strong constraint on the alignments.
[224000360190] |For instance, that they are one-to-one or one-to-many or something similar.
[224000360200] |This significantly reduces the number of features that are fed into the classifier.
[224000360210] |At the same time, the hidden variable models look a lot like introducing a hidden layer into a neural network!
[224000360220] |Consider a drastic simplification of the parallel sentence detection problem (which is, in the end, just a binary classification problem).
[224000360230] |Namely, consider the XOR problem.
[224000360240] |As we all know, we can't solve this with a linear classifier.
[224000360250] |But if we add a hidden variable to the mix, we can easily find a solution (eg., the hidden variable indicates left-versus-right and then conditional on this, we learn two separate classifiers, one pointing up and one pointing down).
[224000360260] |This is exactly what a two layer neural net (i.e., one hidden layer) would do!
[224000360270] |From this perspective, it seems one can argue that hidden variable models are simply increasing the model complexity of the underlying learner.
[224000360280] |This comes at a cost: increased model complexity means we either need more data or stronger prior information.
[224000360290] |Hidden variable models --- more precisely, the constraints present in hidden variable models --- seem to provide this prior information.
[224000380010] |Making Humans Agree
[224000380020] |The question of human agreement came up recently.
[224000380030] |As background, the common state of affairs in NLP when someone has decided to work on a new problem is to ask two (or more) humans to solve the problem and then find some way to measure the inter-annotator agreement.
[224000380040] |The idea is that if humans can't agree on a task, then it is not worth persuing a computational solution.
[224000380050] |Typically this is an iterative process.
[224000380060] |First some guidelines are written, then some annotation is done, then more guidelines are written, etc.
[224000380070] |The hope is that this process converges after a small number of iterations.
[224000380080] |LDC has made a science out of this.
[224000380090] |At the very least, inter-annotator agreement serves as a useful upper bound on system performance (in theory a system could do better; in practice, it does not).
[224000380100] |There are two stances one can take on the annotator agreement problem, which are diametrically opposed.
[224000380110] |I will attempt to argue for both.
[224000380120] |The first stance is: If humans do not agree, the problem is not worth working on.
[224000380130] |(I originally wrote "solving" instead of "working on," but I think the weakened form is more appropriate.)
[224000380140] |The claim here is that if a problem is so ill-specified that two humans perform it so differently, then when a system does it, we will have no way of knowing if it is doing a good job.
[224000380150] |If it mimics one of the humans, that's probably okay.
[224000380160] |But if it does something else, we have no idea if human N+1 might also do this, so we don't know if its good or bad.
[224000380170] |Moreover, it will be very difficult to learn, since the target concept is annotator-specific.
[224000380180] |We want to learn general rules, not rules specific to annotators.
[224000380190] |The second stance is: If humans do agree, the problem is not worth working on.
[224000380200] |This is the more controversial statement: I have never heard anyone else ever make it.
[224000380210] |The argument here is that any problem that humans agree on is either uninteresting or has gone through so many iterations of definitions and redefinitions as to render it uninteresting.
[224000380220] |In this case, the annotation guidelines are essentially an algorithm that the annotator is executing.
[224000380230] |Making a machine execute a similar algorithm is not interesting because what it is mimicing is not a natural process, but something created by the designers.
[224000380240] |Even without multiple iterations of guidelines, high human agreement implies that the task being solved is too artificial: personalization is simply too important of an issue when it comes to language problems.
[224000380250] |My beliefs are somewhere in the middle.
[224000380260] |If humans cannot agree at all, perhaps the problem is too hard.
[224000380270] |But if they agree too much, or if the guidelines are overly strict, then the problem may have been sapped of its reality.
[224000380280] |One should also keep in mind that personalization issues can render agreement low; we see this all the time in summarization.
[224000380290] |I do not think this necessarily means that task should not be worked on: it just means if worked on, it should be a personalized task.
[224000380300] |Measuring agreement in such cases is something I know nothing about.
[224000390010] |ICML Accepted Papers Up
[224000390020] |See here.
[224000390030] |I'm excited to see:
[224000390040] |Bayesian Multi-Population Haplotype Inference (Xing, Sohn, Jordan + Teh)
[224000390050] |Constructing Informative Priors using Transfer Learning (Raina, Ng + Koller)
[224000390060] |Disciminative Unsupervised Learning of Structured Predictors (Xu, Wilkinson, Southey + Schuurmans)
[224000390070] |Maximum Margin Planning (Ratliff, Bagnell + Zinkevich)
[224000390080] |Pachinko Allocation (Li + McCallum)
[224000390090] |Semi-supervised Learning for Structured Output Variables (Brefeld + Scheffer)
[224000390100] |Topic Modeling: Beyond Bag-of-Words (Wallach)
[224000400010] |NLP is an Engineering Discipline?
[224000400020] |At my graduation ceremony this past week, Dean Yannis Yortsos gave a short speech, in which he quoted someone famous (I've already forgotten who it was) at the Engineering graduation segment, saying something akin to "Science is the study of what the world is; Engineering is making something the world has never seen before."
[224000400030] |This comment had a fairly significant effect on me.
[224000400040] |However, when thinking about it in the context of NLP, it seems that NLP is an Engineering discipline, not a scientific one, at least according to the above (implied) definition of the terms.
[224000400050] |NLP lives somewhere within the realm of language, statistics, machine learning, and cognitive science (you can include signal processing if you want to account for speech).
[224000400060] |Let's suppose that NLP is a scientific discipline: then, what aspect of the world does it study?
[224000400070] |It cannot be language, otherwise we'd be linguists.
[224000400080] |It cannot be statistics or learning, lest we call ourselves mathematicians.
[224000400090] |Cognitive science is out or we'd be...well, cognitive scientists.
[224000400100] |In fact, at least as I've described it in my job talks, NLP is the field that deals with building systems to solve problems that have a language component.
[224000400110] |This seems to render NLP to fall exclusively within the realm of Engineering.
[224000400120] |This observation seems to have several repurcussions.
[224000400130] |It means, among other things, that we should not attempt to seek theoretical results, unless we are wearing another hat (I often wear the hat of a machine learning person; other people play linguists occasionally).
[224000400140] |It also means that perhaps we shouldn't complain as much that so much of NLP is system building.
[224000400150] |It seems that almost by definition, NLP must be system building: we are out to create things that never existed before.
[224000400160] |If one, like me, is interested in science---that is, the study of natural things in the world---it also means that we had better find some related topic (machine learning, cognitive science or linguistics) to study simultaneously.
[224000410010] |Structured Prediction versus Multitask Learning
[224000410020] |For a formal comparison, I will briefly define both SP and MTL:
[224000410030] |A structured prediction problem is a distribution D over pairs (x,y), with x in X and y in Y, together with a loss function l(y,y').
[224000410040] |Typically Y is large, and elements y have complex interdependent structured, which can be interpreted in various ways.
[224000410050] |The goal is a function f : X -> Y that minimizes the expect loss: E_{x,y ~ D}[l(y,f(x))].
[224000410060] |A multitask learning problem is a distribution D over tuples (x,y,y1,...,yK), with x in X and (for simplicity) y,y1,...,yK all in 2={0,1}.
[224000410070] |The goal is a function f : X -> 2 that minimizes the expected loss: E_{x,y,y1,...,yK ~ D}[1(y != f(x))].
[224000410080] |Importantly the loss doesn't care about y1...yK.
[224000410090] |However, we typically assume that y1...yK are related to the x -> y mapping.
[224000410100] |Without assumptions on the SP loss or output space Y, SP subsumes MTL.
[224000410110] |For instance, in sequence labeling, we can think of a multitask learning problem as a sequence labeling problem where we only care about the label of the first word.
[224000410120] |This reminds me of some results that suggest that it's really not any easier to predict the first word in a translation than to solve the entire translation problem.
[224000410130] |This means that we could try to solve an MTL problem by applying, say, a CRF.
[224000410140] |One key difference that seems quite important is that results from MTL have suggested that it is easy to get an MTL model that performs better on the first task, but its hard to get a single MTL model that performs well on all tasks simultaneously.
[224000410150] |For instance, in neural networks, one might want to run 1000 training epochs to do well on task 1, but only 10 or as many as 10000 to do well on one of the other tasks.
[224000410160] |This implies that it's probably not a good idea to naively apply something like a CRF to an MTL problem.
[224000410170] |For this reason (and the nature of the loss functions), I feel that SP and MTL are complementary problems.
[224000410180] |The application of MTL to time-series data has been reportedly very successful.
[224000410190] |Time-series data is nice because it naturally provides additional tasks: in addition to predicting the value at time t, also predict at times t+5, t+10 and t+20.
[224000410200] |During training we have access to these values, but we don't during testing, so we don't have the option of using them as features.
[224000410210] |One especially cool thing about this observation is that it applies directly to search-based structured prediction!
[224000410220] |That is, when attempting to make a prediction about some search step, we can consider future search steps to be related tasks and apply MTL techniques.
[224000410230] |For instance, when trying to predict the first word in a translation, we use the second and third words as related tasks.
[224000410240] |I, personally, think this would be quite a cool research project.
[224000410250] |So my tentative conclusions are: (1) SP can solve MTL, but this is probably not a good idea.
[224000410260] |(2) MTL can be applied to search-based SP and this probably is a good idea.
[224000410270] |(3) SP and MTL should not be considered the same problem: the primary difference is the loss function, which is all important to SP, but not so important to MTL.
[224000410280] |That said, SP algorithms probably do get a bit of gain from the "good internal representation" learned in an MTL setting, but by trying to learn this representation directly, we could probably do much better.
[224000420010] |Getting Started In: Summarization
[224000420020] |I'll kick off the "Getting Started In" series with summarization, since it is near and dear to my heart.
[224000420030] |Warning: I should point out at the get-go that these posts are not intended to be comprehensive or present a representative sample of work in an area.
[224000420040] |These are blog posts, not survey papers, tutorials or anything of that sort.
[224000420050] |If I don't cite a paper, it's not because I don't think it's any good.
[224000420060] |It's also probably not a good idea to base a career move on what I say.
[224000420070] |Summarization is the task of taking a long utterance (or a collection of long utterances) and producing a short utterance.
[224000420080] |The most common is when utterance=document, but there has also been work on speech summarization too.
[224000420090] |There are roughly three popular types of summarization: headline generation, sentence extraction and sentence compression (or document compression).
[224000420100] |In sentence extraction, a summary is created by selecting a bunch of sentences from the original document(s) and gluing them together.
[224000420110] |In headline generation, a very short summary (10 words or so) is created by selecting a bunch of words from the original document(s) and gluing them together.
[224000420120] |In sentence (document) compression a summary is created by dropping words and phrases from a sentence (document).
[224000420130] |One of the big problems in summarization is that if you give two humans the same set of documents and ask them to write a summary, they will do wildly different things.
[224000420140] |This happens because (a) its unclear what information is important and (b) each human has different background knowledge.
[224000420150] |This is partially alleviated by moving to a task-specific setting, like the query-focused summarization model, which has grown increasingly popular in the past few years.
[224000420160] |In the query-focused setting, a document (or document collection) is provided along with a user query that serves to focus the summary on a particular topic.
[224000420170] |This doesn't fix problem (b), but goes a long way to fixing (a), and human agreement goes up dramatically.
[224000420180] |Another advantage to the query-focused setting is that, at least with news documents, the most important information is usually presented first (this is actually stipulated by many newspaper editor's guidelines).
[224000420190] |This means that producing a summary by just taking leading sentences often does incredibly well.
[224000420200] |A related problem is that of evaluation.
[224000420210] |The best option is to do a human evaluation, preferably in some simulated real world setting.
[224000420220] |A reasonable alternative is to ask humans to write reference summaries and compare system output to these.
[224000420230] |The collection of Rouge metrics has been designed to automate this comparison (Rouge is essentially a collection of similarity metrics for matching human to system summaries).
[224000420240] |Overall, however, evaluation is a long-standing and not-well-solved problem in summarization.
[224000420250] |Techniques for summarization vary by summary type (extraction, headline, compression).
[224000420260] |The standard recipe for sentence extraction works as follows.
[224000420270] |A summary is created by first extracting the "best" sentence according to a scoring module.
[224000420280] |The second sentence is selected by finding the next-best sentence according to the scoring module, minus some redundancy penalty (we don't want to extract the same information over and over again).
[224000420290] |This process ends when the summary is sufficiently long.
[224000420300] |The scoring component assigns to each sentence in the original document collection a score that says how important it is (in query-focused summarization, for instance, the word overlap between the sentence and the query would be a start).
[224000420310] |The redundancy component typically computes the similarity between a sentence and the previously extracted sentences.
[224000420320] |Headline generation and sentence compression have not yet reached a point of stability in the same way that sentence extraction has.
[224000420330] |A very popular and successful approach to headline generation is to train a hidden Markov model something like what you find in statistical machine translation (similar to IBM model 1, for those familiar with it).
[224000420340] |For sentence compression, one typically parses a sentence and then attempts to summarize it by dropping words and phrases (phrases = whole constituents).
[224000420350] |Summarization has close ties to question answering and information retrieval; in fact, having some limited background in standard IR techniques (tf-idf, vector space model, etc.) are pretty much necessary in order to understand what goes on in summarization.
[224000420360] |Here are some papers/tutorials/slides worth reading to get one started (I'd also recommend Inderjeet Mani's book, Automatic Summarization if you don't mind spending a little money):
[224000420370] |Background IR material
[224000420380] |Sentence extraction
[224000420390] |Marcu's ACL tutorial
[224000420400] |Multi-Document Summarization By Sentence Extraction
[224000420410] |Automated Multi-document Summarization in NeATS
[224000420420] |A Trainable Document Summarizer
[224000420430] |Headline generation: Automatic Headline Generation for Newspaper Stories
[224000420440] |Sentence compression: Statistics-Based Summarization --- Step One: Sentence Compression
[224000420450] |Other:
[224000420460] |Discussion: What might be in a summary?
[224000420470] |Automatic evaluation: ROUGE: a Package for Automatic Evaluation of Summaries
[224000420480] |Recent fun stuff: Cut and paste based text summarization, Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment and A Statistical Approach for Automatic Speech Summarization
[224000430010] |Penn Discourse TreeBank
[224000430020] |I visited Penn last November and while talking with Joshi, I found out about a recent annotation effort from Penn/LDC: the Penn Discourse Treebank.
[224000430030] |(There is also an upcoming tutorial at ACL about the effort).
[224000430040] |The PDTB is quite different from discourse resources I'm familiar with (eg., the RST treebank).
[224000430050] |The interesting aspect of the PDTB is that discourse connectives (either those that appear explicitly or those that are implicit) are treated as a sort of predicate between two arguments, much like one would expect for verbs.
[224000430060] |As someone who is interested in problems that break the sentence boundary, I find much discourse-related resources and work interesting.
[224000430070] |One nice thing about the PDTB is that, since not every sentence is required to relate to another (relation only occurs when it is clear), human agreement is high and many unnatural decisions made in the RST treebank are avoided (eg., forcing things to be a tree when they shouldn't be).
[224000430080] |One not-so-nice thing about the PDTB is that not every sentence is required to relate to another.
[224000430090] |In a theory of discourse, I am looking for something that tells me why this blog post is a text and why a string of largely unrelated sentences are not.
[224000430100] |The PDTB-style annotation does not give us this.
[224000430110] |Because of this, it is hard for me to imagine many applications for which this is incredibly useful.
[224000430120] |In case anyone more familiar with the PDTB group reads this blog, I'm curious if anyone has proposed baseline models and evaluation criteria for this problem.
[224000430130] |I have some ideas if there isn't anything out there, but I don't want to step on toes...
[224000440010] |Here's a Loss, There's a Loss
[224000440020] |There are several different notions of "loss" that one encounters in machine learning literature.
[224000440030] |The definition I stick with is: the end criteria on which we will be evaluated.
[224000440040] |(The other notions often have to deal with how one approximates this loss for learning: for instance the standard loss for classification is 0/1, but this is approximated by hinge-loss or log-loss or...)
[224000440050] |For a large number of problems, there are quite a few "reasonable" losses.
[224000440060] |For chunking problems, there is 0/1 loss (did you get everything correct), Hamming loss (how many individual predictions did you get correct) and F-score (over chunks).
[224000440070] |For MT, there is WER, PER, HED, NIST, BLEU, etc.
[224000440080] |For summarization there are all the Rouges.
[224000440090] |Some of these losses are easier to optimize than others (optimize from the learning perspective).
[224000440100] |The question that arises is: if I can optimize one, should I worry about any of the others?
[224000440110] |A positive answer would make our lives much easier, because we could ignore some of the complexities.
[224000440120] |A negative answer would mean that we, as machine learning people, would have to keep "chasing" the people inventing the loss functions.
[224000440130] |It seems the answer is "yes and no."
[224000440140] |Or, more specifically, if the problem is sufficiently easy, then the answer is no.
[224000440150] |In all other cases, the answer is yes.
[224000440160] |Arguments in favor of "no":
[224000440170] |All automatically evaluated loss functions are by definition approximations to the true loss.
[224000440180] |What does one more step of approximation hurt?
[224000440190] |The loss functions are rarely that different: I can often closely approximate loss B with a weighted version of loss A (and weighting is much easier than changing the loss entirely).
[224000440200] |Empirical evidence in sequence labeling/chunking problems and parsing show that optimizing 0/1 or Hamming or accuracy or F-score can all do very well.
[224000440210] |Arguments in favor of "yes":
[224000440220] |If not "yes" then reranking would not be such a popular technique (though it has other advantages in features).
[224000440230] |Empirical evidence in MT and summarization suggest that optimizing different different measures produces markedly different results (though how this translates into human evaluation is questionable).
[224000440240] |If effort is put in to finding an automatic criteria that closely correlates with human judgment, we should take advantage of it if we want to do well on human judgements.
[224000440250] |There is a fourth argument for "yes" which comes from the binary classification literature.
[224000440260] |If we solve hundreds of different classification problems, some of which are harder than others, we can compare directly how well one loss looks like another.
[224000440270] |The standard picture looks like that below:
[224000440280] |On the x-axis is one loss function (say, hinge-loss) and on the y-axis is another (say, squared loss).
[224000440290] |Each dot corresponds to a different learning problem: those in the bottom left are easy, those in the top right are hard.
[224000440300] |What we commonly observe is that for very easy problems, it doesn't matter what loss you use (getting low error on one directly implies low error on another).
[224000440310] |However, as the problem gets harder, it makes a bigger and bigger difference which we use.
[224000440320] |A confounding factor which even further argues for "yes" is that in the binary classification example, the losses used are ones for which zero loss in one always implies zero loss for the other.
[224000440330] |This is very much not the case for the sorts of losses we encounter in NLP.
[224000440340] |So, my personal belief is that if we can optimize the more complicated loss, we always should, unless the problem is so easy that it is unlikely to matter (eg, part of speech tagging).
[224000440350] |But for any sufficiently hard problem, it is likely to make a big difference.
[224000440360] |Though I would love to see some results that do the same experiment that Franz did in his MT paper, but also with human evaluations.
[224000450010] |ACL papers up
[224000450020] |Available here.
[224000450030] |Some that look especially interesting:
[224000450040] |A Discriminative Global Training Algorithm for Statistical MT (Tillmann and Zhang)
[224000450050] |Semi-Supervised Conditional Random Fields for Improved Sequence Segmentation and Labeling (Jiao, Wang, Lee, Greiner and Schuurmans)
[224000450060] |An Iterative Implicit Feedback Approach to Personalized Search (Lv, Sun, Zhang, Nie, Chen and Zhang)
[224000450070] |Training Conditional Random Fields with Multivariate Evaluation Measures (Suzuki, McDermott and Isozaki)
[224000450080] |Integrating Syntactic Priming into an Incremental Probabilistic Parser, with an Application to Psycholinguistic Modeling (Dubey, Keller and Sturt)
[224000450090] |Robust PCFG-Based Generation using Automatically Acquired LFG Approximations (Cahill and van Genabith)
[224000450100] |Exploiting Syntactic Patterns as Clues in Zero-Anaphora Resolution (Iida)
[224000450110] |An End-to-End Discriminative Approach to Machine Translation (Liang, Bouchard-Cote, Taskar and Klein)
[224000460010] |NAACL: The post-hoc short-list
[224000460020] |NAACL is over, and while the majority of the conference took place in hallways for me this time around (lots of old and new friends made their way to NYC), I did get a chance to see a few papers that I enjoyed.
[224000460030] |I'm short-listing some of them here (including workshop papers), though I encourage others to list those that they liked, since one person can have at most a 33% recall.
[224000460040] |Effective Self-Training for Parsing (McClosky, Charniak + Johnson).
[224000460050] |Train a parser A, run A on 2m sentences, use A's output to train a new parser B and B does better on test data.
[224000460060] |This is a somewhat surprising story, and one that has plenty of counterexamples in the literature.
[224000460070] |These guys got it to work.
[224000460080] |The key was to use the Charniak + Johnson reranking parser as parser A, rather than just the Charniak parser.
[224000460090] |This begs the question: why does it work in the latter case.
[224000460100] |Unfortunately, the paper doesn't answer this question.
[224000460110] |Though talking to Eugene, they are interested in ideas why.
[224000460120] |One obvious idea is that the epsilon difference in performance for reranking and not reranking happens to occur at a phase transition.
[224000460130] |I find this unlikely.
[224000460140] |My guess (which I've suggested and they seem interested in exploring) is the following.
[224000460150] |The unreranked parser produces systematic errors, and so self-training just reinforces these errors leading to worse performance.
[224000460160] |In contrast, the reranked parser has a high variance of errors (this is intuitively reasonable if you think about how reranking works) and therefore these come across as "noise" when you retrain on the large corpus.
[224000460170] |It's easy to verify this claim: simply remove enough data from the reranked parser so that its performance is comparable to the vanilla parser and see if you still get self-training improvements.
[224000460180] |If so, I may be right.
[224000460190] |One can then do a more exhaustive error analysis.
[224000460200] |Other people have suggested other reasons which I think the Brown folk will also look into.
[224000460210] |I'm very curious.
[224000460220] |Prototype-driven Learning for Sequence Models (Haghighi + Klein).
[224000460230] |This received the best student paper award (congrats Aria!).
[224000460240] |The idea in this paper is that we don't want to have to label entire sequences to do normal supervised learning.
[224000460250] |Instead, what we do is, for each possible label, we list a small number of prototypical words from that label.
[224000460260] |They then use a constrained MRF, together with distributional similarity features to attempt to learn a sequence labeling model.
[224000460270] |I think this idea is very interesting, and I am in general very interested in ways of using "not quite right" data to learn to solve problems.
[224000460280] |My only concern with this approach is that it is quite strict to require that all occurances of prototypical words get their associated tag.
[224000460290] |This is perhaps okay for something like POS tagging, but in general problems, I feel that words are too ambiguous for this to work.
[224000460300] |(And, for completeness, since I asked this question at the talk, I would have liked to have seen at least one baseline model that made use of the prototypes, even in a naive way.)
[224000460310] |Quasi-Synchronous Grammars: Alignment by Soft Projection of Syntactic Dependencies (D Smith + Eisner, SMT Workshop).
[224000460320] |This paper makes the argument that the raw synchronicity assumptions made by SCFGs for either projection or translation is too strong (others, like Hwa, have argued the same).
[224000460330] |They analyze the sorts of transformations required to explain real parallel data (does a parent/child end up as a parent/child after translation, or is sisterhood maintained).
[224000460340] |They propose a translation model that can account for much more varied transformations that standard SCFGs.
[224000460350] |It's a workshop paper, and there's no decoder yet, but talking to David afterwards, they are actively working on this and it may actually be safe to hold one's breath.
[224000460360] |Computational Challenges in Parsing by Classification (Turian + Melamed, CHPJISLP Workshop).
[224000460370] |This is a history-based parsing-as-classification model, where one essentially parses in a bottom-up, sequential fashion.
[224000460380] |A lot of this paper is dedicated to talking about how to scale up to being able to train decision trees as classifiers (laziness, parallelization and sampling provide a 100 fold speedup), but something I found extremely interesting (though not shocking once you hear it) is that right-to-left search outperformed left-to-right.
[224000460390] |The obvious explanation is that English is largely right-branching, and by searching right-to-left, most decisions remain strongly local.
[224000460400] |They achive quite impressive results, and there's some obvious way in which this algorithm could be Searnified.
[224000460410] |So that's my short-short list.
[224000460420] |I'd love to hear what papers other people liked (so I know what to read!).
[224000460430] |That said, I heard many rumors of lots of really cool sounding ACL and EMNLP papers, so I'm really looking forward to Sydney!
[224000470010] |Incremental Improvements
[224000470020] |I usually think of incremental improvements as the result of taking an existing result X and twiddling X a little to yield Y. This can happen in system development (adding syntax to my MT system helped), theory (Y is a straightforward corollary of X), etc.
[224000470030] |Incremental improvements are often uninteresting, unless your research area happens to be exactly that of X (i.e., if I work on MT, maybe I will try adding syntax now).
[224000470040] |But in a broader sense, everything is an incremental improvement.
[224000470050] |It is vanishingly unlikely that a paper will come along that is not incremental in some sense.
[224000470060] |I think what often ends up making a paper more or less successful is how many incremental improvements it contains.
[224000470070] |For instance, the original CRF paper was clearly successful.
[224000470080] |Yet, mathematically, CRFs are only incrementally different from MEMMs.
[224000470090] |Experimentally, they only perform slightly better.
[224000470100] |Algorithmically, the optimization is only slightly more clever than inference in HMMs.
[224000470110] |Theoretically, they can solve only a slightly broader family of problems.
[224000470120] |But importantly, all of these advantages come in a bundle.
[224000470130] |The CRF is a success largely because a single, relatively simple formalism made (incremental) improvements in at least four areas.
[224000470140] |I think you can make similar arguments about other "must reads" in NLP, such as Collins' original parsing paper.
[224000470150] |If true, this might lead one to make research and accept/reject decisions on the basis of the number of areas in which improvements are made.
[224000470160] |This would certainly cut down on the number of papers published, but I also feel that many purely incremental papers do end up being useful, if only as steps in a path.
[224000470170] |For instance, MEMMs themselves are much more incremental upon HMMs, maximum entropy models and some older work by both Brill and Roth.
[224000470180] |The important question is whether CRFs would have been discovered, had MEMMs not been used and exhibited problems (eg., the label-bias problem).
[224000470190] |Of course, such observations are easy in retrospect: I don't know how to identify them without seeing the goal.
[224000470200] |(Incremental improvements also serve a second role: they are advertisements for the original technique, for those who missed it the first time around.)
[224000490010] |Having an Impact
[224000490020] |Having just completed my thesis, I've been thinking a lot about what directions I should persue, post-graduation.
[224000490030] |While I'll probably stick with some of the stuff I've been working on in the past, this is also an opportunity to diversify.
[224000490040] |This opens the question: what are promising areas to work on?
[224000490050] |One important qualification (though by no means the only one) is impact: I want to work on things that are important, either in the sense that they become tools used by others for a diverse set of problems, or change the way people think about or look at a particular area.
[224000490060] |Rather than applying my own model for what work has been important, one can look back at what other people think have been the most influential papers in NLP, as well as what papers have won best paper awards recently (with the caveat that the latter is probably less indicative).
[224000490070] |The story seems to be that the most impactful papers, at least as measured by this criteria, are either MT papers, parsing papers or machine learning papers.
[224000490080] |This makes some historical sense: MT is sort of the quintessential NLP application, while parsing is essentially the language-for-language's sake task.
[224000490090] |This analysis, unfortunately, puts me in a sticky situation.
[224000490100] |I've already been doing machine learning stuff, which I hope will turn out to be useful for other people.
[224000490110] |I'd ideally like to get back into really languagey stuff in the future, and this leaves parsing at MT.
[224000490120] |I have little interest in working on parsing (nothing against parsing: it's just not my thing, and there are lots of people who can probably do it better than me), so now I'm stuct with MT.
[224000490130] |MT is, without a doubt, an interesting topic.
[224000490140] |But the field is so crowded right now, it seems difficult to find a sufficiently large and interesting niche to carve out.
[224000490150] |I personally think out-of-English translation is promising, especially conjoined with morphological research (pretty much required when going into a highly inflected language), but there are a large number of barriers to entry to this problem (evaluation, my American-esque inability to speak any foreign language sufficiently reliably to do error analysis (except Japanese, but there's so little data there), etc.).
[224000490160] |But regardless of my personal interests and limitations, the fact that the field is so dominated by two problems seems unfortunate.
[224000490170] |There are 44 bins that ACL allowed you to select for your topic this year and while many of them largely overlap, there are many that do not.
[224000490180] |Why, for instance, does research in discourse, inference, phonology, question answering, semantics, and word sense disambiguation have comparatively little weight in the field as a whole?
[224000490190] |Why are there six sessions on MT (plus one multilingual), six on parsing (plus two on grammars), but only two on semantics and discourse and one on language modeling at ACL this year?
[224000490200] |And beyond that, there seem to be lots of interesting problems that no one is working on (or at least that no one publishes at ACL).
[224000490210] |I do not doubt that these are interesting problems, but it seems that there is a severe skew that may not be healthy to the community.