[224000500010] |
Beating an N-Gram
[224000500020] |Our entry into the NIST MT eval this year has a recapitalization component, currently being done by a large language model (500mb, gzipped) together with SRILM's "disambig" tool.
[224000500030] |I was recently conscripted to try to improve this.
[224000500040] |After about a week's worth of work, using a bunch of tools and rules and a bunch of other stuff (though not 100% tuned), I've eeked out a 0.25 increase in BLEU scores on the 2003 and 2004 test data for three of our MT systems (phrase-based, syntax-based and Hiero).
[224000500050] |This could easily be upped to maybe 0.5 with a few more hours of work (there are tokenization differences between my training and test data).
[224000500060] |The story that it's hard to beat an n-gram language model is fairly ubiquitous.
[224000500070] |In fact, my most useful features for classification are based on the n-best outputs of disambig using the large LM.
[224000500080] |There are older results that suggest that once you have enough data, using stupid techniques is sufficient.
[224000500090] |This seems to be more true for NLP than for many other areas I know about, though perhaps this is because there exist NLP problems for which it is even possible to get a "sufficient" amount of data.
[224000500100] |While this story is true for English, I'm not sure that it would hold for other languages.
[224000500110] |My guess is that n-gram models work significantly worse in languages with more free word order and (though this usually comes as a package deal) stronger morpology.
[224000500120] |As a counter-argument, though, some of my friends who have contact with people who do commercial speech recognition in languages other than English do actually use vanilla n-gram models in those other languages.
[224000500130] |I don't know whether this is for lack of an alternative or just because they remain so good even outside of English.
[224000500140] |I don't know the secret sauce to beating an n-gram.
[224000500150] |They appear to work so well because language (where, by "language" I mean "English") is really not that ambiguous, in a Shannon-experiment sense.
[224000500160] |Moreover, raw English words seem to really be just about right when it comes to information content.
[224000500170] |Sure, we can get some mileage by looking at suffixes or bigrams, but, comparing to, say, German (where one word can correspond to 2-3 English words), English really seems to strike the right balance of amount of information in a word to sparsity (where "right" = "right for NLP").
[224000500180] |I'm sure other languages fall fairly close by and I recall seeing a study comparing Shannon-style tests in multiple languages (anyone know a pointer?), but, if pressed, this is my guess as to why n-grams work so well.
[224000500190] |To beat them, it seems we are almost forced to look at problems with less data, or problems which are significantly noisier.
[224000510010] |Thesis is Done!
[224000510020] |I'm happy to announce that my thesis, Practical Structured Learning Techniques for Natural Language Processing, is now done (which is good, because it needs to be handed in by 11:59pm on July 5th).
[224000510030] |Many thanks to everyone who supported me through this (long) process; I really appreciate it.
[224000520010] |My Thesis Path
[224000520020] |(Minor note: the latest version of the EDT section didn't get folded in properly to the version of the thesis that went up yesterday; this is fixed now.)
[224000520030] |Many people have asked me how I settled on a thesis topic, I think largely because they are trying to find their own paths.
[224000520040] |My story is (at least to me) a bit odd and circuitous, but I'll tell it anyway.
[224000520050] |When I came to USC, I knew I wanted to do NLP and, more specifically, I wanted to do summarization.
[224000520060] |Working with Daniel was a natural choice.
[224000520070] |I was particularly interested in coming up with good models for getting around the sentence extraction paradigm that dominates the field.
[224000520080] |My first work was on extending Kevin and Daniel's sentence compression work to the document level by using discourse information.
[224000520090] |This worked reasonably well.
[224000520100] |My next milestone was to try to leverage alignment information of document/abstract pairs in order to learn complex transformations.
[224000520110] |This led to the segment HMM model for doc/abs alignment, that met with reasonable success (considering how darned hard this problem is).
[224000520120] |At that point, it became clear that trying to do a full abstractive system just wasn't going to work.
[224000520130] |So I started looking at interesting subproblems, for instance sentence fusion.
[224000520140] |Unfortunately, humans cannot reliably do this task, so I threw that out, along with a few other ideas.
[224000520150] |Around the same time, I began noticing that to have any sort of reasonably interesting model that did more than sentence extraction, I really was going to need to run a coreference resolution system.
[224000520160] |So I went to the web (this was back in 2003) and found one to download.
[224000520170] |Except I didn't because there weren't any publicly available.
[224000520180] |So I went to build my own.
[224000520190] |I wanted to do is "right" in the sense that I wanted a principled, joint system that could be trained and run efficiently.
[224000520200] |I read a lot of literature and didn't find anything.
[224000520210] |So then I read a lot of machine learning literature (I was beginning to get into ML fairly hardcore at this point) to find a framework that I could apply.
[224000520220] |I couldn't find anything.
[224000520230] |So I decided to build my own thing and came up with LaSO, which is essentially a formalization and tweak on Mike Collins and Brian Roark's incremental perceptron.
[224000520240] |My thesis proposal used LaSO to solve the EDT problem, and I presented LaSO at ICML 2005.
[224000520250] |Also at ICML 2005 I met John Langford, having previously noticed that what I was doing with LaSO looked something like some recent work he had done on reinforcement learning.
[224000520260] |We had a long dinner and conversation and, after a few visits to Chicago, many emails, and lots of phone calls, came up with Searn.
[224000520270] |With both LaSO and Searn, I always had in the back of my mind that I didn't want to make any assumptions that would render it inapplicable to MT, since everyone else at ISI only does MT :).
[224000520280] |At about the same time, I started thinking about how to do a better job of sentence compression and came up with the vine-growth model that eventually made it into my thesis.
[224000520290] |This was really the first point that I started thinking about summarization again, and, in the evolution of LaSO to Searn, it now became possible to do learning in this model.
[224000520300] |So, in a sense, I had come full circle.
[224000520310] |I started with summarization, lost hope and interest, began to get more interested in EDT and machine learning, and then finally returned to summarization, this time with a new hammer.
[224000530010] |What is Natural Language Understanding?
[224000530020] |I've read lots of papers and seen lots of talks that justify a task as being useful to doing "natural language understanding."
[224000530030] |The problem I have here is that I really have no idea what NLU actually is.
[224000530040] |I think the standard line is that NLU is the process of transforming natural language text into "some representation" that is "easy" for a computer to "manipulate."
[224000530050] |Of course, the quoted words are so vacuous in meaning that I can easily imagine reasonable definitions for them that make this task either trivial or essentially impossible.
[224000530060] |What I find unfortunate here is that, without trying hard to pin down definitions, this seems like a completely reasonable goal for NLP technology; in fact, my impression is that up until 10 or 20 years ago, this actually was one of the major goals of the field.
[224000530070] |According to the anthology, roughly 25% of the papers in 1979 contained the terms "NLU" or "language understanding", compared to 20% in the 1980s, 10% in the 90s and 7% in the 2000s.
[224000530080] |One possible explanation of the dwindling of this topic is that publication has become increasingly driven by experimental results, and if one cannot pin down a definition of a task, one cannot reliable compare results across multiple papers.
[224000530090] |The recent push on textual entailment bears some semblance to NLU, but is simultaneously better defined and more restricted (though I admit to having heard some grumbling that some more work needs to be done to get textual entailment to be a very well defined task).
[224000530100] |There also continues to be a modicum of work on database filling and natural langauge database queries, though certainly not at the same rate as before.
[224000540010] |Conference Formats
[224000540020] |There are two conference formats I'm familiar with.
[224000540030] |One is typified by ACL and ICML; the other by NIPS.
[224000540040] |The ACL format is a roughly 3 day schedule, with three parallel sessions, which gives time for somewhere around 80 presentations, each 20 mins + 5 for questions.
[224000540050] |ACL has smallish poster sessions that are typically associated with "short papers."
[224000540060] |My impression is that many people do not attend the poster sessions.
[224000540070] |In contrast, NIPS typically accepts many more papers (120ish per year), but the vast majority are presented as posters.
[224000540080] |There is only one track at the conference, with several invited talks, a few paper presentations (maybe 15 total) that last 20 minutes.
[224000540090] |They also have "spotlight presentations", which are 2 slide, 2 minute talks designed to draw attention to particularly interesting posters.
[224000540100] |The poster sessions run from (roughly) 6 until 10 every night and are attended by everyone.
[224000540110] |My feeling is that both formats have advantages and disadvantages.
[224000540120] |I personally like the NIPS format, essentially because I feel that at the end of the conference, I have seen more of what is there (because I spend so much time at posters).
[224000540130] |This also means that I'm completely exhausted at the end of NIPS.
[224000540140] |The major disadvantage to the NIPS format seems to be that I'm used to using post-conference time as dinner socialization time, and this seems to happen somewhat less at NIPS (this is confirmed by a friend who is more in the "in crowd" at NIPS than I am).
[224000540150] |I think that it might be a little intimidating for junior researchers to go up to a "famous" poster, but on the other hand, this forces you to immediately get to know everyone.
[224000540160] |The "lots of posters" format also means that the decision about the number of papers to accept at NIPS is essentially limited by physical space, rather than length of the conference.
[224000540170] |Ken Church seems to think we should be accepting more papers, anyway.
[224000540180] |Are there other major disadvantages to having a single-track conference, with the majority of papers presented as posters?
[224000540190] |I don't expect to see a shift in how our conferences are held, but I'd like to understand all the pros and cons.
[224000550010] |Visa Problems entering USA?
[224000550020] |Some people are trying to do something about it (contacting congressmen, etc.).
[224000550030] |If you've had problems entering the US for conferences, you should contact them.
[224000560010] |Where are Interesting Learning Problems in NLP?
[224000560020] |I just spent a few days visiting Yee Whye and NUS (photos to prove it!).
[224000560030] |YW and I talked about many things, but one that stood out as a ripper is attempting to answer the question: as a machine learning person (i.e., YW), what problems are there in NLP that are likely to be amenable to interesting machine learning techniques (i.e., Bayesian methods, or non-parametric methods, etc.).
[224000560040] |This was a question we tried to answer at the workshop last year, but I don't think we reached a conclusion.
[224000560050] |At first, we talked about some potential areas, mostly focusing around problems for which one really needs to perform some sort of inference at a global scale, rather than just locally.
[224000560060] |I think that this answer is something of a pearler, but not the one I want to dwell on.
[224000560070] |Another potential partial answer arose, which I think bears consideration: it will not be on any problem that is popular right now.
[224000560080] |Why?
[224000560090] |Well, what we as NLPers usually do these days is use simple (often hueristic) techniques to solve problems.
[224000560100] |And we're doing a sick job at it, for the well studied tasks (translation, factoid QA, ad hoc search, parsing, tagging, etc.).
[224000560110] |The hunch is that one of the reasons such problems are so popular these days is because such techniques work so bloody well.
[224000560120] |Given this, you'd have to be a flamin' galah to try to apply something really fancy to solve one of these tasks.
[224000560130] |This answer isn't incompatible with the original answer (globalness).
[224000560140] |After all, most current techniques only use local information.
[224000560150] |There is a push toward "joint inference" problems and reducing our use of pipelining, but this tends to be at a fairly weak level.
[224000560160] |This is not to say that Bayesian techniques (or, fancy machine learning more generally) is not applicable to problems with only local information, but there seems to be little need to move toward integrating in large amounts of global uncertainty.
[224000560170] |Of course, you may disagree and if you do, no wuckers.
[224000560180] |p.s., I'm in Australia for ACL, so I'm trying to practice my Aussie English.
[224000570010] |Preprocessing vs. Model
[224000570020] |Back from Sydney now, and despite being a bit jetlagged, I thought I'd begin with an COLING/ACL-related post.
[224000570030] |There was a paper presented by Ben Wellington that discusses whether current syntax-driven models for translation can capture the sorts of reorderings/alignments observed in hand-aligned data.
[224000570040] |I'm not looking at the paper, but what I remember from the talk is that something like 5-10% of sentences are missed assuming an ITG-like model, without gapping.
[224000570050] |At the end of the talk, Dekai Wu commented that they are aware of this for ITG and that an analysis of where it happens is essentially is "weird" extrapositions that occur in English, such as wh-movement, topical preposing, adverbial movements, etc.
[224000570060] |Apparently such things occur in roughly 5-10% of sentences (from whatever corpora were being used).
[224000570070] |This talk was near the end of the conference, so my buffer was near full, but my recollection is that the suggestion (implied or explicit) was that English could (should?) be "repaired" by unextraposing such examples, at which time a model like ITG could be directly applied to nearly all the data.
[224000570080] |To me this seems like a strange suggestion.
[224000570090] |ITG is a very simple, very easy to understand model of translation; more natural, in many regards, than, say, the IBM models.
[224000570100] |(Okay, maybe in all regards.)
[224000570110] |One nice thing about the model is that it is largely un-hackish.
[224000570120] |Given this, the last thing I'd really want to do is to fix English by applying some preprocessing step to it, so that the output could then be fed into this nice pretty model.
[224000570130] |There's an obvious generalization to this question, hinted at in the title of this post.
[224000570140] |It seems that in many problems (translation with ITG being but one example), we can either choose a theoretically clean but maybe not completely appropriate model plus some preprocessing, or a somewhat messier but maybe 99.9% recall model with no preprocessing.
[224000570150] |I suppose one could argue that this is the long sought after integration of rule-based and statistically-based NLP, but this seems uncompelling, because it's not really an integration.
[224000570160] |I suppose that I'd almost always prefer the latter sort of solution, but that's not to say the first isn't also useful.
[224000570170] |For instance, in the ITG case, after identifying cases where the ITG assumption fails, we could try to minimally enhance the ITG model to be able to capture these corner cases.
[224000580010] |Loss versus Conditional Probability
[224000580020] |There was a talk in the session I chaired at ACL about directly optimizing CRFs to produce low F-scores for problems like NE tagging and chunking.
[224000580030] |The technique is fairly clever and is based on the observation that you can use very similar dynamic programming techniques to do max F-score as to do max log-probability in CRFs.
[224000580040] |The details are not particularly important, but during the question phase, Chris Manning asked the following question: Given that F-score is not really motivated (i.e., is a type 4 loss), should we really be trying to optimize it?
[224000580050] |Conditional probability seems like a completely reasonable thing to want to maximize, given that we don't know how the tags will be used down the pipeline.
[224000580060] |(It seems Chris is also somewhat biased by a paper he subsequently had at EMNLP talking about sampling in pipelines.)
[224000580070] |I think Chris' point is well taken.
[224000580080] |Absent any other information, conditional probably seems like a quite plausible thing to want to optimize, since given the true conditional probabilities, we can plug in any loss function at test time and do minimum Bayes risk (in theory, at least).
[224000580090] |On the other hand, there is an interesting subtlety here.
[224000580100] |Conditional probability of what?
[224000580110] |The standard CRF optimization tries to maximize conditional probability of the entire sequence.
[224000580120] |The "alternative objective" of Kakade, Teh and Roweis optimizes the sub of the conditional probabilities of each label.
[224000580130] |These are two quite different criterea, and which one should we choose?
[224000580140] |In fact, neither really seems appropriate.
[224000580150] |Conditional probability of the sequence doesn't make sense because it would rather improve a bad label from probability 0.01 to 0.1 than improve a bad label from 0.4 to 0.6 and thus get it right.
[224000580160] |But summed conditional probability of labels doesn't make sense in NE tagging tasks because always assigning probability 0.9 to "not an entity" will do quite well.
[224000580170] |This is essentially the "accuracy versus f-score" problem, where, when few elements are actually "on," accuracy is a pretty terrible metric.
[224000580180] |If we take Chris' advice and desire a conditional probability, it seems what we really want is direct conditional probability over the chunks!
[224000580190] |But how do we formulate this and how do we optimize it?
[224000580200] |My impression is that a direct modification of the paper Chris was asking about would actually enable us to do exactly that.
[224000580210] |So, while the authors of this paper were focusing on optimizing F-score, I think they've also given us a way to optimize conditional chunk probabilities (actually this should be easier than F-score because there are fewer forward/backward dependencies), similar to what Kakade et al. did for conditional label probabilities.
[224000590010] |Future DUC Tasks
[224000590020] |The Document Understanding Conference features a yearly summarization competition.
[224000590030] |For the past few years, the task has been query-focused summarization of clusters of (essentially entirely) news documents.
[224000590040] |There will be a pilot task next year and based on comments made during DUC 2006, it appears it will be one of the following:
[224000590050] |Multidocument, (probably) query-focused summarization of blog posts.
[224000590060] |Multidocument summarization of news, with respect to known information.
[224000590070] |The idea in (1) is that there are several "novel" aspects one has to deal with.
[224000590080] |First, blog posts are out of domain for most parsers, etc., which means we'll get noisy input but not as noisy as speech.
[224000590090] |Second, although the blog posts (the blogs would be from the TREC blog collection) will essentially all focus on news topics (saldy, NLPers is not in the corpus), they are almost certainly more emotionally fueled than vanilla news.
[224000590100] |The identification of sentiment and opinion, which are both in vogue these days, will potentially become more useful.
[224000590110] |The idea in (2) is that in most real world situations, the user who desires the summary has some background information on the topic.
[224000590120] |The idea is that the summarization engine would be handed a collection of 5-10 documents that the user has presumably read, then 5-10 new documents to be summarized.
[224000590130] |The novel aspect of this task is, essentially, detecting novelty.
[224000590140] |Personally, I think both are potentially interesting, though not without their drawbacks.
[224000590150] |The biggest potential problem I see with the blogs idea is that I think we're reentering the phase of not being able to achieve any sort of human agreement without fairly strict guidelines.
[224000590160] |It's unclear if, say, two viewpoints are expressed, how a summary should reflect these.
[224000590170] |The biggest problem I see with idea (2) is that it is very reminiscent of some TREC-style tasks, like TDT, and I'm not sure that doing anything more than essentially doing normal query-focused summarization with an MMR-style term to account for "known information."
[224000590180] |That's not to say these aren't worth exploring -- I think both are quite interesting -- but, as always, we should be careful.
[224000600010] |Human in the Loop Learning
[224000600020] |What we, as NLPers, typically do vis-a-vis machine learning is what might be called "Human in the Loop" machine learning.
[224000600030] |Why?
[224000600040] |Typically, we develop a model and set of features, train it on some training data and evaluate it on some development data.
[224000600050] |We then augment/change the model and features, retrain and retest on the dev data.
[224000600060] |Only once we are satisfied with our dev results do we run it on the test data and report scores in a paper.
[224000600070] |This is "human in the loop" because in each iteration we're using our human knowledge to, say, add some additional features that will hopefully correct for errors on the development data.
[224000600080] |To many machine learning people, this is not ideal: the whole point of machine learning is that the human is not in the loop.
[224000600090] |One can liken the "adding more features" to something akin to adding new sensors to a robot: when it is deployed on Mars, we cannot easily send a person to duct-tape on a new sensor.
[224000600100] |But I'm not sure that this argument really applies here.
[224000600110] |We are, after all, trying to either learn new things about language or build systems that work.
[224000600120] |For both of these goals, retwiddling models and adding features is the right thing to do.
[224000600130] |If we could have an automated system to help us with this, that would be great.
[224000600140] |But I don't think we should shy away from keeping ourselves in the loop.
[224000600150] |Perhaps another reason why machine learning tends to shy away from HITLL is that it goes against the long-standing models of supervised learning, which typically begin by saying something like "Let D be a distribution over pairs of inputs x and outputs y and let L be a labeled training set drawn iid from D."
[224000600160] |Here, the inputs (and hence the distribution) specify exactly what information we have access to in terms of features.
[224000600170] |This model essentially takes features for granted, as simply part of the learning problem.
[224000600180] |This is quite reasonable from the ML perspective, but is somehow lacking if one wants to talk about feature engineering, etc.
[224000600190] |There may be another, more apt model out there, but everything I can think of reduces trivially to the standard supervised model, unless inappropriate assumptions are made.
[224000610010] |Change of Notation
[224000610020] |Ed Hovy has a particular label he assigns to a large variety of work in NLP: it performs a change of notation.
[224000610030] |The canonical example is POS tagging, where we are changing from one notation (words) to another notation (POS tags).
[224000610040] |He also applies this notion to most forms of parsing (though this is a bit more like adding notation), machine translation (one notation is Arabic, the other English), etc.
[224000610050] |My impression (though it's sometimes hard to tell where Ed really stands on such things) is that solving a task with a change-of-notation technique is suboptimal.
[224000610060] |I've never fully grasped what Ed was trying to say specifically until recently (and I make no promises I didn't get it wrong, but this is my interpretation).
[224000610070] |I think what Ed means by change of notation is that what is being done is pure symbol manipulation.
[224000610080] |That is, we have a bunch of symbols (typically words) and we just push them around and change them, but the symbols don't actually "mean" anything.
[224000610090] |There's no sense that the symbol "chair" is grounded to an actual chair.
[224000610100] |Or, as a logician might say, the manipulation is purely syntactic, not semantic.
[224000610110] |There is no model in which statements are evaluated.
[224000610120] |My interpretation is that by "notation" Ed simply means "syntax" (the mathematical notion of syntax, not the linguistic one).
[224000610130] |On the other hand, it's hard not to say that everything we do, by definition, must be symbol manipulation.
[224000610140] |Computer science is all about playing with symbols.
[224000610150] |There may be a sense in which the word "chair" is grounded to an actual chair, but this actual chair, once inside the computer, will again be a symbol!
[224000610160] |Despite this, I think that what one can think about is something like "how large is the connection between my model and the real world?"
[224000610170] |For most applications, the connection is probably limited to the specific form of training data.
[224000610180] |This is probably as close to the smallest connection possible.
[224000610190] |Sometimes people attempt to integrate ontological information, either handcrafted (via something like WordNet) or automatically derived.
[224000610200] |This is, in some sense, attempting to open the "program to world" pipe a bit more.
[224000610210] |And there is some work that explicitly focuses on learning grounding, but I'm not aware of many actual uses for this yet.
[224000610220] |Additionally, some applications cannot help but to be tied more closely to the real world (eg., robot navigation).
[224000610230] |But for the majority of tasks, the connection is quite small.
[224000620010] |Machine learning has failed[?.]
[224000620020] |When I was visiting Singapore, Lan Man gave a talk at NUS right before mine (for extra fun, try to find her web site without using my link!).
[224000620030] |The work she presented is, for the purposes of this discussion, essentially a feature weighting technique for document classification.
[224000620040] |More specifically, let's consider the standard bag of words model for text classification.
[224000620050] |Here, we have a feature set that is exactly the size of the vocabulary.
[224000620060] |The easiest thing to do is to use binary features (word present or not), but other things seem to work better.
[224000620070] |For instance, the number of times the word appears (term frequency, or tf) often works better.
[224000620080] |This is not so surprising.
[224000620090] |The problem with just using tf is that words like "the" appear a lot and they're somehow less "important."
[224000620100] |So instead, what we often do is tf*idf, where idf is the "inverse document frequency."
[224000620110] |There are a few variants of idf (that use logs, etc.) but the basic idea is to count how in many documents each word appears at least once.
[224000620120] |Then, we divide the number of documents by this count.
[224000620130] |So the idf value is always at least one, and is one only for words that appear in all documents (eg., "the").
[224000620140] |What Lan Man was proposing was essentially a different value to use instead of idf (I don't recall the details, but it was similarly an easy-to-compute function of the corpus).
[224000620150] |The result is essentially that binary does worst, tf does next best, followed by tf*idf and then tf*her_proposal worked best (for most data sets).
[224000620160] |The importance of using tf*idf arose in IR, where people started computing cosine similarities between word vectors.
[224000620170] |Here, downweighting things like "the" was of utmost importance, due to how cosine works.
[224000620180] |But it makes much less sense in supervised learning.
[224000620190] |That is, for linear classifiers (which are the norm, and which she used for many of her experiments), it really shouldn't matter if we apply a constant scaling factor to each feature.
[224000620200] |The learned feature weight itself is what is supposed to determine the importance of a feature.
[224000620210] |If we learn weights for tf*idf, we could easily convert them into just-as-good weights for tf by multiplying each weight by the corresponding df value.
[224000620220] |I can think of a few explanations of this phenomenon, but none is at all satisfying (eg., by changing the scaling, we may be increasing the size of the margin while reducing the maximum norm vector size, which, for most generalization bounds, would be a good thing.
[224000620230] |Another answer is an interplay with regularization -- it becomes less "expensive" to have an overall high weight -- learned weight times df -- for infrequent informative words than for irrelevant words.).
[224000620240] |The problem is that this is a problem!
[224000620250] |It means that even for applications as simple as text categorization, not only is feature engineering an important issue, but feature weighting seems also to be.
[224000620260] |(Incidentally, I'm curious if other fields -- eg., vision -- have similar techniques to "idf."
[224000620270] |For instance, one could imagine weighting pixel values by "idf"s of how often a pixel is on in a dataset.
[224000620280] |I wonder if this helps.
[224000620290] |If not, it may be because for smallish images, the number of features is small already.
[224000620300] |I wonder if there is an algorithm design for machine learning that would get around this problem.)
[224000630010] |Doing Named Entity Recognition? Don't optimize for F1
[224000630020] |(Guest post by Chris Manning.
[224000630030] |Thanks Chris!)
[224000630040] |Among ML-oriented nlpers, using a simple F1 of precision and recall is the standard way to evaluate Named Entity Recognition.
[224000630050] |Using F1 seems familiar and comfortable, but I think most nlpers haven't actually thought through the rather different character that the F1 measure takes on when applied to evaluating sequence models.
[224000630060] |It's not just that it's a type 4 loss (a simple, intuition-driven measure like accuracy): In most cases such measures are reasonable enough for what they are, but using F1 for NER has an under-appreciated dysfunctional character.
[224000630070] |You wouldn't want to optimize for it!
[224000630080] |This post explains what I was really thinking about when I made the comment that Hal referred to previously (fortunately, I didn't try to say all this after the talk!).
[224000630090] |I agree with Hal: the paper was a very nice piece of work technically.
[224000630100] |I just think that the authors, Jun Suzuki et al., chose a bad peak to climb.
[224000630110] |Everyone is familiar with the F1 measure for simple classification decisions.
[224000630120] |You draw a 2x2 contingency table of whether something should be yes/no, and whether the system guessed yes/no, and then calculate the harmonic mean of precision and recall.
[224000630130] |But now think about Named Entity Recognition.
[224000630140] |You're chugging through text, and every now-and-again there is an entity, which your system recognizes or doesn't or fantasizes.
[224000630150] |I will use the notation word/GOLD/GUESS throughout, with O denoting the special background class of not-an-entity.
[224000630160] |So there are stretches of plain text (drove/O/O along/O/O a/O/O narrow/O/O road/O/O).
[224000630170] |These are the non-coding regions of NER.
[224000630180] |Then there are sequences (of one or more tokens) where there was an entity and the system guessed it right (in/O/O Palo/LOC/LOCAlto/LOC/LOC ./O/O), where there was an entity but the system missed it (in/O/O Palo/LOC/O Alto/LOC/O ./O/O), and where there wasn't an entity but the system hypothesized one (an/O/O Awful/O/ORG Headache/O/ORG ./O/O).
[224000630190] |Things look good up until here: those events map naturally on to the false negatives (fn), true positives (tp), false negatives (fp), and false positives (fp) of the simple classification case.
[224000630200] |The problem is that there are other events that can happen.
[224000630210] |A system can notice that there is an entity but give it the wrong label (I/O/O live/O/O in/O/O Palo/LOC/ORG Alto/LOC/ORG ./O/O).
[224000630220] |A system can notice that there isan entity but get its boundaries wrong (Unless/O/PERS Karl/PERS/PERS Smith/PERS/PERS resigns/O/O).
[224000630230] |Or it can make both mistakes at once (Unless/O/ORG Karl/PERS/ORG Smith/PERS/ORG resigns/O/O).
[224000630240] |I'll call these events a labeling error (le), a boundary error (be), and a label-boundary error (lbe).
[224000630250] |I started thinking along these lines just as an intuitive, natural way to characterize happenings in NER output, where entities are sparse occurrences in stretches of background text.
[224000630260] |But you can make it formal (I wrote a Perl script!).
[224000630270] |Moving along the sequence, the subsequence boundaries are: (i) at start and end of document, (ii) anywhere there is a change to or from a word/O/O token from or to a token where either guess or gold is not O, and (iii) anywhere that both systems change their class assignment simultaneously, regardless of whether they agree.
[224000630280] |If you chop into subsequences like that, each can be assigned to one of the above seven classes.
[224000630290] |Now, the thing to notice is that for the first 4 event types, you are either correct or you get 1 demerit, assessed to either precision or recall.
[224000630300] |In the simple classification case, that's the end of the story and the F1 measure is sensible.
[224000630310] |But when doing precision and recall over subsequences, there are these other three event types.
[224000630320] |Each of them is assessed a minimum of 2 demerits, with both precision and recall being hit.
[224000630330] |Therefore, it is fairly clear that optimizing for F1 in this context will encourage a system to do the following: if I'm moderately uncertain of either the class label or the boundaries of the entity, because a mistake would cost me a minimum of 2 demerits, I'm better off proposing no entity, which will cost me only 1 demerit.
[224000630340] |(Two notes:
[224000630350] |(i) As I've defined events, the possible demerits for an event in the last three classes is unbounded, though in practice 2 is the most common case.
[224000630360] |For example, this lbe event would be assessed 4 demerits (3 to precision, and 1 to recall): Smith/ORG/PERS and/ORG/O Newcomb/ORG/PERS and/ORG/O Co./ORG/ORG.
[224000630370] |(ii) Despite my title, the problem here isn't with the F measure per se, as Bob Moore emphasized to me at a coffee break during ACL 2006 (thanks!).
[224000630380] |The problem would occur with any measure that combines precision and recall and which is increasing in both arguments, such as the simple arithmetic mean of precision and recall.)
[224000630390] |Observe that this behavior is the opposite of the way things were meant to work: people adopted F1 in IR rather than using accuracy because accuracy gives high scores to a system that returns no documents, which obviously isn't useful.
[224000630400] |But, here, optimizing for F1 is encouraging a system to not mark entities.
[224000630410] |Now let's look at some data.
[224000630420] |I used this event classification system on the output of my NER system on the CoNLL 2003 shared task English testa data.
[224000630430] |Here is how many events of each type there were:
[224000630440] |tn 5583 tp 4792 fn 118 fp 120 le 472 be 102 lbe 75
[224000630450] |Note in particular that over 2/3 of the errors are in those 3 extra categories that are multiply penalized.
[224000630460] |The ratios of classes vary with the task.
[224000630470] |For example, in biological NER, you tend to get many more boundary errors.
[224000630480] |But in my experience it is always the case that lots of the errors are in the last 3 classes.
[224000630490] |Moreover, some of the errors in the le and be classes are not that bad, and sometimes even reflect subtle judgement calls and human annotator inconsistency in the gold standard.
[224000630500] |For instance, in the GENIA data you can find both regulation/O of/O human/DNA interleukin-2/DNA gene/DNA expression and transduction/O to/O the/O human/O IL-2/DNA gene/DNA, where it is unclear whether to include human in the name of the gene.
[224000630510] |Or in a newswire phrase like the Leeds stadium, it's not always very clear whether Leeds should be tagged ORG as a reference to the football team or LOC as a reference to the city.
[224000630520] |In almost any imaginable task, you would prefer systems that made these errors to ones that missed such entities entirely.
[224000630530] |In other words, the F1 measure is punishing more severely mistakes that should be punished less according to reasonable intuitions of task utility.
[224000630540] |Has this been noticed before?
[224000630550] |I think so.
[224000630560] |The ACE program has a system for giving partial credit.
[224000630570] |But most ML people react very negatively to a scoring system that you couldn't possibly write on a napkin and which involves various very arbitrary-looking constants....
[224000630580] |Do these observations undermine the last decade of work in NER?
[224000630590] |I don't think so.
[224000630600] |It turns out that there are lots of measures that are pretty okay providing you do not specifically optimize for them, but are dysfunctional if you do.
[224000630610] |A well-known example is traditional readability measures.
[224000630620] |p.s.
[224000630630] |As I finish writing this guest post, it's occurred to me that I think this is the first nlpers post with some actual natural language examples in it.
[224000630640] |If you're reading this post, I guess that at least shows that such content isn't actively filtered out!
[224000640010] |NLG on NPR
[224000640020] |I heard a story on NPR while driving today that referenced a recent Financial Times story about the use of natural language generation technology for producing financial stories.
[224000640030] |The financial company, Thomson, is now generating some of its stories using NLG.
[224000640040] |The NLG technology, based on the little bit of information I gathered from the NPR story and reading the associated article, seems to be straightforward template filling.
[224000640050] |This is not terribly surprising, given the stagnant nature of financial articles (grab any such article from WSJ and you see things like "The DOW went up ____ percent (___ points) because of ___."
[224000640060] |Most of this can be easily gathered from databases.
[224000640070] |One might suspect that the final "___" (the reason) would be hard to come up with automatically, but apparently this is highly standardized as well.
[224000640080] |According to the NPR story following the one on the NLG technology, the number of reasons that fill that blank is nearly closed class ("profits" or "middle east" or "terrorism" or "bin Laden videotape" or "strong trading" or ...).
[224000640090] |And, based on some hear-say evidence, the reasons are largely bogus anyway (this is not surprising: summarizing one billion trades as being driven by a bin Laden videotape is highly suspect).
[224000640100] |An amusing comment was made when the journalist at NPR asked the Thomson rep if his job was in danger.
[224000640110] |The response was that it was, only if he believed that his abilities topped out at filling out templates.
[224000640120] |Anyway, if anyone knows more about the technology, and if there's anything interesting in it, it would be fun to know.
[224000640130] |Beyond that, I just thought it was fun that NLP technology made a spot on NPR.
[224000650010] |Multilingual = Not Lingual at All?
[224000650020] |There has been a trend for quite some time now toward developing algorithms and techniques to be applicable to a wide range of languages.
[224000650030] |Examples include parsing (witness the recent CoNLL challenge), machine translation, named entity recognition, etc.
[224000650040] |I know that in at least one or two of my own papers, I have claimed (without any experimental substantiation, of course :P) that there is no reason why the exact same system could not be run on languages other than English, provided a sufficient amount of labeled training data (and a native speaker who can deal with the annoying tokenization/normalization issues in the non-English language).
[224000650050] |I get the feeling that a large part of the surge is blowback against older NLP systems, for which hundreds of/or thousands of human hours were put into writing language-specific grammars and rules and lexicons.
[224000650060] |The replacement idea is to spend thouse hundreds of/or thousands of hours annotating data, and then repeatedly reusing this data to solve different problems (or to try to come up with better solutions to an existing problem, despite the associated fears in doing so).
[224000650070] |I think that, overall, this is a good trend.
[224000650080] |The problem that I see is that it is potentially limiting.
[224000650090] |In order to develop a system that could plausibly be applied to (nearly) any language, one has to resort to features that are universal across all languages.
[224000650100] |This is fine, but for the most part the only universal features we know of that are reasonably computable are things like "language is made up of words and words are sort of semanticy units on their own" (of course, this misses a lot of compounds in German and is hard to do in Chinese without spaces) and "words sometimes have prefixes and suffixes and these are syntactically useful" (oops, Arabic has infixes) and "capitalization is often a good indicator of something proper-noun-like" (except for German where many common nouns are capitalized or Japanese where there isn't case marking).
[224000650110] |These are sometimes compounded "adjacent words carry semantic meaning."
[224000650120] |But all in all, these features are relatively weak from the perspective of "language understanding."
[224000650130] |This distinction seems analogous to the "domain independent" versus "domain specific" one that we've also seen.
[224000650140] |If you are willing to limit yourself to a specific domain (eg., counter-terrorism), you can probably do a pretty good job doing reasonably deep understanding.
[224000650150] |On the other hand, if you want to work at the other end---applicable across all domains---there's little you can do because you're better off going for shallow with complete coverage rather than deep but sparse.
[224000650160] |Where I think that the domain specific people have it right is that they actually do take advantage of being in a specific domain.
[224000650170] |I know that when I work on a problem that's language specific (eg., summarization or coreference), I've only seldom taken advantage of the fact that the language is English.
[224000650180] |Sure, for summarization I've occasionally made use of an English parser and for coreference I've made use of mined data that's specific to English, but overall, I treat it as pretty much "any old language."
[224000650190] |This would probably be fine if I then ran my system on Arabic and Chinese and Portuguese and showed that it worked.
[224000650200] |But I don't.
[224000650210] |This seems to tell me that I'm missing something: that I have not been clear about my goal.
[224000650220] |Maybe I should take a hint from the domain specific people and decide which side of the language independent camp I want to be on.
[224000650230] |(The one counterargument that I will use to save face is that applying to other languages is often a lot of relatively needless work...you often have a pretty good idea of what's going to happen and I'd be surprised if people have strongly believed they've built something that's reasonably language independent and it turns out not to be.)
[224000680010] |Doing DP Clustering? Don't Sample!
[224000680020] |Dirichlet process techniques are increasingly popular tools for Bayesian analysis.
[224000680030] |There's not enough space here to describe how they work, so I'll assume you know.
[224000680040] |With the exception of the Variational DP techniques that Dave Blei and Michael Jordan developed, one typically uses MCMC techniques to perform sampling from the posterior distribution and then uses these samples to compute properties of interest.
[224000680050] |In many cases, the properties of interest are simply the cluster assignments.
[224000680060] |Since it's unclear how to use multiple samples over the cluster assignments to generate a single one (except, perhaps, by some MBR method), one typically just chooses the single cluster assignment from the sample that has maximal marginal likelihood.
[224000680070] |This is, of course, not really Bayesian, but it still seems a reasonable thing to do.
[224000680080] |For this post, I'm going to consider a model in which we have a likelihood term F(x | theta) and a mean prior G0, where we first draw G from DP(G0,alpha) and then theta from G and then x from theta.
[224000680090] |In particular, I will assume G0 and F are conjugate, which means that in the Gibbs sampler, we can analytically integrate out both G and theta and draw only for the cluster assignments (this is Neal's algorithm 3).
[224000680100] |This algorithm is nice because it converges much faster than ones that also have to sample over theta.
[224000680110] |(I know there are other good algorithms...MH works, as do split/merge proposals.)
[224000680120] |The potential worry is that if all you want is the cluster assignment that maximizes the marginal likelihood, is a Gibbs sampler a good search algorithm?
[224000680130] |The answer is no.
[224000680140] |Let's consider another way to search.
[224000680150] |We arbitrarily order the observed data, then label left-to-right over this ordering.
[224000680160] |When we get to x_n, we'll have already created k clusters, and then there are k+1 possible choices.
[224000680170] |It is straightforward to compute a partial marginal likelihood for each of these possibilities, which leads to a direct implementation for breadth-first search.
[224000680180] |But we can (often) do better.
[224000680190] |If F is in the exponential family and G0 is its natural prior, we can construct a reasonably good admissible heuristic and apply A* search (I'm working on writing up the details, and I'll make the code available shortly...it's quite simple to implement, but proving that the heuristic is admissible is a bit involved and I'm not 100% sure it's true for all exponential family members or just specific ones).
[224000680200] |Here are some artificial experiments.
[224000680210] |I generated ten documents over a vocabulary of 40 words, based on three clusters drawn from a symmetric Dirichlet with parameter alpha=4, approximately 40 words per document (distributed Poisson). I then use a DP with G0=Dirichlet(2) and F=Multinomial, with the scale parameter on the DP=2. I ran two Gibbs samplers (one initializing with a single large cluster, one initializing with many singleton clusters), and three search algorithms on this data.
[224000680220] |The first search was full A*.
[224000680230] |The second was beamed A* with a beam of 5.
[224000680240] |The last was A* with an inadmissible, but in some sense tigher, heuristic, that's even easier to compute.
[224000680250] |The results are in the following graph:
[224000680260] |The y axis is negative log probability (lower is better) and the x axis is time in seconds.
[224000680270] |This is all in matlab and I didn't do anything fancy to make either algorithm especially fast.
[224000680280] |The horizonal lines are, top to bottom, heuristic A*, beam A* and full A*.
[224000680290] |The timing are, <0.1s, 1.6s and 2.1s, respectively (variances over 5 runs with different permutations of the input are shown in parens). So the search does significantly better (attains a higher marginal likelihood than the sampler) in very little time (even after 30 seconds, the Gibbs sampler is still really far from getting down to even the performance of heuristic A*).
[224000680300] |So that's for small data.
[224000680310] |It turns out that the heuristic isn't good enough to work on huge data sets, unless you use a really small beam, which hampers performance (I'm investigating this).
[224000680320] |But if we use the inadmissible heuristic, we can handle large-ish data sets fairly easily.
[224000680330] |I ran the same experiment, but with 1000 docs over a vocabulary of 400 words, with 20 clusters, 1000 words per document and a symmetric Dirichlet prior with alpha=4.
[224000680340] |The Gibbs here actually sucks.
[224000680350] |Within about ten iterations, it gets suck with a neg log lik of 724842 and 16 clusters (about 50 seconds per Gibbs iteration).
[224000680360] |The heuristic A* takes about 2300 seconds total and ends with a neg log like of 707020 (and 20 clusters), quite significantly better.
[224000680370] |Combining heuristic A* with beam search (beam=100) leads to a neg log lik of 707260 (slightly worse, but no where near as bad as Gibbs) in only 1800 seconds.
[224000680380] |(Incidentally, the Gibbs gets stuck because the documents are so long, that the marginal posterior likelihoods completely dwarf the vanilla marginals, so it essentially never moves out of a local maximum.
[224000680390] |With shorter documents, this doesn't happen as much.)
[224000680400] |I'm still working on this a bit...I'm using DP clustering enough that this result make a huge difference for me.
[224000680410] |I think there's a lot of room for improvement, even over what I have so far.
[224000680420] |I'm also applying it to real data to see if it still helps.
[224000680430] |But overall, it looks like this is a fairly promising direction (which is actually quite surprising, because clustering is typically not something that we would typically attack in a "left-to-right" fashion).
[224000690010] |I'll Take Movie Recommendations for $1m, Alex
[224000690020] |If you feel like you have the world's greatest recommender system, you should enter the NetFlix challenge for improving their movie recs.
[224000690030] |In addition to the possibility of winning a lot of money and achieving fame, you also get an order-of-magnitude larger data set for this task than has been available to date.
[224000690040] |(Note that in order to win, you have to improve performance over their system for 10%, which is a steep requirement.)
[224000690050] |I'll offer an additional reward: if you do this using NLP technology (by analysing movie information, rather than just the review matrix), I'll sweeten the pot by $10.
[224000710010] |NIPS papers up
[224000710020] |http://nips.cc/Conferences/2006/Program/schedule.php
[224000720010] |Scaling and Data
[224000720020] |In NLP, we often live in the idealized learning world where we have more data than we really know what to do with.
[224000720030] |The oft-cited Banko + Brill results are perhaps extreme in this regard (in the sense that we rarely have quite that much data), but we certainly have far more than most fields.
[224000720040] |The great thing about having lots of data is that large data sets support complex statistical analysis.
[224000720050] |As a stupid example, consider estimating a Gaussian.
[224000720060] |We estimate the mean and covariance (or generate a posterior over these quantities, if you prefer to be Bayesian).
[224000720070] |In a small data setting, we'd almost always approximate the Gaussian by either a diagonal, or constant diagonal covariance matrix.
[224000720080] |Especially if the number of data points is less than the number of dimensions (true Bayesians might not do this, but this is probably tangengtial).
[224000720090] |But if we have billions of data points, there's likely enough information in there to reliably estimate quite a few parameters (or approximate their posteriors) and we can do the full covariance matrix estimation.
[224000720100] |The problem is that the full covariance estimation is computationally really expensive.
[224000720110] |Not only do we have to play with O(D^2) parameters (D is the dimensionality), but we also have to perform complex operations on the data that typically scale at least as O(N^2) (N Is the number of data points).
[224000720120] |This is incredibly frustrating.
[224000720130] |We have the data to support a complex statistical analysis, but we don't have the computation time to actually perform the analysis.
[224000720140] |So we either throw out data to get the computation time down and do something more complex (which may now not be supported by the data) or, more often than not, do something simple on the large data set.
[224000720150] |Now, there is often nothing wrong with doing something simple, but if we cannot even try to do things that are more complex, then it's hard to say for sure whether simple is enough.
[224000720160] |So then the question is: how can we scale.
[224000720170] |I only know a handful of answers to this question, but maybe other people can contribute some.
[224000720180] |Get a job at Google and just use a billion machines (and/or some really clever Google engineers, ala the Google SMT system).
[224000720190] |This is obviously not a very satisfying option for everyone.
[224000720200] |Subsample the data.
[224000720210] |This is also not very satisfying (and, perhaps, even worse than the first option).
[224000720220] |Use a randomized algorithm, such as what Deepak did in his thesis.
[224000720230] |The message here is that if your complexity hinges on pairwise computations that look something like distance metrics, you can introduce randomization and do this in something like O(N) rather than O(N^2) time.
[224000720240] |Use smart data structures.
[224000720250] |Things like kd-trees are becoming increasingly popular in the ML community for solving pairwise problems.
[224000720260] |The idea is to recursively divide your data space (in an intelligent fashion) so that you can store sufficient statistics about what's under a node at that node itself.
[224000720270] |(I think one reason these haven't taken off in NLP is that they appear at first glance to be much better suited to real-valued mid-dimensional data, rather than sparse, discrete, super-high-dimensional data...is there an alternative for us?)
[224000720280] |There may be other general solutions, but I'm not aware of them.
[224000720290] |As it stands, with the exception of Deepak and few others, the solution appears to be basically to hire a bunch of smart people (and/or grad students) to do lots of engineering.
[224000720300] |But I'd prefer general principles.
[224000730010] |Two More Competitions
[224000730020] |Busy week this is!
[224000730030] |Here are two more pointers.
[224000730040] |ClearForest semantic web service: they want you to do a mashup using their IE technology and potentially win some money.
[224000730050] |Agnostic Learning vs. Prior Knowledge challenge: can you do better machine learning if you know something about your data.
[224000730060] |There's at least one data set in there that's text related.
[224000730070] |There are workshops at NIPS and IJCNN.
[224000730080] |Enjoy!
[224000740010] |The Shared Task Effect
[224000740020] |Shared tasks have been increasing in popularity over the past half decade.
[224000740030] |These are effectively competitions (though perhaps that word is rightfully disdained) for building systems that perform well on a given task, for a specific data set.
[224000740040] |Typically a lot of stuff is given to you for free: the data, variously preprocessing steps, evaluation scripts, etc.
[224000740050] |Anywhere from a handful of people to dozens enter these shared tasks.
[224000740060] |Probably the most well known are the CoNLL shared tasks, but they have also taken place in other workshops (eg., the two SMT workshops and many others).
[224000740070] |Goverment-run competitions (eg., GALE, ACE, DUC (to some degree) and others) are somehow similar, with the added bonus that money is often contingent on performance, but for the most part, I'll be talking about the community-driven shared tasks.
[224000740080] |(I'll note that shared tasks exist in other communities, but not to the extent that they exist in NLP, to my knowledge.)
[224000740090] |I think there are both good and bad things about having these shared tasks, and a lot depends on how they are run.
[224000740100] |Perhaps some analysis (and discussion?) can serve to help future shared task organizers make decisions about how to run these things.
[224000740110] |Many pros of shared tasks are perhaps obvious:
[224000740120] |Increases community attention to the task.
[224000740130] |Often leads to development or convergence of techniques by getting lots of people together to talk about the same problem.
[224000740140] |Significantly reduces the barrier of entry to the task (via the freely available, preprocessed data and evaluation scripts).
[224000740150] |(Potentially) enables us to learn what works and what doesn't work for the task.
[224000740160] |Makes a standardized benchmark against which future algorithms can be compared.
[224000740170] |Many of these are quite compelling.
[224000740180] |I think (3) and (5) are the biggest wins (with the caveat that it's dangerous to test against the same data set for an extended period of time).
[224000740190] |My impression (which may be dead wrong) is that cf. (1), there has been a huge source of interest in semantic role labeling due to the CoNLL shared task.
[224000740200] |I can't comment on how useful (2) is, though it seems that there is at least quite a bit of potential there.
[224000740210] |I know there have been at least a handful of shared task paper that I've read that gave me an idea along the lines of "I should try that feature."
[224000740220] |In my opinion, (4) seems like it should be the real reason to do these things.
[224000740230] |I think the reason why people don't tend to learn as much as might be possible about what does and does not work is that there's very little systematization in the shared tasks.
[224000740240] |At the very least, almost everyone will use (A) a different learning algorithm and (B) a different feature set.
[224000740250] |This means that it's often very hard to tell -- when someone does well -- whether it was the learning or the features.
[224000740260] |Unfortunately (were it not the case!) there are some cons associated with shared tasks, generally closely tied to corresponding pros.
[224000740270] |May artificially bloat the attention given to one particular task.
[224000740280] |Usefulness of results is sometimes obscured by multiple dimensions of variability.
[224000740290] |Standardization can lead to inapplicability of certain options that might otherwise work well.
[224000740300] |Leads to repeated testing on the same data.
[224000740310] |Many of these are personal taste issues, but I think some argument can be made for them all.
[224000740320] |For (1), it is certainly true that having a shared task on X increases the amount of time the collective research community spends on X. If X is chosen well, this is often fine.
[224000740330] |But, in general, there are lots of really interesting problems to work on, and this increased focus might lead to narrowing.
[224000740340] |There's recently been something of a narrowing in our field, and there is certainly a correlation (though I make no claim of causation) with increased shared tasks.
[224000740350] |(2) and (3) are, unfortunately, almost opposed.
[224000740360] |You can, for instance, fix the feature set and only allow people to vary the learning.
[224000740370] |Then we can see who does learning best.
[224000740380] |Aside from the obvious problem here, there's an additional problem that another learning algorithm might do better, if it had different features.
[224000740390] |Alternatively, you could fix the learning and let people do feature engineering.
[224000740400] |I think this would actually be quite interesting.
[224000740410] |I've thought for a while about putting out a version of Searn for a particular task and just charge people with coming up with better features.
[224000740420] |This might be especially interesting if we did it for, say, both Searn and Mallet (the UMass CRF implementation) so we can get a few more points of comparison.
[224000740430] |To be more concrete about (3), a simple example is in machine translation.
[224000740440] |The sort of preprocessing (eg., tokenization) that is good for one MT (eg., a phrase-based system) may be very different from the preprocessing that is good for another (eg., syntax-based).
[224000740450] |One solution here is to give multiple versions of the data (raw, preprocessed, etc.), but then this makes the (2) situation worse: how can we tell who is doing best, and is it just because they have a darn good tokenizer (don't under-estimate the importance of this!).
[224000740460] |(4) doesn't really need any extra discussion.
[224000740470] |My personal take-away from putting some extra thought into this is that it can be very beneficial to have shared tasks, if we set at the beginning what are the goals.
[224000740480] |If our goal is to understand what features are important, maybe we should consider fixing the learning to a small set of algorithms.
[224000740490] |If our goal is learning, do the opposite.
[224000740500] |If we want both, maybe ask people to do feature ablation and/or try with a few different learning techniques (this is perhaps too much burden, though).
[224000740510] |I think we should definitely keep the (3) of low barrier of entry: to me, this is one of the biggest pros.
[224000740520] |I think the SMT workshops did a phenomenal job here, for a task as complex as MT.
[224000740530] |And, of course, we should choose the tasks carefully.
[224000750010] |Saving Read Papers
[224000750020] |I'm going to have a go at doing a mini-poll.
[224000750030] |Basically, I'm interested in whether or not papers you have read (and, presumably, find interesting) find their way into a permanent spot on your machine or your physical space.
[224000750040] |Please vote :).
[224000760010] |Saving Read Papers, Revisited
[224000760020] |So, why am I interesting in how you save read papers?
[224000760030] |Well, I don't want to ruin the surprise yet.
[224000760040] |First, let's take a look at the (still incoming) results.
[224000760050] |The most popular method (roughly 60% of the population surveyed) is to save them locally.
[224000760060] |People have also pointed to some tools for archiving, though my guess is that these are probably under utilized.
[224000760070] |I'm actually a bit surprised more people don't use delicious, though I do not so perhaps I shouldn't be surprised.
[224000760080] |(Incidentally, I fall into the majority class.)
[224000760090] |The reason I'm curious is that I spend a nontrivial amount of time browsing people's web pages to see what papers they put up.
[224000760100] |Some of this has to do with the fact that I only follow about a dozen conferences with regularity, which means that something that appears in, say, CIKM, often falls off my radar.
[224000760110] |Moreover, it seems to be increasingly popular to simply put papers up on web pages before they are published formally.
[224000760120] |Whether this is good or not is a whole separate debate, but it is happening more and more.
[224000760130] |And I strongly believe that it will continue to increase in popularity.
[224000760140] |So, I have a dozen or so researcher's whose web pages I visit once a month or so to see if they have any new papers out that I care about.
[224000760150] |And, just like the dozen conferences I follow, there are lots that fall off my radar here.
[224000760160] |But this is (almost) exactly the sort of research problem I like to solve: we have too much information and we need it fed to us.
[224000760170] |I've recently been making a fairly obvious extension to my Bayesian query-focused summarization system that enables one to also account for "prior knowledge" (i.e., I've read such and such news stories -- give me a summary that updates me).
[224000760180] |I've been thinking about whether to try such a thing out on research articles.
[224000760190] |The basic idea would be to feed it your directory containing the papers you've read, and then it would routinely go around and find new papers that you should find interesting.
[224000760200] |Such a thing could probably be hooked into something like delicious, though given the rather sparse showing here, it's unclear that would be worthwhile.
[224000760210] |Of course, it's a nontrivial undertaking to get such a thing actually running beyond my controlled research environment (my desktop), so I wanted to get a sense of whether anyone might actually be interested.
[224000760220] |Ross's comment actually really got my attention because it would be probably easier technologically if everything could be done online (so one wouldn't have to worry about cross-platform, etc.).
[224000760230] |Anyway, this is something I've been thinking about for a while and it seems like a lot of the tools exist out there already.
[224000770010] |NIPS and NAACL workshops
[224000770020] |Just a few pointers for things that seem at least somewhat related.
[224000770030] |At NIPS this year, we have:
[224000770040] |Learning to compare examples
[224000770050] |Machine Learning for Multilingual Information Access
[224000770060] |Novel Applications of Dimensionality Reduction
[224000770070] |At NAACL, we have:
[224000770080] |Document Understanding Conference (DUC) 2007
[224000770090] |Computational Approaches to Figurative Language
[224000770100] |Confidence Estimation for Natural Language Processing
[224000770110] |TextGraphs-2: Graph-based Methods for Natural Language Processing
[224000770120] |Bridging the Gap: Academic and Industrial Research in Dialog Technologies
[224000770130] |Syntax and Structure in Statistical Translation (SSST)
[224000780010] |The ACL Universe is now a Wiki
[224000780020] |See here.
[224000790010] |Getting Started In: Sequence Labeling
[224000790020] |Okay, at long last it's time for another edition of "Getting Started In" (p.s., if you want to write one of these for your area of expertise, I'm very open to it!).
[224000790030] |So, what is sequence labeling?
[224000790040] |It's the meta problem that we face all the time in NLP tasks where we want to assign a single label to each element in a sequence.
[224000790050] |For us, a sequence is usually a sentence and a word is an element.
[224000790060] |The elements we are trying to assign are usually things like parts of speech, syntactic chunk labels (is this part of a noun phrase, verb phrase, etc.), named entity labels (is this a person?) and so on.
[224000790070] |Information extraction systems (i.e., extracting meeting times and locations from emails) can also be treated as sequence labeling problems.
[224000790080] |There are roughly two varieties of sequence labeling: (1) raw labeling and (2) joint segmentation and labeling. The latter is (IMO) more common.
[224000790090] |Raw labeling is something like POS tagging where each element gets a single tag.
[224000790100] |Joint segmentation and labeling is where whole segments get the same label.
[224000790110] |For instance, in named entity recognition, a sentence like "Yesterday , George Bush gave a speech ." contains example one named entity ("George Bush").
[224000790120] |Here, we want to assign the label "PERSON" to the entire phrase "George Bush", not to individual words (for instance, if two named abut, we need to know where they separate).
[224000790130] |The easiest way to deal with segmentation problems is to transform them into raw labeling problems.
[224000790140] |The standard way to do this is the "BIO" encoding, where we label each word by "B-X", "I-X" or "O".
[224000790150] |Here, "B-X" means "begin a phrase of type X", "I-X" means "continue a phrase of type X" and "O" means "not in a phrase."
[224000790160] |For instance, the Bush sentence would be labeled as: "Yesterday/O ,/O George/B-PER Bush/I-PER gave/O a/O speech/O ./O" Once can now treat this as a raw labeling problem, perhaps being careful to avoid producing impossible sequences at test time (eg., an "I-X" can only follow a "B-X" or another "I-X").
[224000790170] |Other encodings are also possible.
[224000790180] |So for now, I'll concentrate on raw labeling and revisit the true joint segmentation problem at the end.
[224000790190] |In raw labeling, we're faced with the problem of assigning a single label to each word in a sequence.
[224000790200] |Perhaps the easiest approach is to predict each label independently.
[224000790210] |Then, we just have a collection of multiclass classification tasks (each label is a different class).
[224000790220] |We can use whatever classifier we want for this.
[224000790230] |Despite it's simplicity, this approach is actually quite effective for many problems.
[224000790240] |Intuitively, we usually believe that the labels in these problems are not independent.
[224000790250] |For instance, in POS tagging, it's basically impossible to have a verb immediately following a determiner.
[224000790260] |We would therefore like to use some local sequence information to improve our performance.
[224000790270] |The "old school" approach to doing this is to use a hidden Markov model (HMM).
[224000790280] |I'll refer you to Manning+Schutze for details here (keeping in mind that really what we're using is just a Markov model...in these problems, there is nothing that's hidden).
[224000790290] |But the basic idea here is that we have two probability distributions: a transition distribution (how likely is it that a verb will follow a determiner) and an emission distribution (how likely is it that we'll see the word "the" given that we know this word should be a determiner).
[224000790300] |If our transition probabilities are local, then the Viterbi algorithm will run efficiently at test time.
[224000790310] |This locality is referred to as the "Markov assumption."
[224000790320] |Specifically, a first-order Markov assumption says that the probability of label at time t+2 is independent of label at time t given the label at time t+1.
[224000790330] |The higher Markov order you use, the harder it is to decode (complexity-wise).
[224000790340] |The potential problem with using HMMs is that the emission probabilities are of the form p(word | tag), where we'd really like to model p(tag | word).
[224000790350] |The latter is preferable because it's easier to include lots of overlapping features (capitalization, word identity, prefixes, suffixes, stems, etc.).
[224000790360] |The partial solution is to use a maximum entropy Markov model (MEMM), where we model p(tag | word) using a maximum entropy model, but keep everything else as in an HMM.
[224000790370] |MEMMs are only slightly more complex to train than HMMs, but work a whole lot better.
[224000790380] |At training time, we essentially include features having to do with the previous labels, but otherwise this is just as in the independent classifiers approach.
[224000790390] |Viterbi search still runs at test time, so we're limited to the same Markov order constraints as HMMs.
[224000790400] |The potential problem with MEMMs (noticing a trend here? :P) is that when the models are trained, they are trained against CORRECT previous labels.
[224000790410] |That is, when we create a classification example corresponding to the label at time t+1, we include features that depend on the label at time t.
[224000790420] |But these will always be correct at training time, but can be wrong at test time.
[224000790430] |This leads to the infamous "label bias" problem.
[224000790440] |The conditional random field (CRF) is essentially an answer to this problem.
[224000790450] |Instead of training to predict each label independently, but then running Viterbi at test time, we train to get the whole sequence right.
[224000790460] |(This is a good instance of having identical training and testing situations.)
[224000790470] |Training CRFs is a bit of a bore, since each iteration of training requires one to run the forward-backward algorithm over each training instance.
[224000790480] |But CRFs do often perform better than plain MEMMs.
[224000790490] |The UMass group has been nice enough to release their CRF implementation.
[224000790500] |Once we get to CRFs, we can play a bunch of other games.
[224000790510] |For instance, we can essentially come up with a margin-based version of the CRF.
[224000790520] |This is roughly like moving from maxent to SVMs.
[224000790530] |The result is either max-margin Markov networks or SVMstruct, depending on exactly how you formulate the problem.
[224000790540] |The latter has a nice implementation available to play with.
[224000790550] |So that's a whole lot of learning, but where the action really is in the features.
[224000790560] |For me, there's essentially a standard feature set I always use for these problems, and then add and subtract as I see fit.
[224000790570] |The features I usually use are the following (with examples based on the word "George": the exact word ("George"), the stem ("george"), prefixes of length 1-3 ("G", "Ge" and "Geo"), suffixes of length 1-3 ("rge", "ge" and "e") and some regexps.
[224000790580] |The most useful I use is to transform all capital letters to "A", all lower-case to "a", all numbers to "0", and all punctuation to ".".
[224000790590] |Then, we collapse identical adjacent letters.
[224000790600] |So George -> Aaaaaa -> Aa.
[224000790610] |Finally, I use list of people, places and organizations collected from various gazetteers and check whether each word falls in any of these lists.
[224000790620] |(Incidentally, these are available as part of my TagChunk program for solving joint tagging/chunking problems.)
[224000790630] |All of these features are typically applied in a window of +/- 1,2 or 3 words around the given word.
[224000790640] |You can see exactly what I calculate in this perl script.
[224000790650] |The final issue in features is what to do with the transition features.
[224000790660] |In particular, one can think of putting the lexical features on "nodes" or "edges".
[224000790670] |By putting them on nodes, I mean that you have features like "previous-tag-is-DT current-word-is-George current-case-is-Aa".
[224000790680] |But you can also do a conjunction thing and say "previous-tag-is-DT current-word-is-George-and-previous-tag-is-DT current-case-is-Aa-and-previous-tag-is-DT".
[224000790690] |This obviously blows up your feature space, but if you have enough data, it's almost always worth doing.
[224000790700] |Now, getting back to the segmentation issue.
[224000790710] |The fact that there are multiple valid encodings for mapping segmentation -> raw labeling is probably a bad thing.
[224000790720] |It seems that, ideally, we'd like to learn to do the segmentation directly.
[224000790730] |This is really not that hard, and it comes with a bunch of benefits, especially in what features you can employ.
[224000790740] |For instance, you can check if the whole string "George Bush" falls in a list of names.
[224000790750] |Or you can notice that both words are capitalized.
[224000790760] |These are both great "phrase-based" features.
[224000790770] |More evidence has been published that treating these problems as segmentation problems directly is beneficial.
[224000800010] |Why Speech Summarization?
[224000800020] |The vast majority of work on summarization is at the text level.
[224000800030] |A document (or collection of documents) comes in as text, and a summary goes out as text.
[224000800040] |Although I'm certainly not one of the ones pushing summarization of speech, I think it's quite an interesting task.
[224000800050] |(Here, by speech summarization, I pretty much mean speech in, speech out ... though similar arguments could be made for text in, speech out applications.)
[224000800060] |A lot of my conclusion comes from personal experience, but I also had a really great conversation with Alan Black (who works on speech synthesis problems).
[224000800070] |Let me motivate this by an personal anecdote.
[224000800080] |My mom tends to leave really long voicemail messages.
[224000800090] |Like, really long.
[224000800100] |Usually she overruns the alloted time and has to call back.
[224000800110] |It's not uncommon for her messages to span three mailbox slots.
[224000800120] |(Sorry, mom, if you're reading this!)
[224000800130] |The most interesting thing about these messages is that they really don't contain that much information.
[224000800140] |In fact, there's usually only one or two sentences in the whole thing that really matter (at least from an "information" sense).
[224000800150] |The problem is that you can't "skim" voicemail.
[224000800160] |If I get an equally long email (say, one that would take me 3 minutes to read), I can fairly easily skim it to get the gist.
[224000800170] |Especially with some tailored highlighting, I'm able to absorb much more text per second than speech per second.
[224000800180] |Word on the street (or at least around the croissant table at ACL) is that studies have been done that show that people can get at information from text just as fast if its presented in a summary as if its not summarized at all, essentially because we're really good at skimming (I knew high school English taught me something!).
[224000800190] |(Incidentally, if anyone has a reference for this, let me know.
[224000800200] |I believe I saw it ages ago, but I can't remember where and I'd love to have it.)
[224000800210] |So one solution to the "Mom's message" problem is to run automatic speech recognition and turn it into an email.
[224000800220] |This has the added advantage that I check my email much more frequently than I check my voicemail, it gives me a record of the message after I've deleted the speech signal to save space, and allows for indexing.
[224000800230] |But there are also times when I don't have easy access to a computer and I really do want listen to the voicemail.
[224000800240] |This is where speech summarization becomes interesting.
[224000800250] |Anyway, if you're now convinced that speech summarization is interesting, well, I don't really know what to tell you since I really haven't looked at this problem.
[224000800260] |The work I am aware of (and this is a very biased sample...I apologize that I missed something) is:
[224000800270] |Summarizing Speech Without Text Using Hidden Markov Models by Sameer Maskey and Julia Hirshberg
[224000800280] |Sentence-extractive automatic speech summarization and evaluation techniques by Makoto Hirohata, Yosuke Shinnaka, Koji Iwano and Sadaoki Furui (Furui has been working in this area for a while...I can't find this exact paper, but his lab has others)
[224000800290] |Evaluation of Sentence Selection for Speech Summarization by Xiaodan Zhu and Gerald Penn
[224000800300] |Additionally, a quick search turned up a recent workshop at Interspeech on Speech Summarization, which probably has more links and information than I can easily produce myself.
[224000810010] |Any Stem Cell Research?
[224000810020] |I was listening to NPR a few weeks ago while pre-election news was still accosting my ears.
[224000810030] |Though not in my state (shockingly!), there was apparently a measure on the ballots in a lot of states referencing stem cell research.
[224000810040] |The title of the measure was, apparently, "Should any stem cell research be allowed?"
[224000810050] |Okay, quick quiz: how did you first interpret that sentence?
[224000810060] |If you were in favor of certain varieties, but not all varieties of stem cell research, would you vote for it?
[224000810070] |In other words, does "any" mean "at least one type of" or "all types of"?
[224000810080] |To me, and at least a few other people I've talked to, the prefered interpretation is "all"...it's also much easier to reach this interpretation if you put a bit of stress on "any."
[224000810090] |But it's also really easy to get the other interpretation by unstressing "any" or by putting another sentence in front of it (eg., "Stem cell research in all forms is purely evil." :P).
[224000810100] |The strange thing is that I'm really having a hard time coming up with other examples where the preference for "all" is so strong (even with the stress).
[224000810110] |It seems that the fact that stem cell research is a class of things, rather than a single thing is important.
[224000810120] |It also seems that the presence of "should" is pretty much necessary.
[224000810130] |But even so, I really can't get anything else to come out with such a strong preference for "all".
[224000810140] |So what does this have to do with NLP?
[224000810150] |Well, not too much other than language is hard.
[224000810160] |Something like this would probably kill a textual entailment system, but given that it's somewhat ambiguous even to people (the degree of ambiguity is person relative though: a Brit here tells me that he has a really hard time getting the "all" interpretation at all), maybe there's just nothing that can be done about it.
[224000820010] |ICML 2007 webpage up
[224000820020] |ICML 2007 will be held in Oregon this year from June 20-24; more information is on the newly published webpage.
[224000820030] |(Sadly, this overlaps with ACL in Prague, with a non-trivial flight in between.)
[224000830010] |Feature engineering (with a coreference focus)
[224000830020] |I gave a talk today at our small group meeting about feature engineering for coreference resolution problems.
[224000830030] |I think it is oft underappreciated by people (myself occasionally included) who spend a lot of time working on fancy machine learning algorithms that the difference between success and failure often hinges more on clever features than on fancy learning.
[224000830040] |For those who want to look at the slides, you can take an OpenOffice or PDF version for your perusal.
[224000830050] |I forewarn that it may be a bit hard to follow without the soundtrack, but it's sort of an "untold story" version of the coref paper we had at HLT/EMNLP 2005.
[224000830060] |So what did we do that's clever (or, at least, as far clever as I knew at the time and know now)?
[224000830070] |Well, a few things.
[224000830080] |First, I tried extending the notion of Hobbs' distance to discourse trees.
[224000830090] |This actually works surprisingly well.
[224000830100] |The true referent is the first element according to syntactic Hobbs' distance in about 70-80% of the cases and we (almost) see the same thing at the discourse level.
[224000830110] |The only oddity is that you have to flip the nuclearity of the attribution relation.
[224000830120] |Then you get about 80% holding, assuming perfect discourse trees.
[224000830130] |Of course, we never have perfect discourse trees and the noise introduced by poor parsing is enough to make this feature more or less useless.
[224000830140] |The second thing we tried was to use this name/instance data that Mike Fleischman gathered at ISI before going off to MIT.
[224000830150] |This data was gathered in the context of Q/A to answer questions like "Who is the president of the U.S.?" by using lookup-tables.
[224000830160] |But it's great data for coref!
[224000830170] |It includes about 2 million examples of names paired with their occupations.
[224000830180] |Maybe you can find this data interesting too.
[224000830190] |Mike describes it more in his ACL 2003 paper.
[224000830200] |These guys make a huge difference in coref performance.
[224000830210] |The next thing that helped a lot was a clever, non-standard use of gazetteers.
[224000830220] |What we do is take the ID of the gazetteer to which the anaphor belongs and pair it with the lexical item of the antecedent.
[224000830230] |Why?
[224000830240] |Well, suppose we have a gazetteer that contains words like "California" and "Utah" and "New York."
[224000830250] |By pairing this list id (eg., "list-5") with antecedents, we get features like "list-5 *AND* state" or "list-5 and country" etc.
[224000830260] |We'll probably learn that the former is a good feature and the latter is not.
[224000830270] |We also have a feature that checks to see if the two referents are on the same gazetteer but are not the same word.
[224000830280] |Why?
[224000830290] |This is a negative feature.
[224000830300] |"California" and "Utah" appear on the same list (states) but are not identical (string-wise).
[224000830310] |So they're probably not coreferent.
[224000830320] |We tried pretty hard to get WordNet to work, but it just kept not helping.
[224000830330] |WordNet distance is a terrible measure (eg., "Japan" and "Russia" have distance 2 -- they are both under "country").
[224000830340] |On the other hand, like in the gazetteer case, "nearby in WordNet but not the same word" is a reasonably good feature, though it turned out not to help much.
[224000830350] |Hypernym and hyponym were the most promising, but we never actually got any benefit from them.
[224000830360] |The rest of the stuff in the talk made it fairly close to the front of the paper.
[224000830370] |The most surprising result was that count-based features help a lot!
[224000830380] |These features look at things like the entity to mention ratio, the number of entities detected so far, etc.
[224000830390] |These can't be included in a simple coref model that makes pairwise decisions, but for us it's easy.
[224000830400] |The end result is that using just string-based features (standard "substring" and edit distance stuff), lexical features (word pairs), count-based features and Mike's data, we can get an ACE score of 87.6 using LaSO for doing the learning (this goes up about 0.8 if you use Searn in most cases, but I don't have all the numbers).
[224000830410] |Adding in some inference to predict things like number and gender gets you up to 88.3.
[224000830420] |Gazetteers get you up to 88.7.
[224000830430] |After that, the rest of the features (regular expression patterns, word clusters, WordNet, discourse, and syntax) don't help...all together, they get you up to 89.1, but that's not much bang for the buck.
[224000830440] |My conclusion is basically that while learning makes a difference (Searn is a bit better than LaSO), the feature engineering makes a bigger one[*].
[224000830450] |Many of the features we found useful were a priori not really that obvious.
[224000830460] |I spent a month or two of my life trying to come up with ways of being clever to help improve performance (playing on dev data so as to be fair).
[224000830470] |It's definitely a non-trivial enterprise, but it often pays off.
[224000830480] |[*] This is assuming that your learning is robust enough to handle the features you want: for instance, the standard pairwise models could not handle the count-based features that helped us a lot.
[224000840010] |Back from NIPS
[224000840020] |The holidays are upon us (hence the lack of posts), but I wanted to give a brief nod to NIPS, which I thought went pretty well this year.
[224000840030] |The highlights for me (note that I didn't attend everything, so the usual qualifications apply) were:
[224000840040] |The Netflex Prize talk: Bennett talked about the origins of the prize and how well people are doing.
[224000840050] |They seem to firmly believe that the goal will be met. There's some indication it actually won't take all that long.
[224000840060] |Currently leading the pack at the time of the talk was wxyz, who is one of my old CMU buddies, Yi Zhang, now at UCSC.
[224000840070] |Nearly tied for first was Geoff Hinton's group up at Toronto.
[224000840080] |There weren't any details given about what people are doing to do so well, but it seems reasonable to assume that Toronto is doing something with deep belief nets (more later).
[224000840090] |Free Lunches (Invited talk by Dan Ariely).
[224000840100] |This was probably my favorite talk of the whole conference.
[224000840110] |The gist of the talk is that it's really easy to get humans to behave in ways that defy the standard "cost/benefit analysis" setting.
[224000840120] |A few fun examples.
[224000840130] |Split a bunch of people into two groups.
[224000840140] |Have one group write down 3 things they like about their significant other; have the other group write 10 things.
[224000840150] |Then ask them how much they love their SO.
[224000840160] |The result is that the "3 things" group loves them much more (presumably because no one can actually list 10 things).
[224000840170] |There were also a lot of examples about how often people cheat; the basic result is that it seems that knowing you cannot be caught does not necessarily make you cheat more.
[224000840180] |Also, if you prime people by having them sign an honor code, they cheat less.
[224000840190] |There were many more examples.
[224000840200] |I'm not quite positive what the take-home message was, but it was a very interesting talk.
[224000840210] |Analysis of Representations for Domain Adaptation (Blitzer et al).
[224000840220] |This is the first compelling analysis I've seen of the domain adaptation problem.
[224000840230] |The basic idea is to bound generalization error based on the distance between the source and target distributions.
[224000840240] |Quite clever and John says that they may even try to develop algorithms that explicitly minimize this bound.
[224000840250] |Boosting Structured Prediction for Imitation Learning (Ratliff et al).
[224000840260] |A cute feature-boosting algorithm for SP problems, used for the "following a map" problem.
[224000840270] |Large Margin Gaussian MMs for ASR (Sha, Saul).
[224000840280] |Do standard Gaussian mixture modeling for ASR, but put a large margin constraint on the Gaussians.
[224000840290] |Do a hard-EM-like thing to get fixed cluster assignments, and you can write this as a semi-definite program.
[224000840300] |Very cute.
[224000840310] |Greedy Layer-wise Training of Deep Networks (Bengio et al).
[224000840320] |Deep belief nets can do a really good job at some vision problems, but they're quite hard to train (symmetries, etc.).
[224000840330] |The basic idea here is to initialize the training by a simple layer-at-a-time method.
[224000840340] |The thing I found especially interesting is that their initialization looks a lot like a "predict yourself" sort of strategy, which seems to be a recurring theme in a lot of unsupervised/semi-supervised learning problem.
[224000840350] |There were also a lot of good workshops.
[224000840360] |I most enjoyed the one on covariate shift (i.e., domain adaptation).
[224000840370] |I enjoyed it so much, I plan to post separately on it in a few days (holiday schedule permitting).
[224000850010] |Learning when test and train inputs have different distributions -- NIPS workshop
[224000850020] |I spent the second day of workshops at NIPS (while not skiing) attending the Learning when test and train inputs have different distributions workshop.
[224000850030] |This is closely related (or really just a different name) for the domain adaptation problem I've been interested in for quite some time.
[224000850040] |Unfortunately, I can't easily come across the list of papers (there were some good ones!) which means that my memory may be lacking at some parts.
[224000850050] |Here are some points I took away.
[224000850060] |Statisticians have worked on this problem for a long time.
[224000850070] |If you provide insurance, you're going to want a predictor to say whether to give a new person a policy or not.
[224000850080] |You have lots of data on people and whether they made any claims.
[224000850090] |Unfortunately, the training data (people you have information on) is limited to those to whom you gave policies.
[224000850100] |So the test distribution (entire popular) differs from the training distribution (people to whom you gave policies).
[224000850110] |Some guy (can't remember his name right now) actually won a Nobel prize in Economics for a partial solution to this problem, which was termed "covariate shift" (because statisticians call our "inputs" "covariates" and they are changing).
[224000850120] |There seems to be a strong desire to specify models which (though see the comment below) can be characterized as "train p(y|x) and test p(y|x) are the same, but train p(x) and test p(x) differ."
[224000850130] |In other words, the "labeling function" is the same, but the distribution over inputs is different.
[224000850140] |This is probably the clearest way to differentiate domain adaptation from multitask learning (for the latter, we typically assume p(x) stays the same but p(y|x) changes).
[224000850150] |I'm not sure that this is really a hugely important distinction.
[224000850160] |It may be to obtain interesting theoretical results, but my sense is that in the problems I encounter, both p(x) and p(y|x) are changing, but hopefully not by "too much."
[224000850170] |An interesting point made along these lines by Shai Ben David that I spent a bunch of time thinking about several years ago was that from a theoretical perspective, assuming p(y|x) is the same is a vacuous assumption, because you can always take two radically different p(x) and q(x), add a feature that indicates which (p vs. q) the data point came from, and call this the "global p(x)".
[224000850180] |In fact, in some sense, this is all you need to do to solve multitask learning, or domain adaptation: just add a feature saying which distribution the input is from, and learn a single model.
[224000850190] |I've been doing some experiments recently and, while you can do better than this in practice with standard learning models, it's not such a bad approach.
[224000850200] |There were several other talks I liked.
[224000850210] |It seemed that the results (theoretically) were of the form "if p(y|x) is the same and p(x) and p(y) differ only by a 'little' then doing naive things for learning can do nicely."
[224000850220] |My favorite formalization of p(x) and p(y) differ a little was the Shai Ben David/John Blitzer approach of saying that they differ slightly if there is a single hyperplane that does well (has low error) on both problems.
[224000850230] |The restriction to hyperplanes is convenient for what they do later, but in general it seems that having a single hypothesis from some class that will do well on both problems is the general sense of what "p(y|x) is the same" is really supposed to mean.
[224000850240] |I also enjoyed a talk by Alex Smola on essentially learning the differences between the input distributions and using this to your advantage.
[224000850250] |In general, my sense was that people are really starting to understand this problem theoretically, but I really didn't see any practical results that convinced me at all.
[224000850260] |Most practical results (modulo the Ben David/Blitzer, which essentially cites John's old work) were very NIPSish, in the sense that they were on unrealistic datasets (sorry, but it's true).
[224000850270] |I wholeheartedly acknowledge that its somewhat difficult to get your hands on good data for this problem, but there is data out there.
[224000850280] |And it's plentiful enough that it should no longer be necessary to make artificial or semi-artificial data for this problem.
[224000850290] |(After all, if there weren't real data out there, we wouldn't be working on this problem...or at least we shouldn't :P.)
[224000860010] |NAACL Accepted Papers
[224000860020] |See here.
[224000860030] |I probably won't be going to NAACL, so if afterwards someone wants to volunteer to post a few papers they especially liked, I'd appreciate it!
[224000880010] |Survey of Parser Usage
[224000880020] |I know people use parsers a fair amount; I'm curious which parsers people use.
[224000880030] |Thus, a new poll :).
[224000880040] |What I'm interested in, in particular, is if you use a parser essentially as a preprocessing step before doing something "higher level."
[224000880050] |In other words, if you are Mark Johnson and you use the Charniak parser to build a better parser, this doesn't count (sorry, Mark!).
[224000880060] |I want to know if you use one to do, eg., summarization or MT or IE or something other than parser...
[224000880070] |(Also, I don't count shallow parsing in this context.)
[224000880080] |I look forward to responses!
[224000890010] |Comments on: Mark-up Barking Up the Wrong Tree
[224000890020] |The "Last Words" article in the Dec 2006 issue of Computational Linguistics is by Annie Zaenen from PARC.
[224000890030] |(I hope that everyone can access this freely, but I sadly suspect it is not so...
[224000890040] |I'm half tempted to reproduce it, since I think it's really worth reading, but I don't want to piss off the Gods at MIT press too much.)
[224000890050] |The main point I got from the article is that we really need to pay attention to how annotation is done.
[224000890060] |A lot of our exuberance for annotating is due to the success of machine learning approaches on the Treebank, so we have since gone out and annotated probably hundreds of corpora for dozens of other tasks.
[224000890070] |The article focuses on coreference, but I think most of the claims apply broadly.
[224000890080] |The first point made is that the Treebank annotation was controlled, and done by experts (linguists).
[224000890090] |Many other annotates are not done so: are done without real standards and without deep analysis of the task.
[224000890100] |The immediate problem, then, is that a learning algorithm that "succeeds" on the annotated data is not necessarily solving the right task.
[224000890110] |There was a similar story that my ex-office-mate Alex Fraser ran across in machine translation; specifically, with evaluating alignments for machine translation.
[224000890120] |The basic problem was two-fold.
[224000890130] |First, the dataset that everyone used (the French-English data from Aachen) was essentially broken, due largely to its distinction between "sure" and "possible" links -- almost every word pair was possibly linked.
[224000890140] |This, together with the broken evaluation metric (alignment error rate --- or AER) made results on this dataset virtually useless.
[224000890150] |The conclusion is essentially: don't use the Aachen data and don't use AER.
[224000890160] |That is, don't use them if you want improved MT performance, i.e., if you expect higher alignment performance to imply higher MT performance.
[224000890170] |(If this sounds familiar, it's perhaps because I mentioned it before.)
[224000890180] |I should say I largely agree with the article.
[224000890190] |Where I differ (perhaps only by epsilon) is that the article seems to pick on annotation for machine learning, but I really don't see any reason why the fact that we're using machine learning matters.
[224000890200] |The issue is really one of evaluation: we need to know that at the end of the day, when we compute a number, that number matters.
[224000890210] |We can compute a number intrinsically or extrinsically.
[224000890220] |In the extrinsic case, we are golden, assuming the extrinsic task is real (turtles upon turtles).
[224000890230] |In the intrinsic case, the situation is fishy.
[224000890240] |We have to make sure that both our annotations mean something and our method of computing error rate means something (ala the error metric types and the use of F for named entities).
[224000890250] |While I've argued on this blog that the error metric is important, the CL article argues that the annotation is important.
[224000890260] |I think that as someone who is on the machine learning side, this is easy to forget.
[224000900010] |Error Analysis
[224000900020] |I was recently asked if I thought that it would be a good idea if our conferences were to explicitly require an error analysis to be performed and reported in papers.
[224000900030] |While this is perhaps a bit extreme (more on this later), there are at least two reasons why this would be desirable.
[224000900040] |When multiple techniques exist for solving the same problem, and they get reasonably close scores, is this because they are making the same sort of errors or different sorts?
[224000900050] |If someone were to build on your paper and try to improve it, where should they look?
[224000900060] |There's an additional aspect that comes up, especially once you're in a sort of supervisory role.
[224000900070] |It's often hard to get students to actually look at outputs and forcing this as part of the game early on is a good idea.
[224000900080] |I was the same as a student (and continue to be the same now) -- only two or three our of a dozen or so papers of mine contain an error analysis.
[224000900090] |This situation reminds me a bit of an excellent talk I saw a few years ago (at ACL or EMNLP in Barcelona, I think) by Mitch Marcus talking about some parsing stuff.
[224000900100] |I don't really remember much of his talk, except that he kept flashing a single slide that read "Look at the data, stupid."
[224000900110] |His argument was essentially that we're not going to be able to model what we want to model unless we really understand what's going on in the data representing the phenomena we're trying to study.
[224000900120] |An exercise that's also good from this perspective is to do some data annotation yourself.
[224000900130] |This is perhaps even more painful than doing an error analysis, but it really drives home the difficulties in the task.
[224000900140] |Getting back to the point at hand, I don't think it's feasible or even necessarily advisable to require all papers to include an error analysis.
[224000900150] |But I also think that more papers should contain error analyses than actually do (including some of my own).
[224000900160] |In the universal struggle to fit papers within an 8 page limit, things have to get cut.
[224000900170] |It seems that the error analysis is the first thing to get cut (in that it gets cut before the paper is even written -- typically by not being performed).
[224000900180] |But, at least for me, when I read a paper, I want to know after the fact what I have learned.
[224000900190] |Occasionally it's a new learning technique.
[224000900200] |Or occasionally it's some new useful features.
[224000900210] |Or sometimes it's a new problem.
[224000900220] |But if you were to take the most popular problems out there that I don't work on (MT, parsing, language modeling, ASR, etc.), I really have no idea what problems are still out there.
[224000900230] |I can guess (I think names in MT are hard, as is ordering; I think probably attachment and conjunctions in parsing; I have little idea in LM and ASR), but I'm sure that people who work on these problems (and I really mean work: like, you care about getting better systems, not just getting papers) know.
[224000900240] |So it would be great to see it in papers.
[224000910010] |Good News on ACL Reviews
[224000910020] |I'm reviewing for ACL again this year (in the machine learning subcomponent).
[224000910030] |A couple of days ago, I received my notice to start bidding on papers (more on bidding below).
[224000910040] |The email came with the following note:
[224000910050] |Naturally, reviewers have been chosen to assess papers based on their own expertise and outlook.
[224000910060] |Having said this, we are aware that ACL has sometimes been perceived, especially in recent years, as overemphasizing the pursuit of small incremental improvements of existing methods, perhaps at the expense of exciting new developments.
[224000910070] |(ACL is solid but boring, is what some people would say.)
[224000910080] |While we believe that it would be counterproductive to change course radically -- We certainly would not want to sacrifice solidity! -- we would like to encourage you, as a reviewer, to look out particularly for what's novel and interesting, even if this means accepting a paper that has one or two flaws, for example because it has not been evaluated as rigourously as you would like.
[224000910090] |(It is for you to judge when a flaw becomes a genuine problem.)
[224000910100] |I think this is fantastic!
[224000910110] |(Would someone who is reviewing---i.e., on the PC---for another area confirm or deny that all areas got such a message, or was it just ML?)
[224000910120] |One difficulty I always have as a reviewer is that I assign scores to different categories (originality, interest, citations, etc.) and then am asked to come up with a meta-score that summarizes all these scores.
[224000910130] |But I'm not given any instruction on how to weigh the different components.
[224000910140] |What this note seems to be doing is saying "weigh interest higher than you usually would."
[224000910150] |In the past two years or so, I've been trying to do this.
[224000910160] |I think that when you start out reviewing, it's tempting to pick apart little details on a paper, rather than focusing on the big picture.
[224000910170] |It's been a conscious (and sometimes difficult) process for me to get over this.
[224000910180] |This explicit note is nice to see because it is essentially saying that my own internal process is a good one (or, at least, whomever wrote it thinks it's a good one).
[224000910190] |I also think---in comparison to other conferences I've PCed for or reviewed for---that ACL does a really good job of moderating the bidding process.
[224000910200] |(For those unfamiliar with bidding... when a paper gets submitted, some area chair picks it up.
[224000910210] |All papers under an area chair are shown---title plus abstract---to the reviewers in that area.
[224000910220] |Reviewers can bid "I want to review this," "I don't want to review this," "I am qualified to review this," or "Conflict of interest."
[224000910230] |There is then some optimization strategy to satisfy reviewers preferences/constraints.)
[224000910240] |In comparison to ECML and NIPS in the past, the ACL strategy of dividing into area chairs seems to be a good thing.
[224000910250] |For ECML, I got a list of about 500 papers to select from, and I had to rank them 1-5 (or 1-10, I don't remember).
[224000910260] |This was a huge hassle.
[224000910270] |It seems that of most of the conferences that I'm familiar with, ACL has a pretty decent policy.
[224000910280] |While I would be thrilled to see them introduce an "author feedback" step, everything else seems to work pretty well.
[224000910290] |In the past, I've only once gotten in to a real argument over a paper with other reviewers --- most of the time, all the reviewer scores have tended to be +/- 1 or 2 (out of ten) of each other.
[224000910300] |And for the times when there is an initial disagreement, it is usually resolved quickly (eg., one reviewer points out some major accomplishment, or major flaw, in the paper that another reviewer missed).
[224000930010] |Bag of Words citation
[224000930020] |I was recently asked by a colleague if I knew what the first paper was that used the bag of words model.
[224000930030] |I'm pretty certain it would be an IR paper, but have no idea what I would be.
[224000930040] |Manning+Schutze and Jurafsky+Martin don't have it.
[224000930050] |I know tf-idf is due to Sparck-Jones, but I presumably BOW existed before that.
[224000930060] |The vector space model is often credited to Salton, which is probably the earliest thing I know of, but my guess is that BOW predated even that.
[224000930070] |Anyone know a citation?
[224000940010] |AI-Stats schedule/papers up
[224000940020] |The papers weren't listed in a single location, so I assembled them in one page.
[224000940030] |Looks like it should be fun (and sunny)!
[224000950010] |I Don't Think We Understand Structured Prediction
[224000950020] |I've posted on structured prediction a few times before and then some.
[224000950030] |This is another post, but with a different perspective.
[224000950040] |Besides, the most recent time was last May, so it's been a while.
[224000950050] |This post is prompted by a recent OpEd by John Lafferty and Larry Wasserman on Challenges in Machine Learning in Statistica Sinica, where they mention SP as one of three major open questions in machine learning (the other two are semi-sup and sparse learning in high dimensions).
[224000950060] |I've been trying to think of a good way of ontologizing SP methods recently, primarily because I want to understand them better.
[224000950070] |The more I think about it, the more I feel like we really don't understand the important issues.
[224000950080] |This is partially evidenced by an observation that in the past six years (essentially, since the first CRF paper), there have been like 30 different algorithms proposed for this problem.
[224000950090] |The fact that there are so many proposed algorithms tells me that we don't really understand what's going on, otherwise we wouldn't need to keep coming up with new techniques.
[224000950100] |(I'm not innocent.)
[224000950110] |To me, the strongest division between techniques is between techniques that require tractable inference in the underlying problem (eg, CRFs, M3Ns, SVM-struct, structured perceptron, MIRA, MEMMs, HMMs, etc.) and those that don't (eg, incremental perceptron, LaSO, Searn, the Liang/Lacoste-Julien/Klein/Taskar MT algorithm, local predictors ala Roth and colleagues, etc.).
[224000950120] |Whether this truly is the most important distinction is unclear, but to me it is.
[224000950130] |I think of the former set as a sort of "top down" or "technology-push" techniques, while the latter are a sort of "bottom up" or "application-pull" techniques.
[224000950140] |Both are entirely reasonable and good ways of going at looking at the problem, but as of now the connections between the two types are only weakly understood.
[224000950150] |An alternative division is between generative techniques (HMMs), conditional methods (CRFs) and margin-based techniques (everything else, minus Searn which is sort of non-committal with respect to this issue).
[224000950160] |I don't really think this is an important distinction, because with the exception of generative being quite different, conditional methods and margin-based methods are essentially the same in my mind.
[224000950170] |(Yes, I understand there are important differences, but it seems that in the grand scheme of things, this is not such a relevant distinctions.)
[224000950180] |A related issue is whether the algorithms admit a kernelized version.
[224000950190] |Pretty much all do, and even if not, I also don't see this as a very important issue.
[224000950200] |There are other issues, some of which are mentioned in the Lafferty/Wasserman paper.
[224000950210] |One is consistency.
[224000950220] |(For those unfamiliar with the term, a consistent algorithm is essentially one that is guaranteed to find the optimal solution if given infinite amounts of data.)
[224000950230] |CRFs are consistent.
[224000950240] |M3Ns are not.
[224000950250] |Searn is not, even if you use a consistent classifier as the base learning algorithm.
[224000950260] |My sense is that to statisticians, things like consistency matter a lot.
[224000950270] |In practice, my opinion is that they're less important because we never have that much data.
[224000950280] |Even within the context of techniques that don't require tractability, there is a great deal of variation.
[224000950290] |To my knowledge, this family of techniques was essentially spawned by Collins and Roark with the incremental perceptron.
[224000950300] |My LaSO thing was basically just a minor tweak on the IP.
[224000950310] |Searn is rather different, but other published methods are largely other minor variants on the IP and/or LaSO, depending on where you measure from.
[224000950320] |And I truly mean minor tweaks: the essentially difference between LaSO and IP is whether you restart your search at the truth when you err.
[224000950330] |Doing restarts tends to help, and it lets you prove a theorem.
[224000950340] |We later had a NIPS workshop paper that restarted, but allowed the algorithm to take a few steps off the right path before doing so.
[224000950350] |This helped experimentally and also admitted another theorem.
[224000950360] |I've seen other work that does essentially the same thing, but tweaks how updates are done exactly and/or how restarts happen.
[224000950370] |The fact that we're essentially trying all permutations tells me that many things are reasonable, and we're currently in the "gather data" part of trying to figure out the best way to do things.
[224000960010] |Language and X
[224000960020] |I feel it is increasingly interesting to look at problems that have both a linguistic and a non-linguistic signal.
[224000960030] |The simplest example I can think of here is recent work in looking at images and their captions (eg., the matching words and pictures approach, though this is not the only such example).
[224000960040] |The idea in many of these techniques is that we have a single data set that contains two types of data.
[224000960050] |One is language (eg., the captions) and the other is something else (eg., the images).
[224000960060] |There are many other potentially interesting sources of data like this that I'll discuss after saying why I think these problems are interesting.
[224000960070] |I think there are two basic reasons why such "dual" problems are interesting:
[224000960080] |By looking at multiple modalities, we can get cross-modality reinforcement of what we're learning.
[224000960090] |I think a great example of that is this paper, which uses face recognition together with textual coreference in order to do unsupervised coref in captions.
[224000960100] |The idea is that if you can tell that two different pictures contain the entity we call "Bill Clinton" in them, then it's more likely that a name in their caption will corefer.
[224000960110] |This gives us a greater ability to share data across a single data set.
[224000960120] |When language is treated in the context of some other source of information, we get a sort of grounding effect.
[224000960130] |That is, continuing the coref example from #1, we have--in some sense--grounded the string "Bill Clinton" to a pictorial representation of a person.
[224000960140] |Sure, this representation is still just a bunch of symbols, but they're markedly different symbols from the ones we use to represent the entity in purely linguistic terms.
[224000960150] |Perhaps hard-core cognitive scientists wouldn't call this grounding, but it's getting there.
[224000960160] |(Also along these lines, one of my favorite papers from a few years ago was my ex-ISI friend Mike doing grounding by looking and language and action in video games.)
[224000960170] |There are probably other reasons why such tasks are interesting, but these are the two that appeal to me most strongly.
[224000960180] |It seems that vision+language is the hot topic, or at least the warmest of the bunch.
[224000960190] |This is probably because vision people and language people tend to overlap and meet at machine learning conferences.
[224000960200] |It's probably also because the sorts of techniques used by the two communities are perhaps the most similar.
[224000960210] |But I think there's lots of room for looking at other "X"s.
[224000960220] |For instance, biology.
[224000960230] |There are lots of data sets (eg., GEO) that contain textual information ("these are cell lines depicting ovarian cancer") plus the actual cell lines.
[224000960240] |Heck, many papers in PubMed contain such information, albeit in figures rather than matrices.
[224000960250] |Robotics is another option.
[224000960260] |Ray Mooney's group down in Texas has worked at understanding orders given to robocup competitors based on language information (eg., this paper).
[224000960270] |Perhaps the oldest that actually lives within the NLP community is NLP + databases, which we really haven't seen much of in the past 5-10 years.
[224000960280] |I think this is an interesting and fruitful area of future research and is one that I'll probably be exploring myself (but I won't give away what "X"s!).
[224000970010] |Loss Functions for Chunking Tasks
[224000970020] |I feel like this is starting to be a bit of dead horse, but I wanted to follow up a bit on previous posts talking about f-score versus accuracy for chunking problems.
[224000970030] |An easy observation is that if you have a chunking problem for which the majority of the chunks are multi-word tokens, then it is possible to get a model that achieves quite good accuracy, but abysmal f-score.
[224000970040] |Of course, real world taggers may not actually do this.
[224000970050] |What I wanted to know was the extent to which accuracy, f-score and ACE score are not correlated when used with a real world tagger.
[224000970060] |Here's the experiment.
[224000970070] |I use TagChunk for all the experiments.
[224000970080] |I do experiments on both syntactic chunking (CoNLL data) and NER (also CoNLL data), both in English. In the experiments, I vary (A) the amount of training data used, (B) the size of the beam used by the model, (C) the number of iterations of training run.
[224000970090] |When (A) is varied, I run five sets of training data, each randomly selected.
[224000970100] |The data set sizes I used are: 8, 16, 32, 64, 125, 250, 500, 1000, 2000, 4000 and 8000.
[224000970110] |(There are number of sentences, not words.
[224000970120] |For the NER data, I remove the "DOCSTART" sentences.)
[224000970130] |The beam sizes I use are 1, 5 and 10.
[224000970140] |The number of iterations is from 1 to 10.
[224000970150] |For the chunking problem, I tracked Hamming accuracy and f-score (on test) in each of these settings.
[224000970160] |These are drawn below (click on the image for a bigger version):
[224000970170] |As we can see, the relationship is ridiculously strongly linear.
[224000970180] |Basically once we've gotten above an accuracy of 80%, it would be really really hard to improve accuracy and not improve F-score.
[224000970190] |The correlation coefficient for this data is 0.9979 and Kendall's tau is 0.9795, both indicating incredibly strong correlation (formal caveat: these samples are not independent).
[224000970200] |For the NER task, I do the same experiments, but this time I keep track of accuracy, F-score and (a slightly simplified version of) the ACE metric.
[224000970210] |The results are below (again, click for a bigger version):
[224000970220] |The left-most image is accuracy-versus-F, the middle is accuracy-versus-ACE and the right is F-versus-ACE.
[224000970230] |The ACE seems to be the outlier: it produces the least correlation.
[224000970240] |As before, with accuracy-to-F, we get a ridiculously high correlation coefficient (0.9959) and tau (0.9761).
[224000970250] |This drops somewhat when going to accuracy-to-ACE (corr=0.9596 and tau=0.9286) or to F-to-ACE (corr=0.9715 and tau=0.9253).
[224000970260] |Nevertheless, the majority of the non-linearity occurs in the "very low accuracy" region.
[224000970270] |Here, that region is in the 0.8-0.9 range, not the 0.1-0.5 range as in chunking.
[224000970280] |This is because in chunking, almost every word is in a chunk, whereas in NER there are a ton of "out of chunk" words.
[224000970290] |The take-away from these experiments is that it seems like, so long as you have a reasonably good model (i.e., once you're getting accuracies that are sufficiently high), it doesn't really matter what you optimize.
[224000970300] |If your model is terrible or if you don't have much data, then it does.
[224000970310] |It also seems to make a much bigger difference if the end metric is F or ACE.
[224000970320] |For F, it's pretty much always okay to just optimize accuracy.
[224000970330] |For ACE, it's not so much, particularly if you don't have sufficient data.
[224000990010] |Poker Face
[224000990020] |Apologies for the long silence -- life got in the way.
[224000990030] |Part of life was AIstats, which I'll blog about shortly, but I'm waiting for them to get the proceedings online so I can (easily) link to papers.
[224000990040] |A common (often heated) argument in the scientific community is the "blind" versus "not-blind" review process -- i.e., can the reviewers see the identity of the authors.
[224000990050] |Many other bloggers have talked about this in the past (here and here, for instance).
[224000990060] |I don't want to talk about this issue right now, but rather one specific consequence of it that I don't think actually need to be a consequence.
[224000990070] |But if often is.
[224000990080] |And I think that more than anything else, it does serve to hurt research.
[224000990090] |Recent anecdote: Less than a month ago, I was talking with my friend John Blitzer at UPenn recently over email about domain adaptation stuff.
[224000990100] |We both care about this problem and have worked on different angles of it.
[224000990110] |Initially we were talking about his NIPS paper, but the discussion diffused into more general aspects of the problem.
[224000990120] |I had been at the time working furiously on a somewhat clever, but overall incredibly simply approach to domain adaptation (managed to make it in to ACL this year -- draft here).
[224000990130] |The interesting thing about this approach was that it completely went against the theoretical bounds in his paper (essentially because those are crafted to be worst case).
[224000990140] |It doesn't contradict the bounds of course, but it shows the good adaptation is possible in many cases when the bounds give no information.
[224000990150] |Of course I told him about this, right?
[224000990160] |Well, actually, no.
[224000990170] |I didn't.
[224000990180] |In retrospect this was stupid and a mistake, but it's a stupid mistake I've made over and over again.
[224000990190] |Why did I do this?
[224000990200] |Because at the time I knew that the paper was on its way to ACL -- perhaps I had even submitted at that point; I cannot remember.
[224000990210] |And if he happened to be one of the reviewers on it, then he would obviously know it was me and then the "double blind" aspect of ACL would be obviated.
[224000990220] |Or---perhaps worse---he would have marked it as a conflict of interest (though indeed I don't think it was) because he knew my identity.
[224000990230] |And then I would have lost the opinion of a pretty smart guy.
[224000990240] |All this is to say: I think there's often a temptation not to talk about ongoing research because of the double-blind rule.
[224000990250] |But this is ridiculous because the people who are close to you in research area are exactly those to whom you should talk!
[224000990260] |I don't think the solution has anything to do with reversing double-blind (though that would solve it, too).
[224000990270] |I think the solution is just to realize that in many cases we will know the authorship of a paper we review, and we shouldn't try so hard to hide this.
[224000990280] |Hiding it only hinders progress.
[224000990290] |We should talk to whomever we want about whatever we want, irregardless of whether this person may or may not review a paper on the topic later.
[224000990300] |(As a reviewer, their identity is hidden anyway, so who cares!)
[224000990310] |(Briefly about conflict of interest.
[224000990320] |I used to be somewhat liberal in saying I had a COI for a paper if I knew who the author was.
[224000990330] |This is a bad definition of COI, since it means that I have a COI with nearly every paper in my area(s).
[224000990340] |A true COI should be when I have something to gain by this paper being published.
[224000990350] |E.g., it is written by my advisor, student, or very very very close colleauge -- i.e., one with whom I publish regularly, though even that seems a bit of a stretch.)