[222004370010] |
The Y-2011 bug
[222004370020] |Because of accidentally typing "2010" on a few posts that were written in advance, today's post and a few others won't show up correctly in Reader or Twitter.
[222004370030] |Some people will have seen these posts, but some not.
[222004370040] |Today's I think is particularly worth reading, so check it out here.
[222004380010] |Findings: Which of my posts do you like best?
[222004380020] |It will surprise nobody that I like data.
[222004380030] |By extension, it should surprise nobody that what I like about blogging is getting instant feedback on whether people found a post interesting and relevant or not.
[222004380040] |This is in contrast to writing a journal article, where you will wait minimally a year or two before anyone starts citing you (if they ever do).
[222004380050] |How I feel about data.
[222004380060] |Sometimes the results are surprising.
[222004380070] |I expected my posts on the suspicious data underlying recent graduate school rankings to make a splash, but the two posts together got a grand total of 2 comments and 16 tweets (some of which are automatically generated by FieldofScience).
[222004380080] |I didn't expect posts on my recent findings regarding pronoun processing to generate that much interest, but they got 6 comments and 26 tweets, putting them among the most popular, at least as far as Twitter is concerned.
[222004380090] |To get a sense of which topics you, dear readers, find the most interesting, I compiled the statistics from all my posts from the fall semester and tabulated those data according to the posts' tags.
[222004380100] |Tags are imperfect, as they reflect only how I decided to categorize the post, but they're a good starting point.
[222004380110] |Here are the results, sorted by average number of retweets:
[222004380120] |Since we all know correlation = causation, if I want to make a really popular post, I should label it "findings, publication, peer review".
[222004380130] |If I want to ensure it is ignored, I shouldn't give it a label at all.
[222004380140] |At this point, I'd like to turn it over the crowd.
[222004380150] |Are these the posts you want to see?
[222004380160] |If not, what do you want to read more about?
[222004380170] |Or if you think about your favorite blogs, what topics do you enjoy seeing on those blogs?
[222004390010] |Mendeley -- Not quite ready for prime time
[222004390020] |Prompted by Prodigal Academic, I decided to give Mendeley a shot.
[222004390030] |That is, instead of working on a long over-due draft of a paper.
[222004390040] |Mendeley is two things.
[222004390050] |First, it is a PDF library/reader.
[222004390060] |Second, it is a citation manager.
[222004390070] |Currently, I used Papers for the first and Endnote for the second.
[222004390080] |Both work well enough -- if not perfectly -- but it is a pain that I have to enter every paper I want to cite into two different programs.
[222004390090] |(Don't tell me I could export my Papers citations library to Endnote.
[222004390100] |First, I'd have to do that every time I update my library, which is annoying.
[222004390110] |Second, Papers was created by someone who clearly never cites books, book chapters, conference proceedings, etc.
[222004390120] |So I'd have to fix all of those in Endnote ... every time I export.)
[222004390130] |(Also, don't tell me about Zotero.
[222004390140] |Maybe it's gotten better in the last year since I tried it, but it was seriously feature-deficient and buggy beyond all belief.)
[222004390150] |First glance
[222004390160] |At first, I was pleasantly surprised.
[222004390170] |Unlike Papers, Mendeley is free so long as you don't want to use their Cloud functionality much (I don't).
[222004390180] |Papers is convinced there are people named Marc Hauser, Marc D Hauser, M D Hauser, and M Hauser.
[222004390190] |Mendeley can be led astray but has some nice options to allow you to collapse two different author records -- or two different keywords.
[222004390200] |(On that note, my Papers library has implicit causality, Implicit causality and Implicit Causality all as different keywords.
[222004390210] |Once Papers has decided the keyword for a paper is, say, Implicit Causality, nothing on G-d's green Earth will convince it to switch to implicit causality.
[222004390220] |And its searches are case sensitive.
[222004390230] |Mendeley has none of these "features.")
[222004390240] |Also, Mendeley will let you annotate PDFs and export the PDFs with your annotations in a format readable by other PDF viewers (if, for instance, you wanted to share your annotated PDF with someone).
[222004390250] |That's a nice feature.
[222004390260] |These would all be nice additional features if the the core functionality of Mendeley was there.
[222004390270] |I'm sorry to say that the product just doesn't seem to be ready for prime time.
[222004390280] |I typed "prime time" into Flickr, and this is what it gave me.
[222004390290] |Not sure why.photo credit here.
[222004390300] |Second glance
[222004390310] |The first disappointment is that Mendeley does not have smart collections.
[222004390320] |Like smart playlists in iTunes, smart collections are collections of papers defined by various search terms.
[222004390330] |If you have a smart collection that indexes all articles with the keywords "implicit causality," "psych verbs" and "to read", then whenever you add a new paper with those keywords, they automatically go into the smart collection.
[222004390340] |This is very handy, and it's an excellent feature of Papers (except that, as mentioned above, my smart folder for implicit causality searches for the keywords "implicit causality," "Implicit causality" OR "Implicit Causality").
[222004390350] |I suspect Mendeley doesn't have smart collections because it doesn't have a serious search function.
[222004390360] |You can search for papers written by a given author or with a given keyword, but if you want to search for papers written by the conjunction of two authors or any paper on "implicit causality" written by Roger Brown, you're out of luck.
[222004390370] |Rather, it'll perform the search.
[222004390380] |It just won't find the right papers.
[222004390390] |Third glance
[222004390400] |That might be forgivable if the citation function in Mendeley was usable.
[222004390410] |The idea is that as you write a manuscript, when you want to cite, say, my paper on over-regularization (18 citations and counting!), you would click on a little button that takes you to Mendeley.
[222004390420] |You find my paper in your PDF library, click another button, and (Hartshorne & Ullman, 2006) appears in your Word document (or NeoOffice or whatever) and the full bibliographic reference appears in your manuscript's bibliography.
[222004390430] |You can even choose what citation style you're using (e.g., APA).
[222004390440] |Sort of. Let's say you want to cite two different papers by Roger Brown and Deborah Fish, both published in 1983 (which, in fact, I did want to do).
[222004390450] |Here's what it looks like:
[222004390460] |Implicit causality effects are found in both English (Brown Fish, 1983) and Mandarin (Brown Fish, 1983)
[222004390470] |At least in APA style, those two papers should be listed as (Brown Fish, 1983a) and (Brown Fish, 1983b), because obviously otherwise nobody has any idea which paper you are citing.
[222004390480] |This gets worse.
[222004390490] |Suppose you wrote:
[222004390500] |Implicit causality effects have been found in multiple languages (Brown Fish, 1983; Brown Fish, 1983).
[222004390510] |Correct APA 5th Ed. style is, I believe, (Brown Fish, 1983a, 1983b).
[222004390520] |Actually, I'm not sure what exactly the correct style is, because Endnote always takes care of it for me.
[222004390530] |There are other issues.
[222004390540] |Mendeley doesn't have a mechanism for suppressing the author.
[222004390550] |So you end up with:
[222004390560] |As reported by Brown and Fish (Brown Fish, 1983; Brown Fish, 1983), verbs have causality implicit in their meaning.
[222004390570] |instead of
[222004390580] |As reported by Brown and Fish (1983a, 1983b), verbs have causality implicit in their meaning.
[222004390590] |Nor does Mendeley know about et al:
[222004390600] |Hauser, Chomsky and Fitch (Hauser, Chomsky Fitch, 2001) put forward a new proposal....blah blah blah...as has been reported several times in the literature (Hauser, Chomsky Fish, 2001; Brown Fish, 1983; Brown Fish, 1983).
[222004390610] |That is, the second time you cite a paper with more than 2 authors, it doesn't contract to (Hauser et al. 2001).
[222004390620] |Unfortunately, there is no work-around for any of these problems.
[222004390630] |In theory, you can edit the citations to make them match APA style.
[222004390640] |Within a few seconds, a friendly dialog box pops up and asks you if you really want to keep your edited citation.
[222004390650] |You can click "OK" or click "cancel," but either way it just changes your carefully-edited citation back to its default -- at least it does on my Mac (the forums suggest that this works for some people).
[222004390660] |It's possible that people who don't use APA won't have as many of these problems.
[222004390670] |Numbered citations, for instance, probably work fine.
[222004390680] |I've never submitted a paper anywhere that used numbered citations, though.
[222004390690] |So I either need to switch professions or continue using Endnote to write my papers.
[222004390700] |Hopefully
[222004390710] |One can hope that Mendeley will solve some of these issues.
[222004390720] |I found discussions on their "suggested features" forum going back many months for each of the problems discussed above, which suggests I may be waiting a while for these fixes.
[222004390730] |I do understand that Mendeley is technically in beta testing.
[222004390740] |But it's been in beta testing for over two years, so that's not really an excuse at this point.
[222004390750] |Alternatively, maybe Papers will add a good citation feature (and discover books).
[222004390760] |Or maybe Zotero will confront its own demons.
[222004390770] |I'm going to have to wait and see.
[222004390780] |It makes one appreciate Endnote.
[222004390790] |Yes, it's a dinosaur.
[222004390800] |No, it hasn't added any really useable features since I started using it in 2000.
[222004390810] |But it worked then, and it still works now.
[222004390820] |There's something to be said for that.
[222004400010] |Learning the passive
[222004400020] |If Microsoft Word had its way, passive verbs would be excised from the language.
[222004400030] |That would solve children some problems, because passive verbs are more difficult to learn than one might think, because not all verbs passivize.
[222004400040] |Consider:
[222004400050] |*The bicycle was resembled by John. *Three bicycles are had by John. *Many people are escaped by the argument.
[222004400060] |The bicycle was resembled by John: A how-to guide.
[222004400070] |So children must learn which verbs have passives and which don't.
[222004400080] |I recently sat down to read Pinker, Lebeaux and Frost (1987), a landmark study of how children learn to passivize verbs.
[222004400090] |This is not a work undertaken lightly.
[222004400100] |At 73 pages, Pinker et al. (1987) is not Steve Pinker's longest paper -- that honor goes to his 120-page take-down of Connectionist theories of language, Pinker and Prince (1988) -- but it is long, even for psycholinguistics.
[222004400110] |It's worth the read, both for the data and because it lays out the core of what become Learnability and Cognition, one of the books that has had the most influence on my own work and thinking.
[222004400120] |The Data
[222004400130] |The authors were primarily interested in testing the following claim: that children are conservative learners and only passivize verbs that they have previously heard in the passive.
[222004400140] |This would prevent them from over-generating passives that don't exist in the adult language.
[222004400150] |First, the authors looked at a database of transcriptions of child speech.
[222004400160] |A large percentage of the passive verbs they found were passives the children couldn't possibly have heard before because they aren't legal passives in the adult language:
[222004400170] |It's broked? (i.e., is it broken?)
[222004400180] |When I get hurts, I put dose one of does bandage on.
[222004400190] |He all tieded up, Mommy.
[222004400200] |Of course, when we say that the child couldn't have heard such passives before, you can't really be sure what the child heard.
[222004400210] |It just seems unlikely.
[222004400220] |To more carefully control what the child had heard, the authors taught children of various ages (the youngest group was 4 years old) made-up verbs.
[222004400230] |For instance, they might demonstrate a stuffed frog jumping on top of a stuffed elephant and say, "Look, the frog gorped the elephant."
[222004400240] |Then they would show the elephant jumping on top of a mouse and ask the child, "What happened to the mouse?"
[222004400250] |If you think "gorp" has a passive form, the natural thing to do would be to say "The mouse was gorped by the elephant."
[222004400260] |But a child who only uses passive verbs she has heard before would refuse to utter such a sentence.
[222004400270] |However, across a range of different made-up verbs and across four different experiments, the authors found that children were willing -- at least some of the time -- to produce these new passive verbs.
[222004400280] |(In addition to production tests, there were also comprehension tests where the children had to interpret a passivization of an already-learned verb.)
[222004400290] |Some Considerations
[222004400300] |These data conclusively proved that children are not completely conservative, at least not by 4 years of age (there has been a lot of debate more recently about younger children).
[222004400310] |With what we know now, we know that the conservative child theory had to be wrong -- again, at least for 4 yos -- but it's worth remembering that at the time, this was a serious hypothesis.
[222004400320] |There is a lot of other data in the paper.
[222004400330] |Children are more likely to produce new passive forms as they get older (higher rates for 5 year-olds than 4 year-olds).
[222004400340] |They taught children verbs where the agent is the object and the patient is the subject (that is, where The frog gorped the elephant means "the elephant jumped on top of the frog").
[222004400350] |Children had more difficulty passivizing those verbs.
[222004400360] |However, a lot of these additional analyses are difficult to interpret because of the small sample sizes (16 children and only a handful of verbs per experiment or sub-experiment).
[222004400370] |Theory
[222004400380] |Fair warning: the rest of this post is pretty technical.
[222004400390] |What excites me about this paper is the theoretical work.
[222004400400] |For instance, the authors propose a theory of linking rules that have strong innate constraints and yet still some language-by-language variation.
[222004400410] |The linkages between individual thematic roles in thematic cores and individual grammatical functions in predicate-argument structures is in turn mediated by a set of unmarked universal linking rules: agents are mapped onto subjects; patients are mapped onto objects; locations and paths are mapped onto oblique objects.
[222004400420] |Themes are mapped onto any unique grammatical function but can be expressed as oblique, object or subject; specifically, as the 'highest' function on that list that has not already been claimed by some other argument of the verb.
[222004400430] |With respect to passivization, what is important is that only verbs which have agents as subjects are going to be easily passivized.
[222004400440] |The trick is that what counts as an 'agent' can vary from language to language.
[222004400450] |It is common for languages to restrict passivized subjects to patients affect by an action ...
[222004400460] |The English verbal passive, of course, is far more permissive; most classes of transitive verbs, even those that do not involve physical actions, have the privilege of passivizability assigned to them.
[222004400470] |We suggest this latitude is possible because what counts as the patient of an action is not self-evident ...
[222004400480] |Languages have the option of defining classes in which thematic labels are assigned to arguments whose roles abstractly resemble those of physical thematic relations...
[222004400490] |This last passage sets up the core of the theory to be developed in Learnability and Cognition.
[222004400500] |Children are born knowing that certain canonical verbs -- ones that very clearly have agents and patients, like break -- must passivize, and that a much larger group of verbs in theory might passivize, because they could be conceived of as metaphorically having agents and patients.
[222004400510] |What they have to learn is which verbs from that broader set actually do passivize.
[222004400520] |Importantly, verbs come in classes of verbs with similar meanings.
[222004400530] |If any verb from that set passivizes, they all will.
[222004400540] |This last prediction is the one I am particularly interested in.
[222004400550] |A later paper (Gropen, Pinker, Hollander, Goldberg &Wilson, 1989) explored this hypothesis with regards to the dative alternation, but I don't know of much other work.
[222004400560] |In general, Learnability and Cognition go less attention than it should have, perhaps because by the time it was published, the Great Past Tense Debate had already begun.
[222004400570] |I've often thought of continuing this work, but teaching novel verbs to children in the course of an experiment is damn hard.
[222004400580] |Ben Ambridge has recently run a number of great studies on the acquisition of verb alternations (like the passive), so perhaps he will eventually tackle this hypothesis directly.
[222004400590] |---- Pinker S, Lebeaux DS, and Frost LA (1987).
[222004400600] |Productivity and constraints in the acquisition of the passive.
[222004400610] |Cognition, 26 (3), 195-267 PMID: 3677572
[222004410010] |New Experiment: EmotionSense
[222004410020] |I just posted a new experiment on the website: EmotionSense.
[222004410030] |I have lately gotten very interested in verb-learning, specifically how we decide which of the participants in an event should be the grammatical subject, which the grammatical object, etc. (see this post and this one).
[222004410040] |In this experiment, you'll answer some questions about different types of emotions.
[222004410050] |I'll use this information to help design some upcoming verb-learning experiments.
[222004410060] |As usual, the experiment is short and should take 3-5 minutes.
[222004410070] |[snappy caption goes here]
[222004410080] |-----photo credit here
[222004420010] |Learning What Not to Say
[222004420020] |A troubling fact about language is that words can be used in more than one way.
[222004420030] |For instance, I can throw a ball, I can throw a party, and I can throw a party that is also a ball.
[222004420040] |These cats are having a ball.
[222004420050] |The Causative Alternation
[222004420060] |Sometimes the relationship between different uses of a word is completely arbitrary.
[222004420070] |If there's any relationship between the different meanings of ball, most people don't know it.
[222004420080] |But sometimes there are straightforward, predictable relationships.
[222004420090] |For instance, consider:
[222004420100] |John broke the vase.
[222004420110] |The vase broke.
[222004420120] |Mary rolled the ball.
[222004420130] |The ball rolled.
[222004420140] |This is the famous causative alternation.
[222004420150] |Some verbs can be used with only a subject (The vase broke. The ball rolled) or with a subject and an object (John broke the vase. Mary rolled the ball).
[222004420160] |The relationship is highly systematic.
[222004420170] |When there is both a subject and an object, the subject has done something that changed the object.
[222004420180] |When there is only a subject, it is the subject that undergoes the change.
[222004420190] |Not all verbs work this way:
[222004420200] |Sally ate some soup.
[222004420210] |Some soup ate.
[222004420220] |Notice that Some soup ate doesn't mean that some soup was eaten, but rather has to mean nonsensically that it was the soup doing the eating.
[222004420230] |Some verbs simply have no meaning at all without an object:
[222004420240] |Bill threw the ball. *The ball threw.
[222004420250] |In this case, The ball threw doesn't appear to mean anything, nonsensical or otherwise (signified by the *).
[222004420260] |Try:
[222004420270] |*John laughed Bill.
[222004420280] |Bill laughed.
[222004420290] |Here, laughed can only appear with a subject and no object.
[222004420300] |The dative alternation
[222004420310] |Another famous alternation is the dative alternation:
[222004420320] |John gave a book to Mary.
[222004420330] |John gave Mary a book.
[222004420340] |Mary rolled the ball to John.
[222004420350] |Mary rolled John the ball.
[222004420360] |Once again, not all verbs allow this alternation:
[222004420370] |John donated a book to the library. *John donated the library a book.
[222004420380] |(Some people actually think John donated the library a book sounds OK.
[222004420390] |That's all right.
[222004420400] |There is dialectical variation.
[222004420410] |But for everyone there are verbs that won't alternate.)
[222004420420] |The developmental problem
[222004420430] |These alternations present a problem for theory: how do children learn which verbs can be used in which forms?
[222004420440] |A kid who learns that all verbs that appear with both subjects and objects can appear with only subjects is going to sound funny.
[222004420450] |But so is the kid who thinks verbs can only take one form.
[222004420460] |The trick is learning what not to say One naive theory is that kids are very conservative.
[222004420470] |They only use verbs in constructions that they've heard.
[222004420480] |So until they hear "The vase broke," they don't think that break can appear in that construction.
[222004420490] |The problem with this theory is that lots of verbs are so rare that it's possible that (a) the verb can be used in both constructions, but (b) you'll never hear it used in both.
[222004420500] |Another possibility is that kids are wildly optimistic about verb alternations and assume any verb can appear in any form unless told otherwise.
[222004420510] |There are two problems with this.
[222004420520] |The first is that kids are rarely corrected when they say something wrong.
[222004420530] |But perhaps you could just assume that, after a certain amount of time, if you haven't heard e.g.
[222004420540] |The ball threw then threw can't be used without an object.
[222004420550] |The problem with that is, again, that some verbs are so rare that you'll only hear them a few times in your life.
[222004420560] |By the time you've heard that verb enough to know for sure it doesn't appear in a particular construction, you'll be dead.
[222004420570] |The verb class hypothesis
[222004420580] |In the late 1980s, building on previous work, Steven Pinker suggested a solution to this problem.
[222004420590] |Essentially, there are certain types of verbs which, in theory, could participate in a given alternation.
[222004420600] |Verbs involving caused changes (break, eat, laugh) in theory can participate in the causative alternation, and verbs involving transfer of possession (roll, donate) in theory can participate in the dative alternation, and this knowledge is probably innate.
[222004420610] |What a child has to learn is which verbs do participate in the dative alternation.
[222004420620] |For reasons described above, this can't be done one verb at a time.
[222004420630] |And this is where the exciting part of the theory comes in. Pinker (building very heavily on work by Ray Jackendoff and others) argues that verbs have core aspects of their meaning and some extra stuff.
[222004420640] |For instance, break, crack, crash, rend, shatter, smash, splinter and tear all describe something being caused to fall to pieces.
[222004420650] |What varies between the verbs is the exact manner in which this happens.
[222004420660] |Jackendoff and others argues that the shared meaning is what is important to grammar, whereas the manner of falling to pieces was extra information which, while important, is not grammatically central.
[222004420670] |Pinker's hypothesis was that verb alternations make use of this core meaning, not the "extra" meaning.
[222004420680] |From the perspective of the alternation, then, break, crack, crash, rend, shatter, smash, splinter and tear are all the same verb.
[222004420690] |So children are not learning whether break alternates, they learn whether the whole class of verbs alternate.
[222004420700] |Since there are many fewer classes than than there are verbs (my favorite compendium VerbNet has only about 270), the fact that some verbs are very rare isn't that important.
[222004420710] |If you know what class it belongs to, as long as the class itself is common enough, you're golden.
[222004420720] |Testing the theory
[222004420730] |This particular theory has not been tested as much as one might expect, partly because it is hard to test.
[222004420740] |It is rather trivial to show that verbs do or don't participate in alternations as a class, partly because that's how verb classes are often defined (that's how VerbNet does it).
[222004420750] |Moreover, various folks (like Stefanowitsch, 2008) argue that although speakers might notice the verb classes, that doesn't prove that people actually do use those verb classes to learn which verbs alternate and which do not.
[222004420760] |The best test, then, is it teach people -- particularly young children -- new verbs that either belong to a class that does alternate or to a class that does not and see if they think those new verbs should or should not alternate.
[222004420770] |Very few such studies have been done.
[222004420780] |Around the same time Pinker's seminal Language and Cognition came out in 1989, which outlines the theory I described above, a research team led by his student Jess Gropen (Gropen, Pinker, Hollander, Golberg and Wilson, 1989) published a study of the dative alternation.
[222004420790] |They taught children new verbs of transfer (such as "moop," which meant to move an object to someone using a scoop), which in theory could undergo the dative alternation.
[222004420800] |The question they asked was whether kids would be more likely to use those verbs in the alternation if the verbs were monosyllabic (moop) or bisyllabic (orgulate).
[222004420810] |They were more likely to do so for the monosyllabic verbs, and in fact in English monosyllabic verbs are more likely to alternate.
[222004420820] |This issue of how many syllables the verb has did come up in Language and Cognition, but it wasn't -- at least to me -- the most compelling part of the story (which is why I left it out of the discussion so far!).
[222004420830] |Ambridge, Pine and Rowland (2011)
[222004420840] |Ben Ambridge, Julian Pine and Caroline Rowland of the University of Liverpool have a new study in press which is the only study to have directly tested whether verb meaning really does guide which constructions a child thinks a given verb can be used in, at least to the best of my knowledge -- and apparently to theirs, since they don't cite anyone else.
[222004420850] |(I've since learned that Brooks and Tomasello, 1999, might be relevant, but the details are sufficiently complicated and the paper sufficiently long that I'm not yet sure.)
[222004420860] |They taught children two novel verbs, one of which should belong to a verb class that participates in the causative alternation (a manner of motion verb: bounce, move, twist, rotate, float) and one of which should not (an emotional expression: smile, laugh, giggle).
[222004420870] |Just to prove to you that these classes exist, compare:
[222004420880] |John bounced/moved/twisted/rotated/floated the ball.
[222004420890] |The ball bounced/moved/twisted/rotated/floated.
[222004420900] |*John smiled/laughed/giggled Sally.
[222004420910] |Sally smiled/laughed/giggled.
[222004420920] |Two groups of children (5-6 years old and 9-10 years old) were taught both types of verbs with subjects only.
[222004420930] |After a lot of training, they were shown new sentences with the verbs and asked to rate how good the sentences were.
[222004420940] |In the case of the manner of motion verb, they liked the sentences whether the verb had an subject and an object or if the verb had only a subject.
[222004420950] |That is, they thought the verb participated in the causative alternation.
[222004420960] |For the emotion expression verb, however, they thought it sounded good with a subject only; when it had both a subject and an object, they thought it did not sound good.
[222004420970] |This was true both for the older kids and the younger kids.
[222004420980] |This is, I think, a pretty nice confirmation of Pinker's theory.
[222004420990] |Interestingly, Ambridge and colleagues think that Pinker is nonetheless wrong, but based on other considerations.
[222004421000] |Partly, our difference of opinion comes from the fact that we interpret Pinker's theory differently.
[222004421010] |I think I'm right, but that's a topic for another post.
[222004421020] |Also, there is some disagreement about a related phenomenon (entrenchment), but that, too, is a long post, and the present post is long enough.
[222004421030] |____ Gropen, J., Pinker, S., Hollander, M., Goldberg, R., and Wilson, R. (1989).
[222004421040] |The Learnability and Acquisition of the Dative Alternation in English Language, 65 (2) DOI: 10.2307/415332
[222004421050] |Ben Ambridge, Julian M. Pine, and Caroline F. Rowland (2011).
[222004421060] |Children use verb semantics to retreat from overgeneralization errors Cognitive Linguistics
[222004421070] |For picture credits, look here and here.
[222004430010] |New tags
[222004430020] |Rather than write a new blog post (or my nearly-due BUCLD proceedings paper), I decided to revamp the post tags on this blog.
[222004430030] |Their usage has been inconsistent, which is making it harder and harder to find old blog posts that I want to link to.
[222004430040] |Hopefully the new and improved tags will also be useful for you, dear reader.
[222004430050] |Now if you want to find any of my articles on the academic career path, on animal language or on universal grammar -- just to give a few examples -- they are only a mouse click away.
[222004430060] |In addition to standard tags, there are also a series of tags beginning with the preposition "on".
[222004430070] |These appear on most posts now and are more meta-thematic than the others.
[222004440010] |New York Times, You Can't Handle the Truth.
[222004440020] |Earlier today I wrote about the research behind an opinion article at the New York Times.
[222004440030] |When I looked at the sources cited, I was unable to find any information supporting the claims made in the article.
[222004440040] |In fact, what I found directly contradicted those claims.
[222004440050] |I finished by saying that while I was willing to believe these claims, I'd like to know what data support them.
[222004440060] |In passing, I mentioned that I had submitted an abbreviated version of this analysis as a comment on the Times website.
[222004440070] |That comment was not published.
[222004440080] |I figured maybe there had been a computer error, so I submitted another one later in the day.
[222004440090] |That one was also not published.
[222004440100] |Finally, at 6:13pm, I submitted an innocuous and useless comment under an assumed name:
[222004440110] |I agree with Pat N.
[222004440120] |It's nice to hear from someone who has some optimism (@ Dr. Q).
[222004440130] |This comment was published almost immediately.
[222004440140] |The Times states that "comments are moderated and generally will be posted if they are on-topic and not abusive."Since the moderators didn't publish the comment, we can conclude one of two things:
[222004440150] |1) Discussion of the empirical claims made in a New York Times article is not "on topic."
[222004440160] |2) Pointing out a mistake made in a New York Times article is a kind of abuse.
[222004450010] |Do students at selective schools really study less?
[222004450020] |*Updated with More Analysis*
[222004450030] |So says Philip Babcock in today's New York Times.
[222004450040] |He claims:
[222004450050] |Full-time college students in the 1960s studies 24 hours per week, on average, and their counterparts today study 14 hours per week.
[222004450060] |The 10-hour decline is visible for students from all demographic groups and of all cognitive abilities, in every major and at every type of college.
[222004450070] |The claim that this is true for "every type of college" is important because he wants to conclude that schools have lowered their standards.
[222004450080] |The alternative is that there are more, low-quality schools now, or that some schools have massively lowered their standards.
[222004450090] |These are both potentially problems -- and are probably real -- but are not quite the same problem as all schools everywhere lowering their standards.
[222004450100] |So it's important to show that individual schools have lowered their standards, and that this is true for the selective schools as well as the not-selective schools.
[222004450110] |The article links to this study by Babcock.
[222004450120] |This study analyzes a series of surveys of student study habits from the 1960s to the 2000s, and thus seems to be the basis of his argument, and in fact the introduction contains almost the identical statement that I have quoted above.
[222004450130] |Nonetheless, despite these strong conclusions, the data that would support them appear to be missing.
[222004450140] |SAT scores and size are not available in the early years, so study time by college selectivity is not reported.
[222004450150] |He goes on to say that he can look at selectivity in the more recent surveys: specifically matched 1988-2003 surveys.
[222004450160] |These do show a decrease in study time from on the order of 1-2 hours for high-, medium- and low-selectivity schools (I cannot find how selectivity was defined).
[222004450170] |Whether this is even statistically significant is unclear, as he does not report any statistics or confidence intervals.
[222004450180] |In any case, it is not a 10 hour difference.
[222004450190] |What Babcock might have meant, and more problems with the data
[222004450200] |It is possible that when Babcock was saying that the decrease in study time was true of all types of schools, he meant that when you look at all types of schools in 2003/4, students at all levels report studying less than the average student reported in 1961.
[222004450210] |The problem is that, for all we know, the schools in his sample were more selective in 1961 than they were in 2003/4.
[222004450220] |Moreover, the is something worrisome about his selectivity data.
[222004450230] |Whenever analyzing data, many researchers like to do what is called a "sanity check": they make sure that the data contain results that are known to be true.
[222004450240] |If you were looking at a study of different types of athletes, you might make sure that the jockeys are shorter than the basketball players, lighter than the football players and chew less tobacco than the baseball players.
[222004450250] |If you find any of these things do not hold, you might go back and make sure there isn't a type-o somewhere in your data-entry process.
[222004450260] |I worry that Babcock's data fail the sanity check.
[222004450270] |Specifically, look at the number of hours studies according to selectivity of school in 2003:
[222004450280] |highly selective: 13.47 hours middle: 14.68 hours non-selective: 16.49 hours
[222004450290] |Note that this effect is larger than the decline in number of hours studied between 1988 and 2003, so in terms of this dataset, this is a large effect (again, I cannot tell if it is significant, because the relevant statistical information is not provided) and it's not in the direction one would think.
[222004450300] |I will admit that it is possible that students at highly selective schools really do study less than the folks at JuCo, but that conflicts heavily with my pretty extensive anecdotal database.
[222004450310] |So either a) the world is very different from how I thought it was -- in which case, I want more evidence than just this survey -- b) Babcock has defined selectivity incorrectly, or c) there is something wrong with these data.
[222004450320] |One last worrisome fact
[222004450330] |I considered the possibility that the data Babcock was quoting were in a different paper.
[222004450340] |The only other paper on Babcock's website that looked promising was this American Enterprise Institute report.
[222004450350] |This is not a research paper, but rather summarizes research.
[222004450360] |Specifically, according to footnote #2, it summarizes the research in the paper I just discussed.
[222004450370] |Strangely, this paper does have a graph (Figure 4) breaking down study habits of students in the 1960s based on selectivity of the school they are attending: the very data he states do not exist in the later paper.
[222004450380] |I'm not really sure what to make of that, and have nothing further to say on the topic.
[222004450390] |At the very least, I would be hesitant to use those graphs as evidence to support the general claim that study habits have changed even at the selective schools, since it's unclear where the data case from, or if in fact they even exist (to be clear: it's Babcock who says they don't exist, not me).
[222004450400] |Conclusion
[222004450410] |To summarize, there seems to be very little evidence to support Babcock's conclusion that study time has decreased even at selective schools by 10 hours from the 1960s to modern day.
[222004450420] |That is, he has a survey from 1961 in which students studied 25 hrs/week, two surveys in the 1980s in which students studied 17 hours/week, and two surveys in the 2000s in which students studied 14-15 hrs/week, but these surveys are all based on different types of students at different schools, so it's hard to make any strong conclusions.
[222004450430] |If I compared the weight of football places from Oberlin in 1930 and Ohio State in 2005, I'd find a great increase in weight, but in fact the weight of football players at Oberlin probably has not increased much over that time period.
[222004450440] |Moreover, there are aspects of these data that deserve some skepticism.
[222004450450] |When report to people who went to selective schools that these data suggest students at such schools study 13 hrs/week, the response is usually something like, "Do you mean per day?"
[222004450460] |Finally, since no statistics were run, it's quite possible that none of the results in this study are significant.
[222004450470] |I want to be clear that I'm not saying that Babcock's claims aren't true.
[222004450480] |I'm just saying that it's not clear he has any evidence to support them (which is not to say I think it's a bad study: it was a good study to have done and clearly took a lot of work, but I find it at best suggestive of future avenues of research and certainly not conclusive).
[222004460010] |Talking about Love
[222004460020] |Much of my work is on verbs that describe emotion, called "psych verbs."
[222004460030] |The curious thing about psych verbs is that they come in two varieties, those that put the experiencer of the emotion in subject position (Mary likes/hates/fears John) and those that put the experiencer of the emotion in object position (Mary delights/angers/frightens John).
[222004460040] |These verbs have caused a four-decades-long headache for theorists trying to explain how people know what should be the subject and what should be the object of a given verb.
[222004460050] |Many theorists would like to posit theories on which you put the "do-er" in subject position and the one "done to" in object position.
[222004460060] |But some psych verbs seem to go one way and some the other.
[222004460070] |There are basically only three theoretical possibilities:
[222004460080] |a) There's no general rule that will tell you whether the experiencer of an emotion should be the subject or object of a given verb.
[222004460090] |b) There's a general rule that tells you the experiencer should be the subject (or, on other theories, the object), and then there are some exceptions.
[222004460100] |c) There are no exceptions.
[222004460110] |There are two kinds of psych verbs that actually mean very different things.
[222004460120] |Each group follows a particular rule: one sends the experiencer to subject; the other, to object.
[222004460130] |I started out as a fan of theory (b).
[222004460140] |The results of my own work have pushed me in the direction of (c).
[222004460150] |The only theory that I'm pretty sure is wrong is (a). There are a lot of reasons I think (a) is wrong.
[222004460160] |One has to do with Broca's aphasia.
[222004460170] |Broca's aphasia
[222004460180] |People with Broca's aphasia -- typically caused by a stroke or brain injury -- have difficulty with grammar but are relatively good at remembering what individual words mean.
[222004460190] |Classically, Broca's aphasia was thought to result from damage to Broca's area, though I've heard that association is not as solid as once believed.
[222004460200] |Some well-known language-related areas of the brain.
[222004460210] |Either way, Maria Mercedes Pinango published a study in 2000 looking at how well Broca's aphasics understand psych verbs.
[222004460220] |She found that they had particular trouble with experiencer-object verbs (delights/angers/frightens) ... unless the verbs were in passive form (Mary is delighted/angered/frightened by John), in which case they had more trouble with the experiencer-subject verbs.
[222004460230] |There are a lot of reasons this could be.
[222004460240] |The main aspect of the finding that interests me here is that this is *not* what you'd expect on theory (a), since on that theory, all psych verbs are more or less the same and there's no particular reason Broca's aphasia or anything else should impact one more than the other.
[222004460250] |One worry one might have about this study was that it was published as a book chapter and not in a journal, and book chapters don't (usually) undergo the same review process.
[222004460260] |I don't personally know that much about aphasia or how one goes about testing aphasics, so it's hard for me to review Pinango's methods.
[222004460270] |More importantly, there weren't many participants in the study (these participants are not easy to find), so one would like replication.
[222004460280] |Replication
[222004460290] |As it happens, Cynthia Thompson and Miseon Lee recently published just such a replication (well, they published it in 2009, but one doesn't always hear about papers right away).
[222004460300] |It's a nice study with 5 Broca's aphasics, published in the Journal of Neurolinguistics.
[222004460310] |They tested both sentence comprehension and sentence production, finding that while passive sentences were harder overall, experiencer-subject verbs (like/hate/fear) were easier in the active form and experiencer-object verbs (delight/anger/frighten) were easier in the passive form.
[222004460320] |This effect was much more pronounced in sentence production than comprehension (in the latter case, it was not strictly significant), most likely because comprehension is easier.
[222004460330] |Again, these are not the results you expect if the rules that tell you who should be a subject and who should be an object are verb-by-verb, since then there's no reason brain damage should affect one class of verbs as opposed to another (since there are no verb classes).* What exactly it does mean is much trickier.
[222004460340] |Give me another 20-30 years, and hopefully I'll have an answer.
[222004460350] |*Actually, I can come up with a just-so story that saves theory (a).
[222004460360] |But it's certainly not what you would expect, and I believe there are a lot of other data from other paradigms that speak against theory (a).
[222004460370] |_________
[222004460380] |Thompson CK, and Lee M (2009).
[222004460390] |Psych verb production and comprehension in agrammatic Broca's aphasia.
[222004460400] |Journal of neurolinguistics, 22 (4), 354-369 PMID: 20174592
[222004470010] |Love, Marriage & Race
[222004470020] |People who have been following this blog know that birth order affects who you are friends with and who you marry.
[222004470030] |Here's some comprehensive evidence on race.
[222004470040] |It probably won't come as a surprise, but it's nice to have numbers.
[222004480010] |New Experiment: Drama Queen
[222004480020] |The latest experiment in my quest to understand how people use emotion verbs is now posted.
[222004480030] |You will be introduced to a character who is, as the name of the game implies, a drama queen.
[222004480040] |She has many fraught relationships with her friends.
[222004480050] |You will be introduced to a number of friends, how Susan feels about each friend, and a new verb that you will try to use to describe that relationship.
[222004480060] |Enjoy.
[222004490010] |Missing Words
[222004490020] |My dictionary lists several Chinese words for disdain, but none for discourage.
[222004490030] |The government in Orwell's 1984 would have loved this, as they -- along with many contemporary writers (I'm talking about you, Bill Bryson) -- believed that you don't have a word for something you can't think about it.
[222004490040] |I guess China has no need for the motivational speaker industry.
[222004490050] |You can't be discouraged if you don't have a word for it.
[222004490060] |Unfortunately for the government of Oceania, there's very little evidence this is true.
[222004490070] |The availability of certain words in a language may have effects on memory or speeded recognition, but probably does nothing so drastic as making certain thoughts inaccessible.
[222004490080] |I think examples like the one above make it clear just how unlikely the hypothesis was to be true to begin with. ----- photo credit here.
[222004500010] |New experiment: Mind Reading Quotient
[222004500020] |Language requires a lot of inference.
[222004500030] |Consider the following three conversations:
[222004500040] |A: Are there lots of people at the party?
[222004500050] |B: Well, most people have left already.
[222004500060] |A: How long has the party been going on?
[222004500070] |B: Well, most people have left already.
[222004500080] |A: Is it a good party?
[222004500090] |B: Well, most people have left already.
[222004500100] |In each of these cases, B's statement literally means the same thing, but the interpretation is different.
[222004500110] |Explaining (a) why this should be the case, and (b) how people figure out the implicit meanings is a very active area of research in modern linguistics and psycholinguistics.
[222004500120] |The Mind Reading Quotient
[222004500130] |Basically, understanding conversations like the ones above seem to require a certain amount of "mind reading" -- that is, guessing what the speaker (B, in this case) means to say.
[222004500140] |If you've ever wondered "what did she mean by that?" you were engaged in this kind of mind reading.
[222004500150] |I just posted a new experiment -- the Mind Reading Quotient -- which consist of several short tests of this kind of mind reading ability.
[222004500160] |A couple of the tests look specifically at trying to work out what somebody is saying.
[222004500170] |A couple of the tests look at similar skills in the non-linguistic domain.
[222004500180] |My favorite of the non-linguistic tasks is a coordination game.
[222004500190] |Thomas Schelling won a Nobel Prize in part for pioneering work on the topic.
[222004500200] |He found that people are very good at guessing what another person is thinking under certain conditions.
[222004500210] |For instance, if you tell two people they must meet up in New York City -- but without communicating with each other in any way -- they are actually fairly likely to succeed.
[222004500220] |Most likely, they would both show up on the corner of Times Square (or in one of a very small number of likely locations).
[222004500230] |The Mind Reading Quotient includes several such problems.
[222004500240] |The goal of this study in part is to get a sense of how good people are at such tasks.
[222004500250] |There are a lot of thought experiments out there, but not nearly enough data.
[222004500260] |I will also be looking to see if people who are better at one of these tasks are also better at the others -- that is, is there a single underlying "mind reading ability," or does each task require a separate set of skills?
[222004500270] |Reports so far are that the experiment runs 20-25 minutes.
[222004500280] |Because this is broken up into 7 separate activities, it should seem faster than that.
[222004500290] |And a lot of the tasks are fun (at least, I think so).
[222004500300] |Plus, at the end of the experiment, you'll be able to see your scores on many of the different sub-tasks.
[222004500310] |In two cases (a vocabulary test and an empathy test), I also have percentile scores already worked out, so you can see how you compare to average.
[222004500320] |Follow this link to the study.
[222004500330] |--- For previous posts about pragmatics and other linguistic inferences, check out this one, this one and this one.
[222004500340] |image CC by Ignacio Conejo.
[222004510010] |Overheard: Converting common knowledge into scientific knowledge
[222004510020] |Because they are so familiar, it is easy to assume that category labels drawn from everyday language are self-evidently the correct way to describe emotion.
[222004510030] |However, transforming everyday categorical descriptions into an effective research tool is at the least a challenge.
[222004510040] |Cowie &Cornelius (2003) Describing the emotional states that are expressed in speech.
[222004510050] |Speech Communication 40, 5-32.
[222004520010] |Blogger Spam Filter: Not Totally Useless
[222004520020] |For the first time ever, Google/Blogger's spam filter actually caught a spam comment.
[222004520030] |Usually, it lets the spam go right through unmolested and only traps legitimate comments.
[222004520040] |We can hope this is the start of a trend.
[222004530010] |Another problem with statistical translation
[222004530020] |In the process of writing my latest article for Scientific American Mind, I spent a lot of time testing out automatic translators like Google Translate.
[222004530030] |As I discuss in the article, these programs have gotten a lot better in recent years, but on the whole they are still not very good.
[222004530040] |I was curious what the Italian name of one of my favorite arias meant.
[222004530050] |So I typed O Soave Fanciulla into Google Translate.
[222004530060] |Programs like Google Translate are trained by comparing bilingual documents and noting, for a given word in one language, what word typically appears in the other language in the same place.
[222004530070] |Not surprisingly, Google Translate translated O Soave Fanciulla as O Soave Fanciulla -- no doubt because it was the case that, in the bilingual corpora GT was trained on, sentences with the phrase o soave fanciulla in Italian had o suave fanciulla in English.
[222004530080] |I was reduced to translating the words one at a time: soave -> sweet, fanciulla -> girl.
[222004530090] |GT thinks o means or, but I expect that's the wrong reading in this context ("or sweet girl"?).
[222004540010] |Feds to College Students: "We don't want your professors to know how to teach"
[222004540020] |The National Science Foundation just changed the rules for their 3-year graduate fellowships: no teaching is allowed.
[222004540030] |Ostensibly, this is to ensure that fellows are spending their time doing research.
[222004540040] |This is different from the National Defense Science &Engineering graduate fellowship Vow of Poverty: you can teach as much as you want, so long as you don't earn money from it.*
[222004540050] |Consider that, ideally, PhD programs take 5 years, and the final year is spent on (a) writing the dissertation, and (b) applying for jobs.
[222004540060] |This means that NSF graduate fellows may have as little as one year in which to get some teaching experience.
[222004540070] |Presumably, NSF was thinking one of three things:
[222004540080] |1) They're trying to make it harder for their fellows to get jobs at universities that care about teaching.
[222004540090] |2) They honestly don't believe teaching experience is important.
[222004540100] |3) They weren't thinking at all.
[222004540110] |I'm curious what will happen at universities that require all students to teach, regardless of whether they have outside fellowships or not.
[222004540120] |Will they change that rule, or will they forbid students to have NSF fellowships.
[222004540130] |Given the current financial situation, I'm guessing they'll go with the former, but it's hard to say.
[222004540140] |*The exact NDSEG rule is that your total income for any year should be no more than $5,000 in addition to the fellowship itself.
[222004540150] |Depending on the university, this can be less than what one would get paid for teaching a single class.
[222004550010] |Annoyed about taxes
[222004550020] |It's not that I mind paying taxes per se.
[222004550030] |In fact, I consider it everyone's patriotic duty to pay taxes.
[222004550040] |I just wish it wasn't so damn complicated.
[222004550050] |The primary confusion I have to deal with every year is that Harvard provides a series of mini-grants for graduate students, which they issue as scholarships.
[222004550060] |Scholarships are taxable as income, unless they are used to pay for tuition or required course supplies are not taxable, however.
[222004550070] |Scholarships which are used to I'm a graduate student, which means that the four courses I take every semester are "independent research," and obviously doing research is required.
[222004550080] |On the other hand, the IRS regulations specifically state that any scholarships used to pay for research are taxable.
[222004550090] |So if I use the mini-grant to pay for my research, is it taxable or not?
[222004550100] |I actually asked an IRS representative a few years ago, and she replied that something counts are "required for coursework" only if everyone else taking that course has to buy it.
[222004550110] |If "everyone else" includes everyone else in the department doing independent research, then it's trivially the case that they are not required to do my research (though that would be really nice!), nor are they actually required to spend anything at all (some people's research costs more than others).
[222004550120] |If "everyone else" is only me -- this is independent research after all -- then the mini-grant is not taxable.
[222004550130] |This of course all hinges on whether or not "independent research" is actually a class.
[222004550140] |My understanding is that the federal government periodically brings action against Harvard, claiming that independent research is not a class.
[222004550150] |Some people occasionally deduct the mini-grant expenditures as business expenses.
[222004550160] |This is not correct.
[222004550170] |According to the IRS, graduate students are not employees and have no business, and thus we have no business expenses (this reasoning also helps prevent graduate student unions -- you can't form a union if you aren't employed).
[222004550180] |And in any case, as I mentioned, we are specifically forbidden to write off the cost of doing research.
[222004550190] |It's not just that the rules are confusing, they don't make sense.
[222004550200] |Why does the government want to tax students for the right to do research?
[222004550210] |How is that a good idea?
[222004550220] |Research benefits the public at large, and comes at a high opportunity cost for the researcher already (one could make more doing just about anything else).
[222004550230] |Why make us pay for it?
[222004550240] |(It probably should be pointed out that Harvard could cough up the taxes itself, or they could administer the mini-grants as grants rather than as scholarships, though that would cost them more in terms of administrative overhead.
[222004550250] |Instead, Harvard specifically forbids using any portion of the mini-grant to pay for the incurred taxes.
[222004550260] |Though since they don't ask for any accounting, it's quite possible nobody pays any attention to that rule.)
[222004560010] |Graduate School Rankings
[222004560020] |There have been a number of interesting posts in the last few days about getting tenure (1, 2, 3).
[222004560030] |One thing that popped out at me was the use of the National Research Council graduate school rankings in this post.
[222004560040] |I am surprised that these continue to be cited, due to the deep flaws in the numbers.
[222004560050] |Notice I said "numbers", not "methodology".
[222004560060] |I actually kind of link their methodology.
[222004560070] |Unfortunately, the raw numbers that they use to determine rankings are so error-ridden as to make the rankings useless.
[222004560080] |For those who didn't see my original posts on the subject, cataloging the errors, see here and here.
[222004570010] |Arcadia
[222004570020] |The super-lame New Yorker review of the recent Broadway revival of Stoppard's "Arcadia" moved me to do a rare thing: write a letter to the editor.
[222004570030] |They didn't publish it, despite the fact -- and I think I'm being objective here -- my letter was considerably better than the review.
[222004570040] |Reviews are no longer free on the New Yorker website (you can see a synopsis here), but I think my letter covers the main points.
[222004570050] |Here it is below:
[222004570060] |Hilton Als ("Brainstorm", Mar 28) writes about the recent revival of "Arcadia" that Stoppard's "aim is not to show us people but to talk about ideas."
[222004570070] |Elsewhere, Als calls the show unmoving and writes that Stoppard does better with tragicomedies.
[222004570080] |"Arcadia" is not a show about ideas.
[222004570090] |It is about the relationship people have with ideas, particularly their discovery.
[222004570100] |Anyone who has spent any amount of time around academics would instantly recognize the characters as people, lovingly and realistically depicted.
[222004570110] |(Als singles out Billy Crudup's "amped-up characterization of the British historian Bernard Nightengale" as particularly mysterious.
[222004570120] |As Ben Brantley wrote in the New York Times review, "If you've spent any time on a college campus of late, you've met this [man].")
[222004570130] |As an academic, the production was for me a mirror on my own life and the people around me.
[222004570140] |Not everyone will have that experience.
[222004570150] |The beauty of theater (and literature) is that it gives us peek into the inner lives of folk very different from ourselves.
[222004570160] |It is a shame Als was unable to take advantage of this opportunity.
[222004570170] |Where the play focuses most closely on ideas is the theme of an idea (Thomasina's) stillborn before its time.
[222004570180] |If one feels no pathos for an idea that came too soon, translate "idea" into "art" and "scientist" into "artist" and consider the tragedies of artists unappreciated in their time and quickly forgotten.
[222004570190] |Even a theater critic can find the tragedy in that.
[223000020010] |SAXWriter updated for XHTML subset of XML
[223000020020] |At Alias-i recently, we’ve been writing lots of XHTML.
[223000020030] |You may have noticed the new LingPipe site, for example.
[223000020040] |Now we’re trying to generate XHTML automatically.
[223000020050] |The XML written by com.aliasi.xml.SAXWriter
, although valid XML, does not comply with XHTML’s requirements.
[223000020060] |XHTML puts two additional restrictions on XML elements: (1) that elements with no content and no attributes have an additional space, as in
, and (2) elements with attributes and no content must be closed with a separate tag, as in
.
[223000020070] |I extended LingPipe’s SAXWriter class to support this.
[223000020080] |The old constructors still provide the old behavior, which writes elements as compactly as possible.
[223000020090] |Two new constructors allow a flag to specify the more verbose output required by XHTML.
[223000020100] |This wasn’t too hard; in fact, the unit tests were harder than adding the new formatting code.
[223000020110] |The original motivation for the SAXWriter was that the Xerces example writer demo was so minimal.
[223000020120] |The new Xerces 2 J sax.Writer is much slicker.
[223000020130] |Even so, it still suffers from allowing you to change the character set on the output writer without declaring it in the XML declaration, which always prints UTF-8.
[223000020140] |Very odd, given the care they’ve now taken with the rest of the program.
[223000020150] |It’s worth studying, especially for tips on XML 1.1.
[223000020160] |Warning: The feature described here will be in the 2.2.1 release, which is not yet scheduled.
[223000030010] |LingPipe Site Colophon
[223000030020] |A colophon is a description of the publication of a book.
[223000030030] |Nice examples can be found in the end of the Addsion-Wesley Java series books and the back of most O’Reilly books.
[223000030040] |They tell you about the software, fonts, production and design of the book.
[223000030050] |I love typesetting and read colophons.
[223000030060] |For LingPipe, we wanted something crisp, readable, navigable and extensible.
[223000030070] |We did not want it to be busy-looking like a direct mail solicitation or barren looking like a high-end clothing store.
[223000030080] |Luckily, LingPipe’s not Amazon or IBM, so we thought we could tackle it on our own.
[223000030090] |To make a long story short, it’s harder than you might think to do it right.
[223000030100] |Even with most of the content in place, it took me roughly a week and a half to finish the site redesign.
[223000030110] |We also wanted something that would proportionally resize.
[223000030120] |Most pages don’t resize their content area’s widths when resizing (a good example is this blog format from WordPress), so eventually at large font sizes, things get pushed out.
[223000030130] |Also, most sites don’t let you make them too small without having to scroll.
[223000030140] |We wanted what’s known as a fluid layout.
[223000030150] |I also thought this’d be a good opportunity to switch from our table-based layouts, which are impossible to maintain, to something a little more modern, like cascading style sheets (CSS).
[223000030160] |En route, we learned that CSS is more like troff than TeX.
[223000030170] |I’d had previous experience typesetting books with TeX.
[223000030180] |The kind-hearted copy-editors and designers at Cambridge University Press first sent me down the book-design road by sending me typesetting corrections for my first book (now in its second printing!).
[223000030190] |It took a week of my time, but boy did it look beautiful in the end, and about 75 pages shorter.
[223000030200] |Lesson one: don’t waste whitespace.
[223000030210] |For basic design, we had the official Alias-i colors from our logo, a dark blue and a yellow.
[223000030220] |Together with a lighter blue for links, and black white and gray, this makes up the entire color range of the site.
[223000030230] |For basic fonts, we wanted something readable.
[223000030240] |On paper, that means serif.
[223000030250] |On the web, that means sans-serif.
[223000030260] |We went with Verdana, which is a font designed for the web that is avaialble on all browsers.
[223000030270] |We paired that with Lucida Console, which is a beautiful code font, though not quite contrasty enough against Verdana.
[223000030280] |And it needs to be sized a bit differently to get them to match in terms of x-heigh (how high the top of the letter ‘x’ is) and in terms of weight (how dark a page of the text looks).
[223000030290] |Courier looked terribly fat and out of place and there are no other choices of fixed-width fonts for code.
[223000030300] |Note that all of our runing content links are underlined and rendered in blue with a light mouseover effect.
[223000030310] |Overall, the fonts are way smaller than the default values, but not as small as professional designers tend to choose.
[223000030320] |Try cranking down the size one notch in the browser —that’s how it’d look to a designer.
[223000030330] |For basic layout, we chose a simple header, left navigation bar and running content area.
[223000030340] |They’re all spaced proportionally, so they resize as the web browser resizes and/or fonts are resized.
[223000030350] |Sounds easy, but CSS is a nightmare for this compared to tables.
[223000030360] |We wanted to refrain from the world of CSS hacks, which are typically of the two-wrongs-make-a-right variety, such as exploiting an MS IE bug to resize something that IE gets wrong because of its double margin bug.
[223000030370] |Why does CSS have no vertical alignment?
[223000030380] |I couldn’t figure out any way using just CSS to get our logo to stay nestled on top of the content.
[223000030390] |A higher-level goal is that it should look OK with all the styling turned off.
[223000030400] |I haven’t gotten around to a printer style yet, but it looks OK without any styles at all.
[223000030410] |I can vaguely recommend the book I used to get up to speed, Stylin’ with CSS, but I would’ve dearly liked something aimed at programmers (O’Reilly’s offering is no better in this regard).
[223000030420] |I found the book Great Web Typography a bit dated, but the most use of anything out there for the actual design issues.
[223000030430] |I can also highly recommend the W3C’s pages, including their validator.
[223000030440] |We followed the basic rules of design: contrast, repetition, alignment and proximity (aka CRAP), as explained in Robin Williams’s indispensible The Non-Designers Design Book.
[223000030450] |Contrast: You want the important things to stand out.
[223000030460] |In page design, you pretty much get contrast for that.
[223000030470] |So you see the major heads are set bolder and also set on solid backgrounds a la Apache.
[223000030480] |You see that the current page is highlighted in yellow, allowing a kind of breadcrumb-like behavior for the nav bar.
[223000030490] |You see the navigation all on a blue background and the text clearly set out on a white background.
[223000030500] |You’ll also see the code on a light grey background with some slightly darker grey borders.
[223000030510] |Ditto for the sidebars.
[223000030520] |This makes them stand out and separates them from the surroundijng material.
[223000030530] |It makes it easy to scan the sidebars, and also easy to scan the code.
[223000030540] |All titles are in bolder and/or bigger fonts so they stand out.
[223000030550] |Tables have slightly darker titles and a bit lighter headers.
[223000030560] |Alignment: This is tricky.
[223000030570] |But check out how everything lines up.
[223000030580] |When it doesn’t, pages look vaguely disturbing.
[223000030590] |Here, we’ve lined up the nav bar with the title, all the titles to the left, the sidebars run just as far as the running titles, etc.
[223000030600] |Proximity: What goes together conceptually should be close physically.
[223000030610] |You see the navbar items under a top-level header are closer to the title than the separators.
[223000030620] |The search button is right next to the search bar.
[223000030630] |The titles are closer to the first paragraph of following text than the previous text.
[223000030640] |Paragraphs are just widely enough separated to be noticable.
[223000030650] |Everything is also given a bit of space from its surrounding elements for readability.
[223000030660] |Repetion: We tended to reuse the same elements rather than making everything up new.
[223000030670] |Thus the navigation is all in the nav bar all the time, and it always looks the same.
[223000030680] |The title’s always at the top.
[223000030690] |We try to start pages with an explanation of what the page is about.
[223000030700] |But this also involves repeating the same fonts and colors again and again.
[223000030710] |It gives the whole design coherence.
[223000030720] |That’s about it.
[223000030730] |I learned enough that I can diagnose major flaws in web pages.
[223000030740] |This is not always a good thing.
[223000030750] |I’ve always hated badly set books after understanding how to do it right, and now more and more web sites are bugging me when I see them use too many fonts, text jammed up against borders, lack of interline spacing, overfull lines, etc. etc.
[223000030760] |Having said that, I have a long way to go, and would love to get feedback about our design.
[223000040010] |Blame Canada for more n-gram applications
[223000040020] |The Canadians are as psyched as we are about character n-grams and have applied them to a host of new problems: (1) Alzheimer’s type classification from transcripts, (2) signature-based virus detection from executables, (3) author gender attribution, (4) document clustering, (5) Spam Filtering, and even (6) genome sequence clustering and classification.
[223000040030] |Check it out from Vlado Keselj’s List of Publications.
[223000040040] |Vlado, who’s now at Dalhousie after a Ph.D. at Waterloo, seems to have taken the torch from Fuchun Peng, who recently graduated from Waterloo and moved to UMass.
[223000040050] |Fuchun’s dissertation is well worth reading for the wide range of character n-gram classification evaluations.
[223000040060] |Anyone game to recreate any this work in LingPipe?
[223000050010] |Postponed: Character LMs with a Dozen Apps –Columbia Uni Talk
[223000050020] |I’m giving a talk some time this semester Thursday 23 March 2006 as part of Columbia University’s series http://www1.cs.columbia.edu/nlp/otslac.html (OTSLAC).
[223000050030] |It’ll be in the CS Seminar Room in the MUDD Building on Columbia’s main campus.
[223000050040] |Here’s the abstract:
[223000070010] |LingPipe 2.2.1 released
[223000070020] |We released LingPipe 2.2.1.
[223000070030] |This is a maintenance release that patches a major bug in the new fast cache and a minor bug in spelling token sensitivity.
[223000070040] |The first fix enables the new util.FastCache
to handle negative hash codes.
[223000070050] |I was reminded that not only may hash codes be negative, the remainder operator (%) does not always return a positive value.
[223000070060] |The first unit tests must’ve coincidentally used strings with only positive hash codes.
[223000070070] |This has been fixed and tried in a large scale setting now.
[223000070080] |Why do I say reminded?
[223000070090] |Well, it’s literally Puzzle Number 1 in Joshua Bloch and Neal Gafter’s mind-bending Java Puzzlers.
[223000070100] |Many of the other puzzles deal with overflow of integer values.
[223000070110] |The second fix takes care of a subtle bug in spelling correction token sensitivity for when the best hypothesis splits a token into two tokens.
[223000070120] |I was forgetting that the second token had been edited implicitly.
[223000070130] |There are also two new spelling features: one lets you set a minimum on the size of tokens edited and the second lets you specify a set of do-not-edit tokens.
[223000070140] |If you’re not using spelling or the new cache, the new release won’t affect you at all.
[223000080010] |Phrase Extraction: Binomial Hypothesis Testing vs. Coding Loss
[223000080020] |LingPipe provides two API methods in the language model package that allow either collocations or new phrases to be extracted.
[223000080030] |Both are implemented by classical hypothesis testing.
[223000080040] |There’s an Interesting Phrases Tutorial on our web site.
[223000080050] |Collocation extraction works over a single corpus (text collection).
[223000080060] |It tests the independence of a phrase.
[223000080070] |For a bigram phrase like “language model” it uses the classic four-way comparison of bigram counts: count(+language,+model), count(-language,+model), count(+language,-model), count(-language,-model) in a chi-squared test of independence.
[223000080080] |Basically, the question is: does the phrase “language model” appear more than you would expect it to by chance based on the probability of “language” and “model” occurring independently?
[223000080090] |New terminology extraction works over two corpora.
[223000080100] |The goal is to find things that are mentioned significantly more in the second corpus than the first.
[223000080110] |Here we use a standard binomial significance test, measuring if the probability in the foreground corpus is “significantly” higher than the probability in the background corpus.
[223000080120] |This test is based on language models.
[223000080130] |The standard binomial (large sample) test computes a z-score, which essentially measures the number of standard deviations above or below the number that were expected.
[223000080140] |This is a beautiful formula:
[223000080150] |Here the totalCountFG
is the total count of phrases in the foreground corpus, and probBG is the estimate in the background model of the probability of phrase x.
[223000080160] |High positive values of z means that the phrase x occurs many more times than expected; low negative values means many times less than expected.
[223000080170] |Recall that sqrt(probBG(x)*(1-probBG(x)))
is the deviation of the background Bernoulli distribution (which is iterated a number of times equal to the total count for the binomial model).
[223000080180] |This is just the standard hypothesis test for large sample binomial distributions using a normal approximation of the binomial variance, the error of which is bounded by the central limit theorem (isn’t everhthing in stats?).
[223000080190] |The actual scaling taking place w.r.t. corpus counts makes more sense if we replace countFG(x)
with an estimate based on a foreground language model, because our best estimate of the foreground count is totalCountFG * pFG(x)
, where pFG(x)
is a foreground model probability.
[223000080200] |This gives us:
[223000080210] |Here we see that our confidence in the significance of a (deviation scaled) difference goes up with the square root of the total count (as we’d expect, again from the central limit theorem).
[223000080220] |In ordering these results for terminology extraction over a single corpus, the total counts are ignored.
[223000080230] |They are useful if you want to make comparisons across corpora.
[223000080240] |You can also get rid of the square roots by squaring everything; much faster to compute that way.
[223000080250] |So the final ordering (as implemented inside LingPipe) is:
[223000080260] |(Curse this too-smart-for-it’s-own-good blog software —it’s insisting on em-dashes here.)
[223000080270] |Thus we’re ordering by probability differences scaled by deviation.
[223000080280] |But note that 1-probBG(x)
is so close to 1 when
probBG(x)
is small (as it usually is here), that it can be safely dropped from the denominator of the above equation (as suggested in Manning and Schuetze’s book).
[223000080290] |Now if the background model is just a unigram (or other lower order model), you can use this second setup to test independence; things that are significantly more likelely in the higher-order model than the lower-order model are more phrase-like.
[223000080300] |Turns out I’m not the only one who thought of doing things this way.
[223000080310] |Yesterday, I read an interesting paper by Takashi Tomokiyo and Matthew Hurst A language model approach to keyphrase extraction from a 2003 ACL workshop.
[223000080320] |They distinguish two issues.
[223000080330] |The first is phrasiness (our notion of collocation), which they measure over a single corpus by taking the foreground model to be an n-gram and the bacground model to be a unigram.
[223000080340] |The second is informativeness, which we’ve called newness or significance.
[223000080350] |They measure informativeness by comparing a foreground model to a background model of the same order.
[223000080360] |What’s interesting is that they also consider crossing these measures, with the best results coming from a linear interpolation of phrasiness and newness.
[223000080370] |We should try this, too.
[223000080380] |Here’s where things get interesting.
[223000080390] |Instead of using classical hypothesis testing for significance of differences, they use compression loss, which totally makes sense.
[223000080400] |They wind up with a formula:
[223000080410] |Note that the log difference is the loss (in bits if logs are base 2) in compression for coding a single instance (that is, it’s the cross-entropy), and the whole thing is scaled by an estimate of the number of instances that need to be coded.
[223000080420] |It has all the same parts as our estimate, but counter-intuitively, they’re in different polarities.
[223000080430] |Although the difference of logs is going in the same direction as our squared difference, Tomokiyo and Hurst multiply by foreground probability rather than (approximately) dividing by background probability.
[223000080440] |My guess is that this’ll skew their results toward less significant differences (in the statistical sense) that are more likely.
[223000080450] |I’ll have to check out the combination approach and see what it looks like.
[223000080460] |I can also compare to what they did, because with the visitor pattern I used to encode terminology extraction, this’ll be easy to code.
[223000090010] |SIGHan ‘06: Chinese Word Segmentation and Named Entity Recognition
[223000090020] |LingPipe 2.3 (not out yet) includes a rescoring n-best named-entity adapter and an implementation involving longer-distance character language models.
[223000090030] |We used it for the named-entity portion of the 3rd International Chinese Language Processing Bakeoff.
[223000090040] |We used the same exact implementation of word segmentation as can be found in our Chinese Word Segmentation Tutorial.
[223000090050] |There were four word-segmentation corpora and two named-entity corpora (well, three, but we ignored the LDC’s as it was in its own format and required even more licenses for download).
[223000090060] |LingPipe fared pretty well, winding up near the median in most evaluations (which also put it near the top).
[223000090070] |Our best segmentation performance was .972 F-measure, and our best named-entity recognition (person/location/organization) performance was .855 F-measure.
[223000090080] |These were 1.1% and 3.5% off the best closed (no external resources) scores in the bakeoff, respectively.
[223000090090] |Full details are available from our system write-up:
[223000090100] |Bob Carpenter. 2006.
[223000090110] |Character Language Models for Chinese Word Segmentation and Named Entity Recogntion.
[223000090120] |Proceedings of the 5th International Chinese Language Processing Workshop (SIGHAN).
[223000090130] |Sydney.
[223000090140] |We hope to release LingPipe 2.3 in the next month or two; we’re pretty busy with commercial applications work right now.
[223000100010] |Finding Text with One Language Model
[223000100020] |At our Java SIG presentation this past Tuesday, we had several interesting questions about classifiers.
[223000100030] |One of the things people wanted to be able to do was pull the natural language text out of a document (as compared to, say, ASCII art, tables, lists of menu titles derived from HTML to text conversion, etc.).
[223000100040] |That’s a two-class classification problem, where a sequence of characters is classified as either “GOOD” or “BAD”
[223000100050] |Not a problem, was our answer, but as usual, the devil’s in the details.
[223000100060] |One person framed the question beautifully:
[223000100070] |Spot on.
[223000100080] |The really cool part is how that’s actually implemented with our classification interface.
[223000100090] |A simple two-class classifier will operate by classifying to the category with the highest probability given the text being classified, P(cat|text).
[223000100100] |That’s what we’re calling “confidence” —the probability of the category given the text.
[223000100110] |As usual, it’s much easier to build a joint model P(cat,text) of the probablity of the category AND the text.
[223000100120] |Then we just note that:
[223000100130] |where we’ve applied the usual Bayesian inversion to compute the marginal as a sum of joint estimates:
[223000100140] |If we’re just choosing the first-best category, then we don’t even need to compute the conditional, because:.
[223000100150] |Now here’s the question: what if we only have training examples of one category, say that of natural language text in English?
[223000100160] |As a good statististician, all I want is a good estimate of the probability of the text given the category.
[223000100170] |The best approach is to get some data.
[223000100180] |The second best approach is to take a uniform estimate of P(text|cat) [which will depend on the length of the text if it's to be properly normalized].
[223000100190] |The first-best approach is our normal language model-based classifier.
[223000100200] |The second-best approach is implemented as the classify.BinaryLMClassifier.
[223000100210] |The constructor takes a “cross-entropy threshold” and what happens in first-best classification is that the input is classified as being of the category if its entropy-rate is lower than the threshold.
[223000100220] |But how do we do this by inheritance?
[223000100230] |By making the competing model P(cat2|text) with a constant character language model with a per-character estimate of the cross-entropy rate.
[223000100240] |Voila.
[223000100250] |Let’s see an example.
[223000100260] |I build a language model P(text|GOOD) using data about what’s “good” text to process.
[223000100270] |This can probably just be run of the mill training data you get from anywhere, but the closer it matches your application good data, the better.
[223000100280] |Then I look at it’s cross-entropy rate (log prob / length) on unseen text.
[223000100290] |I can actually go so far as to compute a variance, as I did in the paper at the ACL software workshop last year.
[223000100300] |This’ll let me estimate where I need to set the threshold to get a given recall value (which will then determine precision).
[223000100310] |In essence, anything that looks too “far out”, that is, too high entropy against the training model, will be classified as non-text.
[223000100320] |That’s how you “recognize the feeling of unfamiliarity”.
[223000100330] |Unfamiliarity is just high cross-entropy in a statistical sense.
[223000110010] |Presentation at NY JavaSIG
[223000110020] |Last tuesday (May 23) I gave a 40 min how-to talk on uses of linguistics in the “application stack” featuring source level details of how to get LingPipe jumping through hoops like “did you mean?” style query spell checking, sentence classificaiton and finally sentence detection, entity detection up to within document coref.
[223000110030] |Slides at NYJava Sig Past Presentations
[223000110040] |How did it go?
[223000110050] |Otis came up to me gesturing the universal sign of head-about-to-explode (hands pressing sides of skull in) and stuttered that half of the talk would have been more than enough.
[223000110060] |One person hated it in the NY Sig forum “Should anyone waste their time listening to things like Linguistics+Software” but that prompted a bunch of useful reponses.
[223000110070] |Some quotes:
[223000110080] |“Yes, I really enjoyed the speakers….While the Ling Pipe speaker was, in my mind, a quirky academic, I wouldn’t expect much less from a guy that has spent many, many hours thinking about how language is constructed and used and subsequently trying to encode that knowledge in software.”
[223000110090] |Computational linguistics has definately rotted my mind–quantifier scope ambiguities really warped reality.
[223000110100] |“I thought the linguistics toolkit –LingPipe –was very interesting.
[223000110110] |The topic is quite academic but the use of linguistics is growing in everyday applications with the increase of unstructured or semi-structured data in the form of email, chats and all types of electronic documents.
[223000110120] |However, you are right that this is definitely not a mainstream concern in the world of java these days.”
[223000110130] |That is pretty much the problem we are trying to solve, make lingusitcs more of a mainstream concern, part of a competent developers toolkit like a DB would be.
[223000110140] |“Personally I thought the LingPipe talk was fascinating.
[223000110150] |Imo, they need another layer over what Breck was describing, which was a sort of ‘system call’ layer.
[223000110160] |A set of higher level components for various use cases would go a long way.
[223000110170] |Seems like a very powerful library to extract semantics out of text.”
[223000110180] |This comment had Bob and I talking for a while.
[223000110190] |Perhaps a single class that has precompiled models with methods for the “standard” things is called for.
[223000110200] |Maybe it will make it into the 2.5 release.
[223000110210] |Giving the talk was a good experience, I spent a week writing it up and will spend more time tuning.
[223000110220] |I am looking for other venues to give it so get in touch if you have ideas. breck
[223000120010] |NAACL/HLT in Brooklyn this week
[223000120020] |NYU is hosting HLT-NAACL this year (thanks especially to Setoshi Sekine for local organizer duties).
[223000120030] |The opening reception is tonight (Sunday, 4 June) in the neighborhood I share with NYU (Greenwich Village).
[223000120040] |After tutorials today, the conference starts in earnest tomorrow.
[223000120050] |The main conference is in Brooklyn, home of Alias-i.
[223000120060] |We’ll be there with hot-out-of-the-burner LingPipe 2.2.2 CDs to pass out.
[223000120070] |As usual, the workshops look more interesting than the main conference.
[223000120080] |I have to agree with Ken Church’s “editorial” in the Computational Linguistics journal —the acceptance rates are ridiculously low.
[223000120090] |I also took his point about the burden it puts on reviewers.
[223000120100] |I declined all invitations to review for the main conference.
[223000120110] |For better or worse, the workshops are looking more and more like annual organized group meetings than ad-hoc workshop.
[223000120120] |Standing conferences posing as workshops include CoNLL-X, DUC, SIGPHON and BioNLP, ScaNaLU —about half of the total, with others announced as “First International”.
[223000120130] |We really wish Martin Jansche’s software workshop from last year was being held again this year.
[223000120140] |I guess my mom and dad in the audience didn’t tip the balance in Ann Arbor last year.
[223000120150] |In any case, we’re really looking forward to BioNLP ’06.
[223000120160] |Both Breck and I were on the program committee and there are a lot of very interesting looking papers.
[223000120170] |The resources available are staggering, and best yet, mostly free (as in beer and as in speech).
[223000120180] |The workshop on joint inference should also be interesting —we saw a preview of Heng Ji, Cynthia Rudin and Ralph Grishman’s paper Re-ranking algorithms for name tagging at the NYU seminar on Friday, which is some of the niftiest work on information extraction that we’ve seen.
[223000150010] |Spring Cleaning LingPipe with PMD and fixcrlf and emacs
[223000150020] |I’m so brain dead today that the only activity that seemed possible was cleaning out existing code in some mindless fashion.
[223000150030] |So, I pulled out a handy mail tip from Otis Gospodnetić and fired up a bunch of tools I could run from Ant.
[223000150040] |First off, the core ant task fixcrlf, which allows you to set all end-of-line markers to unix-style line feeds, as well as replacing tabs with a specified number of spaces (4, of course).
[223000150050] |Second, time to cleanup emacs so it always does the right thing instead of that pesky tab default.
[223000150060] |Everyone knows the magic incantation:
[223000150070] |which I always have to track down off the web.
[223000150080] |This works in xemacs, anyway.
[223000150090] |Next up, the real killer Ant task, PMD, which has all kinds of rules for checking code.
[223000150100] |I’m using the basic rules (unused variables, empty catch blocks, empty whiles, etc.), the unused code rules, and the imports rules, which find all the places I over-eagerly imported and then never erased.
[223000150110] |Just download the bin version, unpack it, and copy its lib over to your ant lib and you’re ready to go (you probably won’t need the Xerces and XML-APIs if you’re already running OK with LingPipe).
[223000150120] |I can highly recommend all these tools.
[223000150130] |I hope the 2.4 release of LingPipe’s cleaner than 2.3 in all of the above ways.
[223000160010] |A Nobel in Computational Linguistics?
[223000160020] |How is that?
[223000160030] |Amongst the research geeks we see a BIG opportunity in squeezing more information out of written human knowledge.
[223000160040] |Generally that means the research literature but it can extend to databases and other “encodings” of information about the world.
[223000160050] |The squeezing involves the transition from word-based search to concept-based search.
[223000160060] |It’s a big deal and one that you personally care about if you think you will be needing the serious attention of a doctor 15 years from now.
[223000160070] |Making the leap will uncover a new world of therapies, treatments and scientific understanding–at least that is the idea and it is well worth exploring.
[223000160080] |It is a cure-for-cancer level achievement.
[223000160090] |As they say at the Indy 500: “Researchers, start your graduate students”.
[223000160100] |A few details, but I am going to keep this sketchy.
[223000160110] |Mr. Search Engine does a pretty good job finding words in documents, but words are a long way from finding every document in MEDLINE that mentions the gene id 12, official name Serpina3.
[223000160120] |Why?
[223000160130] |Not enough found: The concept for Serpina3 is expressed in documents as ‘ACT’, ‘GIG24′, ‘AACT’ amongst others and Mr. Search Engine misses these entirely.
[223000160140] |Attempts to help Mr. Search Engine have pretty much failed up to now.
[223000160150] |Too much found: The alias ‘ACT’ is highly ambiguous amongst genes as well as the word ‘act’ in more common use.
[223000160160] |It is like finding John Smith on the web–Mr. Search Engine doesn’t even get the fact that there are lots of different things in the world mentioned the same way.
[223000160170] |What is the payoff?
[223000160180] |Once you get concept indexing sorted out, then you can start playing games very effectively with some old ideas floated by Don Swanson in ’88 originally about migraines and dietary magnesium*.
[223000160190] |The approach there tries to find disease A with underlying causes B, and then find treatments C which apply to B but are not known to apply to disease A yet.
[223000160200] |Nice idea–the problem is that it is pretty seriously limited if the A, B and C’s are limited to keyword lookup.
[223000160210] |Make those concept lookups and Dr. Swanson’s approach will gain some serious traction.
[223000160220] |Once that happens I see Dr. Swanson and the folks who solve the concept indexing problem enjoying some quality time in Stockholm.
[223000160230] |I hope they invite me along for the celebration dinner.
[223000160240] |–breck
[223000160250] |*-There are tons of other interesting ideas that would gain traction with concept search as well.
[223000160260] |Swanson however is the first person I know of who actually did something with it.
[223000160270] |Cite:
[223000160280] |SWANSON, D. R. (1988), Migraine and magnesium: eleven neglected connections, Perspectives in Biology and Medicine, 31 : 526–557.
[223000170010] |A day teaching at the Johns Hopkins Summer Institute
[223000170020] |I went down to the Johns Hopkins to teach information extraction for a day at the NAACL summer school.
[223000170030] |It was 28 students with an hour morning presentation and a 3.5 hour lab in the afternoon.
[223000170040] |The only constraint was that I was to return them in good condition and preferrably a bit more learned in the ways of LingPipe and information extraction.
[223000170050] |The students ranged in experience from undergraduates to senior graduate students.
[223000170060] |I decided that a good lab project would be for them to reprocess with the results returned by a search engine.
[223000170070] |So I loaded the excellent open source search engine Lucene with 1300 FBIS articles from back in the TIDES days and set the problem of helping intelligence analysts sort through a days worth of fresh intelligence about Iraq.
[223000170080] |Their task was to find better ways of presenting returned results.
[223000170090] |In the morning presentation I covered the basic input/output setup I was giving them and the source for using LingPipe to do sentence ranking with language models and extraction of named entities up to the level of coreference.
[223000170100] |After the morning presentation, they broke up into 6 groups of on average 4 people and hatched a plan over lunch.
[223000170110] |At 1:30 we started the lab, Bob showed up to lend a helping hand.
[223000170120] |All the groups briefed Bob or me on what they were doing and we helped them get started.
[223000170130] |Lots of interesting ideas were floated and a steady hum built in the lab as we got working.
[223000170140] |Once they got going I slipped out and procured a bottle of Moet Champagne (the real stuff–none of this California malarky) as 1st prize.
[223000170150] |Bob noted that I was perhaps as interested in teaching them about quality wine as linguistics….
[223000170160] |The whole session was a blur, but in the end we saw interesting applications using entity detection for node/link visualization, a few efforts linking locations to google maps (not very detailed in Iraq), an effort to recognize sentences of future intent using tense.
[223000170170] |We ended with votes after brief presentations and a group of students sliped out in search of an ice bucket.
[223000170180] |Lessons learned: 3.5 hours is not much time, we should have structutred things more perhaps–the project would have been much better set as a week long effort.
[223000170190] |It is really fun to work with smart motivated students.
[223000170200] |The tasks limitations had more to do with project management than coding skills.
[223000170210] |Thanks to Roy Tromble our TA, Jason Eisner and David Yarowski who invited us.
[223000170220] |breck
[223000180010] |Props to Mark Davis for Unicode and ICU
[223000180020] |Three cheers for Mark Davis, IBM tech staffer extraordinaire.
[223000180030] |Not only is he the co-author of unicode, he somehow found time to co-write an amazing Java package that does all the unicode munging and normalization you could dream of.
[223000180040] |And it supplies a whole bunch of charset encoders and decoders that don’t ship with Sun’s JVM.
[223000180050] |It’s all up to date with the latest unicode and plays nicely with both the 1.4 JDK and the 1.5 JDK.
[223000180060] |We’ve been working on international search, specifically in Chinese, Japanese, Korean, Arabic.
[223000180070] |We needed to be able to do unicode normalization.
[223000180080] |For instance, the composite Japanese character (code point 0x307A) is functionally equivalent to the pair of characters and ĵA; (code points for the base character 0×3078 and combining mark 0x309A, the latter of which probably looks like garbage even if you have Japanese fonts installed).
[223000180090] |See the Unicode chart for Hiragana for a nice PDF rendering.
[223000180100] |But it turns out this problem is widespread.
[223000180110] |There are single unicode characters for typesetting ligatures like “fft” in English.
[223000180120] |And characters like Ö(letter o with umlauts) may be written as two characters, with the diaresis (two dot diacritc) supplied as a combining character.
[223000180130] |There are also full-width and half-width versions of Latin1 characters which should be normalized.
[223000180140] |And font variants with their own characters (e.g. cursive vs. non-cursive in many languages).
[223000180150] |Oh, and fractions, superscripts, subscripts and scientific symbols.
[223000180160] |It even normalizes rotated geometric shapes.
[223000180170] |The rules for Hangul (Korean) are too complex, but suffice it to say there’s an algorithm behind it and a single "character" may be composed of three unicode characters.
[223000180180] |The normalization we’re using is NFKC, which involves aggressive decomposition (called “compatibility”) and standard recomposition (called “canonical”).
[223000180190] |Read the spec.
[223000180200] |Mark Davis, co-author of the normalization spec, also co-authored a Java package to deal with the vagaries of unicode programatically.
[223000180210] |What we’ve been using is the normalization.
[223000180220] |If you use character data at all, drop what you’re doing (unless it’s reading the spec, in which case you may finish) and check out IBM’s open source Components for Unicode, which is available in C and Java versions.
[223000190010] |At NLPers: Scoring Named Entity Recognition
[223000190020] |Check out Chris Manning’s post on scoring NE recognition for applications and my and Hal’s responses at:
[223000190030] |http://nlpers.blogspot.com/2006/08/doing-named-entity-recognition-dont.html
[223000190040] |The gist of Chris’s response is that F1 “double”-penalizes, so it’s the wrong metric to optmize for applications.
[223000190050] |I respond that the problem is in trying to do first-best only scoring.
[223000200010] |Bayesian Language Models: Latent Dirichlet and Pitman-Yor
[223000200020] |I’ve recently run across not one, but two recent papers on Bayesian language modeling.
[223000200030] |I haven’t been living under a rock, but it has taken me a while to get my head around Bayesian modeling.
[223000200040] |The first of these is a very flexible hierarchical model:
[223000200050] |David M. Blei, Andrew Y. Ng and Michael I. Jordan. 2003.
[223000200060] |Latent Dirichlet Allocation.
[223000200070] |JMLR.
[223000200080] |Their approach is so by-the-book, that it would have made a nice example in either the hierarchical models chapter or the posterior mode chapter of Gelman et al.’s Bayesian Data Analysis.
[223000200090] |This model generates a document by first selecting a multinomial topic distribution, then for each word in the paper, selecting a topic, then for each topic, generating a word based on a conditional probability estimate of the word given the topic.
[223000200100] |The Dirichlet distribution in the title is a conjugate prior for multinomials and is used to represent the distribution from which the topic distributions are selected, with the posterior distribution (after EM style training) is also a Dirichlet.
[223000200110] |The topics are latent in that they are bootstrapped using EM rather than trained.
[223000200120] |The obvious applications are in multi-class clustering, where an optimal mixture of topics may be inferred for each document.
[223000200130] |It’s also being used for dimensionality reduction in the same way as latent semantic indexing (LSI).
[223000200140] |I’m going to use a simpler latent Dirichlet model for inferring paradigms for morphological stemming from unlabeled data.
[223000200150] |The inference is a bit simple as there’s only one paradigm chosen per word, in contrast with the multiple topics chosen per document.
[223000200160] |With EM-style training, this will amount to what’s often called "model-based clustering", or "soft clustering", or just "EM clustering".
[223000200170] |My models aren’t simple conditionals, but will be a combination of length model (for stems and suffixes) and a character language model for generating the actual stems.
[223000200180] |Note that this proposal is not at all what Erwin Chan did recently in his 2006 SIGPHON paper Learning probabilistic paradigms for morphology in a latent class model; he applied LDA to perform a divisive clustering of stems and suffixes, with the goal of retrieving linguistically sensible hard clusters for stems and suffixes.
[223000200190] |Another paper in this paradigm brings in correlation between topics, Laffterty and Blei’s 2005 NIPS paper, Correlated Topic Models.
[223000200200] |The basic Dirichlet/multinomial model has been used for IR (Lafferty and Zhai), as has the LDA (Zhai).
[223000200210] |The second of these papers solves a problem I’ve been mulling over for years, which is how to model the Zipf-like (power law) distribution of language with a reasonable prior and inference scheme.
[223000200220] |For the answer, look no further than:
[223000200230] |Yee Whye Teh. 2006.
[223000200240] |A hierarchical Bayesian LM based on Pitman-Yor Processes.
[223000200250] |ACL.
[223000200260] |The Pitman-Yor process is also known as a Chinese restaurant process.
[223000200270] |The standard presentation is that there’s a restaurant with infinitely many tables of infinite capacity and a sequence of customers showing up.
[223000200280] |Each customer either sits at an existing table with probability proportional to how many people are already sitting there, or sits at a new table with a small probability.
[223000200290] |This leads to a natural power-law distribution.
[223000200300] |What’s really nice is that simulated draws can be made pretty much the way the model’s described.
[223000200310] |And what’s really neat is that the empirical Bayes priors look just like Kneser-Ney smoothing.
[223000200320] |The language model performs about as well as Kneser-Ney smoothing (which is itself just what is known as "update exclusion" in the prediction-by-partial matching compression literature) with token trigrams on 14M tokens of English news training data.
[223000200330] |LingPipe’s Witten-Bell smoothing for character n-grams performed indistinguishably from Kneser-Ney smoothing.
[223000200340] |In any case, Pitman-Yor’s far too costly for our uses, at least with the sampling Teh describes, but I’m curious as to how low we can bound the entropy rate of a source like MEDLINE.
[223000200350] |I think the analogy could be improved.
[223000200360] |Instead of tables, imagine a Chinese restaurant that can make infinitely many different dishes.
[223000200370] |Each customer comes in and says “I’ll have what he’s having” with probability proportional to the number of people already having a dish, and says “I’ll have something no one else is having” with some small probability.
[223000200380] |The obvious step is to combine these two ideas to get a very tight model.
[223000200390] |Use LDA to do what Chen and Goodman did with mixing topics, and use Pitman-Yor to do the estimation of the models (replacing the simple conditional estimates used in LDA).
[223000200400] |I wouldn’t be surprised if someone does this for ACL.
[223000210010] |LingPipe Talk at Columbia Uni, 26 Oct 2006
[223000210020] |I’ll be giving the following talk at Columbia next Thursday.
[223000210030] |4:15-5:15 PM CS Conference Room Mudd Building
[223000210040] |Character Language Modeling for Word Segmentation and Entity Detection
[223000210050] |Bob Carpenter Alias-i, Inc.
[223000210060] |I’ll discuss the application of LingPipe’s character language models to the two problems in Chinese language processing: word segmentation and named entity extraction.
[223000210070] |For word segmentation, we use the same noisy channel model as we use for spelling correction.
[223000210080] |The source model is a character language model trained on word segmented Chinese data.
[223000210090] |The channel model is weighted edit distance; for word segmentation, this is merely deterministic space deletion.
[223000210100] |There are no Chinese-specific features at all in the models.
[223000210110] |The bakeoff F1 measure for our segmenter was .961; the winning F1 was .972.
[223000210120] |For named entity extraction, we use a two stage process.
[223000210130] |The first stage is an HMM with character language model emissions.
[223000210140] |For Chinese, where we consider each character a token, this reduces to the more usual multinomial emission HMM.
[223000210150] |We code entity-extraction as a tagging problem using fine-grained states to effectively encode a higher-order HMM.
[223000210160] |For rescoring, we use a pure character language model approach that allows longer distance dependencies, encoding chunking information as characters within the models.
[223000210170] |As with word segmentation, there are no Chinese-specific features.
[223000210180] |The bakeoff F1 for our entity extractor was .855; the winning F1 was .890.
[223000210190] |Time permitting, I’ll discuss our confidence ranking entity and part-of-speech taggers and show some output from MEDLINE POS tagging and gene mention extraction.
[223000210200] |The LingPipe web site provides tutorials on both word segmentation and entity extraction.
[223000210210] |There are also web demos for both applications.
[223000210220] |The sandbox contains the complete code used to generate entries for the SIGHAN bakeoff; the data is available from SIGHAN.
[223000210230] |Two papers covering roughly the same material as the talk are available at:
[223000210240] |Char LM Paper
[223000210250] |Word Segmentation and Entity Paper
[223000220010] |Biocreative Encore: High Precision and High Recall Entity Extraction
[223000220020] |I verified that the Biocreative scoring script could handle inputs with overlaps.
[223000220030] |It can!
[223000220040] |This is great.
[223000220050] |All the scoring scripts for named entity detection should work like this one.
[223000220060] |Even better, Biocreative allows 3 submissions, so we could get in our first-best rescoring model for the big F measure, but still have one submission aiming for high precision and one for high recall.
[223000220070] |We’re really excited about being able to tune these values, and we finally have a public evaluation that will allow us to submit relevant results.
[223000220080] |So with a couple hours still to go before a party tonight, I decided it’d be worth writing a confidence-based entry.
[223000220090] |It simply uses the CharLmHmmChunker as a ConfidenceChunker and set a confidence threshold.
[223000220100] |I put in one entry with threshold 0.90 and one with 0.0001.
[223000220110] |The high precision setting (0.90 probability estimate or better) returned only 1/5 as many entity mentions as the first-best entry.
[223000220120] |The high recall setting (0.0001 probability estimate or better) returned about 5 times as many mentions as the first-best entry.
[223000220130] |Given our previous experiments on NCBI’s GeneTag corpus, as outlined in our named entity tutorial, these should result in relatively high precision and recall respectively compared to the first-best entry.
[223000220140] |All in, it took about another hour and a half or so, including digging up final result email addresses, submitting our results, and writing this blog entry.
[223000220150] |Everything’s in the CVS sandbox module biocreative2006.
[223000220160] |See the last blog entry for anonymous checkout details.
[223000230010] |LingPipe Biocreative Entry: Process Details
[223000230020] |We decided that BioCreative was a fairly low priority given the number of commercial projects we have going on and the number of research collaborations and grant applications we’re putting in.
[223000230030] |But I figured it wouldn’t take that long, so I started after lunch the Saturday afternoon before it was due.
[223000230040] |Let’s see how it goes…
[223000230050] |Saturday, 1:00 PM —create sandbox project in CVS and add ant build file copied over from SIGHan.
[223000230060] |You can check it out from our anonymous CVS sandbox:
[223000230070] |See LingPipe Sandbox for more information on checking projects out of the sandbox.
[223000230080] |Saturday, 1:10 PM —Found data.
[223000230090] |Downloaded into dist directory in project.
[223000230100] |Saturday, 1:30 PM —Found our team ID; I need better mail search or organization (as do other people, judging by the list traffic).
[223000230110] |Found the data file submission format, but not the actual format.
[223000230120] |Saturday, 1:35 PM —Completed system description for submission.
[223000230130] |Saturday, 1:40 PM —Ant task for unpacking data done.
[223000230140] |Saturday, 1:45 PM —Found data format.
[223000230150] |It’s different than Biocreative 1, and it’s different than the GeneTag format on NCBI, though pretty close.
[223000230160] |I forgot they don’t count spaces.
[223000230170] |I have the parser for that already in LingPipe, but not the generator required for the bakeoff.
[223000230180] |More munging code.
[223000230190] |Saturday, 1:55 PM —Rewrote GeneTagChunkParser as Biocreative2006ChunkParser.
[223000230200] |Saturday, 2:00 PM —Wrote top level run1 ant task.
[223000230210] |Saturday, 2:20 PM —Found a bug in training data.
[223000230220] |Or at least something I didn’t expect —overlapping entities:
[223000230230] |P01406630A0965|12 14|p50 P01406630A0965|18 20|p65 P01406630A0965|18 40|p65-selected kappa B motif P01406630A0965|139 167|heterodimeric NF-kappa B complex
[223000230240] |For now, I’ll just catch an exception and see how many there are.
[223000230250] |Saturday, 2:25 Four, it turns out.
[223000230260] |I’m going to leave the code as
[223000230270] |Saturday, 2:30 PM —Sent the list email about the four overlapping cases.
[223000230280] |Saturday, 2:55 PM —I always forget how to test overlap, then I had a problem with scope of an accumulator, so it took a while to get rid of overlaps.
[223000230290] |I just keep the first one rather than the longest.
[223000230300] |Saturday, 3:00 PM —Test run seems to work.
[223000230310] |Taking a short break.
[223000230320] |Saturday, 3:15 PM —Back from short break.
[223000230330] |Saturday, 3:25 PM —Finsihed test data parser.
[223000230340] |Saturday, 3:40 PM —Wow, only took about 15 minutes to get the output parser working right.
[223000230350] |It sure helps having done all this offset stuff about a gazillion times before.
[223000230360] |I was tripped up on the reverse end-of-line computation and by the fact that it’s [start,end] closed-closed notation and not [start,end) half-open interval notation.
[223000230370] |The half-open notation is what we use in LingPipe’s chunkers and what Java uses for String, CharSequence and Array operations.
[223000230380] |I’m running the training data through the system.
[223000230390] |The first lines look OK.
[223000230400] |If the scoring works, I’ll crank the n-gram length up to 8 and let it rip on the official file.
[223000230410] |Saturday, 4:05 PM —Verified against the perl eval.
[223000230420] |I had to create a small test set manually.
[223000230430] |The perl script wouldn’t work in dos, though it took awhile to occur to me that maybe I should try it in Cygwin.
[223000230440] |Don’t know what’s wrong here —probably some crazy perl/windows thing I don’t want to know about.
[223000230450] |F=.968 with 5-gram models against the training data.
[223000230460] |It takes about a minute to train and compile the model.
[223000230470] |Saturday, 4:30 PM —Took a break for lunch.
[223000230480] |Saturday, 4:35 PM —Started run1.
[223000230490] |The program involves four hyperparameters: length of n-gram=8, interpolation ratio=8, 128 characters, number chunkings rescored=1024.
[223000230500] |Saturday, 4:45 PM —That was pretty slow.
[223000230510] |I should’ve used estimate caching in the output run, but wanted to keep the code very simple.
[223000230520] |The whole train and run program’s only 89 lines, including blank ones and per-class imports; the only other program is the data format parser, which is about 117 lines long because of the complexity of the data.
[223000230530] |Total time: 3 hours, 45 minutes Break time: 45 minutes Net project time: 3 hours
[223000230540] |We’ll have to wait for the evaluation results, which I’ll post as another blog entry.