[222000140010] |
Update on cat cognition
[222000140020] |For those who were interested in how cats see the world, check out this new study.
[222000140030] |I'm not sure it isn't just saying that cats have motor memory (as do humans), but it's interesting nonetheless.
[222000300010] |Your brain knows when you should be afraid, even if you don't
[222000300020] |I just got back to my desk after an excellent talk by Paul Whalen of Dartmouth College.
[222000300030] |Whalen studies the amygdala, an almond-shaped region buried deep in the brain.
[222000300040] |Scientists have long known that the amygdala is involved in emotional processing.
[222000300050] |For instance, when you look at a person whose facial expression is fearful, your amygdala gets activated.
[222000300060] |People with damage to their amygdalas have difficulty telling if a given facial expression is "fear" as opposed to just "neutral."
[222000300070] |It was an action-packed talk, and I recommend that anybody interested in the topic visit his website and read his latest work.
[222000300080] |What I'm going to write about here are some of his recent results -- some of which I don't think have been published yet -- investigating whether you have to be consciously aware of seeing a fearful face in order for your amygdala to become activated.
[222000300090] |The short answer is "no."
[222000300100] |What Whalen and his colleagues did was use an old trick called "masking."
[222000300110] |If you present one stimulus (say, a fearful face) very quickly (say, 1/20 of a second) and then immediately present another stimulus (say, a neutral face) immediately afterwards, the viewer typically reports only having seen the second stimulus.
[222000300120] |Whalen used fMRI to scan the brains of people while they viewed emotional faces (fearful or happy) that were masked by neutral faces.
[222000300130] |The participants said they only saw neutral faces, but the brain scans showed that their amygdalas knew different.
[222000300140] |One question that has been on researcher's minds for a while is what information does the amygdala care about?
[222000300150] |Is it the whole face?
[222000300160] |The color of the face?
[222000300170] |The eyes?
[222000300180] |Whalen ran a second experiment which was almost exactly the same, but he erased everything from the emotional faces except the eyes.
[222000300190] |The amygdala could still tell the fearful faces from the happy faces.
[222000300200] |You could be wondering, "Does it even matter if the amygdala can recognize happy and fearful eyes or faces that the person doesn't remember seeing?
[222000300210] |If the person didn't see the face, what effect can it have?"
[222000300220] |Quite possibly plenty.
[222000300230] |In one experiment, the participants were told about the masking and asked to guess whether they were seeing fearful or happy eyes.
[222000300240] |Note that the participants still claimed to be unable to see the emotional eyes.
[222000300250] |Still, they were able to guess correctly -- not often, but more often than if they were guessing randomly.
[222000300260] |So the information must be available on some level.
[222000300270] |There are several ways this might be possible.
[222000300280] |In ongoing research in Whalen's lab, he has found that people who view fearful faces are more alert and more able to remember what they see than people who view happy faces.
[222000300290] |Experiments in animals show that when you stimulate the amygdala, various things happen to your body such as your eyes dilating.
[222000300300] |Whalen interprets this in the following way: when you see somebody being fearful, it's probably a clue that there is something dangerous in the area, so you better pay attention and look around.
[222000300310] |It's possible that subjects who guessed correctly [this is my hypothesis, not his] were tapping into the physiological changes in their bodies in order to make these guesses.
[222000300320] |"I feel a little fearful.
[222000300330] |Maybe I just saw a fearful face."
[222000300340] |For previous posts about the dissociation between what you are consciously aware of from what your brain is aware of, click here, here and here.
[222000690010] |Language as a spherical cow
[222000690020] |Part of Noam Chomsky's famous revolution in linguistics (and cognitive science more broadly) was to focus on linguistic competency rather than performance.
[222000690030] |People stutter, use the wrong word, forget what they planned to say, change ideas mid-sentence and occasionally make grammatical errors.
[222000690040] |Chomsky focused not on what people do say, but on what they would say without any such slip-ups.*
[222000690050] |This certainly simplified the study of language, but one has to wonder what this spherical cow leaves out.
[222000690060] |Economists similarly made great strides by assuming all people are perfectly rational, think on the margin, and have full access to all necessary information free of cost.
[222000690070] |However, any theory based on these clearly false premises is limited in its explanatory power.
[222000690080] |Speech errors carry information.
[222000690090] |This was brought home to me by a recent email I received which began, "Er, yes."
[222000690100] |If filler words carried no information, why transcribe them?
[222000690110] |(Lancelot once asked a similar question.)
[222000690120] |However, people clearly do.
[222000690130] |A quick Google search found over seven million hits for "uhhh" and over twenty-one million hits for "ummm."
[222000690140] |These include quotes like "Ummm...
[222000690150] |Go Twins?" and "Uhhh...
[222000690160] |What did she just say?"
[222000690170] |These two quotes are suggestive, but I don't know if all transcription of filler words and other speech errors can be explained as a single phenomenon.
[222000690180] |I did hear of one study where listeners normally assume that if someone pauses and appears to have difficulty finding a particular word, the listeners assume the word is low-frequency.
[222000690190] |However, listeners drop this assumption if they believe the speaker has a neurological impairment that affects speech.
[222000690200] |I expect that many phenomena dismissed as "performance" rather than "competence" are in fact important in communication.
[222000690210] |Whether one believes that communication should be part of any theory of language is debated (Chomsky seems to think language has nothing to do with communication).
[222000690220] |*This part of linguistics is still very influential in psychology.
[222000690230] |I'm not sufficiently current in linguistics to say whether most linguists still do research this way.
[222000850010] |Anonymice run wild through science
[222000850020] |I recently mentioned Jack Shafer's long-standing irritation at the over-use of anonymous sources in journalism.
[222000850030] |Sometimes the irritation is at using anonymous sources to report banalities.
[222000850040] |In my favorite column in that series (which has unfortunately been moribund for the last year or two), Shafer calls out anonymous sources whose identities are transparent.
[222000850050] |Why pretend to be anonymous when a simple Google search will identify you?
[222000850060] |I had a similar question recently when reading the method section of a psychology research paper.
[222000850070] |Here is the first paragraph from the method section:
[222000850080] |Sixteen 4-year-olds (age: M = 4,7; range = 4,1-4,11), and 25 college students (age: M = 18,10; range = 18,4-19,6) participated in this study.
[222000850090] |All participants were residents of a small midwestern city.
[222000850100] |Children were recruited from university-run preschools and adults were undergraduate students at a large public university.
[222000850110] |Small midwestern city?
[222000850120] |Large public university?
[222000850130] |I could Google the two authors, but luckily the paper already states helpfully on the front page that both authors work at the University of Michigan, located in Ann Arbor (a small midwestern city).
[222000850140] |Maybe the subject recruitment and testing was done in some other university town, but that's unlikely.
[222000850150] |This false anonymity is common -- though not universal -- in psychology papers.
[222000850160] |I'm picking on this one not because I have any particular beef with these authors (which is why I'm not naming names), but simply because I happened to be reading their paper today.
[222000850170] |This brings up the larger issue of the code of ethics under which research is done (here are the regulations at Harvard).
[222000850180] |After some notable ethical lapses in the early days of human research (for instance, Dr. Waterhouse trying out the smallpox vaccine on children), it became clear that regulations were needed.
[222000850190] |As with any regulations, however, form often wins over substance.
[222000850200] |A lab I used to work at had a very short consent form that said something to the effect that in the experiment, you'll read words, speak out loud, and it won't hurt.
[222000850210] |This was later replaced with a multi-page consent form, probably at the request of our university ethics board, but I'm not sure.
[222000850220] |The effect was that our participants stopped reading the consent form before signing it.
[222000850230] |This was entirely predictable, and I think it is an example of valuing the form -- in particular, having participants sign a form -- over substance -- protecting research participants.
[222000850240] |Since most of the research in a psychology department is less dangerous than filling out a Cosmo quiz, this doesn't really keep me up at night.
[222000850250] |However, I think it's worth periodically rethinking our regulations in light of their purpose.
[222002120010] |Even Experts Don't Know what Brain Scans Mean
[222002120020] |For some reason, many people find neuroscience more compelling than psychology.
[222002120030] |That is, if you tell them that men seem to like video games more than women, they are unconvinced, but if you say that brain scans of men and women playing video games showing that the pleasure centers of their brains respond to video games, suddenly it all seems more compelling.
[222002120040] |More flavors is more fun, and the world can accept variation in what types of evidence people find compelling -- and we're probably the better for it.
[222002120050] |In this case, though, there is a problem in that neuroscientific data is very hard to interpret.
[222002120060] |Jerome Kagan said it perfectly in his latest book, so I'll leave it to him:
[222002120070] |A more persuasive example is seen in the reactions to pictures that are symbolic of unpleasant (snakes, bloodied bodies), pleasant (children playing, couples kissing), or neutral (tables, chairs) emotional situations.The unpleasant scenes typically induce the largest eyeblink startle response to a loud sound due to recruitment of the amygdala.
[222002120080] |However, there is greater blood flow to temporal and parietal areas to the pleasant than to the unpleasant pictures, and, making matters more ambiguous, the amplitudes of the event-related waveform eight-tenths of a second after the appearance of the photographs are equivalent to the pleasant and unpleasant scenes.
[222002120090] |A scientist who wanted to know whether unpleasant or pleasant scenes were more arousing could arrive at three different conclusions depending on the evidence selected.
[222002120100] |Daniel Engber in Slate has more excellent discussion of this problem.
[222002120110] |Similarly, many posts ago, I noted that another Harvard psychologist, Dan Gilbert, prefers to simply ask people if they are happy rather than use a physiological measure because the only reason we think a particular physiological measure indicates happiness is because it correlates with people's self-reports of being happy.
[222002120120] |In other words, using any physiological measure (including brain scans) as indication of a mental state is circular.
[222002120130] |---- Kagan (2007) What Is Emotion, pp. 81-82.
[222002120140] |---- PS Since I've been writing about Russian lately, I wanted to mention an English-language Russian news aggregator that I came across.
[222002120150] |This site is from the writer behind the well-known Siberian Light Russia blog.
[222002140010] |What is the Longest Sentence in English?
[222002140020] |Writers periodically compete to see who can write the longest sentence in literature.
[222002140030] |James Joyce long held the English record with a 4,391 word sentence in Ulysses.
[222002140040] |Jonathan Coe one-uped him in 2001 with a 13,955 word sentence in The Rotter's Club.
[222002140050] |More recently, a single-sentence, 469,375 word novel appeared.
[222002140060] |Will they ever run out of words?
[222002140070] |No.
[222002140080] |It's easy to come up with a long sentence if you want to, though typing it out may be a chore.
[222002140090] |Here's a simple recipe:
[222002140100] |1. Pick a sentence you like (e.g., "'Twas brillig and the slithy toves did gyre and gimble in the wabe.")
[222002140110] |2. Add "Mary said that" to the beginning of your sentence (e.g., "Mary said that 'twas brillig and the slithy toves did gyre and gimble in the wabe.")
[222002140120] |3. Add "John said that" to the beginning of your new sentence (e.g., "John said that Mary said that 'twas brillig and the slithy toves did gyre and gimble in the wabe.")
[222002140130] |4. Go back to step #2 and repeat.
[222002140140] |If you keep this up long enough, you'll have the longest sentence in English or any other language.
[222002140150] |Why this matters.
[222002140160] |There are reasons to care about this other than immortalizing your name.
[222002140170] |This formula is a proof by demonstration that language learning is not simply a matter of copying what you have heard others say.
[222002140180] |If this was true, nobody could ever make a longer sentence than the longest one they had ever heard.
[222002140190] |However, making longer sentences is not simply a matter of stringing words together.
[222002140200] |You can't break the longest-sentence record by stringing together the names "John" and "Mary" 469,376 times.
[222002140210] |That wouldn't be a sentence.
[222002140220] |This exercise is one of the most famous proofs that language has structure, and speakers of a language have an intuitive understanding of that structure (the other famous proof arguably being the sentence Colorless green ideas sleep furiously.).
[222002170010] |Why a People Don't Panic During a Plane Crash
[222002170020] |A lot has been made about the the crew and passengers of United Flight 1549 and their failure to panic when their plane landed in the Hudson.
[222002170030] |For instance, here is the Well blog at the New York Times:
[222002170040] |Amanda Ripley, author of the book “The Unthinkable: Who Survives When Disaster Strikes —and Why” (Crown, 2008), notes that in this plane crash, like other major disasters, people tend to stay calm, quiet and helpful to others.
[222002170050] |“We’ve heard from people on the plane that once it crashed people were calm —the pervading sound was not screaming but silence, which is very typical ...
[222002170060] |The fear response is so evolved, it’s really going to take over in a situation like that.
[222002170070] |And it’s not in your interests to get hysterical.
[222002170080] |There’s some amount of reassurance in that I think.’’
[222002170090] |On a different topic, but along the same lines, the paper's Week in Review section discusses the fact that most people are coping with the recent economic collapse reasonably well, all things considered:
[222002170100] |Yet experts say that the recent spate of suicides, while undeniably sad, amounts to no more than anecdotal, personal tragedy.
[222002170110] |The vast majority of people can and sometimes do weather stinging humiliation and loss without suffering any psychological wounds, and they do it by drawing on resources which they barely know they have.
[222002170120] |Should we be surprised?
[222002170130] |This topic has come up here before.
[222002170140] |People are remarkably bad at predicting what will make them happy or sad.
[222002170150] |Evidence shows that while many people think having children will make them happy, most people's level of happiness actually drops significantly after having children and never fully recovers even after the kids grow up.
[222002170160] |On the other end of the scale, the Week in Review article notes that
[222002170170] |In a recently completed study of 16,000 people, tracked for much of their lives, Dr. Bonanno, along with Anthony Mancini of Columbia and Andrew Clark of the Paris School of Economics, found that some 60 percent of people whose spouse died showed no change in self-reported well-being.
[222002170180] |Among people who’d been divorced, more than 70 percent showed no change in mental health.
[222002170190] |This makes a certain amount of sense.
[222002170200] |Suppose the mafia threatens to burn down your shop if you don't pay protection money, and suppose you don't pay.
[222002170210] |They actually have very little incentive to follow through on the threat, since they don't actually want to burn down your shop -- what they want is the money.
[222002170220] |(This, according to psychology Steve Pinker, is one of the reasons people issue threats obliquely -- "That's a nice shop you have here.
[222002170230] |It'd be a shame if anything happened to it." -- so that they don't have to follow through in order to save face.)
[222002170240] |Similarly, biology requires that we think we'll like having children in order to motivate us to have them.
[222002170250] |Biology also requires that we think our spouse dying would ruin our lives, in order to motivate us to take care of our spouse.
[222002170260] |But once we have children or our spouse dies, there is very little evolutionary benefit accrued by carrying through on the threat.
[222002170270] |Finding the idea of a plane crash very scary: useful.
[222002170280] |Mass panic and commotion during a crash: not so much.
[222002210010] |Getting in to Graduate School
[222002210020] |It's standard dogma that when the economy is bad, people go back to school.
[222002210030] |Although it doesn't appear to be major news yet, a number of schools are reporting an increase in applications (here and here, but see also here).
[222002210040] |Despite an increase in applications, it is very possible fewer people will actually go to graduate school.
[222002210050] |This recession may be unique.
[222002210060] |There are two problems.
[222002210070] |First, masters, MD and JD programs are very expensive, and students typically require loans.
[222002210080] |I shouldn't have to elaborate on why this might present a difficulty for the prospective graduate student right now.
[222002210090] |Second, universities are cutting the number of students they are admitting.
[222002210100] |I don't have systematic numbers, but I know that the Harvard Graduate School of Arts and Sciences is reducing the number of students admitted for PhD programs.
[222002210110] |If the richest university in the country is slashing enrollment, I don't think I'm going out on too far a limb in assuming others are as well.
[222002210120] |Large private universities are depending on their endowments (i.e., the stock market) to cover operating expenses, and students are expensive.
[222002210130] |State schools are dependent on government financing, which is also drying up.
[222002210140] |It is obvious why PhD students at a school like Harvard are an expense: instead of paying tuition, they are paid a salary by the school.
[222002210150] |I don't know if the enrollment cut will hit the professional schools.
[222002210160] |It is well-known that undergraduate programs are typically run at a short-term loss (tuition does not cover expenses), with the school figuring they'll make up the difference in future alumni donations.
[222002210170] |I do not know, but suspect, that the same is true for the professional schools.
[222002210180] |That said, the only schools at Harvard right now that don't seem to have a hiring freeze are the Law and Medical schools.
[222002210190] |As I said, this is not being widely reported, and I do not have numbers for the industry as a whole.
[222002210200] |Hopefully I am wrong, because such a trend would be bad.
[222002210210] |During a recession, more people suddenly have time for school.
[222002210220] |When the recovery comes, it meets a better-educated and more capable workforce, (presumably) further fueling the recovery.
[222002210230] |This time, the opposite may happen.
[222002740010] |Why do so many homophones have two pronunciations?
[222002740020] |An interest in puns has led me to start reading the literature on homophones.
[222002740030] |Interestingly, in appears that in the scientific literature "homophone" and "homograph" mean the same thing, which explains why there are so many papers about mispronouncing homophones.
[222002740040] |Here's a representative quote:
[222002740050] |"...reports a failure to use context in reading, by people with autism, such that homophones are mispronounced (eg: 'there was a tear in her eye' might be misread so as to sound like 'there was a tear in her dress").'
[222002740060] |Sticklers will note that "tear in her eye" actually does involve a homophone (tier), but I don't think that's what the authors meant.
[222002740070] |Readers of this blog know that I'm not a prescriptivist -- that is, I believe words mean whatever most speakers of a language think the words mean.
[222002740080] |So I'm not going to claim that these authors are misusing the word, since there seem to be so many of them.
[222002740090] |That said, it would be convenient to have a term for two words that have the same pronunciation which is distinct from the term for two words with distinct pronunciations but are written in the same way.
[222002800010] |Obama & I
[222002800020] |Geoff Nunberg has a fantastic Fresh Air commentary posted on his website about the political misuse of linguistic information.
[222002800030] |Pundits frequently use statistical information about language -- the frequency of the word I in a politicians speeches, for instance -- to editorialize about the politician's outlook or personality.
[222002800040] |That is to say, pundits frequently misuse statistical information.
[222002800050] |Most of what they say on the topic is nonsense.
[222002800060] |Nunberg has the details, so I won't repeat them here.
[222002800070] |There is one segment worth quoting in full, though:
[222002800080] |To Liberman, those misperceptions suggest that Will and Fish are suffering from what psychologists call confirmation bias.
[222002800090] |If you're convinced that Obama is uppity or arrogant, you're going to fix on every pronoun that seems to confirm that opinion.
[222003090010] |Briefings: New Science Budget
[222003090020] |Details on Obama's 2011 science budget are now available.
[222003090030] |The last issue of Nature has a run-down.
[222003090040] |The news is better than it could have been -- and certainly better than the disastrous Bush years.
[222003090050] |Cancellation of the Constellation program (the replacement for the Shuttle) and the moon mission made the headlines, but despite that, NASA's budget will increase slightly.
[222003090060] |The end of the Constellation project would have seriously increased the amount available for science, but in fact a lot of the money budgeted for that will be spent stimulating the development of commercial rockets.
[222003090070] |NIH is getting a $1 billion increase -- which only amounts to 3.2% because NIH is the biggest of the US science programs.
[222003090080] |Because the NIH received $10.4 billion in stimulus funds, the number of grants they will be able to give out in 2011 will fall considerably.
[222003090090] |One nice piece of news is that stipends for many NIH-supported doctoral students and post-doctoral fellows will rise, showing the administration's continued focus on supporting young scientists.
[222003090100] |The DoE's budget is getting a significant boost of 7% to $28.4 billion, with money going to energy research and eveloment, nuclear weapons and physical sciences.
[222003090110] |The NSF -- the smallest of the non-defense research programs but the one that funds me and most psycholinguists -- is getting a small hike up from $6.9 billion to $7.4 billion.
[222003090120] |Most of what I've seen in the science press has been relative contentment with the budget, given that many other programs are being cut.
[222003090130] |That said, it's worth keeping in mind that the last decade saw the US losing steady ground to the rest of the world in science and technology; whether small increases will help remains to be seen.
[222003520010] |Academics on the Web
[222003520020] |The Prodigal Academic, in discussing "Things I Wish I Know Before Starting on the [Tenure Track]", writes
[222003520030] |Actually spend time on my group website.
[222003520040] |This is a great recruiting tool!
[222003520050] |Students look at the departmental website before they arrive on campus to plan out their potential advisors.
[222003520060] |As someone closer to the applying-to-school phase than TPA, I admit that there are schools I probably did not consider as carefully as I should have because their websites were skimpy and I had difficulty finding much information.
[222003520070] |In fact, even though our department has a relatively good website, I very nearly didn't come to Harvard because I couldn't find the information I needed.
[222003520080] |I came from an adult psycholinguistics background and so I hadn't ever read any of her (developmental) papers.
[222003520090] |We went to different conferences.
[222003520100] |Harvard's departmental website is set up around five research areas: Cognition, Brain &Behavior; Developmental; Clinical; Social; Organizational Behavior.
[222003520110] |Since I was cognitive, I checked the cognitive section for psycholinguists and didn't see her.
[222003520120] |I only found out about her because I ended up working at Harvard for a year as a research assistant in a different lab.
[222003520130] |Again, I actually like our department's website.
[222003520140] |This is just a story about how the organization of websites can ultimately have an important effect.
[222003520150] |Websites are also important for disseminating research.
[222003520160] |When I come across an interesting paper by a researcher I don't know, I almost always check their website.
[222003520170] |If they have papers there for download, I read any that look interesting.
[222003520180] |I've discovered some very important work this way.
[222003520190] |But if they don't have papers available (or, as sometimes happens, don't even have a website), that's often the end of the journey.
[222003760010] |Help! I need data!
[222003760020] |Data collection keeps plugging along at GamesWithWords.org. Unfortunately, as usual, it's not the experiments for which I most need data that get the most traffic.
[222003760030] |Puntastic had around 200 participants in the last month.
[222003760040] |I'd like to get more than that, and I'd like to get more than that in all my experiments.
[222003760050] |But if I had to choose one to get 200 participants, it would be The Video Test, which only got 17.
[222003760060] |The Video Test is the final experiment in a series that goes back to 2006.
[222003760070] |We submitted a paper in 2007, which was rejected.
[222003760080] |We did some follow-up experiments and resubmitted.
[222003760090] |More than once.
[222003760100] |Personally, I think we've simply had bad luck with reviewers, since the data are pretty compelling.
[222003760110] |Anyway, we're running one last monster experiment, replicating all our previous conditions several which-ways.
[222003760120] |It needs about 400 participants, though for really beautiful data I'd like about 800.
[222003760130] |We've got 140.
[222003760140] |As I said, recruitment has been slow for this experiment.
[222003760150] |So... if you have never done this experiment before (it involves watching a video and taking a memory test), please do.
[222003760160] |I'd love to get this project off my plate.
[222003860010] |Slate's Report on Hauser Borders on Fraud
[222003860020] |Love, turned sour, is every bit as fierce.
[222003860030] |I haven't written about the Hauser saga for a number of reasons.
[222003860040] |I know and like the guy, and I find nothing but sadness in the whole situation.
[222003860050] |Nonetheless, I've of course been following the reports, and I wondered why my once-favorite magazine had so long been silent.
[222003860060] |Enjoying my fastest Wi-Fi connection in weeks here at the Heathrow Yotel, I finally found Slate's take on scandal, subtitled What went wrong with Marc Hauser's search for moral foundations.
[222003860070] |The article has a nice historical overview of Hauser's work, in context, and neatly describes several experiments.
[222003860080] |The article is cagey, but you could be excused for believing that (a) Hauser has done a lot of moral cognition research with monkeys, and (b) that work was fraudulent.
[222003860090] |The only problem is that nobody, to my knowledge, has called Hauser's moral cognition research into question -- in fact, most people have gone out of their way to say that that work (done nearly exclusively with humans) replicates very nicely.
[222003860100] |There was some concern about some work on intention-understanding in monkeys, which is probably a prerequisite for some types of moral cognition, but that's not the work one thinks of when talking about Hauser's Moral Grammar hypothesis.
[222003860110] |I can't tell if this was deliberately misleading or just bad reporting, and I'm not sure which is more disturbing.
[222003860120] |Slate's science reporting has always been weak (see here, here, here, and especially here), and the entire magazine has been on a steady decline for several years.
[222003860130] |Sigh.
[222003860140] |I need a new magazine.
[223001140010] |Blog’s Myers-Briggs Type INTJ, says Typealyzer
[223001140020] |Typealyzer.com, from the folks at prfekt.se, determines the “personality type” of a blog.
[223001140030] |Specifically they’re using the ubiquitous Myers-Briggs Type Indicator, of a so-called “Jungian personality type”.
[223001140040] |Typealyzer’s web-based system is backed by UClassify.com‘s classifier-as-a-service implementation.
[223001140050] |I don’t know what they used for training data, but given that, it’d be easy to use LingPipe to do this kind of thing (see any one of our classifier tutorials).
[223001140060] |According to Typealyzer, the LingPipe blog has personality type INTJ, which they’ve dubbed “The Scientists”.
[223001140070] |Fair enough, but if you look at the “Analysis”, a star plot centered on a brain for left/right-brain graphical impact, you’ll see that we’re off the charts (well, off the brain anyway) on the Thinking (T) scale, pretty high on the Intuition (N) scale, and a bit over the edge on the Sensing (S) dimension.
[223001140080] |My (Bob’s) Myers-Briggs type is ENTP (E:21, N:45, T:39, P:39) as of last testing.
[223001140090] |My type hasn’t changed at all since high school.
[223001140100] |My mom tested me in high school because she was practicing giving the test for her doctoral dissertation on approaches to writing and personality type.
[223001140110] |If you read the Myers-Briggs.org descriptions, an ENTP sounds like a fickle INTJ.
[223001140120] |I’m thinking most blogs are going to be type E (extroverted) by the very nature of blogging.
[223001500010] |How Breck approaches new projects in natural language processing
[223001500020] |A skill developers typically don’t get in school is how to frame problems in terms of the messy, approximate world of heuristic and machine learning driven natural language processing.
[223001500030] |This blog entry should help shed some light on what remains a mostly self taught black art.
[223001500040] |This is not the only way to do things, just my preferred way.
[223001500050] |At the top level I seek three things:
[223001500060] |Human annotated data that directly encodes the intended output of your NLP program.
[223001500070] |A brain dead, completely simple instance of a program that connects all inputs to the intended output.
[223001500080] |An evaluation setup that takes 1) and 2) and produces a score for how good a job the system did.
[223001500090] |That score should map to a management approved objective.
[223001500100] |Once I have the above I can then turn my attention to improving a score without worrying about whether I am solving the right problem (1 and 2 handle this) and whether I have sorted out access to the raw data and have a rough architecture that makes sense.
[223001500110] |Some more details on each point:
[223001500120] |Human Annotated Data
[223001500130] |If a human cannot carry out the task you expect the computer to do (given that we are doing NLP), then the project is extremely likely to fail.
[223001500140] |Humans are the best NLP systems in the world.
[223001500150] |Humans are just amazing at it and humans fail to appreciate the sophistication of what they do with zero effort.
[223001500160] |I almost always ask customers to provide annotated data before accepting work.
[223001500170] |What does this provide?
[223001500180] |Disambiguation: Annotated data forces a decision on what the NLP system is supposed to do and it communicates it clearly to all involved parties.
[223001500190] |It also keeps the project from morphing away from what is being developed without an explicit negotiation over the annotation.
[223001500200] |Buy in by relevant parties: It is amazing what happens when you sit management, UI developers, business development folks in a room and force them to take a text document and annotate it together.
[223001500210] |Disagreements that would be at the end of a project surface immediately, people know what they are buying and they get a sense that it might be hard.
[223001500220] |The majority of the hand waving “Oh, just have the NLP do the right thing ”goes away.
[223001500230] |Bonus points if you have multiple people annotate the same document independently and compare them.
[223001500240] |If the agreement rate is low then how can you expect a piece of software to do it?
[223001500250] |Evaluation: The annotated data is a starting place for evaluation to take place.
[223001500260] |You have gold standard data to compare to.
[223001500270] |Without it you are flying blind.
[223001500280] |Simple implementation that connects the bits
[223001500290] |I am what Bob calls a "thin wire developer" because I really prefer to reduce project risk by being sure all the bits of software/information can talk to each other.
[223001500300] |I have been amazed at how difficult access to data/logs/programs can be in enterprise setups.
[223001500310] |Some judgement is required here, I want to hit where there are likely blocks that may force completely different approaches (e.g. access search engine logs for dynamic updates or lists of names that should be tracked in data).
[223001500320] |Once again this forces decisions early in development rather than later.
[223001500330] |Unfortunately it takes experience to know what bits are likely to be difficult to get and valuable in the end system.
[223001500340] |Evaluation
[223001500350] |An evaluation setup will truly save the day.
[223001500360] |It is very frustrating to build a system where the evaluation consists of "eyeballing data by hand" (I actually said this at my PhD defense to the teasing delight of Michael Niv, a fellow graduate student, who to this day ponders my ocularly enhanced appendages).
[223001500370] |Some of the benefits are:
[223001500380] |Developers like a goal and like to see performance improve.
[223001500390] |It gets addictive and can be quite fun.
[223001500400] |You will get a better system as a result.
[223001500410] |If the evaluation numbers map well to the business objective then the NLP efforts are well aligned with what the business wants.
[223001500420] |(Business objectives can be to win an academic bake-off for graduate students).
[223001500430] |Looks great to management.
[223001500440] |Tuning systems for better performance can be a long and opaque process to management.
[223001500450] |I got some good advice to always link the quality of the GUI (Graphical User Interface) to the quality of the underlying software to communicate transparently the state of the project.
[223001500460] |An evaluation score that is better than last month communicates the same thing especially if management helped design the evaluation metric.
[223001500470] |I will likely continue this blog thread picking up in greater detail the above points.
[223001500480] |Perhaps some use cases would be informative.
[223001500490] |Breck
[223001850010] |Google Squared: General Entity and Relation Extraction at a Web Scale
[223001850020] |Wow!
[223001850030] |When I came into the office yesterday, Mike was all revved up about Google Squared.
[223001850040] |If you’re not a natural language processing or search geek, you might not realize just how amazing this application is.
[223001850050] |What is Google Squared?
[223001850060] |Google Squared does arbitrary entity extraction and classification, relation extraction, relation clustering and labeling, all in a very nice spread-sheet like AJAX interface.
[223001850070] |It starts with the same old empty query box.
[223001850080] |Only now you type in a class of entities, such as
.
[223001850090] |Then what you get is something that looks like this:
[223001850100] |Each row represents an instance of the entity type, here a U.S. national park.
[223001850110] |The parks include Yellowstone, Yosemite and the Grand Canyon, so the most famous ones definitely show up in the list, along with oddball parks like the Channel islands.
[223001850120] |The first column is the name.
[223001850130] |Just extracting a list of names of national parks given the entity type is a very hard problem to do for arbitrary inputs at a web scale.
[223001850140] |The other columns are information about the entity in question, or at least that’s the idea.
[223001850150] |First, there’s a picture, which shows up in every “square” I’ve seen.
[223001850160] |For national parks, Google extracted three columns, “Nearest City”, “Established” and “Rooms”.
[223001850170] |So even for a softball query like this one, they’re pulling out oddball results like number of rooms.
[223001850180] |The nearest cities are good, though indicate a huge problem with this technology: granularity.
[223001850190] |What we really want is the “closest city we’ve heard of”, not the town of 10 people right on its doorstep.
[223001850200] |You can also see the lack of uniformity of results, with some cities being listed as just city names (e.g. “Santa Barbara”), and some with states (e.g. “San Francisco, California”).
[223001850210] |One of the first things I asked is “what if there are multiple fillers?” (we were looking at baseball players and a column corresponding to their teams).
[223001850220] |Google’s got you covered, allowing multiple fillers (e.g. “Fredonia, Arizona (North Rim) and Grand Canyon, Arizona”).
[223001850230] |If you’re just threshold here, it’s easy to look stupid, such as pulling out two names for the same team.
[223001850240] |The “established” column makes sense, and it pulls out dates.
[223001850250] |I’m guessing there are some canonical entity types, such as like dates and locations, that it’s looking for in a more structured way (rather than just pulling out arbitrary relations).
[223001850260] |Given their existing technology, they can geolocate a term and then use nearness on a map to pull out things like nearest cities.
[223001850270] |But how do they title it?
[223001850280] |The column for “rooms” is clearly an error.
[223001850290] |They’re confusing hotels with the park itself, which is again, frighteningly easy to do on this kind of task.
[223001850300] |The amazing thing is that the whole chart’s not junk, not that there are some errors.
[223001850310] |Expand the Chart
[223001850320] |After it suggests columns for relations, you can delete them and add your own.
[223001850330] |So I tried “elevation”, which got mixed results, ranging from “a steep 1,100 foot climb” for Yellowstone, a number “10500″ with no units for Yosemite, and the answer “No” for Golden Gate Recreation Area, which has a zen-like ring of truth (it’s on the coast).
[223001850340] |The column “attraction” gets no useful results (I was hoping for “Old Faithful”, etc.).
[223001850350] |If I add “state”, it totally nails it.
[223001850360] |“Season” gets one useful result, and “Open” some times.
[223001850370] |But it’s hardly something I could use for trip planning.
[223001850380] |The table also provides suggested further columns, here “Children”, “Latitude”, “Longitude”, and “Local Climate”.
[223001850390] |Hmm.
[223001850400] |The column “Children” contained odd results like “0 1 2 3″ and “0, 1, 2, 3, 4″.
[223001850410] |You can also add more rows.
[223001850420] |So I tried “St. John” (U.S. Virgin Islands, with one of my fave parks), but wound up getting Acadia National Park (another incredible Rockefeller-bequest).
[223001850430] |So I binged the real name using query
and found that its name is “Virgin Islands National Park”.
[223001850440] |I start typing “Vir..” and it autocompletes for me.
[223001850450] |Did I say this app was super-duper cool?
[223001850460] |I don’t know about “Charlotte Amilie” —it’s the capital of the Virgin Islands, but a quick bing for (
) returns a top-hit with snippet beginning with “Saint John: Largest city: Cruz Bay (2,743)…” Did I mention that Bing does a good job with those snippets?
[223001850470] |Yes, but Does it Work for Genes?
[223001850480] |Whew.
[223001850490] |At least we’re not out of a job (yet).
[223001850500] |If I type in “gene” as the type, I indeed get some genes, but what’s that “MHC class I” doing in row two?
[223001850510] |Google Squared introduces columns for “OMIM” (Online Mendelian Inheritance in Man database, which lists genes, known mutations, and phenotypes), “Uniprot” (the modern union of Swiss-Prot and TREMBL, listing known and hypothesized proteins), and “Symbol”.
[223001850520] |The symbols were only of mediocre quality if what they wanted was the Entrez official symbol for a gene.
[223001850530] |I’m not sure why it proposed the column “Uniprot”, because it didn’t find a single value.
[223001850540] |It’s easy to add columns for your favorite gene (aka YFG, the geneticists name for a dummy variable, which an English speaking computer scientist would call “foo”).
[223001850550] |Sheesh —typing “YFG” just for laughs gives me an image of a topless woman tuning a TV?
[223001850560] |If I use “your favorite gene”, I get references to search engines that’ll search for it for you.
[223001850570] |It does a good job pulling back genes by official name.
[223001850580] |But it’s confused as we are with families (e.g. it pulls back “auts-2″ just fine, but is confused by the family “auts” —these are autism-related genes).
[223001850590] |What’s interesting here is that it’s pulling information from all sorts of sources ranging from Wikipedia to GeneCards.
[223001850600] |So the $1M question for us is whether it can list the relevant facets of a gene.
[223001850610] |For instance, I want to know the proteins it produces.
[223001850620] |No luck with “protein” or “product”.
[223001850630] |What about its position?
[223001850640] |Nope, adding columsn like “position” or “location” return amusing fillers like “Jerusalem”, and their suggested “start” also provides meaningless fillers.
[223001850650] |The column “function” is much better, but it’s hardly like using Entrez.
[223001850660] |You get “tumor supressor” for p53, and “signal transduction” for insulin.
[223001850670] |I had no luck at all trying to find interactions (“interaction”, “regulation”, “methylation”, “binds”, etc.), which are all very nicely faceted in the Entrez database.
[223001850680] |Some Nerve
[223001850690] |It takes some nerve to roll out a technology as brittle as this.
[223001850700] |The enormity of the task and its difficulty is awe inspiring.
[223001850710] |The labbers did a great job implementing this.
[223001850720] |The real question is: will the civilians be as impressed as me, and more importantly, will they find use cases for it?
[223001850730] |Given the quality, I still can’t think of anywhere I’d use this over plain-old search.
[223001850740] |I can imagine some application where I need to discover members of some class I don’t already know.
[223001850750] |Ironically, I think the real competition here is Wikipedia.
[223001850760] |It’s the old manual-labor versus automation battle, but with crowdsourcing on the manual side versus natural language processing for automation.
[223001850770] |For instance, check out Wikipedia’s National Park Service entry.
[223002680010] |Evaluating with Unbalanced Training Data
[223002680020] |For simplicity, we’ll only consider binary classifiers, so the population consists of positive and negative instances, such as relevant and irrelevant documents for a search query, or patients who have or do not have a disease.
[223002680030] |Quite often, the available training data for a task is not balanced with the same ratio of positive and negative instances as the population of interest.
[223002680040] |For instance, in many information retrieval evaluations, the training data is positive biased because it is data annoted from the top results of many search engines.
[223002680050] |In many epidemiological applications, there is also positive bias because a study selects from a high risk subpopulation of the actual population.
[223002680060] |Even with unbalanced training data, we might still want to be able to calculate precision, recall, and similar statistics for the actual population.
[223002680070] |It’s easy if we know (or can estimate) the true percentage of positive instances in the population.
[223002680080] |Specificity and Sensitivity vs. Precision and Recall
[223002680090] |Using the usual notation for true and false positives and negatives,
[223002680100] |, and
[223002680110] |.
[223002680120] |Sensitivity and specificity are accuracies on positive () and negative cases (), respecitively.
[223002680130] |Prevalence for a sample may be calculated from the true and false positive and negative counts, by
[223002680140] |.
[223002680150] |Recall is just sensitivity, but precision is the percentage of positive responses that are correct, namely
[223002680160] |.
[223002680170] |Prevalence Adjusted Contingency Matrix
[223002680180] |Suppose we know the test population prevalence , the probability of a random population member being a positive instance.
[223002680190] |Then we can adjust evaluation statistics over a training population with any ratio of positive to negative examples by recomputing expected true and false positive and negative values.
[223002680200] |The expected true and false positive and negative counts in test data with prevalence over samples, given test sensitivity and specificity and , are
[223002680210] |,
[223002680220] |,
[223002680230] |, and
[223002680240] |.
[223002680250] |Sensitivity and Specificity are Invariant to Prevalence
[223002680260] |Although the adjusted contingency matrix counts (true and false positive and negatives) vary based on prevalence, sensitivity and specificity derived from them do not.
[223002680270] |That’s because specificity is accuracy on negative cases and sensitivity the accuracy on positive cases.
[223002680280] |Plug-In Population Statistic Estimates
[223002680290] |Precision, on the other hand, is not invariant to prevalence.
[223002680300] |But we can compute its expected value in a population with known prevalence using the adjusted contingency matrix counts,
[223002680310] |.
[223002680320] |The counts all cancel; we can really work with true and false positive and negative rates instead of counts.
[223002680330] |Related Work
[223002680340] |This is related to my earlier blog post on estimating population prevalence using a classifier with known bias and an unbiased population sample:
[223002680350] |LingPipe Blog: Predicting category prevalence by adjusting for biased classifiers.
[223002680360] |The approach described in the linked blog post may be used to estimate the population prevalence given only an arbitrarily biased labeled training set and an unlabeled and unbiased sample from the population.
[223003040010] |Big Bit-Packed Array Abstraction (for Java, C, etc.)
[223003040020] |One limitation of Java is that arrays can only be up to Integer.MAX_VALUE
entries long.
[223003040030] |That means you can only get 2 billion entries into an array.
[223003040040] |That’s only 2GB if the arrays contain bytes, and we have way more memory than that these days.
[223003040050] |That’s not enough.
[223003040060] |A second limitation, shared by C, is that each entry is a round number of bits, usually a power of two bytes (8, 16, 32, or 64 bits).
[223003040070] |Often 32 is too little and 64 too much.
[223003040080] |I’ve been spending some time lately working on bioinformatics applications.
[223003040090] |The size and shape of the data’s a bit unusual for me, and very awkwardly sized for Java or even C-based arrays.
[223003040100] |Packing the Human Genome
[223003040110] |For instance, if you want to represent the bases in both strands of the human genome, it’s a sequence 6 billion bases long with each base being 1 of 4 values (A, C, G, or T).
[223003040120] |OK, no big, just pack four values into each byte and you can squeeze under the limit with a 1.5 billion long array of bytes.
[223003040130] |It’s even pretty easy to write the bit fiddling because everything aligns on byte boundaries.
[223003040140] |Indexing the Human Genome
[223003040150] |Now let’s suppose we want to index the genome so we can find occurrences of sequences of arbitrary length in time proportional to their length.
[223003040160] |Easy.
[223003040170] |Just implement a suffix array over the positions.
[223003040180] |(A suffix array is a packed form of a suffix tree, which is a trie structure that represents every suffix in a string.
[223003040190] |Technically, it’s an array consisting of all the positions, sorted based on the suffix defined from the position to the end of the string.)
[223003040200] |Because there are 6G positions, the positions themselves won’t fit into an int
, even using an unsigned representation.
[223003040210] |There are only 32 bits in an int, and thus a max of about 4 billion values.
[223003040220] |But a long
is overkill.
[223003040230] |We need 33 bits, not 64.
[223003040240] |Using 8 bytes per entry, the suffix array would fill 48GB.
[223003040250] |We really only need a little over half that amount of space if we use 33 bits per pointer.
[223003040260] |But now things aren’t so easy.
[223003040270] |First, the bit fiddling’s more of a pain because 33-bit values can’t be both packed and byte aligned.
[223003040280] |Second, you can’t create an array that’ll hold 24GB of values, even if it’s a maximally sized array of longs (that’ll get you to 16GB).
[223003040290] |The Abstraction
[223003040300] |What I’ve defined in the package I’m working on with Mitzi is a BigArray
interface.
[223003040310] |The interface is super simple.
[223003040320] |That’s the basics.
[223003040330] |We might want to make the set operations optional so we could have immutable big arrays.
[223003040340] |As is, there’s no way to define immutable arrays in Java or C (though Java provides immutable wrappers for lists in the collections framework and C probably has something similar in its standard template lib).
[223003040350] |I followed the idiom of Java’s Number
interface and define extra getters for bytes, shorts and integers for convenience, but don’t show them here.
[223003040360] |On top of a big array, we could implement a variety of the new I/O buffer classes, using something like the random access file repositioning pattern, but it’d be clunky.
[223003040370] |It’s really a shame the buffers are also sized to integers.
[223003040380] |Implementations
[223003040390] |What’s groovy about interfaces is that we can implement them any way we want.
[223003040400] |The simplest implementation would just be backed by an array of long values.
[223003040410] |An implementation specifically for two-bit values might be used for the genome, with an efficient byte-aligned implementation.
[223003040420] |What I did was create an implementation that allows you to specify the array length and bits per value.
[223003040430] |I then pack the bits into an array of longs and fiddle them out for getting and setting.
[223003040440] |This is fine for the genome sequence, because it’ll fit, although it’s not as efficient as the byte-aligned implementation, so I’ll probably add extra special-case implementations.
[223003040450] |But it’s still not enough for the suffix array.
[223003040460] |I can’t fit all the bits into an array of longs.
[223003040470] |Although I haven’t implemented it yet, the obvious thing to do here is to build up the big array out of an array of arrays.
[223003040480] |I’d only need two arrays of longs for the human genome suffix array.
[223003040490] |This is pretty easy to implement hierarchically with an array of smaller big array implementations.
[223003040500] |First figure out which subarray, then do the same thing as before.
[223003040510] |Yes, I Know about Burrows-Wheeler Indexing
[223003040520] |If you’re just interested in the suffix array packing problem, the Burrows-Wheeler compression algorithm may be combined with indexing to get the entire genome suffix array into 4GB of memory in a way that’s still usable at run time.
[223003040530] |This was first explored by Ferragini and Manzini in their 2000 paper, Opportunistic data structures with applications.
[223003040540] |Very very cool stuff.
[223003040550] |And it works even better in the genomics setting than for text documents because of the small alphabet size (just four bases for DNA/RNA or 20 amino acids for proteins).
[223003040560] |Several projects have taken this approach (e.g. BWA and Bowtie).
[223003170010] |LingPipe Book, Draft 2
[223003170020] |I’m not quite releasing drafts as frequently as I’d like, though I do have the procedure automated now.
[223003170030] |Where to Get the Latest Draft
[223003170040] |You can download the book and the code tarball from:
[223003170050] |LingPipe Book Home Page
[223003170060] |What’s in It (So Far)
[223003170070] |Draft 2 is up to 350 or so pages, and most of what I’ve added since the first draft is LingPipe-related.
[223003170080] |There’s no more general background in the works, but at this pace, it’ll be another year and another 1000 pages before it’s done.
[223003170090] |We’ll almost certainly have to break it into two volumes if we want to print it.
[223003170100] |The current draft has chapters on getting started with Java and LingPipe, including an overview of tools we use.
[223003170110] |The second chapter’s on character encodings and how to use them in Java.
[223003170120] |The third chapter covers regexes, including all the quantifiers, and again focusing on how to get the most out of Unicode.
[223003170130] |The fourth chapter covers I/O, including files, readers, writers and streams, compressed archives like GZIP, ZIP and Tar, resources on the classpath, URIs, URLs, standard I/O, object I/O and serialization, and LingPipe’s I/O utilities.
[223003170140] |The fifth chapter gets more into LingPipe proper, covering the handler, parser, and corpora abstractions in package com.aliasi.corpus
, as well as support for cross-validation.
[223003170150] |The sixth chapter is on classifier evaluations, including K-ary classifiers, reductions to binary classifiers, all the usual statistics, and how to use them in LingPipe.
[223003170160] |There’s also an extensive section on scored/ranked evaluations.
[223003170170] |I’ll probably rearrange and move tokenization before classifier evals, but it’s currently after.
[223003170180] |I cover just about all aspects of tokenization, including stemming/lemmatization, soundex, character normalization with ICU, and so on.
[223003170190] |There’s complete coverage of LingPipe’s tokenizers and factories, and complete tokenization abstraction.
[223003170200] |I also detail interoperability with Lucene’s Analyzer
class, with examples in Arabic.
[223003170210] |Chapter 9, which will also move earlier, is on symbol tables.
[223003170220] |Chapter 11 is a fairly complete overview of latent Dirichlet allocation (LDA) and LingPipe’s implementations.
[223003170230] |There’s currently almost 100 pages of appendices, including basic math, stats, information theory, an overview of corpora, and an overview of the relevant data types in Java.
[223003170240] |Appendix E is about a 20-page intro to Lucene 3.0, going over all you need to know to get search up and running.
[223003170250] |What’s Next
[223003170260] |The next thing I’ll address will be chapter 7, on naive Bayes classifiers.
[223003170270] |Then I’ll turn to logistic regression classifiers, which will require an auxiliary chapter on feature extraction and another on vectors.
[223003170280] |I may also write chapters on KNN, perceptrons, and our language-model-based classifiers, though he latter depend on a chapter on character language models.
[223003170290] |After that, I’ll probably turn to tagging and chunking, though we’ll have to see.
[223003170300] |That’ll require sentence detection, as well as some more stats and interfaces.
[223003170310] |Comments Welcome
[223003170320] |So far, no one’s sent any comments on the first draft.
[223003170330] |I’d love to hear what you think, be it in the form of comments, corrections, suggestions, or whatever.
[224000670010] |Humor is Hard
[224000670020] |Several months ago I became temporarily interested in trying to automatically identify if entries in online discussions are informative, interesting, humorous, etc.
[224000670030] |(This was somewhat in the context of a summarization sort of system, but the problem seems more generic.)
[224000670040] |It turns out that in the comments section of slashdot, people manually tag comments into such categories.
[224000670050] |I spent a few weeks crawling slashdot (eventually getting my IP banned because this is apparently not allowed) and grabbed a few thousand stories and associated comments.
[224000670060] |I spent a few hours building a straightforward classifier based on the comment labels.
[224000670070] |It turns out one of the hardest sorts of comments to classify correctly are the funny ones.
[224000670080] |In general, I think identifying humor (or attempted humor) is a very hard problem.
[224000670090] |It seems to almost require a substantial amount of world knowledge and inference capabilities, since humorous comments are rarely signalled by straightforward lexical cues (though having three exclamation points or a smiley is a good indicator, these actually occur surprisingly rarely).
[224000670100] |To get a sense of why this is so hard, let's look at some examples.
[224000670110] |These are grabbed from slashdot two days ago (the 21st).
[224000670120] |In one article titled Motorola Unveils Phone Vending Machines (which talks about how you can buy cell phones from vending machines and they they are delivered by robotic arm rather than dropping ala sodas), we have the following comments marked humorous: "can i use the cell phones I want to purchases to purchases the cell phone I am purchasing?" and "I have a hard enough time trying to pull a big old stuffed animal out with those robotic arms much less a tiny tiny phone.
[224000670130] |At 50 bucks a pop rather than 50 cents, I'm going to waste a lot of money."
[224000670140] |In another article about Googling for ATM Master Passwords, we have the following comments.
[224000670150] |"[Subj: The default password is...]
[224000670160] |I thought it was up, up, down, down, left, right, left, right, B, A, Start ..." (for those not of my generation, this is the cheat code for the NES game Contra and several other Konami games).
[224000670170] |Additionally, in response to "Whoever makes these ATMs deserves all the bad publicity that they get." someone comments "Might it be Diebold, by any chance?"
[224000670180] |Finally, in commenting about the article Fish Work as Anti-terror Agents (which discusses how fish like the bluegill help detect poisonous substances in water supplies), we get comments like "In Australia, we have stingrays guarding us from pests." and "How do we know this isn't a red herring by some terroist group?" and finally "Does this mean we can carry water bottles on planes again -- if they have bluefish swimming in them?"
[224000670190] |You may take issue with the degree to which these comments are funny, but regardless of whether they actually are funny, the certainly were intended to be funny.
[224000670200] |What I find fascinating about all these examples is that they're essentially playing the game of drawing surprising comparisons between the article at hand and other common knowledge.
[224000670210] |For instance, the "robotic arms" comment is based on our shared experience of failing at fairs to get stuffed animals.
[224000670220] |The stingray comment is in regards to Steve Irwin's recent death, and the waterbottle joke is in reference to the new airline policies.
[224000670230] |While some (eg., the waterbottle joke) are perhaps easy to identify because they seem "off topic" somehow, other ones (like the Diebold comment or the stingray comment) really are on topic for the article, but just play against some alternative story that we're all expected to know.
[224000670240] |I'm not sure what my conclusion is, but if you're out there looking for a really hard text classification problem for which it at least seems that a lot of knowledge and inference is required, you may find humor detection fun.
[224000700010] |Resources for NLP
[224000700020] |Just a quick pointer that was referred to me.
[224000700030] |In addition to the well known Stanford StatNLP link list, Francois-Régis Chaumartin also maintains a list of NLP resources and tools at proxem.com.
[224000700040] |Any other lists people find especially useful (I suppose this would lead to a meta-list :P)?
[224000870010] |What Irks Me about E-mail Customer Service
[224000870020] |I hate dealing with customer service for large corporations, and it has little to do with outsourcing.
[224000870030] |I hate it because in the past few months, I have sent out maybe three or four emails to customer service peeps, at places like BofA, Chase, Comcast, Ebay, etc.
[224000870040] |Having worked in a form of customer service previously (I worked at the computer services help desk as an undergrad at CMU to earn some extra money), I completely understand what's going on.
[224000870050] |But "understand" does not imply "accept."
[224000870060] |I post this here not as a rant, but because I think there are some interesting NLP problems under the hood.
[224000870070] |So what's the problem?
[224000870080] |What has happened in all these cases is that I have some problem that I want to solve, can't find information about it in the FAQ or help pages on the web site, and so I email customer service with a question.
[224000870090] |As an example, I wanted to contest an Ebay charge but was two days past the 60 day cutoff (this was over Thanksgiving).
[224000870100] |So I asked customer service if, given the holiday, they could waive the cutoff.
[224000870110] |As a reply I get a form email, clearly copied directly out of the FAQ page, saying that there is a 60 day cutoff for filing contests to charges.
[224000870120] |Well no shit.
[224000870130] |So here's my experience from working at the help desk.
[224000870140] |When we got emails, we had the option of either replying by crafting an email, or replying by selecting a prewritten document from a database.
[224000870150] |This database was pretty huge -- many thousands of problems, neatly categorized and searchable.
[224000870160] |For emails for which the answer existed in the database, it took maybe 10 seconds to send the reply out.
[224000870170] |What seems to be happening nowadays is that this is being taken to the extreme.
[224000870180] |A prewritten form letter is always used, regardless of whether it is appropriate or not.
[224000870190] |If it is a person doing this bad routing, that's a waste of 10 seconds of person time (probably more for these large companies).
[224000870200] |If it's a machine, it's no big deal from there perspective, but it makes me immediately hate the company with my whole heart.
[224000870210] |But this seems to be a really interesting text categorization/routing problem.
[224000870220] |Basically, you have lots of normal classes (the prewritten letters) plus a "needs human attention" class.
[224000870230] |There's a natural precision/recall/$$$ trade-off, which is somewhat different and more complex than is standardly considered.
[224000870240] |But it's clearly an NLP/text categorization problem, and clearly one that should be worked on.
[224000870250] |I know from my friends at AT&T that they have something similar for routing calls, but my understanding is that this is quite different. Their routing happens based on short bits of information into a small number of categories. The customer service routing problem would presumably be based on lots of information in a large number of categories.
[224000870260] |appears to the user.
[224000870270] |If the user thinks he is writing an email to a person, he will write a good email with full sentences and lots of information.
[224000870280] |If he's just "searching" then he'll only write a few keywords.
[224000920010] |To err is human, but what about researchers?
[224000920020] |Errors happen and sometimes get in to papers.
[224000920030] |A recent example is the JAIR paper I had with Daniel on Domain Adaptation last year.
[224000920040] |I actually didn't catch the error myself -- it was caught by someone who was reimplementing the technique.
[224000920050] |And it's a totally not-insignificant error: essentially, the update equation for the generative parameters is completely botched.
[224000920060] |If you look through the derivation in the Appendix, it's clear where the error crept in.
[224000920070] |Thankfully, this sort of error is essentially a typo.
[224000920080] |That is, the error was introduced when I was typing up the paper, not when I was doing the research.
[224000920090] |Why this is important is that it means the the implementation reflects the correct updates: only the paper has the mistake.
[224000920100] |This means that the experimental results from the paper are valid, contingent on the fact that you rederive the updates yourself, or just ask me what they should be.
[224000920110] |I'm writing this post because it's somewhat unclear what to do when such a thing arises.
[224000920120] |One temptation is to do nothing.
[224000920130] |I have to admit that I was completely embarrassed when this was pointed out to me.
[224000920140] |There was a part of me that wanted to ignore it.
[224000920150] |It seems that this is the wrong approach for a variety of reasons, not the least of which is to make sure that correct information does get out.
[224000920160] |The question, to some degree, is exactly how to do this.
[224000920170] |I have a blog, which means I can write an entry like this.
[224000920180] |I can also put an errata on my web page that points out the errors (I'm writing this up as we "speak").
[224000920190] |Given that this is a pub in an online journal, I believe I am able to submit updates, or at least additional appendices, which means that the "official version" can probably be remedied.
[224000920200] |But what about conference pubs?
[224000920210] |If this had appeared in ACL and I didn't have a blog, the situation would be something different (ironically, an earlier version with the correct updates had been rejected from ACL because the derivations were omitted for space and two reviewers couldn't verify them).
[224000920220] |Also, what if someone hadn't pointed it out to me?
[224000920230] |I certainly wouldn't have noticed -- that paper was behind me.
[224000920240] |But then anyone who noticed the errors might dismiss the results on the grounds that they could assume that the implementation was also incorrect (it's not inconceivable that an erroneous implementation can still get good results).
[224000920250] |This would also not be good because the idea in the paper (any paper with such errors) might actually be interesting.
[224000920260] |False things are published all the time.
[224000920270] |The STOC/FOCS community (i.e., theory community) has a handful of examples...for them, errors are easy to identify because you can prove the opposite of any theorem.
[224000920280] |I recall hearing of a sequence of several papers that incrementally used results from a previous, but the first was in error, putting the rest in error (I also recall hearing that many of the subsequent results could be salvaged, despite the ancestral mistake).
[224000920290] |I don't know if there's a good solution, given our publication mechanisms (essentially, publish-once-then-appear-in-the-anthology).
[224000920300] |But I'm pretty sure mine is not the first paper with such errors.
[224000920310] |At least I hope not :).
[224001220010] |Multiclass learning as multitask learning
[224001220020] |It's bugged me for a little while that when learning multiclass classifiers, the prior on weights is uniform (even when they're learned directly and not through some one-versus-rest or all-versus-all reduction).
[224001220030] |Why does this bug me?
[224001220040] |Consider our favorite task: shallow parsing (aka syntactic chunking).
[224001220050] |We want to be able to identify base phrases such as NPs in running text.
[224001220060] |The standard way to do this is to do an encoding of phrase labels into word labels and apply some sequence labeling algorithm.
[224001220070] |The standard encoding is BIO.
[224001220080] |A sentence like "The man ate a sandwich ." would appear as "B-NP I-NP B-VP B-NP I-NP O" with "B-X" indicating the beginning of a phrase of type X, and "I-X" indicating being "inside" such a phrase ("O", assigned to "." is "outside" of a chunk).
[224001220090] |If we train, eg., a CRF to recognize this, then (typically) it considers B-NP to be a completely independent of I-NP; just as independent as it is of "O".
[224001220100] |Clearly this is a priori a wrong assumption.
[224001220110] |One way I have gotten around this problem is to actually explicitly parameterize my models with per-class features.
[224001220120] |That is, rather than having a feature like "current word is 'the'" and making K copies of this feature (one per output label); I would have explicitly conjoined features such as "current word is 'the' and label is 'B-NP'".
[224001220130] |This enables me to have features like "word=the and label is B-?" or "word=the and label is ?-NP", which would get shared across different labels.
[224001220140] |(Shameless plug: megam can do this effortlessly.)
[224001220150] |But I would rather not have to do this.
[224001220160] |One thing I could do is to make 2^K versions of each feature (K is still the number of labels), where each encodes some subset of active features.
[224001220170] |But for large-K problems, this could get a bit unwieldy.
[224001220180] |Pairwise features would be tolerable, but then you couldn't get the "B-?" sort of features I want.
[224001220190] |There's also no obvious kernel solution here, because these are functions of the output label, not the input.
[224001220200] |It seems like the right place for this to happen is in the prior (or the regularizer, if you're anti-probabilistic models).
[224001220210] |Let's say we have F features and K classes.
[224001220220] |In a linear model, we'll learn F*K weights (okay, really F*(K-1) for identifiability, but it's easier to think in terms of F*K).
[224001220230] |Let's say that a prior we know that classes j and k are related.
[224001220240] |Then we want the prior to favor w(:,j) to be similar to w(:,k).
[224001220250] |There are a variety of ways to accomplish this: I think that something along the lines of a recent NIPS paper on multitask feature learning is a reasonable way to approach this.
[224001220260] |What this approach lacks in general is the notion that if classes j and k "share" some features (i.e., they have similar weights), then they're more likely to "share" other features.
[224001220270] |You could do something like task clustering to achieve this, but that seems unideal since I'd really like to see a single unified algorithm.
[224001220280] |Unfortunately, all attempts (on my part) to come up with a convex regularizer that shares these properties has failed.
[224001220290] |I actually now think that it is probably impossible.
[224001220300] |The problem is essentially that there is bound to be some threshold controlling whether the model things classes j and k are similar and below this threshold the regularizer will prefer w(:,j) and w(:,k) independently close to zero; above this threshold, it will prefer w(:,j) and w(:,k) to be close to each other (and also close to zero).
[224001220310] |This is essentially the root of the non-convexity.
[224001570010] |Teaching machine translation
[224001570020] |Last Fall (2007), I taught an Applications of NLP course to a 50/50 mix of grads and senior undergrads.
[224001570030] |It was modeled partially after a course that I took from Kevin Knight while a grad student.
[224001570040] |It was essentially 1/3 on finite state methods for things like NER and tagging, then 1/3 on machine translation, then 1/3 on question answering and summarization.
[224001570050] |Overall, the course went over fairly well.
[224001570060] |I had a significant problem, however, teaching machine translation.
[224001570070] |Here's the problem.
[224001570080] |Students knew all about FSTs because we used them to do all the named-entity stuff in the first third of class.
[224001570090] |This enabled us to talk about things like IBM model 1 and the HMM model.
[224001570100] |(There's a technical difficult here, namely dealing with incomplete data, so we talk about EM a little bit.)
[224001570110] |We discuss, but they don't actually make use of, higher order MT models.
[224001570120] |Now, we all know that there's a lot more to MT than model 4 (even limiting oneself to statistical translation techniques).
[224001570130] |Namely, there are phrase-based models and syntactic models.
[224001570140] |We had a very brief (one lecture) overview of syntactic models at the end.
[224001570150] |My beef is with phrase-based models.
[224001570160] |The problem is that we've gone though all this prettiness to develop these word-based models, and then I have to teach them grow-diag-final, phrase extraction and phrase scoring.
[224001570170] |I almost felt embarrassed doing so.
[224001570180] |The problem is that these things are obviously so heuristic that throwing them on top of this really pretty word-for-word translation model just kills me.
[224001570190] |And it's not just me: the students were visibly upset by the lack of real modeling behind these techniques.
[224001570200] |One option would be just not to teach this stuff.
[224001570210] |I don't really think that it sheds much light on the translation process.
[224001570220] |The reason I don't like this solution is because it's nice to be able to say that they will have a handle on a not-too-difficult to understand/implement method for doing real-world MT.
[224001570230] |Instead, I could just spend that time on syntactic models.
[224001570240] |The situation there is better (you can talk about the hierarchy of tree transducers, etc.), but not perfect (eg., all the work that goes in to rule extraction is not too dissimilar from all the work that goes into phrase extraction).
[224001570250] |I suppose that this is just the defacto problem with a relatively immature field: there hasn't been enough time for us to really tease apart what's actually going on in these models and try to come up with some coherent story.
[224001570260] |I'd love a story that doesn't involve first doing word alignment and, is, in some sense, integrated.
[224002050010] |ACL and EMNLP retrospective, many days late
[224002050020] |Well, ACL and EMNLP are long gone.
[224002050030] |And sadly I missed one day of each due either to travel or illness, so most of my comments are limited to Mon/Tue/Fri.
[224002050040] |C'est la vie.
[224002050050] |At any rate, here are the papers I saw or read that I really liked.
[224002050060] |P09-1010 [bib]: S.R.K. Branavan; Harr Chen; Luke Zettlemoyer; Regina BarzilayReinforcement Learning for Mapping Instructions to Actions
[224002050070] |and
[224002050080] |P09-1011 [bib]: Percy Liang; Michael Jordan; Dan KleinLearning Semantic Correspondences with Less Supervision
[224002050090] |these papers both address what might roughly be called the grounding problem, or at least trying to learn something about semantics by looking at data.
[224002050100] |I really really like this direction of research, and both of these papers were really interesting.
[224002050110] |Since I really liked both, and since I think the directions are great, I'll take this opportunity to say what I felt was a bit lacking in each.
[224002050120] |In the Branavan paper, the particular choice of reward was both clever and a bit of a kludge.
[224002050130] |I can easily imagine that it wouldn't generalize to other domains: thank goodness those Microsoft UI designers happened to call the Start Button something like UI_STARTBUTTON.
[224002050140] |In the Liang paper, I worry that it relies too heavily on things like lexical match and other very domain specific properties.
[224002050150] |They also should have cited Fleischman and Roy, which Branavan et al did, but which many people in this area seem to miss out on -- in fact, I feel like the Liang paper is in many ways a cleaner and more sophisticated version of the Fleischman paper.
[224002050160] |P09-1054 [bib]: Yoshimasa Tsuruoka; Jun’ichi Tsujii; Sophia AnaniadouStochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
[224002050170] |This paper is kind of an extension of the truncated gradient approach to learning l1-regularized models that John, Lihong and Tong had last year at NIPS.
[224002050180] |The paper did a great job at motivated why L1 penalties is hard.
[224002050190] |The first observation is that L1 regularizes optimized by gradient steps like to "step over zero."
[224002050200] |This is essentially the observation in truncated gradient and frankly kind of an obvious one (I always thought this is how everyone optimized these models, though of course John, Lihong and Tong actually proved something about it).
[224002050210] |The second observation, which goes into this current paper, is that you often end up with a lot of non-zeros simply because you haven't run enough gradient steps since the last increase.
[224002050220] |They have a clever way to accumulating these penalties lazily and applying them at the end.
[224002050230] |It seems to do very well, is easy to implement, etc.
[224002050240] |But they can't (or haven't) proved anything about it.
[224002050250] |P09-1057 [bib]: Sujith Ravi; Kevin KnightMinimized Models for Unsupervised Part-of-Speech Tagging
[224002050260] |I didn't actually see this paper (I think I was chairing a session at the time), but I know about it from talking to Sujith.
[224002050270] |Anyone who considers themselves a Bayesian in the sense of "let me put a prior on that and it will solve all your ills" should read this paper.
[224002050280] |Basically they show that sparse priors don't give you things that are sparse enough, and that by doing some ILP stuff to minimize dictionary size, you can get tiny POS tagger models that do very well.
[224002050290] |D09-1006: [bib] Omar F. Zaidan; Chris Callison-BurchFeasibility of Human-in-the-loop Minimum Error Rate Training
[224002050300] |Chris told me about this stuff back in March when I visited JHU and I have to say I was totally intrigued.
[224002050310] |Adam already discussed this paper in an earlier post, so I won't go into more details, but it's definitely a fun paper.
[224002050320] |D09-1011: [bib] Markus Dreyer; Jason EisnerGraphical Models over Multiple Strings
[224002050330] |This paper is just fun from a technological perspective.
[224002050340] |The idea is to have graphical models, but where nodes are distributions over strings represented as finite state automata.
[224002050350] |You do message passing, where your messages are now automata and you get to do all your favorite operations (or at least all of Jason's favorite operations) like intersection, composition, etc. to compute beliefs.
[224002050360] |Very cool results.
[224002050370] |D09-1024: [bib] Ulf HermjakobImproved Word Alignment with Statistics and Linguistic Heuristics
[224002050380] |Like the Haghighi coreference paper below, here we see how to do word alignment without fancy math!
[224002050390] |D09-1120: [bib] Aria Haghighi; Dan KleinSimple Coreference Resolution with Rich Syntactic and Semantic Features
[224002050400] |How to do coreference without math!
[224002050410] |I didn't know you could still get papers accepted if they didn't have equations in them!
[224002050420] |In general, here's a trend I've seen in both ACL and EMNLP this year.
[224002050430] |It's the "I find a new data source and write a paper about it" trend.
[224002050440] |I don't think this trend is either good or bad: it simply is.
[224002050450] |A lot of these data sources are essentially Web 2.0 sources, though some are not.
[224002050460] |Some are Mechanical Turk'd sources.
[224002050470] |Some are the Penn Discourse Treebank (about which there were a ridiculous number of papers: it's totally unclear to me why everyone all of a sudden thinks discourse is cool just because there's a new data set -- what was wrong with the RST treebank that it turned everyone off from discourse for ten years?!
[224002050480] |Okay, that's being judgmental and I don't totally feel that way.
[224002050490] |But I partially feel that way.)
[224002230010] |How I teach machine learning
[224002230020] |I've had discussions about this with tons of people, and it seems like my approach is fairly odd.
[224002230030] |So I thought I'd blog about it because I've put a lot of thought into it over the past four offerings of the machine learning course here at Utah.
[224002230040] |At a high level, if there is one thing I want them to remember after the semester is over it's the idea of generalization and how it relates to function complexity.
[224002230050] |That's it.
[224002230060] |Now, more operationally, I'd like them to learn SVMs (and kernels) and EM for generative models.
[224002230070] |In my opinion, the whole tenor of the class is set by how it starts.
[224002230080] |Here's how I start.
[224002230090] |Decision trees.
[224002230100] |No entropy.
[224002230110] |No mutual information.
[224002230120] |Just decision trees based on classification accuracy.
[224002230130] |Why?
[224002230140] |Because the point isn't to teach them decision trees.
[224002230150] |The point is to get as quickly as possible to the point where we can talk about things like generalization and function complexity.
[224002230160] |Why decision trees?
[224002230170] |Because EVERYONE gets them.
[224002230180] |They're so intuitive.
[224002230190] |And analogies to 20 questions abound.
[224002230200] |We also talk about the who notion of data being drawn from a distribution and what it means to predict well in the future.
[224002230210] |Nearest neighbor classifiers.
[224002230220] |No radial basis functions, no locally weighted methods, etc.
[224002230230] |Why?
[224002230240] |Because I want to introduce the idea of thinking of data as points in high dimensional space.
[224002230250] |This is a big step for a lot of people, and one that takes some getting used to.
[224002230260] |We then do k-nearest neighbor and relate it to generalization, overfitting, etc.
[224002230270] |The punch line of this section is the idea of a decision boundary and the complexity of decision boundaries.
[224002230280] |Linear algebra and calculus review.
[224002230290] |At this point, they're ready to see why these things matter.
[224002230300] |We've already hinted at learning as some sort of optimization (via decision trees) and data in high dimensions, hence calculus and linear algebra.
[224002230310] |Note: no real probability here.
[224002230320] |Linear classifiers as methods for directly optimizing a decision boundary.
[224002230330] |We start with 0-1 loss and then move to perceptron.
[224002230340] |Students love perceptron because it's so procedural.
[224002230350] |The rest follows mostly as almost every other machine learning course out there.
[224002230360] |But IMO these first four days are crucial.
[224002230370] |I've tried (in the past) starting with linear regression or linear classification and it's just a disaster.
[224002230380] |You spend too much time talking about unimportant stuff.
[224002230390] |The intro with error-based decision trees moving to kNN is amazingly useful.
[224002230400] |The sad thing is that there are basically no books that follow any order even remotely like this.
[224002230410] |Except...drum roll... it's actually not far from what Mitchell's book does.
[224002230420] |Except he does kNN much later.
[224002230430] |It's really depressing how bad most machine learning books are from a pedagogical perspective... you'd think that in 12 years someone would have written something that works better.
[224002230440] |On top of that, the most recent time I taught ML, I structured everything around recommender systems.
[224002230450] |You can actually make it all work, and it's a lot of fun.
[224002230460] |We actually did recommender systems for classes here at the U (I had about 90-odd students from AI the previous semester fill out ratings on classes they'd taken in the past).
[224002230470] |The data was a bit sparse, but I think it was a lot of fun.
[224002230480] |The other thing I change most recently that I'm very happy with is that I have a full project on feature engineering.
[224002230490] |(It ties in to the course recommender system idea.)
[224002230500] |Why?
[224002230510] |Because most people who take ML, if they ever use it at all, will need to do this.
[224002230520] |It's maybe one of the most important things that they'll have to learn.
[224002230530] |We should try to teach it.
[224002230540] |Again, something that no one ever talks about in books.
[224002230550] |Anyway, that's my set of tricks.
[224002230560] |If you have some that you particularly like, feel free to share!
[225000230010] |More on POS
[225000230020] |Upon reflection, I realize I may have mis-interpretated Hal's point about POS tags.
[225000230030] |What he seems to be referring to is the lack of explicitly available POS data, not the internal mental events of humans in the act of processing language.
[225000230040] |Nonetheless, it remains an interesting direction to follow-up on: what, if any, POS tagging do humans do naturally?
[225000670010] |The Perils of Orthography or Words Without Vowels
[225000670020] |I just signed up for the new online linguistics magazine Cambridge Extra as advertised on The Linguist List.
[225000670030] |The magazine is apparently going to run little Q&A competitions every issue –“In each issue you will have the chance to win different prizes from Cambridge University Press.”
[225000670040] |The inaugural question is this
[225000670050] |What is the longest word in the English Language without a vowel in its spelling?
[225000670060] |Now, the key here is “in its spelling”.
[225000670070] |When I first read the question, I missed that part and thought real hard about this.
[225000670080] |Hmmmm, I thought, is there a word in English that has no vowel when pronounced?
[225000670090] |It’s true that there are expressions that we utter that are voweless, like “Hmmmm” above, or answering a question like my mom with “MmmHmm”.
[225000670100] |But it’s real easy to get tricked by orthography when analyzing a “word” like nth as in ‘the nth iteration”.
[225000670110] |While spelled with no vowel, it is pronounced with an initial vowel, something like /ε/, an open, mid, front vowel.
[225000670120] |I Googled the question and discovered quite a range of attempts at answering the question, almost all of them consistently mistook orthography for phonology.
[225000670130] |This is like frikkin crack to a linguistics blogger!
[225000670140] |I found this juicy, but representative answer posted on Yahoo! Answers posted by “Mrs. C”:
[225000670150] |Is there any word without a Vowel?
[225000670160] |Best Answer - Chosen by Voterssky, rhythm
[225000670170] |PS: To the people screaming 'Y is a vowel' ... er, no it's not!
[225000670180] |A E I O U are the only 5 vowels.
[225000670190] |Y SOUNDS like a vowel in certain words, but it doesn't 'become' a vowel just because it sounds like one!
[225000670200] |Even my 8 year old students can tell you this!
[225000670210] |Although this answer is consistent with the intent of the spelling constraint, I love this part: “Y SOUNDS like a vowel in certain words, but it doesn't 'become' a vowel just because it sounds like one!”
[225000670220] |Heehee.
[225000670230] |With an exclamation point for emphasis too!!!
[225000670240] |Has there ever been a more convoluted blurring of the difference between letters and phonemes?
[225000670250] |To put it simply, Yes!
[225000670260] |If something sounds like a vowel, it does indeed become one!
[225000670270] |Regardless of what orthographic representation it may take.
[225000670280] |Although phonology and writing systems were never my interests in linguistics, I am quite certain that orthographies are never more than convenient hacks engineered to approximate the phonology and phonotactics of a language.
[225000670290] |They are always imperfect.
[225000750010] |War on Colbert!!!
[225000750020] |I hereby declare war on Stephen Colbert!
[225000750030] |During a The Word segment called College Credit, that rat bastard Stephen had the cheekiness, the impudence, the audacity, the temerity, the mendacity, to lump linguistics in with classics and comparative literature in the lowest tier of his new pricing system for college majors.
[225000750040] |His new three tiered system includes the following:
[225000750050] |MarketableNon-marketableYou know this is killing your parents.
[225000750060] |Here is my transcription of his profane comment:
[225000750070] |...and the lowest tier which includes classics, comparative literature, linguistics.
[225000750080] |Basically, anything taught by someone who says he lives to teach. (starts at the 1.08 minute mark)
[225000750090] |Now, I’m on board with Stephen’s plan to charge for individual facts, this makes perfect sense (that’ll cost you $1).
[225000750100] |But I’ll be damned if I’ll let some plastic-haired, thin-lipped, southern-fried ninny sully the name of my chosen profession.
[225000750110] |This bozo couldn't even get on the ballot in South Carolina years after porn stars, muscle men, and Gary Coleman broke the ballot-box glass ceiling in the California gubernatorial election.
[225000750120] |So, be ware!
[225000750130] |Stephen Colbert (hey, that rhymes), I was a wrestler for 13 years.
[225000750140] |I know from pain.
[225000850010] |Causative Productivity
[225000850020] |Andrew Sullivan used the phrase “has decided me” and I thought it was odd to use decided as a causative (it sounded like a child’s error) but I found numerous examples by Googling “decided me”, including some by highly respected authors:
[225000850030] |Some examples
[225000850040] |Andrew Sullivan:I was undecided up to now, but forty seconds of YouTube has decided me:
[225000850050] |Booker T. Washington (1903)The course of events has decided me.
[225000850060] |I have determined to go South to take one of the numerous positions awaiting my acceptance.
[225000850070] |John Austin Lectures on Jurisprudence, Or, the Philosophy of Positive Law'It will be a great and difficult labour; but if you do not do it, it will never be done.'
[225000850080] |This decided me.
[225000850090] |It’s been a while since I studied the syntax of causation.
[225000850100] |There must be a name for this phenomenon, right?
[225000850110] |I mean, other than causative productivity which I invented as a title for this post.
[225000890010] |My Sweeney Todd Review
[225000890020] |Well, I saw Sweeney Todd yesterday afternoon as promised.
[225000890030] |Sigh, I was yet again underwhelmed by Hollywood.
[225000890040] |I have enough affection for the play to have basically enjoyed the movie and it’s certainly worth any Broadway fan’s time.
[225000890050] |But there is nothing particularly special about this movie.
[225000890060] |It is a competent adaptation of the play.
[225000890070] |But one shouldn’t strive to be competent, should one?
[225000890080] |Director Tim Burton has a reputation for visual splendor; but his skill is almost strictly static.
[225000890090] |He can create beautiful looking things, but he has no particular gift for interesting interaction.
[225000890100] |There were few moments of interesting choreography between character movements or scene juxtaposition.
[225000890110] |It also lacks interesting camera angles.
[225000890120] |We spend virtually the entire movie at eye level and at a medium distance from the characters.
[225000890130] |This is classic mediocre filmmaking.
[225000890140] |The first minute of this YouTube clip of the play shows the sheer genius of Broadway artists.
[225000890150] |They have created a center stage round-about that acts as Mrs. Lovett’s pie shop and Sweeney Todd’s murderous barber shop, as well as other setting.
[225000890160] |It is constructed to allow multiple scenes to unfold simultaneously, one right on top of the other, each playing off the others and it’s visually the stuff of genius.
[225000890170] |I could watch the video of the play a dozen times and still want to see it again.
[225000890180] |Sadly, I’m done with the movie, forever.
[225000940010] |The Destruction of Turkey by Chomsky
[225000940020] |It appears something strange and vaguely troubling has been going on in Turkey these last few days.
[225000940030] |After reviewing my Sitemeter data, it seems I have had no fewer than eight hits in two days from Turks Googling "innateness hypothesis".
[225000940040] |Is there a conference going on, or are Turks just wild about dated linguistic assumptions?
[225000940050] |My thoughts on the topic can be found here.
[225000940060] |I'm not exactly sure how accurate Sitemeter's location information is, but I see 5 different Turkish locations, some with multiple hits.
[225000940070] |Mustafa, HatayTrk, BurdurIzmirBilgi, VanMersin, Icel
[225000940080] |BTW: You REALLY gotta be a linguist to get my post title, don't you?
[225000940090] |For the interested observer, one could do worse than read this (PDF).
[225001400010] |Brain Sex!
[225001400020] |[Picture Courtesy of 3DScience.com]
[225001400030] |Admit it, my post title got ya, didn't it?
[225001400040] |Anyhoo, Zwicky mentioned this series of articles at Slate.com: The Sex Difference Evangelists (posted July 1) by Amanda Schaffer and Emily Bazelon.
[225001400050] |The six part series takes a critical look at the claims about neurological sex-differences made primarily by two people:
[225001400060] |1. Louann Brizendine, a psychiatrist at U.C.-San Francisco.
[225001400070] |2. Canadian psychologist Susan Pinker.
[225001400080] |I read the first two articles and I'm reasonably impressed with the reporting (something I rarely say about Slate).
[225001400090] |The authors are generally even-handed (though, it's not at all clear to me why the authors point out Pinker's Canadian nationality as if it meaningfully modifiers her profession).
[225001400100] |Anyhoo, this quote stood out for me: "All told, what's striking about the evidence on language is not so much a profound gap between the sexes, but the large gaps in our understanding of the brain."
[225001400110] |All psycholinguists repeat after me now:And miles to go before I sleep,And miles to go before I sleep.
[225002100010] |K is for Kanye
[225002100020] |(image from Lemur King's Folly)
[225002100030] |Wired magazine has a cute article on geek neologisms, 11 Ways Geeks Measure the World (HT kottke).
[225002100040] |Personal favs:
[225002100050] |Warhols (fame duration)1 Warhol equals 15 minutes of fame, So if you’ve been famous for three years, that’s just over 105 kilowarhols.
[225002100060] |I’m going to go out on a limb and say that there’s a critical point —varying from celebrity to celebrity —where that person has outstayed their welcome, and uh …becomes synonymous with a feminine hygiene product (and the bag it came in).
[225002100070] |In keeping with nuclear physics, I’m happy for this to remain as k=1 (where ‘k’ is for ‘Kanye’).
[225002100080] |Frinks (geekiness)I’m sure I’ll take a lot of flak for this, but take it as a suggestion, at least —a standard unit of geekiness called the frink, and that it be measured on the ‘Hoyvin-Glayvin’ scale.
[225002100090] |Simpsons fans won’t need to ask why.
[225002100100] |To figure out where you fall on the Hoyvin-Glayvin Scale, I’ve compiled a handy reference:
[225002100110] |0 Frinks –thought the JockDad April Fool’s Prank was a good direction for this blog.
[225002100120] |10 Frinks –believes Greedo fired first.
[225002100130] |20 Frinks –you’re the family friend who “knows” computers.
[225002100140] |30 Frinks –on Twitter, but only following Ashton and Oprah.
[225002100150] |40 Frinks –you don’t hate sci-fi, but don’t have an opinion on things like Kirk vs. Picard either.
[225002100160] |50 Frinks –You’re the family friend who actually does know computers.
[225002100170] |You probably watch the Battlestar Galactica reruns, too.
[225002100180] |60 Frinks –Solidly geeky.
[225002100190] |Almost stereotypically so.
[225002100200] |70 Frinks –Geeky enough to know geeks don’t like fitting into stereotypes.
[225002100210] |80 Frinks –You’ve probably attended several cons, contemplated which dice to bring to the game, and own at least one Starfleet/Colonial Fleet/Galactic Empire uniform.
[225002100220] |90 Frinks –It’s been a long time since you told a joke that didn’t reference C#, Linux or the Dune saga.
[225002100230] |100 Frinks –Aren’t you Dr. Sheldon Cooper?
[225002500010] |Talking Brains
[225002500020] |This is truly awesome!
[225002500030] |(HT Research Blogging). gfish at World of Weird Things blogs about a voice synthesizer that literally turns thought into speech!
[225002500040] |This 21st Century is going to be amazing.
[225002500050] |Money Quote:
[225002500060] |By matching the frequencies being generated in the cortex, the software tries to predict the phrases that the patients wants to say and via a synthesizer, says them out loud.
[225002500070] |The process can take as little as 50 milliseconds, about the same amount of time it takes an average person to do exactly the same thing with his or her mouth.
[225002540010] |Twitter Project: Medieval and Renaissance Minds
[225002540020] |In an attempt to be connected to the 21st century world, I have begun a brief Twitter project wherein I will be tweeting HERE (#awlobf) one sentence for every page of the bestseller A World Lit Only by Fire: The Medieval Mind and the Renaissance: Portrait of an Age by historian William Manchester from Thursday, January 7 at 9am through Monday, January 11 at 7pm.
[225002540030] |Each Tweet is intended to be a pithy gloss of the take-away point of that page of the book.
[225002540040] |The tweets will be in order and each will begin with the page number it is associated with (1-296).
[225002540050] |The tweets will be published periodically between 9am and 7pm each day from Thursday January 7th though Monday January 11 (the date of the book club meeting).
[225002540060] |I will use SocialOomph to automate the tweets (HT HATProject).
[225002540070] |For more, see the tweets here and a brief explanation here.
[225002770010] |Neuropuns
[225002770020] |I'm not normally much of a pun guy, but this one got me giggling.
[225002770030] |Speaking about the much discussed Belgian patient in a vegetative state who recently showed surprising brain activity, Dr. Allan H. Ropper, a neurologist at Brigham and Women’s Hospital in Boston, similarly warned against equating neural activity and identity.
[225002770040] |“Physicians and society are not ready for ‘I have brain activation, therefore I am,’ ”Dr. Ropper wrote.
[225002770050] |“That would seriously put Descartes before the horse” (original here).
[225002770060] |UPDATE (02/14/2010): hehe, still makes me giggle 10 days later...
[225002770070] |(HT FARK)
[225003050010] |Oldest Example of Written English Discovered
[225003050020] |No, not quite.
[225003050030] |The title of this post comes from a Digg link which linked to this article.
[225003050040] |The writing is dated at around 500 years old, which couldn't possible be "oldest example of written English" could it?
[225003050050] |The Huntington Library has the Ellesmere Chaucer, a manuscript c. 1405, so that's got it beat by a 100 years already and I haven't even bothered to look for Old English manuscripts.
[225003050060] |The claim in the title is quite different from the claim in the original article which begins with this:
[225003050070] |What is believed to be the first ever example of English written in a British church has been discovered.
[225003050080] |Problem is, no-one can read it.
[225003050090] |This just means there's a lot of Latin written in English churches.
[225003050100] |The cool part is that they're crowdsourcing the interpretation.
[225003050110] |If anyone thinks they can identify any further letters from the enhanced photographs, please contact us via the Salisbury Cathedral website.The basic questions of what exactly the words are and why the text was written on the cathedral wall remain unanswered.
[225003050120] |It would be wonderful for us to solve the mystery (link added).
[225003050130] |Go on, give it a shot.
[225003050140] |Looks like the original lyrics to Judas Priest's Better by You Better Than Me to me.
[225003180010] |Teaching Phonetics
[225003180020] |(screen grab from U. Iowa) A collaboration of several departments at The University of Iowa (but not the Linguistics department, wtf!) has put up a really nice interactive/animated tool for demonstrating the articulatory anatomy of speech sounds in English, German and Spanish.
[225003180030] |(HT srabivens on Twitter via #linguistics)
[225003230010] |Tweeting Kluges
[225003230020] |The Twitter hashtag #linguistics is ablaze with links to this Scientific American article about Gary Marcus' claim that language is far from "optimal."
[225003230030] |It's a pretty short and simple article, not much meat, but it has a lot of links (maybe too many?).
[225003230040] |Money quote:
[225003230050] |Visual abilities have been developing in animal predecessors for hundreds of millions of years.
[225003230060] |Language, on the other hand, has had only a few hundred thousand years to eke out a place in our primate brain, he noted.
[225003230070] |What our species has come up with is a "kluge," Marcus said, a term he borrows from engineering that means a solution that is "clumsy and inelegant, but it gets the job done."
[225004370010] |the death of writing!
[225004370020] |This is pure wild speculation: I can imagine writing becoming obsolete within maybe 200 years.
[225004370030] |My logic is thus:
[225004370040] |Writing as a technology has only been with us for a small time (6000 years or so, compared to the 100,000 years or so of homo sapien evolution (ohhh, let's not get into how to date homo sapiens)), and it has only been utilized by a large number of people for an even smaller time (maybe 200 years or so, before that most people were illiterate (probably still are)).
[225004370050] |Hence, writing is an unnecessary and cumbersome luxury we can happily live without should something better come along.
[225004370060] |Now imagine that the computational linguists finally get off their lazy arses and give me a computer I can frikkin talk to and can talk back to me*.
[225004370070] |If I can talk to my computer like a human being, poof goes the keyboard, right?
[225004370080] |I give a 90% chance of having this as a viable alternative within 50 years**.
[225004370090] |Once I'm unburdened of the clunky inefficiency of a keyboard, and once I can preserve and share ideas without a writing system (think bloggingheads), why oh why would I bother with the ridiculous tedium of representing my words in an altogether unnecessary form?
[225004370100] |But, you say, writing provides the best way to preserve and share ideas*** because it lets us organize and review and get all meta.
[225004370110] |No.
[225004370120] |It does none of those things.
[225004370130] |We do that.
[225004370140] |We're just stuck with this third party representation in which to do those things.
[225004370150] |But how would academics write papers and get tenure?
[225004370160] |Good question.
[225004370170] |First, tenure will die long before writing systems (I give it maybe 75 years).
[225004370180] |Second, imagine that instead of writing a paper, I can create a virtual me, encode it with a set of arguments about a topic then instead of reading my paper, you engage in a Socratic give-and-take with this virtual me on the topic.
[225004370190] |Call it iSocrates****
[225004370200] |Thus endeth the prophesy.
[225004370210] |*I want everything Kirk had.
[225004370220] |I got the cell phone.
[225004370230] |The Terrapins are working on a transporter, now give me a frikkin computer I can talk to!
[225004370240] |**I just made those numbers up.
[225004370250] |People seem to like fake numbers, so okay, there you are.
[225004370260] |***You didn't really say that.
[225004370270] |I made that up.
[225004370280] |This is what's known as a straw man argument.
[225004370290] |It makes this kind of bullshitting easier.
[225004370300] |****Dear gawd that's a horrible name.
[225004370310] |Let's hope this whole iXXX trend goes the way of eXXX, eXtreme, XXXtech, XXXsoft, etc.
[225004550010] |purple pain and a gene called 'straightjacket'
[225004550020] |Dr. Kevin Mitchell, a neuroscientist at Smurfit Institute of Genetics, University of Dublin, posted at his excellent blog Wiring the Brain about a weird, interesting study* that points to a possible genetic explanation of synaesthesia** (e.g., hearing a word and experiencing the color red).
[225004550030] |The authors were studying pain mechanisms in fruit flies (turns out the mechanisms are similar to us mammals, whuddathunk?).
[225004550040] |Once they identified a particular gene they dubbed straightjacket*** which is "involved in modulating neurotransmission," they systematically deleted it in test flies and discovered that the test subjects**** no longer processed the pain stimuli, even though the pain stimuli was following the pathway.
[225004550050] |In Mitchell's words:
[225004550060] |Somehow, deletion of CACNA2D3 alters connectivity within the thalamus or from thalamus to cortex in a way that precludes transmission of the signal to the pain matrix areas.
[225004550070] |This is where the story really gets interesting.
[225004550080] |While they did not observe responses of the pain matrix areas in response to painful stimuli, they did observe something very unexpected –responses of the visual and auditory areas of the cortex!
[225004550090] |What’s more, they observed similar responses to tactile stimuli administered to the whiskers.
[225004550100] |Whatever is going on clearly affects more than just the pain circuitry (emphasis added).
[225004550110] |So, if I understand this, they turned off the ability to recognize pain, but when they administered painful stimuli (heat), the test subjects had visual, auditory, and tactile experiences.
[225004550120] |Imagine putting a flame to your hand and seeing purple.
[225004550130] |Pretty frikkin awesome.
[225004550140] |Dr. Mitchell's post does more justice to this complex study, I just thought it was awesome.
[225004550150] |*Geez!
[225004550160] |Take a look at the author list of the publication.
[225004550170] |Do you have a place for 12th author on YOUR CV?
[225004550180] |**FYI: Synaesthesia is NOT the same thing as sound symbolism, necessarily.
[225004550190] |True synaesthesia is a rare phenomenon that appears to have biophysical roots.
[225004550200] |Sound symbolism is mostly hippie-dippy bullshit exploited by marketing professionals to sell stuff.
[225004550210] |***I have no clue why they called it this, but it's a hell of a lot more awesome than CACNA2D3.
[225004550220] |****There were multiple studies referenced, some involving fruit flies, some involving mice, and it wasn't clear to me which evidence came from which studies, so I have chosen to use the cover term "test subjects."
[225004550230] |Neely GG, Hess A, Costigan M, Keene AC, Goulas S, Langeslag M, Griffin RS, Belfer I, Dai F, Smith SB, Diatchenko L, Gupta V, Xia CP, Amann S, Kreitz S, Heindl-Erdmann C, Wolz S, Ly CV, Arora S, Sarangi R, Dan D, Novatchkova M, Rosenzweig M, Gibson DG, Truong D, Schramek D, Zoranovic T, Cronin SJ, Angjeli B, Brune K, Dietzl G, Maixner W, Meixner A, Thomas W, Pospisilik JA, Alenius M, Kress M, Subramaniam S, Garrity PA, Bellen HJ, Woolf CJ, &Penninger JM (2010).
[225004550240] |A Genome-wide Drosophila Screen for Heat Nociception Identifies α2δ3 as an Evolutionarily Conserved Pain Gene.
[225004550250] |Cell, 143 (4), 628-38 PMID: 21074052
[225004580010] |the baffling linguistics of job postings
[225004580020] |While Googling around for other things, I caught this odd fish contained within a job posting for an Account Manager:
[225004580030] |DISCLAIMER: ...
[225004580040] |Linguistics used herein may use First Person Singular and First Person Plural grammatical person construction for and with the meaning of Third Person Singular and Third Person Plural references.
[225004580050] |We reserves the right to amend and change responsibilities to meet business and organizational needs as necessary (emphasis added).
[225004580060] |If I understand this correctly, the bold faced passage says that the authors are allowing themselves to use constructions like "we walks..." and "we talks..."
[225004580070] |But, if you look at the uses of "we" within the text of the actual job posting, nowhere do they actually do this, EXCEPT in the disclaimer itself.
[225004580080] |I find this baffling.
[225004580090] |What is the purpose of this?
[225004580100] |Simply to allow them to write "We reserves..."
[225004580110] |I Googled the sentence and found it popping up in all kinds of job postings and the same thing is true.
[225004580120] |The only time a posting invokes its self-appointed right to this grammatical modification, is within the disclaimer.
[225004580130] |It appears to be boiler-plate job-speak of some kind.
[225004580140] |I'm remarkably freaked out by this.
[225004810010] |history of writing tech
[225004810020] |American Scientist has a review of a new book on the history of writing technologies with a focus on how computers fit in. A BETTER PENCIL: Readers, Writers, and the Digital Revolution by Dennis Baron.
[225004810030] |Money quote:
[225004810040] |Will this shift in the technology of writing and reading be a positive development in human culture?
[225004810050] |Will it promote literacy, or impair it?
[225004810060] |Baron takes a moderate position on these questions.
[225004810070] |On the one hand, he acknowledges that the computer offers remarkable opportunities for self-expression and communication (at least for those of us in the wealthier parts of the world).
[225004810080] |Suddenly, we can all be published authors, and we all have access to the writings—or if nothing else the Twitterings—of millions of other authors.
[225004810090] |On the other hand, much of what these new channels of communication bring us is mere noise and distraction, and we may lose touch with more serious kinds of reading and writing.
[225004810100] |(Another recent book—The Shallows, by Nicholas Carr—argues this point strenuously.)
[225004810110] |Baron remarks: “That position incorrectly assumes that when we’re not online we throw ourselves into high-culture mode, reading Tolstoi spelled with an i and writing sestinas and villanelles instead of shopping lists.”
[225004900010] |soccer vs. football
[225004900020] |Too late for The World Cup, but thanks to Stan Carey at Sentence First, I only just now discovered that we Yanks are not the only English speakers who use soccer to refer to, ya know, that game where you can't touch the ball with your hands (tennis? no... the one that Ronaldo plays).
[225004900030] |In fact, there are about 74 million OTHER English speakers in this world who use soccer to refer to Ronaldo's game too.
[225004900040] |Add the USA's 308 million, and it is almost certainly the case that more English speakers use soccer than football.
[225004900050] |With that, I say thppppt to the English...
[225004900060] |(image from Wikipedia)
[225004900070] |UPDATE [3:38PM eastern]: reader vp points out the following passage from the same Wikipedia article the image came from: several official publications of the English Football Association have the word "soccer" in the title.
[225004900080] |Simon Kuper and Stefan Szymanski write that soccer was the most common name for the game in Britain from the 1890s until the 1970s, and suggest that the decline of the word soccer in the UK, and the incorrect perception that it is an Americanism, were linked to awareness of the North American Soccer League in the 1970s.
[225004910010] |the evolution of journalistic quotes
[225004910020] |They're getting shorter:
[225004910030] |According to a new article in the academic journal Journalism Studies by David M. Ryfe and Markus Kemmelmeier, both professors at the University of Nevada, newspaper quotations evolved in much the same way as TV sound bites.
[225004910040] |By 1916, they found, the average political quotation in a newspaper story had fallen to about half the length of the average quotation in 1892.
[225004910050] |(HT Daily Dish)
[225005240010] |why we need good tools...
[225005240020] |Because we're not all interested in being R experts.
[225005240030] |By far, the single most frustrating part of my own graduate linguistics experience was the fact that in order to study the kinds of linguistic phenomena I wanted to, I had to spend most of my time learning tools that I didn't actually care about, like Tgrep2, Perl, Python*, R, etc.
[225005240040] |As a linguist, I don't really give a damn about any of those things.
[225005240050] |They were all obstacles in my way.
[225005240060] |The more time I spent learning tools, the less interested in linguistics I became.
[225005240070] |I respect the hell out of engineers who build great tools that are valuable to linguists, but if those tools are not user friendly, I might as well scream into the darkness.
[225005240080] |Which is why I am impressed with The Stanford Visualization Group's recent Visualization Tool for Cleaning Up Data:
[225005240090] |Another thing I often hear is that a large fraction of the time spent by analysts -- some say the majority of time -- involves data preparation and cleaning: transforming formats, rearranging nesting structures, removing outliers, and so on.
[225005240100] |(If you think this is easy, you've never had a stack of ad hoc Excel spreadsheets to load into a stat package or database!).
[225005240110] |Yes, more help please.
[225005240120] |HT LingFan1
[225005240130] |*Mad props to the NLTK!