« PREVIOUS ENTRY
Location, location, location
NEXT ENTRY »
Surrealist Hiptop pix!
Do men and women use words in different ways? A group of Israeli artificial-intelligence experts think so. They crunched a bunch of English texts by men and women, both fiction and nonfiction, and looked for interesting patterns. The results? In this paper, they argue that it’s possible to figure out the gender of an author merely by paying attention to a few everyday words — and their guesses are accurate 80 per cent of the time, or higher.
For example, they discovered that in fiction, men are more likely than women to use the words a, the, and as; meanwhile, women are more likely than men to use the words she, for, with, and not. In nonfiction, men are more likely than women to use that and one. Women, however, are more likely than men to use for, with, not, and, and in.
Here’s another weird data point: Men use the pronoun he with roughly the same frequency as women, but women use the total set of all other pronouns — he, she, they, etc. — than men.
Interestingly, there are also some differences between the way everyone uses language in fiction and nonfiction. All authors — both male and female — used pronouns and negation more in fiction than nonfiction.
Did this technique make any mistakes? Yep. The professors crunched 920 English-language texts, and misclassified 12 texts, which were:
Fiction
Possession, by A. S. Byatt
The Remains of the Day, by Kazuo Ishiguro
Now We Are Thirty-Somethings, by Charles Jennings
Now Then Davos, by Martin Wiley, David Harmer, and Ian McMillan
The Seige of Krishnapur, by J. G. Farrell
A Landing on the Sun, by Michael FrayneNonfiction
Thank you for having me, by Maureen Lipman
A Crowd is not Company, by Robert Kee
T.S. Eliot: A Friendship, by Frederick Tomlin
Walking on Water, by Andy Martin
Unpublished Letters and manuscripts, by an Unlisted Female Author
Falling for Love: How Teenaged Mothers Talk, by Sue Sharp
As the scientists note, of the six misclassified non-fiction documents, all are biographical or diary-like. That’s intriguing, insofar as one might expect that people would write most “like” their gender when they’re writing about personal experience. Meanwhile, of the six misclassified fiction documents, all are by men, except for Possession. What’s up with that? Are these men writing “like” women? (Heh — maybe this is a subterranean reason why Jonathan Franzen freaked out so badly when Oprah picked The Corrections for her book club.) On the other hand, decades of gender theory has ably pointed out that gender is an insanely slippery thing: Men can so often act “like” women, and vice versa, that the whole idea of drawing hard lines around what’s male and what’s female is sort of bonkers. It’d be interesting to replicate this study with texts solely by gay men, lesbians, or transgendered people — the folks who often mess directly with society’s concepts of male and female roles — to see if it generates any different results.
The scientists don’t offer any theories as to why they these differences exist. But for me, what’s most interesting is that the words they’re focussing on — the ones that create the “fingerprint” identifying the document — are very common, throwaway words like at, she, but, or that. You wouldn’t expect such simple words to be so important in determining meaning.
Actually, almost all artificial-intelligence research into language backs this up. A decade ago, Thomas Landauer pioneered Latent Semantic Analysis — a way of automatically figuring out the “content” of a piece of writing by looking at a fingerprint of its words. Again, you’d expect that the most “important” words in a document, in terms of identifying what it’s about, would be the ones most individually freighted with meaning. For example, if you looked at this blog entry, you might think the words artificial, intelligence, gender, fiction, nonfiction, men and women would be significant. But what Landauer found is that you could strip out those big-meaning words, leaving all the other stuff behind — the buts, ands, ors, whiches, etc. — and you could still figure out what the document was about. Spooky, eh?
It’s also like the epiphany of Donald Foster — the professor who analyzes word occurrence to determine the author of texts that have been left anonymous by history. He’s the one, you may recall, who figured out that Joe Klein wrote the book Primary Colors. As he noted in his book on the subject, the words that are most revealing of one’s identity are not the high-meaning words — because those are the ones we pay attention to, and sculpt like clay. The ones that reveal our identity are the low-meaning ones — the ifs, the ands, the buts — because we use them unconsciously. They aren’t as subject to our will, and thus are a lot harder to obfuscate.
Maybe I should just stop writing blog entries in full sentences. I’ll just use pronouns and conjunctions.
“I in and the but the they or and.”
(Thanks to Rachel for pointing out this study to me!)
I'm Clive Thompson, the author of Smarter Than You Think: How Technology is Changing Our Minds for the Better (Penguin Press). You can order the book now at Amazon, Barnes and Noble, Powells, Indiebound, or through your local bookstore! I'm also a contributing writer for the New York Times Magazine and a columnist for Wired magazine. Email is here or ping me via the antiquated form of AOL IM (pomeranian99).
ECHO
Erik Weissengruber
Vespaboy
Terri Senft
Tom Igoe
El Rey Del Art
Morgan Noel
Maura Johnston
Cori Eckert
Heather Gold
Andrew Hearst
Chris Allbritton
Bret Dawson
Michele Tepper
Sharyn November
Gail Jaitin
Barnaby Marshall
Frankly, I'd Rather Not
The Shifted Librarian
Ryan Bigge
Nick Denton
Howard Sherman's Nuggets
Serial Deviant
Ellen McDermott
Jeff Liu
Marc Kelsey
Chris Shieh
Iron Monkey
Diversions
Rob Toole
Donut Rock City
Ross Judson
Idle Words
J-Walk Blog
The Antic Muse
Tribblescape
Little Things
Jeff Heer
Abstract Dynamics
Snark Market
Plastic Bag
Sensory Impact
Incoming Signals
MemeFirst
MemoryCard
Majikthise
Ludonauts
Boing Boing
Slashdot
Atrios
Smart Mobs
Plastic
Ludology.org
The Feature
Gizmodo
game girl
Mindjack
Techdirt Wireless News
Corante Gaming blog
Corante Social Software blog
ECHO
SciTech Daily
Arts and Letters Daily
Textually.org
BlogPulse
Robots.net
Alan Reiter's Wireless Data Weblog
Brad DeLong
Viral Marketing Blog
Gameblogs
Slashdot Games