Tossing Sabots into the Automated Essay Grading Machine

“There’s a time when the operation of the machine becomes so odious—makes you so sick at heart—that you can’t take part. You can’t even passively take part. And you’ve got to put your bodies upon the gears and upon the wheels, upon the levers, upon all the apparatus, and you’ve got to make it stop. And you’ve got to indicate to the people who run it, to the people who own it that unless you’re free, the machine will be prevented from working at all.” – Mario Savo


Robot essay graders – they grade just the same as human ones. That’s the conclusion of a study conducted by University of Akron's Dean of the College of Education Mark Shermis and Kaggle data scientist Ben Hamner. The researchers examined some 22,000 essays that were administered to junior and high school level students as part of their states’ standardized testing process, comparing the grades given by human graders and those given by automated grading software. They found that “overall, automated essay scoring was capable of producing scores similar to human scores for extended-response writing items with equal performance for both source-based and traditional writing genre.” (PDF).

“The demonstration showed conclusively that automated essay scoring systems are fast, accurate, and cost effective,” says Tom Vander Ark, managing partner at the investment firm Learn Capital, in a press release touting the study’s results.

The study coincides with a competition hosted on Kaggle and sponsored by the Hewlett Foundation, which data scientists are challenged with developing the best algorithm to automatically grade student essays. “Better tests support better learning,” according to the foundation’s Education Program Director Barbara Chow. “This demonstration of rapid and accurate automated essay scoring will encourage states to include more writing in their state assessments. And, the more we can use essays to assess what students have learned, the greater the likelihood they’ll master important academic content, critical thinking, and effective communication.”


I taught College Composition at the University of Oregon for two years.

Two terms of College Composition are required of all undergraduates at all the public universities in the state: WR121 and WR122. It is possible to “test out” of the former with a high enough score on the SAT or AP exam, but that hardly meant that students who went straight into WR122 were skilled or sophisticated writers.

At the UO, composition classes are taught primarily by Graduate Teaching Fellows, although the university does employ some writing instructors (full- and part-time) as well. WR121 is often the first teaching gig that English and Comparative Literature graduate students are given by the university. To be eligible for this teaching assignment, one must complete a year long series of classes that offer training in teaching composition.

But after a year or two doing so, as graduate students complete their coursework and passes their qualifying exams, they move on to teach literature classes.

“Move on” and “move up.” In the English Department hierarchy, composition sits at the very bottom, below even the Film Studies and Folklore classes. Writing underlies all that happens in a literature department, of course, (in all disciplines, really) but the labor associated with teaching composition has never been highly valued.

And oh, the labor. Grading essays – whether in comp or in other writing-heavy classes – is incredibly time-consuming. That makes the move to robot graders one of efficiency, time- and cost-savings. Grading essays is also incredibly grueling intellectually, I’d argue, as giving meaningful feedback on student writing (and by extension on student thinking) is hard work.

Robot graders don’t give feedback; they simply score. Do they score papers better than graduate students, writing instructors, and those individuals hired by testing companies to pour through the written portions of standardized exams do? “The reality is, humans are not very good at doing this,” Steve Graham, a Vanderbilt University professor who has researched essay grading techniques tells Reuters. “It’s inevitable,” he said, that robo-graders will soon take over.


Via Justin Reich, “Grading Automated Essay Scoring Programs, Part I”:

AES programs are bundles of hundreds of algorithms. For instance, one algorithm might be as follows:

1.  Take all the words in the essays and stem them, so that “shooter,” “shooting ,” and “shoot” are all the same word

2.  Measure the frequency of co-location of all two-word pairs in the essay; in other words, generate a giant list of every stemmed word that appears adjacent to another stemmed word and the frequency of those pairings

3.  For each new essay, compare the frequency of stemmed word pairings to the frequencies found in the training set.

Does this sound ridiculous? Weird? Doesn’t matter. It doesn’t matter that the computer reads completely differently from a human, because it’s not trying to “read” the essay. It’s trying to compare the essay to other essays which have been scored by humans and faithfully replicate the scores. If a weird comparison improves the prediction of human grades, then it’s helping the machine faithfully replicate the scores than humans would have provided, even if it does so entirely differently from how humans would generate the scores.


Via Les Perelman (director of the Writing Across the Curriculum program at the Massachusetts Institute of Technology and long-time critic of robot essay-graders), in a comment on Inside Higher Ed’s coverage of the Shermis and Hamner study:

Here is an absurd essay that received a score of six. The feedback for the six follows.

In today’s society, college is ambiguous. We need it to live, but we also need it to love. Moreover, without college most of the world’s learning would be egregious. College, however, has myriad costs. One of the most important issues facing the world is how to reduce college costs. Some have argued that college costs are due to the luxuries students now expect. Others have argued that the costs are a result of athletics. In reality, high college costs are the result of excessive pay for teaching assistants. …

In conclusion, as Oscar Wilde said, “I can resist everything except temptation.” Luxury dorms are not the problem. The problem is greedy teaching assistants. It gives me an organizational scheme that looks like an essay, it limits my focus to one topic and three subtopics so I don’t wander about thinking irrelevant thoughts, and it will be useful for whatever writing I do in any subject. I don’t know why some teachers seem to dislike it so much. They must have a different idea about education than I do.

Your essay:
* States or implies clearly your thesis or position on this topic
* Organizes and develops ideas logically with clear, insightful connections among them
* Uses particularly well-chosen evidence (reasons, examples, or details) to support your ideas
* Conveys your meaning in an interesting, imaginative, or particularly effective style
* Demonstrates sentence structure and variety that enhances your emphasis
* Displays facility and clarity in choice of language
* Uses grammar and mechanics correctly (virtually free of errors) and demonstrates understanding of correct usage


Machine learning. Machine reading. Machine assessment. Machine writing.

While the Hewlett-sponsored Kaggle competition seeks to improve the algorithms of robot essay graders, elsewhere we’re seeing the rise of robot essay writers. A Chicago-based startup called Narrative Science “transforms data into stories and insights.” Its clients include Forbes magazine, which uses the technology to create “computer-generated company earnings previews.” That is, data is taken from the stock market, Tweets, headlines, and industry reports in order to write a story profiling a particular company. The Big Ten Network uses Narrative Science for its sports stories, which are similarly formulated from scores and stats.

Writing in The Atlantic, Joe Fassler asks if computers can replace writers:

As a journalist and fiction writer, it of course struck me to think about the relevance of all of this to what I do. I arrived at the Chicago office prepared to have my own biases confirmed—that the human mind is a sacred mystery, that our relationship to words is unique and profound, that no automaton could ever replicate the writerly experience. But speaking with Hammond, I realized how much of the writing process—what I tend to think of as unpredictable, even baffling—can be quantified and modeled. When I write a short story, I’m doing exactly what the authoring platform does—using a wealth of data (my life experiences) to make inferences about the world, providing those inferences with an angle (or theme), the creating a suitable structure (based on possible outcomes I’ve internalized from reading and observing and taking creative writing classes). It’s possible to give a machine a literary cadence, too: choose strong verbs, specific nouns, stay away from adverbs, and so on. I’m sure some expert grammarian could map out all the many different ways to make a sentence pleasing (certainly, the classical orators did, with their chiasmus and epanalepsis, anaphora and antistrophe).

Hammond tells me it’s theoretically possible for the platform to author short stories, even a statistically “perfect” piece that uses all our critical knowledge about language and literary narrative. Such attempts have been made before—Russian musicians once wrote the “best” and “worst” songs ever, based on survey data. But I suspect that a computer’s understanding of art will never quite match our own, no matter how specific our guidelines become. Malcolm Gladwell writes about this effect in Blink, noting how, for reasons that sometimes confound us, supposedly market-perfect media creations routinely tank.

Besides, the best journalism is always about people in the end—remarkable individuals and their ideas and ideals, our ongoing, ever-changing human experience. In this, Frankel agrees.

“If a story can be written by a machine from data, it’s going to be. It’s really just a matter of time at this point,” he said. “But there are so many stories to be told that are not data-driven. That’s what journalists should focus on, right?”

And we will, we’ll have to, because even our simplest moments are awash in data that machines will never quantify—the way it feels to take a breath, a step, the way the sun cuts through the trees. How, then, could any machine begin to understand the ways we love and hunger and hurt? The net contributions of science and art, history and philosophy, can’t parse the full complexity of a human instant, let alone a life. For as long as this is true, we’ll still have a role in writing.


While some writing classes teach the “modes of writing” – description, narration, persuasion, exposition – at the University of Oregon, teaching College Composition means teaching argumentative writing. It means teaching logic and rhetoric. And thanks to UO Professor John Gage and his required composition text The Shape of Reason, it means teaching “the enthymeme.”

And all that means breaking students of their habits of writing what we not-so-fondly call “the five paragraph special” – the way they were often taught writing in middle and high school that goes something like this:

1.  Introductory paragraph with catchy opening sentence, a thesis statement (hopefully) tucked somewhere in there, and often a quick sketch of what the essay will cover.

2.  Three paragraphs – often disconnected logically – that make three distinct points that ostensibly “support” that thesis.

3.  Concluding paragraph that restates the introduction with slightly different wording.

As I’m sure you can imagine, telling a bunch of college students that they need to unlearn much of what they think they know about essay-writing – particularly if those students have been ones to earn “high marks” with that comfortably banal five-paragraph formula – doesn’t go over well. It means that often, composition class is full of skeptical – if not hostile – students who must tackle the reasoning behind, not just the mechanics of their essay writing.  They must learn again to write. They must learn -- often for the very first time -- to really think.

College-level writing requires critical reading. It requires critical thinking. It requires tackling a “question at issue” for the community you’re a part of – whether that’s the class or the university at large. It requires thinking about assumptions – of the writer and the reader. It requires taking the reader carefully through the logic of the argument, building connections throughout the essay so that a writer’s claims are supported by reasoning throughout.  It means revising.  It means rethinking.  It means rewriting.  It is a process, not a product.


Grading essays takes time. It isn’t just the reading – although that alone takes time even before you think of how you’ll assess it. It isn’t just marking an A or a B or a C at the top or noting “Excellent” or “Good” or “Needs work” at the bottom – although, yes, that’s time-consuming too.

Grading essays requires responding to the content. (Did the student demonstrate an understanding of the material?) Grading essays requires responding to the form. (Did the student make a compelling and well-supported argument?) Grading essays requires responding to the mechanics. (Did the student make grammatical errors that got in the way of expressing the ideas?) Done thoroughly, these responses are peppered throughout the essay – comments in the margins; awkward phrases circled; question marks next to whole paragraphs, sentences, or individual words. (“How are these ideas connected?” “I don’t follow you here.” “Clarify.” “Say more.” “Cite your sources.”)

Done thoroughly, and done right. This feedback – on drafts as well as on final versions – is crucial. It’s grueling. It’s time-consuming. It’s frustrating. But the hope, as you do so, is that you’re offering individualized feedback that will help a student learn, help the student be a better writer and a better thinker.

Robots can give a grade. That grade, according to the study by Shermis and Hamner, approximates that of a human. But a human can offer so much more than just a grade. And without feedback – substantive, smart, caring, thoughtful, difficult feedback – what’s the point?


What do we want from students’ essay writing? Why do we assign them? What are we assessing? Critical thinking? Grammar? Persuasiveness? Vocabulary? Content knowledge? Comma placement?

According to Steve Kolowich’s Inside Higher Ed story, Shermis “acknowledges that AES software has not yet been able to replicate human intuition when it comes to identifying creativity. But while fostering original, nuanced expression is a good goal for a creative writing instructor, many instructors might settle for an easier way to make sure their students know how to write direct, effective sentences and paragraphs. ‘If you go to a business school or an engineering school, they’re not looking for creative writers,’ Shermis says. ‘They’re looking for people who can communicate ideas. And that’s what the technology is best at’ evaluating.”

Why are nuance and originality just the purview of the creative writing department? Why are those things seen here as indirect or ineffective? Why do we think creativity is opposed to communication?  Is writing then just regurgitation?

What sorts of essays gain high marks among the SAT graders – human now or robot in the future? Are these the sorts of essays that students will be expected to write in college? Is this the sort of writing that a citizen/worker/grown-up will be expected to produce?  Or, for the sake of speed and cost effectiveness, in Vander Ark’s formulation, are we promoting one mode of writing for standardized assessments at the K–12 level, only to realize when students get to college and to the job market that, alas, they still don’t know how to write?

How can we get students to write more? How can we help them find their voice and hone their craft? How do we create authentic writing assignments and experiences – ones that appeal to real issues and real discourse communities, not just to robot graders? How do we encourage students to find something to say and to write that something well?  Is that by telling them that their work will be assessed by an automaton?

How do we support the instructors who have to read student papers and offer them thinking and writing guidance? When we talk about saving time and money here, whose bottom line are we really looking out for?  Who's really interested in this robot grader technology?  And why?

Image credits: Charlie Chaplin, Modern Times

Tags: , , ,