Student Data is the New Oil: MOOCs, Metaphor, and Money

read

Below are the notes and slides from my talk yesterday at Columbia University. The talk was part of the university's Conversations About Online Learning series, and my trip was sponsored by the Office of the Provost and the Columbia Center for New Media Teaching and Learning. A big thanks in particular to Alex Gil for facilitating my invitation to speak there.

I am incredibly honored to be here to speak to you today. Columbia University has a very special place in my heart. I never attended, but my Uncle Jim did. He graduated in 1960 and still lives in the same little rent-controlled apartment just a few blocks from here. He became an entertainment journalist after he graduated, writing for LIFE and for People magazine.

I grew up in Wyoming, and my Uncle Jim would only return once a year -- for like 48 hours -- for Christmas. He would return with the best stories and, often, with the best movie swag.

To me, he represented escape. Escape from Wyoming. Escape to New York. He represented all things Big Apple -- he knew movie stars, he was a Democrat. None of those can be found much in Wyoming -- except maybe in Jackson Hole.

And he was a writer; I wanted to be like my Uncle Jim. I wanted to reject the conservative constraints of "home." I wanted to be a writer, and I was both thrilled and panic-stricken when an aptitude test I took in junior high came back with only one job recommendation for me: freelance writer.

Finally, after years of dropping in and out of school, out of college, and out of a PhD program and disappointing my Uncle Jim plenty along the way, that's what I do: I write.

I was always intrigued by my uncle's profession and his politics, which I somehow credited in part to this university. I was particularly fascinated, although the Vietnam War most affected my dad's generation and although my Uncle Jim attended Columbia almost a decade earlier, by the anti-Vietnam War efforts of students here, by the takeover of parts of the campus.

Much of my work as an academic -- I'm a recovering academic now, for what that's worth -- dealt with student protest. But before I started my scholarship on student activism, I'll confess to you this: as a teenager, I had a weird sort of crush on Columbia student and SDS leader Mark Rudd.

In 2003, shortly after the documentary about the Weather Underground was released, I had the chance to hear Mark Rudd speak. He's a math instructor now at a community college in New Mexico. But he came to the University of Oregon where I was a graduate student -- working on a dissertation on the European avant-garde and student protest theatre, believe it or not. And Eugene, Oregon at the time, much like Columbia University in the late 1960s, was a "hotbed" of radicalism.

After the screening of the Weather Underground film, someone asked Rudd what he thought about the government's surveillance capabilities today. ("Today" being a decade ago now, I should make really clear.) Rudd and the other members of the Weathermen were surveilled as part of the FBI's COINTELPRO project -- a series of covert and often illegal efforts to investigate and discredit the left in the 1960s.

And although not explicitly stated, the question from the audience was certainly related to the FBI's investigation and infiltration of radical environmental groups in the Eugene area in the late 1990s and early 2000s.

But Rudd scoffed at the question. Even with the enhanced capabilities of machines, he insisted, human investigators just aren't that good at sifting through the data. And it was likely, he argued, that computer-based surveillance would continue to be more about the politics of "police work."

Fast forward from that Q&A session to the summer of 2013, dominated by the news leaked by Edward Snowden of the NSA surveillance program, of massive data harvesting and data warehousing by the federal government. Cell phone records. Email. Social networks. Metadata.

This is, I realize, a very long-winded and circuitous introduction to what I want to talk about here today: data-mining and education.

But I want to set the stage with these interrelated stories. I want you to keep in mind, as I talk specifically about learning analytics, big data, and education algorithms, more generally our histories of overreaching government surveillance, our histories of student radicalism, our personal histories of likes and dislikes and dreams, and our histories -- personal and institutional -- of standardized testing and aptitude testing. How do these shape who students become? Will data analytics change this?

Of course, the emerging fields of education data mining and learning analytics rarely if ever frame themselves in terms of surveillance and policing. Not explicitly at least.

The promise of learning analytics, so we're told by education researchers and -- with much more certainty and marketing finesse -- by education companies, is that all this data that students create, that software can track, and that engineers and educators and administrators can analyze will bring about a more "personalized," a more responsive, a more efficient school system. Better outcomes (which we can translate cynically as: higher test scores, higher class and college completion rates).

The claims about big data and education are incredibly bold, and as of yet, mostly unproven.

Take Knewton, for example, one of the corporate leaders in the sector. Once in the highly lucrative business of test prep, Knewton has rebranded to offer "adaptive learning" -- partnering with publishers to create content delivery and assessment software that moves students through educational materials "at their own pace."

That's the PR spin, at least: with big data, Knewton engineers can now precisely identify a student's strengths and weaknesses and "learning styles" (ignoring, I should add here, the evidence that learning styles do not exist) and guide students through the next-best content nugget so as to learn with maximum efficiency. As a recent story about the company put it, "If you learn concept No. 513 best in the morning between 8:20 and 9:35 with 80 percent text and 20 percent rich media and no more than 32 minutes at a time, well, then the odds are you’re going to learn every one of 12 highly correlated concepts best that same way.”

Knewton boasts that over a million data points are created by students on its platform every day, all of which feed its algorithms and its recommendation engine, all in turn in the service of delivering lessons that it claims are perfectly and personally adapted to each and every student.

A million data points a day. A slightly different stat that executives from the company also bandy about: "a million data points for each student who’s taking a Knewton platform course for a semester." Regardless. That is a lot of data about students.

It's still just a tiny drop in the quintillions of bytes of data are created every day. And the million data points a day generated on the Knewton platform are just a tiny drop of the data that students are creating -- in "formal" and "informal" settings, via "sanctioned" and "required" and DIY ed-tech tools, intentionally and unintentionally.

So what data are students creating? What data are schools and software companies gathering?

Traditionally, we've thought about "student data" in terms of what’s on the transcript — that is, demographics, major, and final course grades. Student data includes test scores. Individual assignments. Attendance. Add to that perhaps, behavior and disciplinary records. "This will go down on your permanent record!" as many of us have been informed.

But schools and their administrative and instructional technologies track so much more these days: library check-outs. Gym visits. Inter-mural sports participation. Cafeteria and bookstore purchases. Minutes from student meetings. Times in and out of the dormitory.

Much of this data exists in software silos that are disconnected. But more and more, companies are starting to push for the aggregation of student data into analytics tools that can be sold in turn back to the school. Learning management system log-ins and duration of their LMS sessions. Blog and forum comment history. Internet usage while on campus. Emails sent and received on via university email accounts. The pages students read in digital textbooks. The passages they highlight.

There's more too, that gray area of students' computer usage, via software that isn't necessarily administered by the school: students’ search engine history. Their social media profiles. Time spent on Facebook while in class. Videos watched on Coursera or Khan Academy or Udacity, along with if and where they paused the videos. Exercises completed on any of these platforms. Their Wikipedia visits. Their downloads. Their uploads. Their levels on Grand Theft Auto V.

All their keystrokes and mouse clicks, logged.

Those last items are, along with biometric data, how Coursera says it plans to confirm students’ identities for its "signature track" MOOCs, that is those courses from which students can pay for an official certificate. These are also the data that Coursera says will give it incredible insights into course design.

"By collecting every click, homework submission, quiz and forum note from tens of thousands of students," TED describes Coursera founder Daphne Koller's talk, MOOCs have become "a data mine that offers a new way to study learning."

Actually Coursera has some 4.3 million users, not simply "tens of thousands." So when we think of MOOCs -- and education technology more broadly -- as "a data mine," it is indeed a massive one.

Now although I write about education and technology for a living, my formal academic training is in neither area. I'm a literature and language person, and so when I hear "data mine," my first thoughts aren't about statistics and algorithms and mathematical models. I think about metaphor. I think about cultural history.

The phrase "data-mining" is quite new -- less than 25 or so years old -- although yes certainly, statistics and algorithms and pattern recognition and the analyses based upon these have a much, much longer history. In the 1960s, statisticians referred to the pouring through of data without an a-priori hypothesis as "data-dredging," a practice that carried a negative connotation, as "dredging" would, I guess.

It's worth noting too that the concerns about data and its misuse -- particularly with regards to violations of privacy -- extended to the public-at-large at that time. Indeed, as banking, healthcare, and government services were becoming increasingly computerized in post-war America, the 1960s and 1970s saw the passing of several laws -- many still on the books -- addressing the collection, storage, and sale of people's data, including FERPA, the Family Educational Rights and Privacy Act, the law that governs the privacy of students' education record.

Data-dredging. Data-mining. Technology processes, sure. But also really interesting metaphors.

Dredging data conjures up the image of searching or excavating a large, fluid pool of information. Dredging up information from the bottom of the pool, information that's been buried, that's otherwise inaccessible. And, to continue the metaphor, despite the importance or value in what's being harvested, dredging in the physical world is largely recognized to disturb the ecosystem and to leave behind toxic chemicals.

Mining data might suggest a more targeted resource extraction. It certainly suggests a more lucrative one. And ideally, I'd argue, concerns about damage to the ecosystem persist. And if we're reluctant to talk about the environmental impact of minerals mining, we're ignoring that impact altogether in data-mining.

"Data is the new oil," headlines proclaim. “Data is just like crude," says the market analyst. "It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc., to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value.”

"Data is the new oil," says the investor, urging startups to locate and mine resources currently untapped.

"Data is the new oil," says The World Economic Forum. "In practical terms, a person’s data could be equivalent to their ‘money'".

"Data is the new oil," say tech bloggers and journalists alike.

Among the earliest references I can find of this phrase, for what it's worth, this in 1999: "Information is the new oil. Drilling new sources of information."

Now again, let me remind you. I'm from Wyoming -- the land of Dick Cheney. 48% of my home state is public land, but the minerals rights are leased or sold to private companies for mining (coal, uranium, natural gas, and yes oil). (The tension between public and private when it comes to education data is particularly drought.)

Wyoming is also the site of the Teapot Dome scandal, which before Watergate, was the biggest scandal in the history of the government. In case you easterners didn't pay attention during that unit of history class, Teapot Dome was when the Harding administration gave private companies opportunities to mine public lands in Wyoming (and in California) without going through the proper bidding process.

When I hear "data is the new oil," I think about these histories, the sorts of relationships that have long been forged between government and corporate entities.

I get it: to call data "the new oil" is particularly resonant in our energy-hungry and fossil-fuel reliant economy. And for what it's worth, some data scientists have pushed back on the "oil" metaphor -- or at least some of the uncritical glee surrounding its usage to simply talk about the potential for profits. Jer Thorp, an educator and the former data artist in residence at The New York Times argued that the "data is the new oil" metaphor is deeply flawed. Data isn't something that lies beneath the surface, just waiting to be extracted. Thorp writes -- and apologies for quoting him at length here -- that,

"Perhaps the “data as oil” idea can foster some much-needed criticality. Our experience with oil has been fraught; fortunes made have been balanced with dwindling resources, bloody mercenary conflicts, and a terrifying climate crisis. If we are indeed making the first steps into economic terrain that will be as transformative (and possibly as risky) as that of the petroleum industry, foresight will be key. We have already seen “data spills” happen (when large amounts of personal data are inadvertently leaked). Will it be much longer until we see dangerous data drilling practices? Or until we start to see long term effects from “data pollution”?

One of the places where we’ll have to tread most carefully — another place where our data/oil model can be useful — is in the realm of personal data. A great deal of the profit that is being made right now in the data world is being made through the use of human-generated information. Our browsing habits, our conversations with friends, our movements and location — all of these things are being monetized. This is deeply human data, though very often it is not treated as such. Here, perhaps we can invoke a comparison to fossil fuel in a useful way: where oil is composed of the compressed bodies of long-dead micro-organisms, this personal data is made from the compressed fragments of our personal lives. It is a dense condensate of our human experience."

If we are to embrace the "the new oil" metaphor, Thorp insists that we do so critically, thinking through all the implications, and not merely those implications those that have the "mining" executives rubbing their hands together in glee, anticipating the profits to be made.

The metaphor remains the dominant way we talk about data. Add to "data as oil" or "data-mining." there's the phrase "data exhaust" too, one that I've heard used with increasing frequency. Again, these terms really do matter. "Data exhaust" -- as though even the scraps or waste from what we do online, with our various computing devices, with our phones is valuable -- data exhaust is metadata.

There's a sense with that phrase, "data exhaust," that we're able to make some sort of ecological use of materials that would otherwise simply be waste. Like recycling materials, something that would be tossed aside by its creator or user becomes valuable to someone else. The exhaust is open to being "dredged," sucked up and filtered through by companies and, now we know, by our federal government.

The promise of big data is that mining will uncover something of great value, something that we haven't been able to unearth until now -- thanks to the massive quantity of data, thanks to the massive computer processing power: the cure for cancer, perhaps, or for any number of medical illnesses; in our case here, a cognitive roadmap for how each of us learn. That's certainly something we hear from those in education technology -- that big data and education analytics can crack open the "black box" of learning.

The value uncovered by data mining can be, in other words, a scientific breakthrough -- "an answer." And/or, the value can be money. Indeed, it's helpful to think about the uses of big data and analytics we've already seen: high-speed trading, Internet advertising, e-commerce recommendations.

These, I think, might provide hints of why we see much of the excitement and investment in technology companies right now: data.

The recent explosion of technology startups -- whether mobile tech, social networking sites, or ed-tech -- reflects this. Many of these new companies charge nothing for their product, yet they're raising millions of dollars in venture capital. The business model might develop eventually, sure. These companies might go on to be "the next Google." Even if revenue never materializes, there might be some interesting technology under the hood that make them a target for acquisition. But if nothing else, there'll be data -- lots and lots of user-generated data -- that someone believes can eventually be monetized. Think Twitter. Think LinkedIn. Think Facebook.

Think too of the many, many education technology startups that are free to use:

Remind101, a company that lets teachers and students communicate "safely" via text-message, has raised $3.5 million
Codecademy, a learn-to-code website, has raised $12.5 million.
Udacity, one of the MOOC startups, has raised $20 million.
Edmodo, a social network for schools, has raised $40 million.
Coursera, another one of the MOOC startups, has raised $65 million.

Despite all these free tools, there is a huge market for ed-tech products. Schools, students, parents do and will pay. And no doubt, technology does have incredible capacity to improve and scale our access to education.

But that should be what we cultivate ("to cultivate" -- a different verb, different metaphorical language, no doubt, than "to mine").

Tech should not be used simply, as London Knowledge Lab professor Diana Laurillard has said, "to access masses of data from desperate would-be students." She was referring to the question of scale and access and technology and MOOCs and specifically responding to a round of funding that Coursera had just raised. "Venture capitalists," she said, "are not spending $22 million to nurture students."

Indeed. It seems likely they are spending it because "data is the new oil."

And if educational data is indeed "the new oil," how do we make sure that education technology isn't poised simply to extract value from students? How can it deliver value -- cultivate value, cultivate minds?

We must ask, for starters, "who owns students' data?" Because even though laws like FERPA purport to protect the privacy of students' education record, there isn't a clear provision that states the student record belongs to the student. And as I noted earlier, that student record has expanded to include much much more "data" than when the law passed in the early 1970s. Furthermore, recent revisions to FERPA make it easier for schools, at best the guardian of that record, to release student data to companies that provide educational services and software. The learning management system, for example, the adaptive learning software, the digital textbook publisher, the online course provider.

Of course, this is complicated by the fact that FERPA's protections -- as out-of-date and as frustrating as they have become in a digital world -- only cover students in formal academic settings. These protections only cover students enrolled in programs which receive federal funds under certain Department of Education programs. That leaves open a whole swath of companies -- many new for-profits in the education technology sector -- that need make no pretense of protecting student data under this regulation.

Here is part of the Terms of Service from the World Education University, which, incidentally, offers online courses for free:

“WEU may share your information with any third party outside of our organization as necessary to underwrite free educational offerings. WEU may contact you about specials, new products or services, or changes to this privacy policy."

We can make a joke that "no one reads the Terms of Service." But frankly, I'd wager that no students -- or very, very few -- think through the ways in which their personal data will be used by any products and services associated with their schooling -- whether they are in an ad-supported educational program like WEU or whether they're in a VC-supported educational program like Coursera or whether they're in an endowment-rich school like Columbia or whether they're paying their own way through their local community college.

Technology writer Douglas Rushkoff has long argued, "if you're not paying for the product, you are the product." But I think it's actually far more complicated than that. When it comes to our metadata, it seems we're becoming the product either way. And when it comes to schooling, we're already well accustomed to talking about students as "the product" of the system: heads to fill with information; lives to shape; and now metadata to mine.

So what will be the role of the algorithm in education? How will big data and learning analytics shape the decisions we make -- in the classroom, in institutions; as students, as professors, as administrators, as parents?

Degree Compass was a startup founded at the University of Austin Peay in Tennessee. The point was to take students' performance in certain classes, along with enrollment data and the patterns of "similar students," to point students em to subsequent classes that "the algorithm" said they'd do well in.

From the Austin Peay website:

This system, in contrast to systems that recommend movies or books, does not depend on which classes are liked more than others. Instead it uses predictive analytics techniques based on grade and enrollment data to rank courses according to factors that measure how well each course might help the student progress through their program. From the courses that apply directly to the student’s program of study, the system selects those courses that fit best with the sequence of courses in their degree and are the most central to the university curriculum as whole. That ranking is then overlaid with a model that predicts which courses the student will achieve their best grades. In this way the system most strongly recommends a course which is necessary for a student to graduate, core to the university curriculum and their major, and in which the student is expected to succeed academically.

Degree Compass was acquired by the learning management system Desire2Learn in early 2013 for an undisclosed amount of money.

It's certainly a more sophisticated way of suggesting courses, which is -- in some ways, about suggesting career opportunities -- than the aptitude test that I took back in the 1980s, the one that told me the only job it recommended was "freelance writer." That test asked me questions like: "Do you like lifting heavy objects?" Um. No. "Do you like working outside?" Not really. No. "Do you like to read?" Yes! "Do you like to write?" Yes! "Do you like following orders?" Hell no. "Do you like being in charge of people?" I'm 14. What do I know?! But that test, the Strong Interest Inventory, was based on data. It was based on a questionnaires and research, and it's still touted for its predictive capabilities, even though its easiest version was created in the late 1920s.

And so: imagine, some might say, what we could do with more data. Imagine what we could do with better data. Imagine the predictions we could make.

We could identify which students are likely to be the most successful academically. (In a future where graduation rates are tied to financial aid availability, imagine how this might shape schools' enrollment policies.)

We could identify which students are likely to be dropouts. (Of course, then we have to ask: what do we do then? When do we intervene? How do we intervene? To what end?)

We could investigate how students' performance in Algebra I is tied to their credit score later in life. We could investigate how their performance in The History of Western Civilization is tied to their voting patterns.

We could identify the lessons and the lectures and the assessments sets that "don't work." (You hear this from the folks at Coursera a lot who argue that MOOCs allow them to "fail fast" in this respect. Or, as my friend Mike Caulfield argues in response, you could actually hire good instructional designers and produce quality work from the start.)

We could identify which students are likely to make great biochemistry majors. (You have to wonder here, if the folks who write these algorithms would even suggest people become art history or creative writing majors.)

We could identify which students are likely to be successful entrepreneurs. We could identify which students are likely to become wealthy alumni and reliable donors back to the school.

And to bring things full circle, we could identify which students are likely to become radicals and dissidents. We could share that data with administrators and/or with authorities.

Data, we're told, will allow us to address our most pressing questions in education. But as the uses I've just detailed suggest, it matters who asks those questions, what constitutes those questions. These questions are what shape the algorithms that we build to answer them. And I'll add too that the metaphors we use shape the models we build as well. What does it mean if we decide student data "the new oil"? What does it mean if we view students (and their data) as a resource to be mined and extracted? What's gained? What's lost? What's depleted? Who profits? Who benefits?

I'm just not confident that it'll be those who are having their metadata mined.

Student Data is the New Oil: MOOCs, Metaphor, and Money

Audrey Watters

Written by

Audrey Watters

Credits

Hack Education

The History of the Future of Education Technology