Friday, July 29, 2011

Grading with flair - Part 1 - 15 is the minimum

Today's blog entry is the first in a series of at least two where I will discuss my philosophy of evaluating student work and the assigning of grades.  As Tolkien would say, it is a tale that grew in the telling.  Thus, I've decided to allow myself multiple writings as I refine the Sinkhorn Rubric.

One of my all-time favorite quips goes a little something like "If you think your teachers are tough, wait until you get a boss."  Too true.  You see, all good teachers need to be patient with learners.  As I mentioned in my last post, Yoda was patient with Luke Skywalker.  Purposefully, I did not go into great detail as to why great teachers are patient with their students.  My purpose was that I did not wish to steal any thunder from this blog entry.

The link here between Yoda and evaluation of students is that honest, constructive assessment of student work requires many things: among them are clear expectations, acceptance of creativity, and patience.  And the most important of these is patience.  Students, like children and employees, benefit from high expectations.  But it is counter-productive to insist that an instructor's evaluation of a student does not include at least a certain level of opinion.

In this first post, I discuss the biggest concern I have with evaluating student performance.  I will beg your indulgence, since I take a longer time than usual to get to the point.  So please, be patient.

But first, we need to talk about your flair

If you have never seen Mike Judge's feature length directorial debut, you should stop reading this and watch Office Space as soon as safety and decency allow.  In the movie, Jennifer Aniston's character, Joanna, is a waitress working in one of those restaurant chains.  You've been there.  It's the place where members of the waitstaff are evaluated solely on their willingness to behave in the most obnoxious and ingratiatingly sycophantic manner allowed by law.


The emblem of this behavior is flair.  In the Mike Judge world of a Chotchkies restaurant (Judge himself plays Joanna's supervisor), servers are required to wear at least 15 employee-supplied buttons-- the so-called flair-- on their uniform suspenders. "People can get a burger anywhere.  They come to Chotchkies for the atmosphere."  Suspender flair is presumably a big part of this atmosphere.

Joanna thinks, correctly, that the "atmosphere" of Chotchkies both debases their employees and insults the intelligence of their customers.  Stan, the supervisor character, picks the wrong day to criticize Joanna for wearing only the minimum 15 pieces of flair, and her response is honest, priceless, and a textbook example of a person railing against a divisive and poorly designed system of assessment.
Joanna - "You know what, Stan?  If you want me to wear 37 pieces of flair, like your pretty boy, Brian, over there, why don't you just make the minimum 37 pieces of flair?"
Stan - "Well, I thought I remembered you saying you wanted to express yourself."
Joanna - "Yeah.  You know what?  Yeah, I do.  I do want to express myself, okay.  And I don't need 37 pieces of flair to do it."
There may be children who read my blog, so I won't include a video of what happens next, but picture Jennifer Aniston dressing up as Stone Cold Steve Austin for Halloween.

Suffice it to say, Stan is not a born leader.  Mike Judge's character is a person who apparently got ahead by scoring high on the poorly designed Chotchkies rubric of evaluating employees.  The Chotchkies rubric does not identify and reward leadership, since leadership is apparently not something valued in the mid-level management at this particular establishment.

Guerrilla Grading

In our accreditation-driven education system, exhaustively detailed grading rubrics have become a holy grail of sorts.  In my opinion, it is extremely important that grading be both as fair and objective as possible.  Thus, a rubric is a natural means of establishing a standard to which students can compare their work as a part of their own self-assessment.

This is a good thing for students and instructors.  But I like to remember that professional educators are, well, professionals.  And one of the most important things that all professionals share is the need to operate with some degree of autonomy.  Simply put, educators earn a living at least in part because they know a good paper when they see one.

For me, the difficulty in grading student work is not determining right answers from wrong answers-- or even flawless logic from an argument that needs to be refined. My difficulty is in assigning points.  More properly, my difficulty is in deciding how many points to subtract for a specific error or omission.  Is a sign error a one point mistake or a three point mistake?  The answer to that depends on the problem. Suppose I ask for the roots of x^2+4, a sum of perfect squares.

This is what I'm looking for.


If I instead get the following, the difference is merely a sign error.  But it is exactly what I don't want.


A question of this nature is specifically designed to test if a student can recognize that a certain quadratic equation has no real roots. And I want the student to do this without having been provided the graph.  Students can discover this algebraically by solving the equation, visually by plotting the graph themselves, or intuitively by recognizing that x^2+4 is a vertical translation of x^2.


The Sinkhorn Rubric[TM] - Version 1.0

As it turns out, life is a pass/fail course.  According to Jorge Cham, this is also a reasonable characterization of life in graduate school.


The Sinkhorn Rubric is based on the pass/fail system.  A Jorge Cham pass/fail system is kind of like getting either a C or an F.  For me, an A is good, a C is good enough, and an F is not good enough.  The first and most basic version of the Sinkhorn Rubric is as follows.
A - The student successfully completed the assignment with a substantive addition of desirable elements including, but not limited to, at least one of the following: clarity, brevity, insight, professionalism, innovation, or creativity.
C - The student successfully completed the assignment with no catastrophic errors or significant omissions.
F - The student did not complete the assignment, failed to complete a significant portion of the assignment, or did not provide significant justification for his or her conclusions.
At Chotchkies, Version 1.0 would look like this.
A - A full 37 pieces of flair.  And a terrific smile.
C - The minimum 15 pieces of flair.
F - Flipped off the boss.  And a line cook who just happened to be standing there.
In my studies of student success in courses like college algebra and developmental math, the DFW rate is considered one of the most important performance measures.  The DFW rate is the fraction of students attempting a course who do not complete the course with at least a C.  That is, they earn a D, an F, or withdraw.  Consequently, I submit for your consideration that there is a peer-reviewed precedent of sorts for Version 1.0.

Beer, exams, and beauty contests

In a previous life, I was a blue-ribbon homebrewer.  My summer wheat took first in the specialty category at the Kentucky State Fair many moons ago.  I also was a beer judge as well.  Judging homebrew is kind of like reviewing wine for a magazine, but homebrewers will drink pretty much anything that is both non-toxic and made with malted barley.

http://www.homebrewersassociation.org

In the Homebrew division of the Kentucky State Fair, there are three rounds.  In round one, each beer is judged against a style recognized by the American Homebrewers Association (AHA).  Points are awarded based on how well the entry adheres to the style guidelines in categories such as flavor, bouquet, and appearance.  The X number of beers with the X highest scores move on to round two.

Round two is the medal (ribbon, actually) round.  Judges in each category decide on their three favorite beers, then arrange the top three in order of first, second, and third.  In round three, the blue ribbon winners in each category compete for the coveted Best in Show award.

You see, round one is like grading exams.  You have to decide what to take points off for, and you also decide how many points to take off for each mistake.  Even though the AHA puts out every effort to make the process as objective as possible, it is still highly subjective.  In fact, all judges in each category of round one are required to make certain that their scores fall within a certain tolerance of each other.  And just like grading exams, judging beer for long periods of time has a tendency to dull the senses.  For some strange reason...

Rounds two and three are like judging a beauty contest.  In round two, all judges in, for example, the light ale category get together and decide which of the beers in their category are the three best.  These are the ribbon winners.  Then the judges place the ribbon winners in order of preference-- first, second, and third place.

Now let me ask you, my gentle readers.  Would you rather judge round one or round two?  That is, would you rather grade papers or judge a beauty contest?

AC/DC and fully ordered sets

From a mathematical perspective, the major difficulty associated with schemes for scoring homebrew entries is that the scores are generally whole numbers.  As a subset of the real numbers, whole numbers are what is called a fully ordered set. That means that whenever two numbers in a fully ordered set are compared, either the numbers are equal or one is larger than the other.  So, whenever two distinct (i.e. non-equal) whole numbers are considered, one of them must be larger than the other.

The simplicity of a fully ordered set is quite beautiful to a decision maker.  For example, consider that you wish to buy a 2010 Buick LaCrosse.  If Buick A costs $19,000 and Buick B costs $19,500, then Buick A is less expensive than Buick B. You will never need an expert to tell you that a $19,500 car is more costly than a $19,000 car.

And that's the beauty of a fully ordered set.  But as soon as there is a matter of opinion, things get a little more dicey.  We can easily look up the numbers to see that Australian rock band AC/DC's 1981 album For Those About to Rock reached a higher maximum position on the Billboard album charts than Back in Black.  For Those About to Rock reached number 1.  Back in Black did not reach number 1 despite becoming the second best-selling album in history.


But if you sit down and talk to any serious (or even a casual) AC/DC fan, you will be in for a long discussion of which of these is the better album.  An album grading rubric might help, especially if all manner of AC/DC experts were involved in the creation of this rubric.  But the determination of which is the better album ultimately must be, by nature, a matter of opinion.  That's why music departments teach Music Appreciation instead of Music Scoring and Ranking.

Partially ordered sets and a cheap, used Buick

When the rubber hits the road, any kind of sophisticated decision comes down to a matter of opinion.  A critical and informed expert opinion is preferable to a monkey throwing a dart, but it is still an opinion.  It's also not easy, which is why systems analysts get away with charging exorbitant consulting fees.

The big issue that I've spent so much time introducing is the concept of a partially ordered set.  In a partially ordered set, two non-equal elements can't always be neatly compared.  Let's go back to the Buick example. Buick A costs $19,000 and Buick B costs $19,500.  A is cheaper, so we should buy Buick A if the only thing we care about is the cost of the car.  This is a fully ordered set, but how realistic is it?

http://www.autoinsane.com/2009/07/28/reviews/first-drive/first-drive-2010-buick-lacrosse/

A businessman once told me that the smartest thing you can do for a customer who cares only about cost is to refer him immediately and enthusiastically to your closest competitor.  The implication is that cost is only one factor to consider when purchasing an automated material handling system.  Just like a material handling system, the cost of your family car is important.  But a cheap car isn't all that great if the wheels fall off as soon as you drive off the lot.  So let's consider the mileage of the vehicles.

Suppose that Buick A has 50,000 miles on the odometer to go with the $19,000 price tag.  If Buick B costs $19,500 and sports 5,000 miles, it's quite likely that you would happily pay an extra $500 to get a vehicle with fewer miles.  But what if Buick B costs $19,500 with 45,000 miles on the odometer?  Now you have a decision to make, because the Buick problem of cost and mileage involves two distinct elements of a partially ordered set.

The simplicity of Version 1.0 is a great strength.  A novice instructor could easily assign any sample of student work into one of the three categories in this first incarnation of the Sinkhorn Rubric.  The Version 1.0 rubric does not imply that two students who earn a C have turned in identical work.  What the rubric implies is that each student who earns a C did well enough to pass but not well enough to earn an A.

For next time

I doubt any dean or chief academic officer would be fond of a grading scale with only A, C, and F, so I will expand the Sinkhorn Rubric to allow for a B and a D.  If you have something to add to the discussion, feel free to use the comments below.  I will decide on the numbering scheme for my versions and get back to you next time.

In the meantime, let me ask you a question.  What is the difference between a B and a B+?  I don't want to know the numerical difference.  What I'm getting at is the qualitative difference between B and B+.  What does a student have to do to get a B+ that is significantly better than a B but not worthy of an A or an A-? Would an employer who hired a B+ student of accounting expect significantly more than that of a B student?

No comments:

Post a Comment