Uploaded by 7

10 Clemmensen-Thompson-Fairness-in-Grading-to-Share

advertisement
Student Assessment:
What Constitutes Fairness in Grading?
Line Harder Clemmensen, DTU Compute
[email protected]
Mary Kathryn Thompson, DTU Mekanik
[email protected]
Educational Assessment
• Process of measuring and documenting educational gains in
knowledge, skills, and understanding
• 3 Types:
– Initial Assessment
Establish a baseline
– Formative Assessment
– Summative Assessment
Ongoing feedback to improve learning
Evaluation of learning
Grading
• Grading is part of summative assessment
• The grading process is affected by:
– What is measured (subject, learning objectives, etc.)
– How it is measured (exam, project, etc.)
– How the results are interpreted
• And therefore also by:
– Whose learning / performance is being measured
– Who is doing the measurement / interpretation
What is Measured
Affects Fairness in Grading
• Should measure the knowledge, skills, and understanding that were
developed, not the students’ ability to use the measurement tool
– Example: A test should measure a student’s knowledge, not their
ability to take tests
– Example: If learning is measured via a paper, the grade should
reflect the paper’s content, not the student’s ability to write a paper
• If there are multiple learning objectives (e.g. how to write a paper and
the paper’s content), each should be evaluated independently
Example: Rubrics to Evaluate
Paper Content (Left) and Writing Ability (Right)
Removed
Removed
How it is Measured Affects Fairness in
Grading
• Students often have different preferences for measurement
instruments
– For example: Some are better at (and prefer) exams
– For example: Other are better at (and prefer) projects or papers
How the Results are Interpreted
Affects Fairness in Grading
• How the results are interpreted depends on the subject being
graded
– Some learning can be measured using closed-ended questions
with right/wrong answers
– Some learning must be measured using open-ended
questions or assignments
• (More) Interpretation is needed for open-ended questions and
assignments
How the Results are Interpreted
Affects Fairness in Grading
• How the results are interpreted depends on who is doing the
interpretation
• Interpretation
– Introduces possibility for personal preference and other biases
– Introduces the possibility for human error
• Need for (more) interpretation increases concerns about fairness
in grading
How the Results are Interpreted
Affects Fairness in Grading
• How the results are interpreted depends on how many people are
doing the interpretation
• How many people are doing the interpretation
– Depends on the class size
• Larger classes may have multiple sections / graders
– Depends on what is being evaluated
• Some subjects / theses require multiple examiners
• More graders introduces the possibility for inconsistencies (and
therefore unfairness) in grading
When is Fairness in Grading the Biggest
Concern?
• In large courses with multiple sections / instructors
• Where multiple graders are used or needed
• Where grading requires substantial interpretation
– Where assessment instruments are open-ended (projects,
papers, etc.)
– Where the topic is subjective and depends on personal
appreciation (art, design, etc.)
• Where there is no direct comparison between student
performances
– For example, they write papers on different topics or they work
on projects that are different in nature
How Do We Improve Fairness in Grading?
• Use multiple means of assessment
• Assess multiple times
• Have the work be assessed by multiple graders
• Use rubrics to clearly define the assessment criteria and improve
consistency between graders
• To some extent all of these options (except for the use of
rubrics) are inherent in the Danish grading system
Grading Rubrics
Grading Rubrics
• Scoring tools that “explicitly represent the performance
expectations for an assignment or piece of work”
• They divide “the assigned work into component parts and
provides clear descriptions of the characteristics of the work
associated with each component, at varying levels of mastery”
• Usually “reflect the weighted importance of the objectives of the
assignment”
http://www.cmu.edu/teaching/designteach/teach/rubrics.html
Sample Rubric for Capstone Design
Traditional Rubrics
• Each rubric is organized into the blocks of criteria
• Each criterion has 3 or 4 levels of achievement
• Each level of achievement has a verbal description
• Overall performance is judged based on the evaluation of all the
criteria
Advantages of Rubrics
• For Students:
– Grading criteria is explicit
– They know how and where to focus their efforts
– Allow self-assessment before assignment is due
• For Faculty:
– Reduce grading time
– Increase consistency in a single grader over time
– Increase consistency between multiple graders
– Can help identify weaknesses in the course
Disadvantages of Traditional Rubrics
This is a lot of text to read and memorize…
… especially if English is not your first language
Disadvantages of Traditional Rubrics
Guidelines, not a grading sheet
Nowhere to place marks or notes for each student / assignment
Sample
Grading Rubric
from KAIST
ED100
ED100 Grading Rubrics
• Each rubric is organized into the blocks of criteria
• Each criterion has a qualitative value
• Each block of criteria has a quantitative point value
• Tries to make grading fast and easy
• Tries to limit penalties for one type of mistake
• Tries to strike a balance between guiding the grader and giving them the
freedom to evaluate the deliverables as they see fit
Poster Formatting and Style: ( _____ / 15 points)
The poster was easy to read
0 / - /  / +
The poster was attractive
0 / - /  / +
The poster distributed graphic/blank space/text effectively
0 / - /  / +
The poster made effective use of visual aids
0 / - /  / +
ED100 Grading Rubrics
• Please note: ED100 grading rubrics were developed for first year
students taking classes and presenting their work using English
as a second language
• Some of the criteria in the rubrics were included to address
common problems that were specific to this class and to these
students
• The rubrics may contain criteria that are not appropriate for
generic courses
• A total of 6 sets of rubrics were used in the course. Criteria that
seem to be obviously ‘missing’ from the sample rubrics may be
present in another rubric
How Does the Rating Scale Affect
Grading Rubrics?
Rating Scales
• A rating scale is the scale used to measure (quantify) qualitative
information
1 2 3 4 5
(5 Point Likert Scale)
0 2 4 6 8 10
0 / - /  / +
A B C D F
___ / 10
Etc.
• Originally from psychological measurement
• Can be used in grading rubrics in place of verbal anchors
• Well known that the choice of rating scale affects the measurement
Grading Scales Experiment
• 21 experienced graders
• Each graded the same 5 posters from KAIST ED100
• Using one of 4 grading scales
Thompson, M. K., Clemmensen, L. H., Ahn, B.-U., “Effect of Rubric Rating Scale
on the Assessment of Engineering Design Projects.” International Journal of
Engineering Education Vol. 29, No. 6, pp. 1490–1502, 2013.
Grading Scale A
Grading Scale B
Grading Scale C
Grading Scale D
Results
• All scales produce valid results but
– Scale A tended to underestimate the ‘true’ score
– Scale D tended to overestimate the ‘true’ score
– Scale C had an unusually high standard deviation
• Scale B had the best overall performance in the experiment
• Scale B has had excellent performance in practice
• Is the scale we use and recommend
Poster Formatting and Style: ( _____ / 15 points)
The poster was easy to read
0 / - /  / +
The poster was attractive
0 / - /  / +
The poster distributed graphic/blank space/text effectively
0 / - /  / +
The poster made effective use of visual aids
0 / - /  / +
Results
• More opportunities for point deductions leads to lower grades (Scale A)
• A finer rating scale reduces this tendency (Scale B vs. C)
• Letter grade rating scales should be avoided (Scale D)
(Everyone interprets letter grade scales differently)
• Internally weighted rating scales should be avoided (Scale D)
(Difficult to get the weighting right)
• Important to balance rater responsibility and comfort
– A scale that places more responsibility on the grader increases
reflection and scale validity at the cost of grader satisfaction
– Grader discomfort can lead to intentional misuse of grading scale
– Intentional misuse of the grading scale leads to reduced validity and
an increased number of outliers
Multiple Graders and Grading Juries
Multiple Graders and Grading Juries
• Many subjects and assignments are open-ended and subjective
– No “right” answers
– Many good answers, some excellent
– Everyone has different expectations
• Can lead to problems with fairness and consistency in grading
– Especially in large courses where there are many teams or
sections
• Common solution is to employ multiple raters (i.e. a Jury)
– Helps to balance the differences of opinion that naturally
occur during subjective evaluations
– Increases the objectivity of the evaluation
Multiple Graders and Grading Juries
• Multiple graders introduces the opportunity for disagreement
• In the absence of perfect agreement, how do you determine the
final score?
• Three basic modes of operation:
– Discussion continues until consensus is reached
• Slow and cannot be done asynchronously
• Common in architecture
– Jury members rate independently, scores are averaged
• Multiple methods available for adjudication
– Jury members discuss and then rate independently
• Adjudication may still be necessary
Prior Art: Adjudication
• Material Arts
– 4 judges, 1 referee: Majority rules but referee can overturn
– (Requires an expert rater on each jury)
• Olympic Figure Skating
– 9 raters, remove two lowest and two highest and average
– (Requires larger number of raters)
• Student Peer Reviews
– Weighted average of raters (Carlson et al 2005)
• High Stakes Assessment (Johnson et al)
– Average of two raters, if they disagree bring in an expert
– Can average the three scores
– Expert can replace one score / both scores
Adjudication in ED100
• KAIST ED100: Introduction to Design and Communication
– 6 x 100 deliverables per semester to grade
– Each deliverable was evaluated by 5 – 6 graders
• All sets of scores were analyzed to identify statistical outliers that
potentially represented invalid scores
• Outliers were hand checked by an expert grader
• Scores deemed to be invalid were removed from the data set
• All other scores were averaged to determine the students’ grades
Thompson, M. K., Clemmensen, L.H., Rosas, H., “Statistical Outlier Detection
for Jury Based Grading Systems.” Proceedings of the 120th ASEE Annual
Conference and Exposition, June 23-26, 2013, Atlanta, GA.
Outlier Detection in ED100
• Base rule for flagging potential outliers:
µ - 1.5σ < Score < µ + 1.5σ
Score belongs to the group
• All flagged scores are hand checked by an expert
• There are no outliers if:
σ < 5%
|µ new – µ old| < 2.25%
matter
|σ new – σ old| < 2%
Good jury agreement
Outlier removal doesn’t
Jury agreement unchanged
• Scores are hand-checked anyway if :
σ >> 9 – 15%
Poor jury agreement
All final scores can be challenged by students, faculty or staff
members
Example Jury Results (Fall 2011 Posters)
Professors
Teaching Assistants
Grader 1
Grader 2
Grader 3
Grader 4
Grader 5
Grader 6
Jury 1
Team 1
86
92
87
82
95
Jury 2
Team 2
85
90
82
85
83
76
Jury 3
Team 3
76
81
48
87
87
88
µ
σ
+1.5 σ
-1.5 σ
96.09
80.71
Jury 1
Team 1
88.40
5.13
Jury 2
Team 2
83.50
4.59
Jury 3
Team 3
77.83
15.33
100.83
54.84
µ
σ
+1.5 σ
-1.5 σ
96.09
80.71
Jury 1
Team 1
88.40
5.13
Jury 2
Team 2
83.50
4.59
Jury 3
Team 3
83.80
5.17
No Outliers
σ < 5%
1 Outlier Detected
No Outliers
σ < 5%
91.55
76.05
1 Outlier Removed
Outlier Detection in ED100
• Generally successful and conservative:
– Overall detection rate: 4.8%
– Overall removal rate: 2.7%
– False positive rate: 44.3%
– False negative rate: < 0.1%
Ongoing and Future Work
• Investigating 3 approaches to outlier detection:
– Mean / standard deviation for normal distributions
– Interscore gap
– Score density
• Using the the data from the Fall 2010 and 2011 FDC grading data
sets, we are trying to:
– Formalize each approach (develop an algorithm)
– Characterize each approach
– Optimize each algorithm
• Compare the three methods for grading outlier detection
• Choose one or more methods to use for future research
Ongoing and Future Work
• Studying the failure modes of the outlier detection algorithm
– Classify the types of false positives
– Determine the frequency of each type
– Identify additional conditions to eliminate pathological cases
– Refine the algorithm
• Trying to relate outlier detection to the nature and behavior of
design juries and jury agreement
– Are there patterns to who generates the outliers?
– Are there patterns to when outliers appear?
– Does this teach us anything about how to formulate design
juries?
Do Grading Juries and Grading
Rubrics Really Work?
How Do You Know If a Grading Jury Isn’t
Working?
• If the jurors don’t agree
– High number of flagged outliers
– High number of true outliers (removed from data set)
– Large jury standard deviations
– Low overall grades
Well Functioning Juries in ED100
True outliers produced by grading juries
Removed
Removed
In contrast, a system that produces 15-20% outliers has totally
broken down
How Do You Know If a Rubric Isn’t Working?
• Qualitatively:
– As a grader, you’re confused
– As a grader, you find yourself ignoring the rubric
– As an administrator, if the grades “don’t seem right”
– As an administrator, if complaints and requests for re-grades
are unusually high
• Quantitatively:
– Grades are unusually low
– Jury standard deviations are unusually high
– Grade distributions are non-normal
Visible in Grading Jury Standard Deviations
• A poor grading rubric will lead to poor agreement among the
members of the grading jury
• Some will try to follow the rubric as best they can even if it seems to
give “bad” grades
• Some will abandon the rubric in an attempt to give grades that seem
to be more appropriate
Removed
(All final deliverables normalized to 100 points)
Visible in Grade Means and Distributions
• A poor grading rubric will also lead to an unusually low grades
and an unusually large spread (standard deviation) in deliverable
grades
Removed
Removed
Summary and Conclusions
Fairness in Grading
• The grading process is affected by:
– What is measured (subject, learning objectives, etc.)
– How it is measured (exam, project, etc.)
– How the results are interpreted
– Whose learning / performance is being measured
– Who is doing the measurement / interpretation
• It can be improved by:
– Using multiple means of assessment
– Assessing multiple times
– Having the work be assessed by multiple graders
– Using rubrics to clearly define the assessment criteria and
improve consistency between graders
Grading Rubrics
• Scoring tools that “explicitly represent the performance
expectations for an assignment or piece of work”
• Advantages: Make grading criteria explicit
– Guide student study efforts
– Allow self-assessment before assignment is due
– Reduce grading time
– Increase consistency in a single grader over time
– Increase consistency between multiple graders
– Can help identify weaknesses in a course
Grading Rubrics
• Rubrics can be verbally anchored or use rating scales
• Choice of rating scale can affect the reliability and validity of the
instrument
• We suggest:
– Organizing rubric into groups of criteria
– Each criterion should be coarsely evaluated
– Each group of criteria should be given a point value
– The range of final point values for the assignment should be
large (__ / 100)
• If the learning cannot be measured independently from
supporting skills (e.g. paper content vs. paper writing), then the
learning and the skills should be evaluated separately using
different rubrics
Grading Juries
• Use of multiple graders can increase the objectivity and overall
fairness of the grading process
• Multiple graders introduces the need for consensus
• Consensus can be reached through discussion or by using an
adjudication system
• Adjudication in a large course requires an automated outlier
detection system
Success and Failure of Grading Rubrics and
Grading Juries
• Failure of a grading rubric or a grading jury system can be seen in:
– High number of flagged outliers
– High number of true outliers (removed from data set)
– Large jury standard deviations
– Low overall grades
– Non-normal grading distributions
– Faculty / juror dissatisfaction with grading process
– Faculty / juror dissatisfaction with final grades
– Student dissatisfaction with final grades
Thank You!
Download