Student Assessment: What Constitutes Fairness in Grading? Line Harder Clemmensen, DTU Compute [email protected] Mary Kathryn Thompson, DTU Mekanik [email protected] Educational Assessment • Process of measuring and documenting educational gains in knowledge, skills, and understanding • 3 Types: – Initial Assessment Establish a baseline – Formative Assessment – Summative Assessment Ongoing feedback to improve learning Evaluation of learning Grading • Grading is part of summative assessment • The grading process is affected by: – What is measured (subject, learning objectives, etc.) – How it is measured (exam, project, etc.) – How the results are interpreted • And therefore also by: – Whose learning / performance is being measured – Who is doing the measurement / interpretation What is Measured Affects Fairness in Grading • Should measure the knowledge, skills, and understanding that were developed, not the students’ ability to use the measurement tool – Example: A test should measure a student’s knowledge, not their ability to take tests – Example: If learning is measured via a paper, the grade should reflect the paper’s content, not the student’s ability to write a paper • If there are multiple learning objectives (e.g. how to write a paper and the paper’s content), each should be evaluated independently Example: Rubrics to Evaluate Paper Content (Left) and Writing Ability (Right) Removed Removed How it is Measured Affects Fairness in Grading • Students often have different preferences for measurement instruments – For example: Some are better at (and prefer) exams – For example: Other are better at (and prefer) projects or papers How the Results are Interpreted Affects Fairness in Grading • How the results are interpreted depends on the subject being graded – Some learning can be measured using closed-ended questions with right/wrong answers – Some learning must be measured using open-ended questions or assignments • (More) Interpretation is needed for open-ended questions and assignments How the Results are Interpreted Affects Fairness in Grading • How the results are interpreted depends on who is doing the interpretation • Interpretation – Introduces possibility for personal preference and other biases – Introduces the possibility for human error • Need for (more) interpretation increases concerns about fairness in grading How the Results are Interpreted Affects Fairness in Grading • How the results are interpreted depends on how many people are doing the interpretation • How many people are doing the interpretation – Depends on the class size • Larger classes may have multiple sections / graders – Depends on what is being evaluated • Some subjects / theses require multiple examiners • More graders introduces the possibility for inconsistencies (and therefore unfairness) in grading When is Fairness in Grading the Biggest Concern? • In large courses with multiple sections / instructors • Where multiple graders are used or needed • Where grading requires substantial interpretation – Where assessment instruments are open-ended (projects, papers, etc.) – Where the topic is subjective and depends on personal appreciation (art, design, etc.) • Where there is no direct comparison between student performances – For example, they write papers on different topics or they work on projects that are different in nature How Do We Improve Fairness in Grading? • Use multiple means of assessment • Assess multiple times • Have the work be assessed by multiple graders • Use rubrics to clearly define the assessment criteria and improve consistency between graders • To some extent all of these options (except for the use of rubrics) are inherent in the Danish grading system Grading Rubrics Grading Rubrics • Scoring tools that “explicitly represent the performance expectations for an assignment or piece of work” • They divide “the assigned work into component parts and provides clear descriptions of the characteristics of the work associated with each component, at varying levels of mastery” • Usually “reflect the weighted importance of the objectives of the assignment” http://www.cmu.edu/teaching/designteach/teach/rubrics.html Sample Rubric for Capstone Design Traditional Rubrics • Each rubric is organized into the blocks of criteria • Each criterion has 3 or 4 levels of achievement • Each level of achievement has a verbal description • Overall performance is judged based on the evaluation of all the criteria Advantages of Rubrics • For Students: – Grading criteria is explicit – They know how and where to focus their efforts – Allow self-assessment before assignment is due • For Faculty: – Reduce grading time – Increase consistency in a single grader over time – Increase consistency between multiple graders – Can help identify weaknesses in the course Disadvantages of Traditional Rubrics This is a lot of text to read and memorize… … especially if English is not your first language Disadvantages of Traditional Rubrics Guidelines, not a grading sheet Nowhere to place marks or notes for each student / assignment Sample Grading Rubric from KAIST ED100 ED100 Grading Rubrics • Each rubric is organized into the blocks of criteria • Each criterion has a qualitative value • Each block of criteria has a quantitative point value • Tries to make grading fast and easy • Tries to limit penalties for one type of mistake • Tries to strike a balance between guiding the grader and giving them the freedom to evaluate the deliverables as they see fit Poster Formatting and Style: ( _____ / 15 points) The poster was easy to read 0 / - / / + The poster was attractive 0 / - / / + The poster distributed graphic/blank space/text effectively 0 / - / / + The poster made effective use of visual aids 0 / - / / + ED100 Grading Rubrics • Please note: ED100 grading rubrics were developed for first year students taking classes and presenting their work using English as a second language • Some of the criteria in the rubrics were included to address common problems that were specific to this class and to these students • The rubrics may contain criteria that are not appropriate for generic courses • A total of 6 sets of rubrics were used in the course. Criteria that seem to be obviously ‘missing’ from the sample rubrics may be present in another rubric How Does the Rating Scale Affect Grading Rubrics? Rating Scales • A rating scale is the scale used to measure (quantify) qualitative information 1 2 3 4 5 (5 Point Likert Scale) 0 2 4 6 8 10 0 / - / / + A B C D F ___ / 10 Etc. • Originally from psychological measurement • Can be used in grading rubrics in place of verbal anchors • Well known that the choice of rating scale affects the measurement Grading Scales Experiment • 21 experienced graders • Each graded the same 5 posters from KAIST ED100 • Using one of 4 grading scales Thompson, M. K., Clemmensen, L. H., Ahn, B.-U., “Effect of Rubric Rating Scale on the Assessment of Engineering Design Projects.” International Journal of Engineering Education Vol. 29, No. 6, pp. 1490–1502, 2013. Grading Scale A Grading Scale B Grading Scale C Grading Scale D Results • All scales produce valid results but – Scale A tended to underestimate the ‘true’ score – Scale D tended to overestimate the ‘true’ score – Scale C had an unusually high standard deviation • Scale B had the best overall performance in the experiment • Scale B has had excellent performance in practice • Is the scale we use and recommend Poster Formatting and Style: ( _____ / 15 points) The poster was easy to read 0 / - / / + The poster was attractive 0 / - / / + The poster distributed graphic/blank space/text effectively 0 / - / / + The poster made effective use of visual aids 0 / - / / + Results • More opportunities for point deductions leads to lower grades (Scale A) • A finer rating scale reduces this tendency (Scale B vs. C) • Letter grade rating scales should be avoided (Scale D) (Everyone interprets letter grade scales differently) • Internally weighted rating scales should be avoided (Scale D) (Difficult to get the weighting right) • Important to balance rater responsibility and comfort – A scale that places more responsibility on the grader increases reflection and scale validity at the cost of grader satisfaction – Grader discomfort can lead to intentional misuse of grading scale – Intentional misuse of the grading scale leads to reduced validity and an increased number of outliers Multiple Graders and Grading Juries Multiple Graders and Grading Juries • Many subjects and assignments are open-ended and subjective – No “right” answers – Many good answers, some excellent – Everyone has different expectations • Can lead to problems with fairness and consistency in grading – Especially in large courses where there are many teams or sections • Common solution is to employ multiple raters (i.e. a Jury) – Helps to balance the differences of opinion that naturally occur during subjective evaluations – Increases the objectivity of the evaluation Multiple Graders and Grading Juries • Multiple graders introduces the opportunity for disagreement • In the absence of perfect agreement, how do you determine the final score? • Three basic modes of operation: – Discussion continues until consensus is reached • Slow and cannot be done asynchronously • Common in architecture – Jury members rate independently, scores are averaged • Multiple methods available for adjudication – Jury members discuss and then rate independently • Adjudication may still be necessary Prior Art: Adjudication • Material Arts – 4 judges, 1 referee: Majority rules but referee can overturn – (Requires an expert rater on each jury) • Olympic Figure Skating – 9 raters, remove two lowest and two highest and average – (Requires larger number of raters) • Student Peer Reviews – Weighted average of raters (Carlson et al 2005) • High Stakes Assessment (Johnson et al) – Average of two raters, if they disagree bring in an expert – Can average the three scores – Expert can replace one score / both scores Adjudication in ED100 • KAIST ED100: Introduction to Design and Communication – 6 x 100 deliverables per semester to grade – Each deliverable was evaluated by 5 – 6 graders • All sets of scores were analyzed to identify statistical outliers that potentially represented invalid scores • Outliers were hand checked by an expert grader • Scores deemed to be invalid were removed from the data set • All other scores were averaged to determine the students’ grades Thompson, M. K., Clemmensen, L.H., Rosas, H., “Statistical Outlier Detection for Jury Based Grading Systems.” Proceedings of the 120th ASEE Annual Conference and Exposition, June 23-26, 2013, Atlanta, GA. Outlier Detection in ED100 • Base rule for flagging potential outliers: µ - 1.5σ < Score < µ + 1.5σ Score belongs to the group • All flagged scores are hand checked by an expert • There are no outliers if: σ < 5% |µ new – µ old| < 2.25% matter |σ new – σ old| < 2% Good jury agreement Outlier removal doesn’t Jury agreement unchanged • Scores are hand-checked anyway if : σ >> 9 – 15% Poor jury agreement All final scores can be challenged by students, faculty or staff members Example Jury Results (Fall 2011 Posters) Professors Teaching Assistants Grader 1 Grader 2 Grader 3 Grader 4 Grader 5 Grader 6 Jury 1 Team 1 86 92 87 82 95 Jury 2 Team 2 85 90 82 85 83 76 Jury 3 Team 3 76 81 48 87 87 88 µ σ +1.5 σ -1.5 σ 96.09 80.71 Jury 1 Team 1 88.40 5.13 Jury 2 Team 2 83.50 4.59 Jury 3 Team 3 77.83 15.33 100.83 54.84 µ σ +1.5 σ -1.5 σ 96.09 80.71 Jury 1 Team 1 88.40 5.13 Jury 2 Team 2 83.50 4.59 Jury 3 Team 3 83.80 5.17 No Outliers σ < 5% 1 Outlier Detected No Outliers σ < 5% 91.55 76.05 1 Outlier Removed Outlier Detection in ED100 • Generally successful and conservative: – Overall detection rate: 4.8% – Overall removal rate: 2.7% – False positive rate: 44.3% – False negative rate: < 0.1% Ongoing and Future Work • Investigating 3 approaches to outlier detection: – Mean / standard deviation for normal distributions – Interscore gap – Score density • Using the the data from the Fall 2010 and 2011 FDC grading data sets, we are trying to: – Formalize each approach (develop an algorithm) – Characterize each approach – Optimize each algorithm • Compare the three methods for grading outlier detection • Choose one or more methods to use for future research Ongoing and Future Work • Studying the failure modes of the outlier detection algorithm – Classify the types of false positives – Determine the frequency of each type – Identify additional conditions to eliminate pathological cases – Refine the algorithm • Trying to relate outlier detection to the nature and behavior of design juries and jury agreement – Are there patterns to who generates the outliers? – Are there patterns to when outliers appear? – Does this teach us anything about how to formulate design juries? Do Grading Juries and Grading Rubrics Really Work? How Do You Know If a Grading Jury Isn’t Working? • If the jurors don’t agree – High number of flagged outliers – High number of true outliers (removed from data set) – Large jury standard deviations – Low overall grades Well Functioning Juries in ED100 True outliers produced by grading juries Removed Removed In contrast, a system that produces 15-20% outliers has totally broken down How Do You Know If a Rubric Isn’t Working? • Qualitatively: – As a grader, you’re confused – As a grader, you find yourself ignoring the rubric – As an administrator, if the grades “don’t seem right” – As an administrator, if complaints and requests for re-grades are unusually high • Quantitatively: – Grades are unusually low – Jury standard deviations are unusually high – Grade distributions are non-normal Visible in Grading Jury Standard Deviations • A poor grading rubric will lead to poor agreement among the members of the grading jury • Some will try to follow the rubric as best they can even if it seems to give “bad” grades • Some will abandon the rubric in an attempt to give grades that seem to be more appropriate Removed (All final deliverables normalized to 100 points) Visible in Grade Means and Distributions • A poor grading rubric will also lead to an unusually low grades and an unusually large spread (standard deviation) in deliverable grades Removed Removed Summary and Conclusions Fairness in Grading • The grading process is affected by: – What is measured (subject, learning objectives, etc.) – How it is measured (exam, project, etc.) – How the results are interpreted – Whose learning / performance is being measured – Who is doing the measurement / interpretation • It can be improved by: – Using multiple means of assessment – Assessing multiple times – Having the work be assessed by multiple graders – Using rubrics to clearly define the assessment criteria and improve consistency between graders Grading Rubrics • Scoring tools that “explicitly represent the performance expectations for an assignment or piece of work” • Advantages: Make grading criteria explicit – Guide student study efforts – Allow self-assessment before assignment is due – Reduce grading time – Increase consistency in a single grader over time – Increase consistency between multiple graders – Can help identify weaknesses in a course Grading Rubrics • Rubrics can be verbally anchored or use rating scales • Choice of rating scale can affect the reliability and validity of the instrument • We suggest: – Organizing rubric into groups of criteria – Each criterion should be coarsely evaluated – Each group of criteria should be given a point value – The range of final point values for the assignment should be large (__ / 100) • If the learning cannot be measured independently from supporting skills (e.g. paper content vs. paper writing), then the learning and the skills should be evaluated separately using different rubrics Grading Juries • Use of multiple graders can increase the objectivity and overall fairness of the grading process • Multiple graders introduces the need for consensus • Consensus can be reached through discussion or by using an adjudication system • Adjudication in a large course requires an automated outlier detection system Success and Failure of Grading Rubrics and Grading Juries • Failure of a grading rubric or a grading jury system can be seen in: – High number of flagged outliers – High number of true outliers (removed from data set) – Large jury standard deviations – Low overall grades – Non-normal grading distributions – Faculty / juror dissatisfaction with grading process – Faculty / juror dissatisfaction with final grades – Student dissatisfaction with final grades Thank You!