430367 LTJ Article The Rasch wars: The emergence of Rasch measurement in language testing /$1*8$*( 7(67,1* Language Testing 29(4) 555­–576 © The Author(s) 2012 Reprints and permission: sagepub.co.uk/journalsPermissions.nav DOI: 10.1177/0265532211430367 ltj.sagepub.com Tim McNamara and Ute Knoch The University of Melbourne, Australia Abstract This paper examines the uptake of Rasch measurement in language testing through a consideration of research published in language testing research journals in the period 1984 to 2009. Following the publication of the first papers on this topic, exploring the potential of the simple Rasch model for the analysis of dichotomous language test data, a debate ensued as to the assumptions of the theory, and the place of the model both within Item Response Theory (IRT) more generally and as appropriate for the analysis of language test data in particular. It seemed for some time that the reservations expressed about the use of the Rasch model might prevail. Gradually, however, the relevance of the analyses made possible by multi-faceted Rasch measurement to address validity issues within performance-based communicative language assessments overcame language testing researchers’ initial resistance. The paper outlines three periods in the uptake of Rasch measurement in the field, and discusses the research which characterized each period. Keywords FACETS, history of language testing, item response theory, multi-faceted Rasch measurement, Rasch measurement The paper examines the history of the take-up of Rasch measurement within second and foreign language testing, in research on test development, delivery and validation. In the first part of the paper, two characteristics of the field of language testing research which influenced its reception are outlined: the role of psychometrics within differing regional research traditions, and the professional background and training of those working within language testing. In the second part of the paper, the results are presented of a survey of studies involving Rasch measurement published in dedicated language testing journals (Language Testing from 1984, Language Assessment Quarterly from 2004, Melbourne Papers in Language Testing from 1992, and Assessing Writing from 1994). The research Corresponding author: Ute Knoch, School of languages and linguistics, Babel Level 5, The University of Melbourne, 3010, Australia. Email: [email protected] 556 Language Testing 29(4) is summarized over three periods (the 1980s, the 1990s and the 2000s) and tracks the change from the initial claims for and resistance to Rasch measurement, particularly the simple Rasch model – ‘the Rasch wars’ – to its ultimate wide acceptance by 2000. Regional differentiation and professional training in language testing research In order to understand the context into which Rasch measurement was received within language testing research, it is necessary to recognize certain distinctive features of the field. The first is that, despite the shared concerns of language testing researchers worldwide (i.e. the development and validation of language tests), regional traditions of language testing differ in significant ways, reflecting to some extent the differing cultural values of their societies. For example, the British and American traditions of language testing are acknowledged to draw on rather different theoretical and practical traditions (Alderson, 1987; Alderson, Clapham, & Wall, 1995). The British tradition of language examinations goes back to the beginning of the 20th century and earlier and was relatively slow to feel the full influence of psychometric theory as it developed, particularly in the United States, and instead placed greater emphasis on the importance of test content and its relationship to teaching. Alderson (1987, p. 3) in a commentary on the work of UK testing and examination boards noted ‘the lack of emphasis by exam boards on the need for empirical rather than judgemental validation’ which meant that ‘examination boards do not see the need to pretest and validate their instruments, nor conduct post hoc analyses of their tests’ performance’. On the other hand he emphasized that many of [the] tests are highly innovative in content and format … other tests could benefit greatly … by attention to the content validation procedures they use. It is also true to say that many tests would benefit from greater attention to the relationship between testing and teaching, for which the British exam boards are particularly noted. (Alderson, 1987, p. 4) The British tradition was quick to respond to developments in language teaching, particularly the rapid spread of communicative language teaching and the emergence of the specialist fields of English for academic and other specific purposes in the 1970s and 1980s (Davies, 2008). The clearest example is perhaps the appearance of the ELTS test of English for Academic Purposes in the 1980s, succeeded by IELTS from 1989, which reflected in varying degrees the communicative demands of the academic setting facing international students wishing to study at English-medium universities in Britain, Australia, Canada and New Zealand, which the test targeted. It is interesting in this regard to note the strongly critical comments by the American Grant Henning on the psychometric shortcomings of the analysis of trial data in the ELTS Validation Project (Henning, 1988a). IELTS still arguably prioritizes communicative relevance and impact on teaching and learning over psychometric rigour, for example in its commitment to a face-to-face test of speaking where a single rating determines the candidate’s score, with the inevitable compromise on reliability that results. This is preferred over potentially more reliable tests of spoken language, for example tests administered in noninteractive settings which prioritize technological sophistication in delivery or analysis McNamara and Knoch 557 of performance, but which fail to fully reflect the construct of speaking which necessarily involves face-to-face interaction. The United States tradition by comparison has tended to emphasize psychometric considerations more strongly. The overriding concern for demonstrating satisfactory psychometric properties in tests meant that for many years language tests there proved less responsive to changes in language teaching, particularly the communicative movement, and the accompanying demand for language tests to reflect this. For example, it proved very difficult to change psychometrically sound tests such as the traditional paper and pencil version of the Test of English as a Foreign Language (TOEFL), or the Test of English for International Communication (TOEIC), long after critiques of the constructs underlying such tests had emerged and there was a demand for more communicative tests from teachers and from receiving institutions alike. Now, the two traditions have come together, so that IELTS and the new TOEFL iBT are both communicatively oriented tests: TOEFL iBT is more closely targeted to the academic setting, and is arguably more reliable, but its speaking assessment lacks the face-to-face interaction that is the distinctive feature of the IELTS interview. Over 20 years ago, Alderson (1987, p. 4) suggested that ‘some combination of British judgemental validation and American empirical validation seems required’, and to a large extent this now seems to have occurred. In Australia, research in language testing first seriously emerged in the context of the Adult Migrant Education Program (AMEP), a national program for teaching English as a second language to adult immigrants, which accompanied the introduction of large scale immigration to Australia following World War II. Language assessment has been influenced by both the British and American traditions. Australia played a major collaborative role with British researchers in the development of IELTS, and workplace tests such as the Occupational English Test (McNamara, 1996) drew directly on the British tradition of ESP testing. On the other hand, American influence was felt in the area of oral proficiency testing, where the instrument used for 20 years from the late 1970s for assessing migrants in the Adult Migrant English Program, the Australian Second Language Proficiency Ratings (ASLPR: Ingram & Wylie, 1979) was derived from the American Foreign Service Institute oral proficiency interview and scale (Clark & Clifford, 1988). In fact, however, language assessment in Australia was ultimately to cut a more independent path, particularly when it encountered Australian work on Rasch measurement, as we shall see. The second important contextual feature to consider is the typical professional background and research training of language testing researchers. Language testing is a hybrid field, with roots in applied linguistics and in measurement. Researchers (at least in the English-speaking world) frequently enter work in language testing following initial training and careers in language teaching rather than in statistics or psychometrics. Their introduction to language testing is in specialist courses within graduate study in applied linguistics, or through practical exposure in their professional teaching careers, and they are likely to lack a strong background in mathematics or statistics in their prior education, typically being languages, linguistics or humanities and social science majors in their undergraduate degrees. They may even have consciously avoided and feel uncomfortable with numbers. Graduate training in language testing research shows considerable variability across national contexts, reflecting what we have seen of the differing regional research 558 Language Testing 29(4) traditions. Thus, in very general terms, those applied linguistics graduate students entering language testing in the best American centres of graduate training will then be initiated into the American tradition of language testing research and exposed to extensive training in psychometrics. Arguably the most significant American graduate program in language testing in the last 25 years, that at UCLA under Lyle Bachman, has established a tradition of rigorous psychometric training, and its graduates are maintaining that tradition in their own education of the next generation of language testing researchers. The requirement within American doctoral degrees to undertake extensive coursework, not matched until very recently in British and Australian universities, also gives scope for psychometric training outside applied linguistics programs which is less readily available to those studying in the UK or Australia. The British and Australian tradition of training has tended to emphasize psychometric training less, though more now is incorporated, and has continued to highlight strongly the need to engage with the relationship between test content and task design and theories of language learning and teaching. However the practice at the annual research conference of the International Language Testing Association (ILTA) of holding pre-conference workshops which emphasize psychometric training now provides opportunities for people from all research traditions to develop their knowledge and skills in this area. Overall, the fact that the professional background of many language testers lies outside measurement has advantages and disadvantages. The advantage is that issues of construct interpretation tend to be at the fore, and language testers readily engage in debate over substantive matters of this kind. The disadvantage is that where language testing researchers come to psychometrics late, a lack of depth of training in this area can be a handicap. The contextual background of the culture of language testing, both in its differing broad national traditions, and the professional background and training of language testing researchers, constitutes the setting in which the influence of Rasch measurement began to be felt from the early 1980s onwards. We now consider the history of the uptake of Rasch models in the almost three decades since the publication of the first papers on the topic in the field of language testing. Enter Rasch: The 1970s and 1980s Awareness of Rasch measurement in language testing occurred as a rather belated response to the growing interest in what was known as latent trait theory1 more generally throughout the 1960s and 1970s in educational measurement.2 Ben Wright, the outstanding American advocate of Rasch methods, started corresponding with Georg Rasch, invited him to Chicago and visited him in Denmark in the 1960s. Wright commenced annual courses on the theory and practice of Rasch measurement for Education and Psychology students in 1964 and Rasch spoke at the Educational Testing Service Invitational Test Conference in October 1967. The awareness among those involved more exclusively in language testing occurred first in those centres in which sophisticated psychometric expertise was available: for example the Central Institute for Test Development (CITO) in the Netherlands, Educational Testing Service (ETS) in the United States, the National Foundation for Educational Research (NFER) in the United McNamara and Knoch 559 Kingdom and the Australian Council for Educational Research (ACER) in Australia. It was not long before this began to be felt in language testing. The context of reception and the initial reaction differed in different countries. We will consider here in turn the Netherlands, the United States, the United Kingdom, and Australia. In the early 1980s in the Netherlands psychometricians at the educational research centre CITO became interested in Rasch, and the proceedings of a meeting at CITO in 1982 (van Weeren, 1983) contained four papers on the application of the Rasch model in language testing, mostly in the testing of foreign languages in school examinations. One of these was by a test constructor at CITO, John De Jong, who assessed the validity of a test using the Rasch model (De Jong, 1983). De Jong (personal communication, 26 July, 2008) writes of that time: In 1982 I was not yet aware of the NFER and ACER work, but after joining the ‘Rasch club’, a kind of inter-university discussion group, meeting 3 or 4 times a year, I got into contact with more people and in 1985 I went to a Rasch meeting organized by Ben Wright in conjunction with the AERA3 in Chicago. There I met Geoff Masters and David Andrich and also was staying in the same student-type accommodation as a number of Dutch professors in psychometrics (Wim van der Linden, Ivo Moolenaar, Don Mellenbergh, Eddie Roskam), who expressed that my work could easily be the basis of a PhD. De Jong subsequently completed his PhD on applying Rasch measurement to issues of test development, test equating, international standards, and educational reform, using examples from a variety of language tests representing all four skills (De Jong, 1991). Given the strong psychometric tradition in the United States, and the central role of the program at Chicago under Ben Wright in promulgating Rasch’s ideas in the educational field (Wright & Andrich, 1987), it was inevitable that Rasch measurement would soon come to the attention of language testing researchers. An influential early figure was Grant Henning, who had attended a workshop on Rasch with Ben Wright, and became an advocate of Rasch. Teaching at the American University in Cairo in the early 1980s he inspired Kyle Perkins, there on a sabbatical year from Southern Illinois University Carbondale, to become familiar with Rasch measurement. Others who were with Henning in Cairo were Thom Hudson, who became his student at UCLA, and Dorry Kenyon, a student of Henning’s in Cairo, who was an early adopter of Rasch when he later began working with Charles Stansfield at the Center for Applied Linguistics (CAL) in Washington, DC. Subsequently, at UCLA, Henning taught Rasch measurement and worked on language test data analysis with his own graduate students, particularly Fred Davidson, Brian Lynch, Thom Hudson, Jean Turner and somewhat later Antony Kunnan. Henning’s influence is reflected in the publication of a number of papers by himself and his students in the journal Language Testing, established in 1984. These papers4 set out to introduce the main features of the Basic Rasch model and its earliest extensions to a language testing readership, using data sets to demonstrate its potential in exploring test quality. An important step in the dissemination of knowledge about Item Response Theory (IRT) and Rasch in particular among language testers occurred in 1985. Charles Stansfield, who was exposed to IRT while working at ETS from 1981 to 1986, chaired the 1985 Language Testing Research Colloquium (LTRC), held at ETS, the theme of which was ‘Technology and Language Testing’, and organized a two-day pre-conference 560 Language Testing 29(4) seminar on IRT. In Stansfield’s view, the conference and workshop were significant in the following way: [The] workshop gave formal training to everyone and the following year and for years afterward, there were many papers involving IRT at LTRC. I wouldn’t say the conference introduced IRT to language testers, although for most it did. However, I can say that the conference rapidly moved language testers to this new and promising approach to item analysis, building a scale, calibrating items, etc. After the conference, it was shared knowledge. (Charles Stansfield, personal communication, 22 May 2011) Six out of the 10 papers from the 1985 LTRC published in the proceedings under the title Technology and Language Testing (Stansfield, 1986) dealt with IRT; the authors included Henning, De Jong, and Harold Madsen and Jerry Larson from Brigham Young University, who had used the Basic Rasch model to develop a computer adaptive placement test. Stansfield continued to play a role in promoting Rasch following his appointment as Director of the Center for Applied Linguistics in 1986. He writes: I knew I would be working with small data samples for the tests we developed, so Rasch seemed to be what was needed. Ben Wright had created a Rasch SIG within AERA, so I started attending their meetings. They also had a preconference meeting which was a series of papers on the subject. He usually had comments about each paper, and each comment was positive … I was one of about 6 language testers who went to Ben Wright’s office at the University of Chicago after an LTRC,5 for a full day of lecture and question-answering by Ben Wright. He was a most impressive man and an excellent communicator. (Charles Stansfield, personal communication, 22 May 2011) In the United Kingdom, Rasch measurement entered via work on the testing of reading in the mother tongue in schools, and then spread to second and foreign language testing. Rasch measurement had influenced the work of the National Foundation for Educational Research (NFER), which focused on school educational contexts, and there it had been used in the development of reading tests in English as a mother tongue, work which in turn encountered critique on the grounds of the assumptions of the Basic Rasch model (Goldstein, 1979; see more below). In 1985, the British Council in London organized introductory seminars on Item Response Theory for language testing specialists.6 Alastair Pollitt, a psychometrician based in Edinburgh, who had an interest in schoolbased L1 writing, began using the Rating Scale and Partial Credit models to explore school students’ writing in English as a mother tongue (Pollitt & Hutchinson, 1987; Pollitt, Hutchinson, Entwistle, & DeLuca, 1985). Within second language testing, Rosemary Baker wrote a PhD about Rasch at Edinburgh (Baker, 1987; subsequently published as Baker, 1997), but soon moved out of the field proper. Neil Jones had become aware of the potential of the Basic Rasch model when he happened upon the discussion in Henning (1987), developing his own program for the analysis of language test data, which he demonstrated at a series of workshops (Jones, 1991). One of these workshops was attended by Brian North, who some years later used Rasch measurement in the calibration of descriptors of what became the Common European Framework of Reference (North, 1993, 1995). Jones subsequently wrote a PhD at Edinburgh on the use of Rasch McNamara and Knoch 561 in item banking, supervised by Pollitt (Jones, 1992). There was also some interest in Rasch measurement at Lancaster, a strong centre for language testing research, particularly in the development of computer adaptive tests (Alderson, 1986). In Australia, in the broader field of educational measurement, there was a uniquely strong connection with Rasch measurement. A succession of Australians studied with Ben Wright in Chicago and contributed significantly to the development of Rasch modelling itself – initially David Andrich who developed the Rating Scale model (Andrich, 1978) and Geoff Masters who developed the Partial Credit model (Masters, 1982), and later Ray Adams, who with Khoo Siek-Toon developed the Quest program (Adams & Siek Toon, 1993), Mark Wilson who with Ray Adams and Margaret Wu developed Conquest (Wu, Adams, & Wilson, 1998), and others, particularly Patrick Griffin, who, while he was working in Hong Kong, began working on an ESL test (Griffin, 1985). It would not be long before language testers in Australia would encounter this remarkable intellectual resource, though, given their lack of opportunity for training in psychometrics, it would happen by chance, as a personal account of the exposure to Rasch of one Australian researcher (McNamara) will demonstrate. The happenstance of his entry into language testing has parallels in the careers of many other researchers in language testing, given the context of their background and training outlined above. McNamara worked for 13 years teaching English as a foreign language to adults in private language schools in London and in the Australian equivalent of community colleges in Melbourne. He had had some very introductory training in quantitative methods (basic univariate statistics) during his MA in Applied Linguistics in London, and took one short course on language testing as part of that degree. On his return to Australia, opportunities came up to do consultancies in language testing, the principal one being to develop a workplace-related performance assessment of the ability of immigrant health professionals to carry out the communicative tasks of the workplace. (He had been involved in setting up and teaching ESP courses for such health professionals.) The resulting Occupational English Test (OET) (McNamara, 1996) was strongly influenced by British work on the testing of English for academic purposes, and was within that tradition of communicative language testing: it emphasized real-world contexts for communication, and the productive skills of speaking and writing within profession-specific contexts, as well as shared assessments of reading and listening. McNamara’s work on the OET subsequently formed the basis for a PhD (McNamara, 1990a). It was in the context of carrying out this study that the purely chance encounter with the Rasch tradition happened. At Melbourne, where he had a temporary position in applied linguistics, the head of the Horwood Language Centre, Dr Terry Quinn, an applied linguist with an interest in the policy context of assessment but also with little background in psychometrics, helped him to find a co-supervisor for his thesis. McNamara remembers Quinn taking out a list of academic staff at the University and running down the list, looking for someone whose interests lay in assessment. ‘Ah, here’s one’ he said. ‘Masters – I don’t know him but he’s in the Faculty of Education and it says he’s interested in assessment. Go and see him and see if he will be a co-supervisor with me.’ McNamara went to see Geoff Masters, who in fact resigned from the University within six months to take up another position, but it was long enough for McNamara to be introduced to the Partial Credit Model, which he then used in his thesis in the analysis 562 Language Testing 29(4) of data from the performance-based tests of speaking and writing in the OET. Many of the earlier papers on Rasch in language testing had been on objectively scored data; the combination of the new Rasch measurement and the communicative tradition in language testing, while not original, was still relatively novel (papers by Davidson and Henning (1985) and Henning and Davidson (1987)7 had used Rasch in the analysis of data from self-assessments and from a writing test; see also Davidson, 1991). In summary, then, when Rasch measurement first began to be known within language testing research, it entered regional contexts which differed significantly in their language testing practice and research cultures, and which had differing attitudes to training in psychometrics. This had implications for the initial and subsequent reception of Rasch measurement in the field, and was a significant factor in ‘the Rasch wars’, as we shall now see. The Rasch wars: Controversies over the assumptions of Rasch measurement in the 1980s The advent of Rasch within language testing was not viewed in a positive light by everyone. The Rasch wars, already well under way within educational measurement more generally, had begun. These wars were fought on a number of fronts: the use of Rasch measurement in language testing was exposed to different kinds of attack, some on psychometric grounds, the others on applied linguistic ones. From the psychometric side, the question of the desirability of the use of Rasch measurement in language testing reflected debates and disputes in the broader measurement field, principally about the analysis of dichotomous items using the Basic Rasch model. There were in fact two kinds of dispute here: first, what were the advantages of latent trait theory, including the simple Rasch model and other IRT models for dichotomous items, over classical test theory? Second, what were the relative merits of the Rasch model versus the other IRT models? In those settings, such as the UK and Australia, where the Rasch model was the vehicle for debates about the advantages of latent trait models in general, and the particularities of Rasch measurement were at issue, these two areas of dispute became fused. For example, while the first item response models were developed in the work of Lawley (1943) and Thomson (see Bartholomew, Deary, & Lawn, 2009) in work on intelligence testing in the United Kingdom in the 1940s and 1950s, most of the subsequent interest in IRT was in the United States, and it was forgotten in Britain; when latent trait theory returned to Britain it was only in the form of the Basic Rasch model, which then bore the brunt of the attack. Within latent trait modelling itself, there were fierce arguments between advocates of the Basic Rasch model and those who preferred 2- and 3-parameter IRT models (Hambleton, Swaminathan, & Rogers, 1991). For example, Stansfield (personal communication, 22 May 2011) reports that at ETS ‘the one parameter [i.e. Basic Rasch] model was viewed as simplistic and inadequate to display the properties of an item’. One target of attack from proponents of more complex IRT models was the Rasch assumption of equal discrimination of items, which was easy to disprove as a matter of fact for any particular data set. The nature and advantage of such an assumption – a deliberate McNamara and Knoch 563 simplification – was too little appreciated, although other fields of applied linguistics have readily grasped this point. For example, Widdowson (1996) has used the example of the simplifying aspects of the model of the London Underground present in London Transport maps; it would be misleading, for example, for anyone looking at the map to assume that the distances between stations on the map were true to the actual relative distances between stations. The map uses a deliberate simplification; in Rasch, the deliberate simplification of the assumption of equal discrimination of items permits exploitation of the property of specific objectivity in Rasch models, which means that the relationship of ability and difficulty remains the same for any part of the ability or difficulty continuum (see McNamara, 1996 for an explanation of this point). This property made it easy for items to be banked, essential for applications in computer adaptive testing, and for tests to be vertically equated, allowing mapping of growth over time using linked tests. Supporters of the Rasch model argued that the fact that Rasch has tests of the failure of its own assumption – that is, the capacity to flag when item discriminations for individual items depart sufficiently far from the assumed discrimination to jeopardize the measurement process – means that this assumption is not used recklessly. From applied linguists, the assumption of unidimensionality in Rasch was seen as having deeply problematic implications for test constructs. (As this assumption was shared at that time by all the then-current IRT models, and indeed by Classical Test Theory, it was less of an issue for psychometricians.) In an early example of this sort of critique, Michael Canale argued: Perhaps the main weakness of this version of CAT8 is that the construct to be measured must, according to item response theory, be unidimensional – i.e. largely involve only one factor. Not only is it difficult to maintain that reading comprehension is a unidimensional construct (for example, to ignore the influence of world knowledge), but it is also difficult to understand how CAT could serve useful diagnostic and achievement purposes if reading comprehension is assumed to be unidimensional and, hence, neither decomposable into meaningful subparts nor influenced by instruction. (Canale, 1986, p. 30) Similar objections were vigorously voiced elsewhere. The assumption that a single underlying measurement dimension is reflected in the data was deemed to make Rasch modelling an inappropriate tool for the analysis of language test data (Buck, 1994; Hamp-Lyons, 1989), given the obvious complexity of the construct of language proficiency. Skehan (1989, p. 4) expressed further reservations about the appropriateness of Rasch analysis in the context of ESP testing, given what we know about ‘the dimensions of proficiency or enabling skills in ESP’. McNamara (1990a) argued strongly against this view, using data from the OET trials. He was able to show that Rasch analyses could be used to confirm the unidimensionality of a listening test which might at first sight be thought to be testing different dimensions of listening (McNamara, 1991) and demonstrated that Rasch could be used as a powerful means of examining underlying construct issues in communicative language tests, particularly the role of judgements of grammatical accuracy in the assessment of speaking and writing (McNamara, 1990b). Henning (1992) also defended Rasch against what he argued were misunderstandings of the notion of unidimensionality in Rasch measurement. 564 Language Testing 29(4) The controversies that Rasch measurement could trigger were illustrated in the Australian context9 by the development and reception of the Interview Test of English as a Second Language (ITESL) (Adams, Griffin, & Martin, 1987; Griffin, Adams, Martin, & Tomlinson, 1988). This test was being developed for the purpose of placing elementary and intermediate level students in general language courses for adult immigrants to Australia within the Adult Migrant English Program (AMEP). The teachers in the program did not like it. The grounds for their rejection of the test were complex. First, they shared the general suspicion of teachers in that program of formal testing instruments. Secondly, instead of the communicative focus of the curriculum of the AMEP, ITESL focused on accuracy in the production of individual grammatical structures in the context of very controlled, short spoken tasks. As we have seen, assessment in the program to that time involved the interview-based Australian Second Language Proficiency Ratings (ASLPR) (Ingram & Wylie, 1979), underpinned by a holistic, integrated, performance-based view of proficiency. ITESL was a radical departure from existing practice and caused a storm of controversy among the teachers who were meant to use it, as it went against their ideas of language and language learning. Thirdly, the test had been developed not by teachers, but by psychometricians working in the measurement division of the Education Ministry in Melbourne, Patrick Griffin and Ray Adams, using the Basic Rasch model with its (for the uninitiated) daunting conceptual and statistical complexity. Trialling and subsequent Rasch analysis indicated that the test items formed a continuum of difficulty. In order to understand better the construct of the test, Griffin and Adams had consulted with some ESL teachers in the AMEP whose views on language teaching were felt by others to be rather conservative, emphasizing mastery of grammatical forms. Griffin et al. (1988) went so far as to argue that their research confirmed the existence of a hypothesized ‘developmental dimension of grammatical competence ... in English S[econd] L[anguage] A[cquisition]’ (p.12). This was a red rag to a bull for researchers working in second language acquisition in Australia, especially those whose work was principally in support of the AMEP, for example Geoff Brindley in Sydney and David Nunan, the senior figure in the research centre which supported the AMEP in Adelaide. Nunan (1987) published a strong attack on the test, arguing that the psychometric analysis underlying the development of the test was so powerful that it obscured substantive construct issues. This objection fell on fertile ground among the teachers, who were suspicious of the psychometrics in the test, and Rasch became something of a bogey word among them. Summary of exploration of Rasch in second language testing in the 1980s The situation at the beginning of the 1990s then was as follows. There were several places in the world where second language testing had begun to engage with Rasch measurement: 1. 2. In the Netherlands, John de Jong and others at CITO had begun an exploration of the use of Rasch modelling in data from performance on foreign language tests (De Jong, 1983, 1991; De Jong & Glas, 1987). In the United States, the work of Grant Henning and those he had influenced focused on exploring the potential of the basic Rasch model to illuminate McNamara and Knoch 3. 4. 565 dichotomously scored data from the English as a Second Language Placement Examination (ESLPE), a test for students entering UCLA (Chen & Henning, 1985; Henning, 1984, 1988b; Henning, Hudson, & Turner, 1985; Lynch, Davidson, & Henning, 1988); the use of the Partial Credit model had also been used with writing (Henning & Davidson, 1987) and self-assessment data (Davidson & Henning, 1985). Familiarity among language testers with IRT in general, and with Rasch in particular, was helped by the exposure to it at the 1985 and 1988 LTRCs and associated workshops. The use of Rasch was encountering criticism on both psychometric and applied linguistic grounds. In the United Kingdom, the Rasch model had been used in the development and validation of L1 tests of reading and writing, although this was the subject of intense criticism (Goldstein, 1979). Interest in the applications of Rasch measurement to second language testing was slowly growing, particularly in Edinburgh, but this too was subject to sharp critique, for example in a book written by Robert Wood, a colleague of Goldstein, which reviewed psychometric issues in second language testing on behalf of those responsible for the Cambridge language tests (Wood, 1991). In Australia, Griffin et al. (1988) had developed a controversial test of oral proficiency in ESL using the Partial Credit model, and McNamara (1990a, 1990b) had used the Partial Credit model in the development of a specific purpose communicative test of ESL for health professionals. Language testing research papers involving Rasch, 1984–1989 The discussion of the context of the 1980s in which Rasch measurement first appeared in the language testing literature helps us to understand the picture of published research emerging from the 15 or so papers on language testing using Rasch measurement published in the journal Language Testing10 in the period 1984–1989. They were mostly introductions and explorations of the potential of this new tool, contrasting it with classical measurement theory, and defending the model against theoretical objections. Most studies used the Basic Rasch model, and hence were limited to dichotomously scored data, which meant that tests of grammar and vocabulary, and to a lesser extent of listening and reading, predominated, though there were some studies of judgements of writing (Pollitt & Hutchinson, 1987) and of self-assessment using rating scale data (Davidson & Henning, 1985). There was little exploration as yet of the potential of the model to examine more substantive issues of validity (with the exception of the validity study in the paper by De Jong, 1983). Table 1 presents the results of a survey of papers published in the journal in this period. The table presents information on which Rasch model was used in the analysis, what kinds of language tests were involved (that is, focusing on which language skills), where the research was carried out, whether the article focused exclusively on Rasch or whether it was one of several psychometric procedures used, and the focus of the paper in relation to Rasch – whether Rasch was simply used as part of a larger argument in the paper, or whether it was itself the focus of the paper’s discussion. This framework will also be used in analysis of publications from subsequent periods, to permit comparison across the periods. 566 Language Testing 29(4) Table 1. Published language testing journal research using Rasch, 1984–1989 Model used Basic (8) Skills in test Speaking, Writing (2) Rating scale/Partial credit (3) Reading, Listening (3) Author affiliation Role of Rasch Function of Rasch Australia, NZ (3) Primary (9) Discuss (5) USA (6) One of several (3) Use (5) Unknown (4) N=15 Other or more than one (9) UK, Europe (6) Marginal (3) Both (5) N=14a N=15 N=15 N=15 Note: aThe reason that the total number of articles in this row is lower than in other rows is that in some of the discussion articles, no language skill is specifically targeted. This is also true in tables below (Table 3 and Table 4) summarizing papers from later periods. Developments in the 1990s: Enter FACETS Late in 1990 there was a significant development. Ray Adams had just come back to Melbourne from completing his PhD in Chicago (Adams, 1989) and was aware of the significance of Mike Linacre’s work on multi-faceted Rasch measurement, implemented through the FACETS program (Linacre, 2009), the result of Linacre’s PhD (Linacre, 1989) which he had completed while Adams was in Chicago. Adams helped McNamara understand the radical nature of the development, and its potential for illuminating fundamental issues in performance assessment in second languages became clear. They wrote a paper (McNamara & Adams, 1994) demonstrating FACETS analysis using data from the IELTS writing test, and presented it at the 1991 Language Testing Research Colloquium at ETS; it was the first paper in language testing to use FACETS. The paper proved controversial, as what it showed about IELTS was that with a single rater, the chances of a borderline candidate gaining a different score for the same performance with a different rater were unacceptably high, and so the IELTS practice of a single rating of speaking and writing was hard to defend, a point made by John de Jong who was a discussant to the paper. In fact, FACETS had revealed a problem that was common to all performance tests, the vulnerability of the score to rater effects, a point that as Linacre had pointed out in his PhD thesis had been recognized for over 100 years, but conveniently forgotten. Language testing researchers in the United States, especially the advocates of earlier Rasch models, quickly saw the potential of multi-faceted Rasch measurement. For example, Stansfield and Kenyon used FACETS to scale speaking tasks based on the ACTFL Guidelines and the ILR scale.11 The scaling was found to fit the ACTFL/ILR model and the authors argued that it constituted a validation of the ACTFL/ILR scale. Others remained cautious about Rasch because of their acceptance of the critiques within psychometrics of Rasch assumptions, and their preference for the so-called two- and threeparameter IRT models; Bachman and Pollitt aired the opposing sides on this issue in the late 1980s at meetings of the advisory group for the Cambridge-TOEFL comparability study (Bachman, Davidson, Ryan, & Choi, 1995). A further opportunity for dialogue around these issues arose in 1992, when McNamara, on an extended visit to UCLA, was invited by Bachman to co-teach courses on IRT in which Rasch methods, including multi-faceted Rasch measurement, were taught in contrast with two-parameter IRT McNamara and Knoch 567 models for dichotomous data and Generalizability Theory for judge-mediated data. This resulted in the book Measuring Second Language Performance (McNamara, 1996) which aimed to introduce Rasch methods, including multi-faceted Rasch measurement, to a general language testing readership, emphasizing its potential for exploring data from communicative assessments. In the United Kingdom, the most influential language testing agency, the University of Cambridge Local Examinations Syndicate (UCLES), despite on the one hand the psychometric caution represented by the review by Wood (1991), and on the other a certain suspicion of psychometric methods of the type used by ETS as too ‘American’, began to use Rasch measurement. Alastair Pollitt joined UCLES in 1990, and initiated a project to use Rasch methods to calibrate and equate the first versions of the recently introduced IELTS; the calibration and item-banking system now used in Cambridge was developed soon afterwards. Neil Jones joined Cambridge in 1992, having completed his PhD at Edinburgh, and soon became responsible for this work. In Australia, the establishment of what was to become the Language Testing Research Centre (LTRC) at Melbourne as a result of the adoption in 1987 by the Australian Government of the National Policy on Languages (Lo Bianco, 1987) led to the creation of another centre involved in the use of Rasch models (McNamara, 2001). Usually, academics working in language testing in university contexts tend to be solitary figures; located as they are in applied linguistics programs for the most part, it is unlikely that any program will be large enough to have more than a single member of staff with language testing as a speciality. Here, several language testing researchers were working together carrying out externally funded research, a highly unusual situation. The establishment of the LTRC met a real need for expertise in language assessment in Australia, and a number of projects were begun involving performance assessments which lent themselves to analysis using Rasch methods, particularly multi-faceted Rasch measurement. Melbourne became a centre for Rasch-based language testing research, as figures on publications to be presented below confirm. The entry of multi-faceted Rasch measurement had a strong influence on the Rasch wars. By the early part of the 1990s, the growing communicative movement in language teaching meant that tests of the ability to speak and write in simulated real-world contexts became more and more central to language testing, with a resultant change in focus away from dichotomously scored tests towards performance assessments. Multi-faceted Rasch measurement provided a powerful tool to examine rater characteristics at the level of individual raters, something which Generalizability Theory could in part match, but less flexibly (Lynch & McNamara, 1998). Rater characteristics which were now open for detailed research using Rasch methods included relative severity or leniency; degree of consistency in scoring; the influence of rater training; the influence of professional background; and consistency over time. Other aspects of the rating situation such as the effects on scores of task and mode of delivery (face-to-face vs. technologically mediated) could also be explored, as well as the interaction of these facets with aspects of rater characteristics. It is as if researchers in this field had been handed a very powerful microscope to examine the complexity of the rating process. Despite this, it was not clear even by the mid-1990s that multi-faceted Rasch measurement was going to be taken up more widely. It was not as if the psychometric argument had been resolved: while it is 568 Language Testing 29(4) Table 2. Changing attitudes to Rasch in the 1990s Period For Rasch Australia 1990–1994 1995–1997 1998–1999 Total LT MPLT 5 4 1 10 6 3 2 11 Against Rasch USA Rest of the World USA Rest of the World 1 1 2 4 0 2 2 4 5 0 0 5 1 0 1 2 true that in the Partial Credit Model some differences in ‘classical discrimination’ between items is accounted for by the variation in thresholds between items, this is not true for the Rating Scale Model, where there is no variation in thresholds between items, and for dichotomously scored items, the assumption of equal discrimination still holds. By the late 1990s, however, the appeal of multi-faceted Rasch measurement for understanding issues in communicative language testing proved irresistible, and there was a steady uptake in many world centres, including the United States (Table 2).12 The table divides the papers into two basic categories: those assuming or supporting the use of Rasch modelling; and those arguing against its assumptions. The papers are further classified to indicate the geographical affiliation of the authors. Australian research supportive of Rasch modelling is prominent, particularly in the early period; much of this appeared in the house journal of the Language Testing Research Centre in Melbourne, Melbourne Papers in Language Testing (MPLT). We can see that in the early 1990s, research published by researchers working outside Australia were for the most part questioning the use of Rasch models; by the end of the decade that was no longer the case. Some of the papers that appeared in that time give a feeling for these developments. Buck’s objections to Rasch on the grounds of the unidimensionality assumption in the context of the testing of listening (Buck, 1994) and Henning’s defence of Rasch (Henning, 1992) are typical of the debates at the main international conference of language testing researchers, the Language Testing Research Colloquium, at that time. The cautious interest in multi-faceted Rasch measurement in the United States is demonstrated by the papers of Bachman and some of his students using FACETS: a jointly authored paper (Bachman, Lynch, & Mason, 1995) compared analyses using FACETS and G-Theory on a data set from a test of speaking; Weigle (1994) used FACETS in her PhD thesis to investigate the effect on the measurement qualities of individual raters who had taken part in a program of rater training. In Australia, Brown (1995) demonstrated the potential of FACETS to explore substantial issues of validity in second language performance assessments in her investigation of the effect of rater background on a workplace related test of spoken Japanese for tour guides. Raters with tour guiding experience were shown to be oriented to the assessment criteria differently from those without such experience, particularly on a task involving handling a problematic relationship with a client (a tourist). The paper raises the question of whose criteria should prevail in language tests contextualized within workplace tasks – linguistic or real-world criteria, reflecting a McNamara and Knoch 569 Table 3. Published journal research using Rasch, 1990–1999 Rating scale/ MFRM (17) Partial credit (7) Skills in test Speaking, Writing (22) Reading, Listening (8) Other (6) Author affiliation Australia, NZ (22) USA (6) Europe (3) RoW (5) Role of Rasch Primary (13) One of several (17) Marginal (6) Function of Rasch Discuss (2) Use (23) Both (11) Program used FACETS (12) (Con)Quest (4) Others (8) Model used Basic (11) N=35 N=36 N=36 N=36 N=36 N=24a Note: aThe total in this cell is lower because some authors did not specify the program used. distinction proposed between ‘strong’ and ‘weak’ second language performance assessments (McNamara, 1996). Table 3 summarizes the language testing research featuring Rasch measurement appearing in the 1990s. The major trends it reveals are: (a) a much greater use of multifaceted Rasch measurement; (b) in the context of more research on the assessment of speaking and writing; (c) it is now much more often one of several statistical techniques used; (d) Rasch measurement is now mostly a statistical methodology to be used rather than simply discussed; and (e) the over-representation of Australian research. Character of papers 2000–2009 By about 2000, then, the Rasch wars were essentially over. The acceptance of Rasch measurement, particularly multi-faceted Rasch measurement, as a useful tool in the armory of language testing researchers, especially in performance assessments, is reflected in the summary provided in Bachman’s influential survey of the state of the art of language testing at the turn of the century (Bachman, 2000). Bachman begins by noting the growing use of Rasch measurement: IRT has also become a widely used tool in language testing research … the Rasch model, in its various forms, is still the most widely used in language testing … More recently, the Rasch multi-facet model has been applied to investigate the effects of multiple measurement facets, typically raters and tasks, in language performance assessments. (Bachman, 2000, pp. 5–6) But more significant in this context are his comments on the state of the debate over the appropriateness of its use: The abstract technical debate about dimensionality and the appropriateness of different IRT models has been replaced by a much more pragmatic focus on practical applications, particularly with respect to performance assessments that involve raters and computer-based tests. (Bachman, 2000, p. 22) This is confirmed by the following summary of the publications in the first decade of the current century (Table 4). As can be seen from the table, the use of Rasch measurement in language testing research appears to have become universally uncontroversial 570 Language Testing 29(4) Table 4. Published journal research using Rasch, 2000–2009 Basic (12) Rating scale/Partial credit (4) MFRM (29) Speaking, Writing (26) Reading, Listening (6) Other (12) Australia, NZ (15) USA (11) UK, Europe (8) RoW (16) Role of Rasch Primary (15) One of several (26) Marginal (6) Function of Rasch Discuss (5) Use (39) Both (3) Model used Skills in test Author affiliation N=45a N=44 N=50 N=47 N=47 Note: a The totals in this column differ as some authors did not specify the model they used, some discussion papers did not target a specific language skill and some papers were authored by multiple authors from different parts of the world. In total, 47 papers were included in the sample. and routine. No longer is its use restricted to one or two centres; it is used by researchers in many different countries. Most typically, multi-faceted Rasch measurement is used with judge-mediated data from communicative language tests, often simply in order to establish the psychometric qualities of such tests, but also, and more interestingly, to address substantive validity issues. Another feature of the current scene is that Rasch is just one of a battery of psychometric tools used in the research, and increasingly, qualitative methods (especially introspection) are used in order to support or interrogate the quantitative findings. Some examples of these validity studies will give a feel for the current situation. Bonk and Ockey (2003) used multi-faceted Rasch measurement in their study of a group oral assessment in English language at a Japanese university, in which groups of three or four students were assessed in conversation by two raters. The study demonstrated the relatively high degree of variability among the raters, confirming that such assessments are of modest quality in terms of reliability. The study also addressed the question of rater change over time, and found that the raters tended to become harsher with experience. Nevertheless they conclude that the group oral test of the type they studied is, despite its shortcomings, useful as a general measure of oral proficiency, especially in contexts where oral skills would not otherwise be assessed. A paper by Elder, Knoch, Barkhuizen, and von Randow (2005) returned to the unresolved issue of whether the information on the quality of ratings of individual raters available from multi-faceted Rasch measurement analyses of rater harshness and consistency, and patterns of biased ratings involving particular criteria or particular tasks, could form the basis of feedback to raters which would improve their subsequent performance. Earlier studies (e.g. Lunt, Morton, & Wigglesworth, 1994; Wigglesworth, 1993) of the effectiveness of such feedback to raters had proved inconclusive. The study was carried out using data from ratings of ESL writing in a university screening test in New Zealand. In this case, the feedback given using Rasch-based estimates of relative leniency, consistency and bias was complemented by qualitative feedback, and the paper reports on participants’ perceptions of the usefulness of the feedback. The overall finding was that the feedback was helpful in many but not in all cases. Brindley and Slatyer (2002) carried out a study using Rasch analysis of factors affecting task difficulty in listening assessments. The context of the study was the testing of competencies of immigrant adult ESL learners in Australia. The study involved a comparison of different versions of similar listening content, constructed through McNamara and Knoch 571 altering macro-variables thought likely to affect the difficulty of the listening task such as speech rate, whether the material was listened to once or twice, item type (short answer questions, completion of a table, and sentence completion items), and speech genre (conversational vs. formal). Rasch calibration to compare the difficulty of different versions was complemented by a qualitative analysis of individual items in terms of the kind of information necessary to answer the question, the characteristics of the surrounding text and the characteristics of the stem. The study found a complex interaction of the effects of the variables and the item characteristics, which made the effect of each variable on its own hard to generalize. The study has important implications for our understanding of the second language listening process and of problems in designing listening tests. Conclusion Many challenges face language assessment research at the current time. A range of new topics has emerged. These include, for example, the problems arising in the assessments of combined cohorts of non-native speakers and native speakers in globalized work and study environments; the introduction of the assessment of different language skills (e.g. listening and writing) using integrated test tasks; and the automatic scoring of speech and writing. In this, the availability of new and more complex Rasch-based analytic tools – the development of multidimensional Rasch models and the programs (e.g. Conquest) to implement them – provides opportunities and challenges. The increasing complexity of the models and programs creates opportunities to explore their potential for application in language testing contexts; but it also raises the question of the accessibility of these tools to individuals lacking extensive psychometric training, which will include many working in language testing research. This history of the uptake of Rasch measurement within language testing research has implications for other fields of educational measurement and for measurement more generally. It demonstrates the potential for Rasch measurement to deal with more than scaling issues: in language testing research Rasch measurement has been used to address a range of substantive validity questions. This history also raises the complex issues involved in joint work between construct specialists and psychometricians. Increasingly, in educational assessment, there is a need for psychometricians to draw more extensively on the expertise of subject specialists. Similarly, subject specialists seeking to develop assessments in their subject areas require the skills of psychometricians in order to develop meaningful assessments. Often the cooperation between individuals with different kinds of training may lead to problematic results, as the controversy over the ITESL test developed by Griffin et al. (1988), discussed above, has shown. Ideally, the expertise is combined in a single person, and that is the language testing tradition; but it is a demanding task to gain the expertise required. The necessary complementarity of applied linguistics and measurement expertise in language test development and language testing research is what characterizes the field; the history of the uptake of Rasch measurement within language testing demonstrates its consequences and difficulties. 572 Language Testing 29(4) Acknowledgements An earlier version of this paper was given as a plenary address at the Pacific Rim Objective Measurement Symposium (PROMS 2008), Ochanomizu University, Tokyo, August 2008. We are grateful to Mark Wilson, Ray Adams, John De Jong, Charles Stansfield, Dorry Kenyon, Nick Saville, Neil Jones, Lyle Perkins, Thom Hudson, Fred Davidson and Brian Lynch for advice on historical and technical points, and to the reviewers for the helpful insights they gave. Notes 1. This term is no longer used ‘because it confounds this area with the “State-Trait” issue in psychology’ (Mark Wilson, personal communication, 31 May 2011). 2. While the community of ‘professional’ language testers came to Rasch models late, early work on tests of reading preceded this: Georg Rasch himself developed the Basic Rasch model as part of the construction of a series of vertically equated reading tests (Rasch, 1960/1980), and Rentz and Bashaw (1977) used Rasch methods to develop a reading scale. 3. Conference of the American Educational Research Association. 4. As well as a paper co-written by Perkins and Miller (1984) comparing Rasch and Classical Test Theory which had appeared in the first issue of Language Testing. 5. This was the 1988 LTRC, held at the University of Illinois, Urbana-Champaign. 6. Similar seminars were organized in the United States by the Educational Testing Service in Princeton. 7. This paper was presented at a conference and subsequently published in a book of proceedings. 8. Computer Adaptive Testing. 9. Rasch models were by far the most prominent representative of Item Response Theory models in the Australian educational measurement context at that time. 10. This was the sole journal dedicated to research on language testing in that period. It commenced publication in 1984. A number of papers on Rasch appeared in other collections and journals, but the tables in the present paper are restricted to surveying publications in dedicated language testing journals, as explained in the Introduction; certain other important papers, particularly in the early period, are mentioned in the body of the article. 11. The study was done in 1990 and presented at AERA in 1991, and at the LTRC in 1992; it was published as Stansfield and Kenyon (1995). 12. This transition to greater acceptance of Rasch was reflected in the educational measurement community in the United States more generally, as Stansfield recollects (personal communication, 22 March 2011): ‘The Rasch SIG [Special Interest Group] was not viewed favorably at AERA. In fact, its members were viewed as lightweight and somewhat renegade … Suffice it to say that professors of educational measurement typically said it was not worth studying or using. However, it prevailed and grew due to many factors, including the wonderful yet strong personality of Ben Wright, the work of Mike Linacre, the practicality and conceptual simplicity of the method, the high cost of 2 and 3 parameter test analysis software and the very large N required for using that software. As a result of all of the above, the use of Rasch analysis in research and testing in school districts throughout the US became commonplace. All of a sudden, you turned around and everybody was using Winsteps or FACETS.’ Nevertheless ETS still has a preference for non-Rasch IRT models, and in general, ‘the controversy has not faded’ (Mark Wilson, personal communication, 31 May 2011). McNamara and Knoch 573 References Adams, R. (1989). Measurement error and its effects upon statistical analysis. Unpublished doctoral dissertation, University of Chicago. Adams, R., Griffin, P., & Martin, L. (1987). A latent trait method for measuring a dimension in second language proficiency. Language Testing, 4(1), 9–27. Adams, R., & Siek Toon, K. (1993). Quest: The interactive test analysis system. Melbourne: Australian Council for Educational Research. Alderson, C. (1986). Innovations in language testing? In M. Portal (Ed.), Innovations in language testing (pp. 93–105). Windsor, Berks: NFER-Nelson. Alderson, C. (1987). An overview of ESL/EFL testing in Britain. In C. Alderson, K. Krahnke,& C. Stansfield (Eds.), Reviews of English language proficiency tests (pp. 3–4). Washington, DC: Teachers of English to Speakers of Other Languages. Alderson, C., Clapham, C., & Wall, D. (1995). Language test construction and evaluation. Cambridge: Cambridge University Press. Andrich, D. A. (1978). A rating scale formulation for ordered response categories. Psychometrika, 43, 561–573. Bachman, L. (2000). Modern language testing at the turn of the century: Assuring that what we count counts. Language Testing, 17(1), 1–42. Bachman, L., Davidson, F., Ryan, K., & Choi, I.-C. (1995). An investigation into the comparability of two tests of English as a foreign language: The Cambridge TOEFL comparability study. Cambridge: Cambridge University Press. Bachman, L., Lynch, B., & Mason, M. (1995). Investigating variability in tasks and rater judgements in a performance test of foreign language speaking. Language Testing, 12(2), 238–257. Baker, R. (1987). An investigation of the Rasch model in its application to foreign language proficiency testing. Unpublished PhD thesis. University of Edinburgh, Edinburgh. Baker, R. (1997). Classical test theory and item response theory in test analysis. LTU Special Report No. 2. Lancaster: Centre for Research in Language Education. Bartholomew, D. J., Deary, I. J., & Lawn, M. (2009). Sir Godfrey Thomson: A statistical pioneer. Journal of the Royal Statistical Society. Series A (Statistics in Society), 172(2), 467–482. Bonk, W., & Ockey, G. (2003). A many-facet Rasch analysis of the second language group oral discussion task. Language Testing, 20(1), 89–110. Brindley, G., & Slatyer, H. (2002). Exploring task difficulty in ESL listening assessment. Language Testing, 19(4), 369–394. Brown, A. (1995). The effect of rater variables in the development of an occupation-specific language performance test. Language Testing, 12, 1–15. Buck, G. (1994). The appropriacy of psychometric measurement models for testing second language listening comprehension. Language Testing, 11(2), 145–170. Canale, M. (1986). Theoretical bases of communicative approaches to second-language teaching and testing. Applied Linguistics, 1, 1–47. Chen, Z., & Henning, G. (1985). Linguistic and cultural bias in language proficiency tests. Language Testing, 2(2), 155–163. Clark, J. L. D., & Clifford, R. T. (1988). The FSI/ILR/ACTFL proficiency scales and testing techniques: Development, current status and needed research. Studies in Second Language Acquisition, 10(2), 129–147. Davidson, F. (1991). Statistical support for training in ESL composition rating. In L. Hamp-Lyons (Ed.), Assessing second language writing in academic contexts (pp. 155–164). Norwood, NJ: Ablex. Davidson, F., & Henning, G. (1985). A self-rating scale of English difficulty: Rasch scalar analysis of items and rating categories. Language Testing, 2(2), 164–179. 574 Language Testing 29(4) Davies, A. (2008). Assessing academic English: Testing English proficiency 1950–1989 – the IELTS solution. Cambridge: Cambridge University Press. De Jong, J. H. A. L. (1983). Focusing in on a latent trait: An attempt at construct validation using the Rasch model. In J. Van Weeren (Ed.), Practice and problems in language testing 5. Papers presented at the International Language Testing Symposium (Arnhem, Netherlands, March 25–26, 1982) (pp. 11–35). Arnhem: Cito. De Jong, J. H. A. L. (1991). Defining a variable of foreign language ability: An application of item response theory. Unpublished PhD thesis, Twente University, The Netherlands. De Jong, J. H. A. L., & Glas, C. A. W. (1987). Validation of listening comprehension tests using item response theory. Language Testing, 4(2), 170–194. Elder, C., Knoch, U., Barkhuizen, G., & von Randow, J. (2005). Individual feedback to enhance rater training: Does it work? Language Assessment Quarterly, 2(3), 175–196. Goldstein, H. (1979). Consequences of using the Rasch Model for educational assessment. British Educational Research Journal, 5, 211–220. Griffin, P. (1985). The use of latent trait models in the calibration of tests of spoken language in large-scale selection-placement programs. In Y. P. Lee, A. C. Y. Fok, R. Lord, & G. Low (Eds.), New directions in language testing (pp. 149–161). Oxford: Pergamon. Griffin, P., Adams, R., Martin, L., & Tomlinson, B. (1988). An algorithmic approach to prescriptive assessment in English as a second language. Language Testing, 5(1), 1–18. Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage. Hamp-Lyons, L. (1989). Applying the partial credit model of Rasch analysis: Language testing and accountability. Language Testing, 6(1), 109–118. Henning, G. (1984). Advantages of latent trait measurement in language testing. Language Testing, 1(2), 123–133. Henning, G. (1987). A guide to language testing. Development, evaluation, research. Cambridge, MA: Newbury House. Henning, G. (1988a). An American view on ELTS. In A. Hughes, D. Porter,& C. Weir (Eds.), ELTS Validation Project: Proceedings of a conference held to consider the ELTS Validation Project Report. English Language Testing Service Research Report 1 (pp. 84–92). London: British Council/University of Cambridge Local Examinations Syndicate. Henning, G. (1988b). The influence of test and sample dimensionality on latent trait person ability and item difficulty calibration. Language Testing, 5(1), 83–99. Henning, G. (1992). Dimensionality and construct validity of language tests. Language Testing, 9(1), 1–11. Henning, G., & Davidson, F. (1987). Scalar analysis of composition ratings. In K. Bailey, T. Dale, & R. Clifford (Eds.), Language testing research: Selected papers from the 1986 Colloquium. Monterey, CA: Defense Language Institute. Henning, G., Hudson, T., & Turner, J. (1985). Item response theory and the assumption of unidimensionality for language tests. Language Testing, 2(2), 141–154. Ingram, D. E., & Wylie, E. (1979). Australian Second Language Proficiency Ratings (ASLPR). In Adult Migrant Education Program Teachers Manual. Canberra: Department of Immigration and Ethnic Affairs. Jones, N. (1991). Test item banker: An item bank for a very small micro. In C. Alderson & B. North (Eds.), Language testing in the 1990s (pp. 247–254). London: Modern English Publications/British Council/Macmillan. Jones, N. (1992). An item bank for testing English language proficiency: Using the Rasch model to construct an objective measure. University of Edinburgh, Unpublished PhD thesis. McNamara and Knoch 575 Lawley, D. N. (1943). On problems connected with item selection and test construction. Proceedings of the Royal Society of Edinburgh, 61, 273–287. Linacre, J. M. (1989). Many faceted Rasch measurement. Unpublished doctoral dissertation, University of Chicago. Linacre, J. M. (2009). Facets Rasch measurement computer program. Chicago: Winsteps.com. Lo Bianco, J. (1987). National policy on languages. Canberra: Australian Government Publishing Service. Lunt, H., Morton, J., & Wigglesworth, G. (1994). Rater behaviour in performance testing: Evaluating the effect of bias feedback. Paper presented at 19th annual congress of Applied Linguistics Association of Australia, July 1994, University of Melbourne. Lynch, B., Davidson, F., & Henning, G. (1988). Person dimensionality in language test validation. Language Testing, 5(2), 206–219. Lynch, B., & McNamara, T. (1998). Using G-theory and Many-facet Rasch measurement in the development of performance assessments of the ESL speaking skills of immigrants. Language Testing, 15(2), 158–180. Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrica, 47, 149–174. McNamara, T. (1990a). Assessing the second language proficiency of health professionals. Unpublished PhD thesis, The University of Melbourne. McNamara, T. (1990b). Item response theory and the validation of an ESP test for health professionals. Language Testing, 7(1), 52–75. McNamara, T. (1991). Test dimensionality: IRT analysis of an ESP listening test. Language Testing, 8(2), 139–159. McNamara, T. (1996). Measuring second language performance. London & New York: Longman. McNamara, T. (2001). Ten years of the Language Testing Research Centre. In C. Elder, A. Brown, E. Grove, K. Hill, N. Iwashita, T. Lumley, T. McNamara, & K. O’Loughlin (Eds.), Experimenting with uncertainty: Essays in Honor of Alan Davies (pp. 5–10). Cambridge: Cambridge University Press. McNamara, T., & Adams, R. (1994). Exploring rater behaviour with Rasch techniques Selected Papers of the 13th Annual Language Testing Research Colloquium (LTRC). Princeton, NJ: Educational Testing Service, International Testing and Training Program Office (also available as ERIC Document Reproduction Service No. ED 345 498). North, B. (1993). The development of descriptors on scales of proficiency: Perspectives, problems, and a possible methodology. NFLC Occasional Paper. Washington, DC: National Foreign Language Center. North, B. (1995). The development of a common framework scale of descriptors of language proficiency based on a theory of measurement. Thames Valley University, Unpublished PhD thesis. Nunan, D. (1987). Methodological issues in research. In D. Nunan (Ed.), Applying second language acquisition research (pp. 143–171). Adelaide: National Curriculum Resource Centre. Perkins, K., & Miller, L. D. (1984). Comparative analyses of English as a Second Language reading comprehension data: Classical test theory and latent trait measurement. Language Testing, 1(1), 21–32. Pollitt, A., & Hutchinson, C. (1987). Calibrating graded assessments: Rasch partial credit analysis of performance in writing. Language Testing, 4(1), 72–92. Pollitt, A., Hutchinson, C., Entwistle, N., & De Luca, C. (1985). What makes exam questions difficult? An analysis of ‘O’ grade questions and answers (Research Report for Teachers No. 2). Edinburgh: Scottish Academic Press. Rasch, G. (1960/1980). Probablilistic models for some intelligence and attainment tests. (Copenhagen, Danish Institute for Educational Research), expanded edition (1980) with foreword and afterword by B.D. Wright. Chicago: University of Chicago Press. 576 Language Testing 29(4) Rentz, R. R., & Bashaw, W. L. (1977). The National Reference Scale for Reading: An application of the Rasch Model. Journal of Educational Measurement, 14(2), 161–179. Skehan, P. (1989). Language testing. Part II. Language Teaching, 22(1), 1–13. Stansfield, C. W. (Ed.). (1986). Technology and language testing: A collection of papers from the Seventh Annual Language Testing Research Colloquium. Washington, DC: Teachers of English to Speakers of Other Languages. Stansfield, C. W., & Kenyon, D. M. (1995). Comparing the scaling of speaking tasks by language teachers and by the ACTFL guidelines. In A. Cumming & R. Berwick (Eds.), The concept of validation in language testing (pp. 124–153). Clevedon, Avon: Multilingual Matters. van Weeren, J. (1983). Practice and problems in language testing 5. Non-classical test theory; Final examinations in secondary schools. Papers presented at the International Language Testing Symposium (Arnhem, Netherlands, 25–26March 1982). Weigle, S. C. (1994). Effects of training on raters of English as a second language compositions: Quantitative and qualitative approaches. Unpublished PhD, University of California, Los Angeles. Widdowson, H. G. (1996). Linguistics. Oxford: Oxford University Press. Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater consistency in assessing oral interaction. Language Testing, 10(3), 305–323. Wood, R. (1991). Assessment and testing. Cambridge: Cambridge University Press. Wright, B. D., & Andrich, D. A. (1987). Rasch and Wright: The early years. Rasch Measurement Transactions, Pre-History, pp. 1–4. Wu, M. L., Adams, R., & Wilson, M. (1998). ACER Conquest: Generalised item response modelling software. Melbourne: ACER Press.