Uploaded by mhadiwirawan90

(2012) The Rasch Wars; The Emergence of Rasch Measurement In Language Testing

advertisement
430367
LTJ
Article
The Rasch wars: The
emergence of Rasch
measurement in
language testing
/$1*8$*(
7(67,1*
Language Testing
29(4) 555­–576
© The Author(s) 2012
Reprints and permission:
sagepub.co.uk/journalsPermissions.nav
DOI: 10.1177/0265532211430367
ltj.sagepub.com
Tim McNamara and Ute Knoch
The University of Melbourne, Australia
Abstract
This paper examines the uptake of Rasch measurement in language testing through a consideration
of research published in language testing research journals in the period 1984 to 2009. Following
the publication of the first papers on this topic, exploring the potential of the simple Rasch model
for the analysis of dichotomous language test data, a debate ensued as to the assumptions of the
theory, and the place of the model both within Item Response Theory (IRT) more generally and
as appropriate for the analysis of language test data in particular. It seemed for some time that
the reservations expressed about the use of the Rasch model might prevail. Gradually, however,
the relevance of the analyses made possible by multi-faceted Rasch measurement to address
validity issues within performance-based communicative language assessments overcame language
testing researchers’ initial resistance. The paper outlines three periods in the uptake of Rasch
measurement in the field, and discusses the research which characterized each period.
Keywords
FACETS, history of language testing, item response theory, multi-faceted Rasch measurement,
Rasch measurement
The paper examines the history of the take-up of Rasch measurement within second and
foreign language testing, in research on test development, delivery and validation. In the
first part of the paper, two characteristics of the field of language testing research which
influenced its reception are outlined: the role of psychometrics within differing regional
research traditions, and the professional background and training of those working within
language testing. In the second part of the paper, the results are presented of a survey of
studies involving Rasch measurement published in dedicated language testing journals
(Language Testing from 1984, Language Assessment Quarterly from 2004, Melbourne
Papers in Language Testing from 1992, and Assessing Writing from 1994). The research
Corresponding author:
Ute Knoch, School of languages and linguistics, Babel Level 5, The University of Melbourne, 3010, Australia.
Email: [email protected]
556
Language Testing 29(4)
is summarized over three periods (the 1980s, the 1990s and the 2000s) and tracks the
change from the initial claims for and resistance to Rasch measurement, particularly the
simple Rasch model – ‘the Rasch wars’ – to its ultimate wide acceptance by 2000.
Regional differentiation and professional training in
language testing research
In order to understand the context into which Rasch measurement was received within language testing research, it is necessary to recognize certain distinctive features of the field.
The first is that, despite the shared concerns of language testing researchers worldwide (i.e. the development and validation of language tests), regional traditions of language testing differ in significant ways, reflecting to some extent the differing cultural
values of their societies. For example, the British and American traditions of language
testing are acknowledged to draw on rather different theoretical and practical traditions
(Alderson, 1987; Alderson, Clapham, & Wall, 1995). The British tradition of language
examinations goes back to the beginning of the 20th century and earlier and was relatively slow to feel the full influence of psychometric theory as it developed, particularly
in the United States, and instead placed greater emphasis on the importance of test content and its relationship to teaching. Alderson (1987, p. 3) in a commentary on the work
of UK testing and examination boards noted ‘the lack of emphasis by exam boards on the
need for empirical rather than judgemental validation’ which meant that ‘examination
boards do not see the need to pretest and validate their instruments, nor conduct post hoc
analyses of their tests’ performance’. On the other hand he emphasized that
many of [the] tests are highly innovative in content and format … other tests could benefit
greatly … by attention to the content validation procedures they use. It is also true to say that
many tests would benefit from greater attention to the relationship between testing and teaching,
for which the British exam boards are particularly noted. (Alderson, 1987, p. 4)
The British tradition was quick to respond to developments in language teaching,
particularly the rapid spread of communicative language teaching and the emergence of
the specialist fields of English for academic and other specific purposes in the 1970s
and 1980s (Davies, 2008). The clearest example is perhaps the appearance of the ELTS
test of English for Academic Purposes in the 1980s, succeeded by IELTS from 1989,
which reflected in varying degrees the communicative demands of the academic setting
facing international students wishing to study at English-medium universities in Britain,
Australia, Canada and New Zealand, which the test targeted. It is interesting in this
regard to note the strongly critical comments by the American Grant Henning on the
psychometric shortcomings of the analysis of trial data in the ELTS Validation Project
(Henning, 1988a). IELTS still arguably prioritizes communicative relevance and impact
on teaching and learning over psychometric rigour, for example in its commitment to a
face-to-face test of speaking where a single rating determines the candidate’s score,
with the inevitable compromise on reliability that results. This is preferred over potentially more reliable tests of spoken language, for example tests administered in noninteractive settings which prioritize technological sophistication in delivery or analysis
McNamara and Knoch
557
of performance, but which fail to fully reflect the construct of speaking which necessarily involves face-to-face interaction.
The United States tradition by comparison has tended to emphasize psychometric
considerations more strongly. The overriding concern for demonstrating satisfactory
psychometric properties in tests meant that for many years language tests there proved
less responsive to changes in language teaching, particularly the communicative movement, and the accompanying demand for language tests to reflect this. For example, it
proved very difficult to change psychometrically sound tests such as the traditional paper
and pencil version of the Test of English as a Foreign Language (TOEFL), or the Test of
English for International Communication (TOEIC), long after critiques of the constructs
underlying such tests had emerged and there was a demand for more communicative
tests from teachers and from receiving institutions alike. Now, the two traditions have
come together, so that IELTS and the new TOEFL iBT are both communicatively oriented tests: TOEFL iBT is more closely targeted to the academic setting, and is arguably
more reliable, but its speaking assessment lacks the face-to-face interaction that is the
distinctive feature of the IELTS interview. Over 20 years ago, Alderson (1987, p. 4) suggested that ‘some combination of British judgemental validation and American empirical
validation seems required’, and to a large extent this now seems to have occurred.
In Australia, research in language testing first seriously emerged in the context of the
Adult Migrant Education Program (AMEP), a national program for teaching English as
a second language to adult immigrants, which accompanied the introduction of large
scale immigration to Australia following World War II. Language assessment has been
influenced by both the British and American traditions. Australia played a major collaborative role with British researchers in the development of IELTS, and workplace tests
such as the Occupational English Test (McNamara, 1996) drew directly on the British
tradition of ESP testing. On the other hand, American influence was felt in the area of
oral proficiency testing, where the instrument used for 20 years from the late 1970s for
assessing migrants in the Adult Migrant English Program, the Australian Second
Language Proficiency Ratings (ASLPR: Ingram & Wylie, 1979) was derived from the
American Foreign Service Institute oral proficiency interview and scale (Clark &
Clifford, 1988). In fact, however, language assessment in Australia was ultimately to cut
a more independent path, particularly when it encountered Australian work on Rasch
measurement, as we shall see.
The second important contextual feature to consider is the typical professional background and research training of language testing researchers. Language testing is a hybrid
field, with roots in applied linguistics and in measurement. Researchers (at least in the
English-speaking world) frequently enter work in language testing following initial training and careers in language teaching rather than in statistics or psychometrics. Their introduction to language testing is in specialist courses within graduate study in applied
linguistics, or through practical exposure in their professional teaching careers, and they
are likely to lack a strong background in mathematics or statistics in their prior education,
typically being languages, linguistics or humanities and social science majors in their
undergraduate degrees. They may even have consciously avoided and feel uncomfortable
with numbers. Graduate training in language testing research shows considerable variability across national contexts, reflecting what we have seen of the differing regional research
558
Language Testing 29(4)
traditions. Thus, in very general terms, those applied linguistics graduate students entering
language testing in the best American centres of graduate training will then be initiated
into the American tradition of language testing research and exposed to extensive training
in psychometrics. Arguably the most significant American graduate program in language
testing in the last 25 years, that at UCLA under Lyle Bachman, has established a tradition
of rigorous psychometric training, and its graduates are maintaining that tradition in their
own education of the next generation of language testing researchers. The requirement
within American doctoral degrees to undertake extensive coursework, not matched until
very recently in British and Australian universities, also gives scope for psychometric
training outside applied linguistics programs which is less readily available to those studying in the UK or Australia. The British and Australian tradition of training has tended to
emphasize psychometric training less, though more now is incorporated, and has continued to highlight strongly the need to engage with the relationship between test content and
task design and theories of language learning and teaching. However the practice at the
annual research conference of the International Language Testing Association (ILTA) of
holding pre-conference workshops which emphasize psychometric training now provides
opportunities for people from all research traditions to develop their knowledge and skills
in this area.
Overall, the fact that the professional background of many language testers lies outside measurement has advantages and disadvantages. The advantage is that issues of
construct interpretation tend to be at the fore, and language testers readily engage in
debate over substantive matters of this kind. The disadvantage is that where language
testing researchers come to psychometrics late, a lack of depth of training in this area can
be a handicap.
The contextual background of the culture of language testing, both in its differing
broad national traditions, and the professional background and training of language testing researchers, constitutes the setting in which the influence of Rasch measurement
began to be felt from the early 1980s onwards. We now consider the history of the uptake
of Rasch models in the almost three decades since the publication of the first papers on
the topic in the field of language testing.
Enter Rasch: The 1970s and 1980s
Awareness of Rasch measurement in language testing occurred as a rather belated
response to the growing interest in what was known as latent trait theory1 more generally
throughout the 1960s and 1970s in educational measurement.2 Ben Wright, the outstanding American advocate of Rasch methods, started corresponding with Georg Rasch,
invited him to Chicago and visited him in Denmark in the 1960s. Wright commenced
annual courses on the theory and practice of Rasch measurement for Education and
Psychology students in 1964 and Rasch spoke at the Educational Testing Service
Invitational Test Conference in October 1967. The awareness among those involved
more exclusively in language testing occurred first in those centres in which sophisticated psychometric expertise was available: for example the Central Institute for Test
Development (CITO) in the Netherlands, Educational Testing Service (ETS) in the
United States, the National Foundation for Educational Research (NFER) in the United
McNamara and Knoch
559
Kingdom and the Australian Council for Educational Research (ACER) in Australia. It
was not long before this began to be felt in language testing. The context of reception and
the initial reaction differed in different countries. We will consider here in turn the
Netherlands, the United States, the United Kingdom, and Australia.
In the early 1980s in the Netherlands psychometricians at the educational research
centre CITO became interested in Rasch, and the proceedings of a meeting at CITO in
1982 (van Weeren, 1983) contained four papers on the application of the Rasch model in
language testing, mostly in the testing of foreign languages in school examinations. One
of these was by a test constructor at CITO, John De Jong, who assessed the validity of a
test using the Rasch model (De Jong, 1983). De Jong (personal communication, 26 July,
2008) writes of that time:
In 1982 I was not yet aware of the NFER and ACER work, but after joining the ‘Rasch club’, a
kind of inter-university discussion group, meeting 3 or 4 times a year, I got into contact with
more people and in 1985 I went to a Rasch meeting organized by Ben Wright in conjunction
with the AERA3 in Chicago. There I met Geoff Masters and David Andrich and also was staying
in the same student-type accommodation as a number of Dutch professors in psychometrics
(Wim van der Linden, Ivo Moolenaar, Don Mellenbergh, Eddie Roskam), who expressed that
my work could easily be the basis of a PhD.
De Jong subsequently completed his PhD on applying Rasch measurement to issues of
test development, test equating, international standards, and educational reform, using
examples from a variety of language tests representing all four skills (De Jong, 1991).
Given the strong psychometric tradition in the United States, and the central role of
the program at Chicago under Ben Wright in promulgating Rasch’s ideas in the educational field (Wright & Andrich, 1987), it was inevitable that Rasch measurement would
soon come to the attention of language testing researchers. An influential early figure
was Grant Henning, who had attended a workshop on Rasch with Ben Wright, and
became an advocate of Rasch. Teaching at the American University in Cairo in the early
1980s he inspired Kyle Perkins, there on a sabbatical year from Southern Illinois
University Carbondale, to become familiar with Rasch measurement. Others who were
with Henning in Cairo were Thom Hudson, who became his student at UCLA, and Dorry
Kenyon, a student of Henning’s in Cairo, who was an early adopter of Rasch when he
later began working with Charles Stansfield at the Center for Applied Linguistics (CAL)
in Washington, DC. Subsequently, at UCLA, Henning taught Rasch measurement and
worked on language test data analysis with his own graduate students, particularly Fred
Davidson, Brian Lynch, Thom Hudson, Jean Turner and somewhat later Antony Kunnan.
Henning’s influence is reflected in the publication of a number of papers by himself and
his students in the journal Language Testing, established in 1984. These papers4 set out
to introduce the main features of the Basic Rasch model and its earliest extensions to a
language testing readership, using data sets to demonstrate its potential in exploring test
quality. An important step in the dissemination of knowledge about Item Response
Theory (IRT) and Rasch in particular among language testers occurred in 1985. Charles
Stansfield, who was exposed to IRT while working at ETS from 1981 to 1986, chaired
the 1985 Language Testing Research Colloquium (LTRC), held at ETS, the theme of
which was ‘Technology and Language Testing’, and organized a two-day pre-conference
560
Language Testing 29(4)
seminar on IRT. In Stansfield’s view, the conference and workshop were significant in
the following way:
[The] workshop gave formal training to everyone and the following year and for years
afterward, there were many papers involving IRT at LTRC. I wouldn’t say the conference
introduced IRT to language testers, although for most it did. However, I can say that the
conference rapidly moved language testers to this new and promising approach to item analysis,
building a scale, calibrating items, etc. After the conference, it was shared knowledge. (Charles
Stansfield, personal communication, 22 May 2011)
Six out of the 10 papers from the 1985 LTRC published in the proceedings under the title
Technology and Language Testing (Stansfield, 1986) dealt with IRT; the authors included
Henning, De Jong, and Harold Madsen and Jerry Larson from Brigham Young University,
who had used the Basic Rasch model to develop a computer adaptive placement test.
Stansfield continued to play a role in promoting Rasch following his appointment as
Director of the Center for Applied Linguistics in 1986. He writes:
I knew I would be working with small data samples for the tests we developed, so Rasch
seemed to be what was needed. Ben Wright had created a Rasch SIG within AERA, so I started
attending their meetings. They also had a preconference meeting which was a series of papers
on the subject. He usually had comments about each paper, and each comment was positive …
I was one of about 6 language testers who went to Ben Wright’s office at the University of
Chicago after an LTRC,5 for a full day of lecture and question-answering by Ben Wright. He
was a most impressive man and an excellent communicator. (Charles Stansfield, personal
communication, 22 May 2011)
In the United Kingdom, Rasch measurement entered via work on the testing of reading in the mother tongue in schools, and then spread to second and foreign language
testing. Rasch measurement had influenced the work of the National Foundation for
Educational Research (NFER), which focused on school educational contexts, and there
it had been used in the development of reading tests in English as a mother tongue, work
which in turn encountered critique on the grounds of the assumptions of the Basic Rasch
model (Goldstein, 1979; see more below). In 1985, the British Council in London organized introductory seminars on Item Response Theory for language testing specialists.6
Alastair Pollitt, a psychometrician based in Edinburgh, who had an interest in schoolbased L1 writing, began using the Rating Scale and Partial Credit models to explore
school students’ writing in English as a mother tongue (Pollitt & Hutchinson, 1987;
Pollitt, Hutchinson, Entwistle, & DeLuca, 1985). Within second language testing,
Rosemary Baker wrote a PhD about Rasch at Edinburgh (Baker, 1987; subsequently
published as Baker, 1997), but soon moved out of the field proper. Neil Jones had become
aware of the potential of the Basic Rasch model when he happened upon the discussion
in Henning (1987), developing his own program for the analysis of language test data,
which he demonstrated at a series of workshops (Jones, 1991). One of these workshops
was attended by Brian North, who some years later used Rasch measurement in the calibration of descriptors of what became the Common European Framework of Reference
(North, 1993, 1995). Jones subsequently wrote a PhD at Edinburgh on the use of Rasch
McNamara and Knoch
561
in item banking, supervised by Pollitt (Jones, 1992). There was also some interest in
Rasch measurement at Lancaster, a strong centre for language testing research, particularly in the development of computer adaptive tests (Alderson, 1986).
In Australia, in the broader field of educational measurement, there was a uniquely
strong connection with Rasch measurement. A succession of Australians studied with
Ben Wright in Chicago and contributed significantly to the development of Rasch modelling itself – initially David Andrich who developed the Rating Scale model (Andrich,
1978) and Geoff Masters who developed the Partial Credit model (Masters, 1982), and
later Ray Adams, who with Khoo Siek-Toon developed the Quest program (Adams &
Siek Toon, 1993), Mark Wilson who with Ray Adams and Margaret Wu developed
Conquest (Wu, Adams, & Wilson, 1998), and others, particularly Patrick Griffin, who,
while he was working in Hong Kong, began working on an ESL test (Griffin, 1985). It
would not be long before language testers in Australia would encounter this remarkable
intellectual resource, though, given their lack of opportunity for training in psychometrics, it would happen by chance, as a personal account of the exposure to Rasch of one
Australian researcher (McNamara) will demonstrate. The happenstance of his entry into
language testing has parallels in the careers of many other researchers in language testing, given the context of their background and training outlined above.
McNamara worked for 13 years teaching English as a foreign language to adults in
private language schools in London and in the Australian equivalent of community colleges in Melbourne. He had had some very introductory training in quantitative methods
(basic univariate statistics) during his MA in Applied Linguistics in London, and took
one short course on language testing as part of that degree. On his return to Australia,
opportunities came up to do consultancies in language testing, the principal one being to
develop a workplace-related performance assessment of the ability of immigrant health
professionals to carry out the communicative tasks of the workplace. (He had been
involved in setting up and teaching ESP courses for such health professionals.) The
resulting Occupational English Test (OET) (McNamara, 1996) was strongly influenced
by British work on the testing of English for academic purposes, and was within that
tradition of communicative language testing: it emphasized real-world contexts for communication, and the productive skills of speaking and writing within profession-specific
contexts, as well as shared assessments of reading and listening.
McNamara’s work on the OET subsequently formed the basis for a PhD (McNamara,
1990a). It was in the context of carrying out this study that the purely chance encounter
with the Rasch tradition happened. At Melbourne, where he had a temporary position in
applied linguistics, the head of the Horwood Language Centre, Dr Terry Quinn, an
applied linguist with an interest in the policy context of assessment but also with little
background in psychometrics, helped him to find a co-supervisor for his thesis.
McNamara remembers Quinn taking out a list of academic staff at the University and
running down the list, looking for someone whose interests lay in assessment. ‘Ah, here’s
one’ he said. ‘Masters – I don’t know him but he’s in the Faculty of Education and it says
he’s interested in assessment. Go and see him and see if he will be a co-supervisor with
me.’ McNamara went to see Geoff Masters, who in fact resigned from the University
within six months to take up another position, but it was long enough for McNamara to
be introduced to the Partial Credit Model, which he then used in his thesis in the analysis
562
Language Testing 29(4)
of data from the performance-based tests of speaking and writing in the OET. Many of
the earlier papers on Rasch in language testing had been on objectively scored data; the
combination of the new Rasch measurement and the communicative tradition in language testing, while not original, was still relatively novel (papers by Davidson and
Henning (1985) and Henning and Davidson (1987)7 had used Rasch in the analysis of
data from self-assessments and from a writing test; see also Davidson, 1991).
In summary, then, when Rasch measurement first began to be known within language
testing research, it entered regional contexts which differed significantly in their language testing practice and research cultures, and which had differing attitudes to training
in psychometrics. This had implications for the initial and subsequent reception of Rasch
measurement in the field, and was a significant factor in ‘the Rasch wars’, as we shall
now see.
The Rasch wars: Controversies over the assumptions of
Rasch measurement in the 1980s
The advent of Rasch within language testing was not viewed in a positive light by everyone. The Rasch wars, already well under way within educational measurement more
generally, had begun. These wars were fought on a number of fronts: the use of Rasch
measurement in language testing was exposed to different kinds of attack, some on psychometric grounds, the others on applied linguistic ones.
From the psychometric side, the question of the desirability of the use of Rasch measurement in language testing reflected debates and disputes in the broader measurement
field, principally about the analysis of dichotomous items using the Basic Rasch model.
There were in fact two kinds of dispute here: first, what were the advantages of latent
trait theory, including the simple Rasch model and other IRT models for dichotomous
items, over classical test theory? Second, what were the relative merits of the Rasch
model versus the other IRT models? In those settings, such as the UK and Australia,
where the Rasch model was the vehicle for debates about the advantages of latent trait
models in general, and the particularities of Rasch measurement were at issue, these two
areas of dispute became fused. For example, while the first item response models were
developed in the work of Lawley (1943) and Thomson (see Bartholomew, Deary, &
Lawn, 2009) in work on intelligence testing in the United Kingdom in the 1940s and
1950s, most of the subsequent interest in IRT was in the United States, and it was forgotten in Britain; when latent trait theory returned to Britain it was only in the form of the
Basic Rasch model, which then bore the brunt of the attack.
Within latent trait modelling itself, there were fierce arguments between advocates of
the Basic Rasch model and those who preferred 2- and 3-parameter IRT models
(Hambleton, Swaminathan, & Rogers, 1991). For example, Stansfield (personal communication, 22 May 2011) reports that at ETS ‘the one parameter [i.e. Basic Rasch]
model was viewed as simplistic and inadequate to display the properties of an item’. One
target of attack from proponents of more complex IRT models was the Rasch assumption
of equal discrimination of items, which was easy to disprove as a matter of fact for any
particular data set. The nature and advantage of such an assumption – a deliberate
McNamara and Knoch
563
simplification – was too little appreciated, although other fields of applied linguistics
have readily grasped this point. For example, Widdowson (1996) has used the example
of the simplifying aspects of the model of the London Underground present in London
Transport maps; it would be misleading, for example, for anyone looking at the map to
assume that the distances between stations on the map were true to the actual relative
distances between stations. The map uses a deliberate simplification; in Rasch, the deliberate simplification of the assumption of equal discrimination of items permits exploitation of the property of specific objectivity in Rasch models, which means that the
relationship of ability and difficulty remains the same for any part of the ability or difficulty continuum (see McNamara, 1996 for an explanation of this point). This property
made it easy for items to be banked, essential for applications in computer adaptive testing, and for tests to be vertically equated, allowing mapping of growth over time using
linked tests. Supporters of the Rasch model argued that the fact that Rasch has tests of
the failure of its own assumption – that is, the capacity to flag when item discriminations
for individual items depart sufficiently far from the assumed discrimination to jeopardize
the measurement process – means that this assumption is not used recklessly.
From applied linguists, the assumption of unidimensionality in Rasch was seen as
having deeply problematic implications for test constructs. (As this assumption was
shared at that time by all the then-current IRT models, and indeed by Classical Test
Theory, it was less of an issue for psychometricians.) In an early example of this sort of
critique, Michael Canale argued:
Perhaps the main weakness of this version of CAT8 is that the construct to be measured must,
according to item response theory, be unidimensional – i.e. largely involve only one factor. Not
only is it difficult to maintain that reading comprehension is a unidimensional construct (for
example, to ignore the influence of world knowledge), but it is also difficult to understand how
CAT could serve useful diagnostic and achievement purposes if reading comprehension is
assumed to be unidimensional and, hence, neither decomposable into meaningful subparts nor
influenced by instruction. (Canale, 1986, p. 30)
Similar objections were vigorously voiced elsewhere. The assumption that a single
underlying measurement dimension is reflected in the data was deemed to make Rasch
modelling an inappropriate tool for the analysis of language test data (Buck, 1994;
Hamp-Lyons, 1989), given the obvious complexity of the construct of language proficiency. Skehan (1989, p. 4) expressed further reservations about the appropriateness of
Rasch analysis in the context of ESP testing, given what we know about ‘the dimensions
of proficiency or enabling skills in ESP’. McNamara (1990a) argued strongly against this
view, using data from the OET trials. He was able to show that Rasch analyses could be
used to confirm the unidimensionality of a listening test which might at first sight be
thought to be testing different dimensions of listening (McNamara, 1991) and demonstrated that Rasch could be used as a powerful means of examining underlying construct
issues in communicative language tests, particularly the role of judgements of grammatical accuracy in the assessment of speaking and writing (McNamara, 1990b). Henning
(1992) also defended Rasch against what he argued were misunderstandings of the
notion of unidimensionality in Rasch measurement.
564
Language Testing 29(4)
The controversies that Rasch measurement could trigger were illustrated in the
Australian context9 by the development and reception of the Interview Test of English as a
Second Language (ITESL) (Adams, Griffin, & Martin, 1987; Griffin, Adams, Martin, &
Tomlinson, 1988). This test was being developed for the purpose of placing elementary and
intermediate level students in general language courses for adult immigrants to Australia
within the Adult Migrant English Program (AMEP). The teachers in the program did not
like it. The grounds for their rejection of the test were complex. First, they shared the general suspicion of teachers in that program of formal testing instruments. Secondly, instead
of the communicative focus of the curriculum of the AMEP, ITESL focused on accuracy in
the production of individual grammatical structures in the context of very controlled, short
spoken tasks. As we have seen, assessment in the program to that time involved the interview-based Australian Second Language Proficiency Ratings (ASLPR) (Ingram & Wylie,
1979), underpinned by a holistic, integrated, performance-based view of proficiency.
ITESL was a radical departure from existing practice and caused a storm of controversy
among the teachers who were meant to use it, as it went against their ideas of language and
language learning. Thirdly, the test had been developed not by teachers, but by psychometricians working in the measurement division of the Education Ministry in Melbourne,
Patrick Griffin and Ray Adams, using the Basic Rasch model with its (for the uninitiated)
daunting conceptual and statistical complexity. Trialling and subsequent Rasch analysis
indicated that the test items formed a continuum of difficulty. In order to understand better
the construct of the test, Griffin and Adams had consulted with some ESL teachers in the
AMEP whose views on language teaching were felt by others to be rather conservative,
emphasizing mastery of grammatical forms. Griffin et al. (1988) went so far as to argue that
their research confirmed the existence of a hypothesized ‘developmental dimension of
grammatical competence ... in English S[econd] L[anguage] A[cquisition]’ (p.12). This
was a red rag to a bull for researchers working in second language acquisition in Australia,
especially those whose work was principally in support of the AMEP, for example Geoff
Brindley in Sydney and David Nunan, the senior figure in the research centre which supported the AMEP in Adelaide. Nunan (1987) published a strong attack on the test, arguing
that the psychometric analysis underlying the development of the test was so powerful that
it obscured substantive construct issues. This objection fell on fertile ground among the
teachers, who were suspicious of the psychometrics in the test, and Rasch became something of a bogey word among them.
Summary of exploration of Rasch in second language
testing in the 1980s
The situation at the beginning of the 1990s then was as follows. There were several places
in the world where second language testing had begun to engage with Rasch measurement:
1.
2.
In the Netherlands, John de Jong and others at CITO had begun an exploration of
the use of Rasch modelling in data from performance on foreign language tests
(De Jong, 1983, 1991; De Jong & Glas, 1987).
In the United States, the work of Grant Henning and those he had influenced
focused on exploring the potential of the basic Rasch model to illuminate
McNamara and Knoch
3.
4.
565
dichotomously scored data from the English as a Second Language Placement
Examination (ESLPE), a test for students entering UCLA (Chen & Henning,
1985; Henning, 1984, 1988b; Henning, Hudson, & Turner, 1985; Lynch,
Davidson, & Henning, 1988); the use of the Partial Credit model had also been
used with writing (Henning & Davidson, 1987) and self-assessment data
(Davidson & Henning, 1985). Familiarity among language testers with IRT in
general, and with Rasch in particular, was helped by the exposure to it at the 1985
and 1988 LTRCs and associated workshops. The use of Rasch was encountering
criticism on both psychometric and applied linguistic grounds.
In the United Kingdom, the Rasch model had been used in the development and
validation of L1 tests of reading and writing, although this was the subject of
intense criticism (Goldstein, 1979). Interest in the applications of Rasch measurement to second language testing was slowly growing, particularly in
Edinburgh, but this too was subject to sharp critique, for example in a book written by Robert Wood, a colleague of Goldstein, which reviewed psychometric
issues in second language testing on behalf of those responsible for the Cambridge
language tests (Wood, 1991).
In Australia, Griffin et al. (1988) had developed a controversial test of oral proficiency in ESL using the Partial Credit model, and McNamara (1990a, 1990b) had
used the Partial Credit model in the development of a specific purpose communicative test of ESL for health professionals.
Language testing research papers involving Rasch, 1984–1989
The discussion of the context of the 1980s in which Rasch measurement first appeared in
the language testing literature helps us to understand the picture of published research
emerging from the 15 or so papers on language testing using Rasch measurement published in the journal Language Testing10 in the period 1984–1989. They were mostly
introductions and explorations of the potential of this new tool, contrasting it with classical measurement theory, and defending the model against theoretical objections. Most
studies used the Basic Rasch model, and hence were limited to dichotomously scored
data, which meant that tests of grammar and vocabulary, and to a lesser extent of listening and reading, predominated, though there were some studies of judgements of writing
(Pollitt & Hutchinson, 1987) and of self-assessment using rating scale data (Davidson &
Henning, 1985). There was little exploration as yet of the potential of the model to examine more substantive issues of validity (with the exception of the validity study in the
paper by De Jong, 1983). Table 1 presents the results of a survey of papers published in
the journal in this period. The table presents information on which Rasch model was used
in the analysis, what kinds of language tests were involved (that is, focusing on which
language skills), where the research was carried out, whether the article focused exclusively on Rasch or whether it was one of several psychometric procedures used, and the
focus of the paper in relation to Rasch – whether Rasch was simply used as part of a
larger argument in the paper, or whether it was itself the focus of the paper’s discussion.
This framework will also be used in analysis of publications from subsequent periods, to
permit comparison across the periods.
566
Language Testing 29(4)
Table 1. Published language testing journal research using Rasch, 1984–1989
Model used
Basic (8)
Skills in test
Speaking, Writing (2)
Rating scale/Partial
credit (3)
Reading, Listening (3)
Author affiliation
Role of Rasch
Function of Rasch
Australia, NZ (3)
Primary (9)
Discuss (5)
USA (6)
One of several (3)
Use (5)
Unknown (4)
N=15
Other or more
than one (9)
UK, Europe (6)
Marginal (3)
Both (5)
N=14a
N=15
N=15
N=15
Note: aThe reason that the total number of articles in this row is lower than in other rows is that in some of
the discussion articles, no language skill is specifically targeted. This is also true in tables below (Table 3 and
Table 4) summarizing papers from later periods.
Developments in the 1990s: Enter FACETS
Late in 1990 there was a significant development. Ray Adams had just come back to
Melbourne from completing his PhD in Chicago (Adams, 1989) and was aware of the
significance of Mike Linacre’s work on multi-faceted Rasch measurement, implemented
through the FACETS program (Linacre, 2009), the result of Linacre’s PhD (Linacre,
1989) which he had completed while Adams was in Chicago. Adams helped McNamara
understand the radical nature of the development, and its potential for illuminating fundamental issues in performance assessment in second languages became clear. They
wrote a paper (McNamara & Adams, 1994) demonstrating FACETS analysis using data
from the IELTS writing test, and presented it at the 1991 Language Testing Research
Colloquium at ETS; it was the first paper in language testing to use FACETS. The paper
proved controversial, as what it showed about IELTS was that with a single rater, the
chances of a borderline candidate gaining a different score for the same performance
with a different rater were unacceptably high, and so the IELTS practice of a single rating
of speaking and writing was hard to defend, a point made by John de Jong who was a
discussant to the paper. In fact, FACETS had revealed a problem that was common to all
performance tests, the vulnerability of the score to rater effects, a point that as Linacre
had pointed out in his PhD thesis had been recognized for over 100 years, but conveniently forgotten.
Language testing researchers in the United States, especially the advocates of earlier
Rasch models, quickly saw the potential of multi-faceted Rasch measurement. For example, Stansfield and Kenyon used FACETS to scale speaking tasks based on the ACTFL
Guidelines and the ILR scale.11 The scaling was found to fit the ACTFL/ILR model and
the authors argued that it constituted a validation of the ACTFL/ILR scale. Others
remained cautious about Rasch because of their acceptance of the critiques within psychometrics of Rasch assumptions, and their preference for the so-called two- and threeparameter IRT models; Bachman and Pollitt aired the opposing sides on this issue in the
late 1980s at meetings of the advisory group for the Cambridge-TOEFL comparability
study (Bachman, Davidson, Ryan, & Choi, 1995). A further opportunity for dialogue
around these issues arose in 1992, when McNamara, on an extended visit to UCLA, was
invited by Bachman to co-teach courses on IRT in which Rasch methods, including
multi-faceted Rasch measurement, were taught in contrast with two-parameter IRT
McNamara and Knoch
567
models for dichotomous data and Generalizability Theory for judge-mediated data. This
resulted in the book Measuring Second Language Performance (McNamara, 1996)
which aimed to introduce Rasch methods, including multi-faceted Rasch measurement,
to a general language testing readership, emphasizing its potential for exploring data
from communicative assessments.
In the United Kingdom, the most influential language testing agency, the University
of Cambridge Local Examinations Syndicate (UCLES), despite on the one hand the psychometric caution represented by the review by Wood (1991), and on the other a certain
suspicion of psychometric methods of the type used by ETS as too ‘American’, began to
use Rasch measurement. Alastair Pollitt joined UCLES in 1990, and initiated a project to
use Rasch methods to calibrate and equate the first versions of the recently introduced
IELTS; the calibration and item-banking system now used in Cambridge was developed
soon afterwards. Neil Jones joined Cambridge in 1992, having completed his PhD at
Edinburgh, and soon became responsible for this work.
In Australia, the establishment of what was to become the Language Testing Research
Centre (LTRC) at Melbourne as a result of the adoption in 1987 by the Australian
Government of the National Policy on Languages (Lo Bianco, 1987) led to the creation
of another centre involved in the use of Rasch models (McNamara, 2001). Usually, academics working in language testing in university contexts tend to be solitary figures;
located as they are in applied linguistics programs for the most part, it is unlikely that any
program will be large enough to have more than a single member of staff with language
testing as a speciality. Here, several language testing researchers were working together
carrying out externally funded research, a highly unusual situation. The establishment of
the LTRC met a real need for expertise in language assessment in Australia, and a number of projects were begun involving performance assessments which lent themselves to
analysis using Rasch methods, particularly multi-faceted Rasch measurement. Melbourne
became a centre for Rasch-based language testing research, as figures on publications to
be presented below confirm.
The entry of multi-faceted Rasch measurement had a strong influence on the Rasch
wars. By the early part of the 1990s, the growing communicative movement in language
teaching meant that tests of the ability to speak and write in simulated real-world contexts became more and more central to language testing, with a resultant change in focus
away from dichotomously scored tests towards performance assessments. Multi-faceted
Rasch measurement provided a powerful tool to examine rater characteristics at the level
of individual raters, something which Generalizability Theory could in part match, but
less flexibly (Lynch & McNamara, 1998). Rater characteristics which were now open for
detailed research using Rasch methods included relative severity or leniency; degree of
consistency in scoring; the influence of rater training; the influence of professional background; and consistency over time. Other aspects of the rating situation such as the
effects on scores of task and mode of delivery (face-to-face vs. technologically mediated) could also be explored, as well as the interaction of these facets with aspects of
rater characteristics. It is as if researchers in this field had been handed a very powerful
microscope to examine the complexity of the rating process. Despite this, it was not clear
even by the mid-1990s that multi-faceted Rasch measurement was going to be taken up
more widely. It was not as if the psychometric argument had been resolved: while it is
568
Language Testing 29(4)
Table 2. Changing attitudes to Rasch in the 1990s
Period
For Rasch
Australia
1990–1994
1995–1997
1998–1999
Total
LT
MPLT
5
4
1
10
6
3
2
11
Against Rasch
USA
Rest of the World
USA
Rest of the World
1
1
2
4
0
2
2
4
5
0
0
5
1
0
1
2
true that in the Partial Credit Model some differences in ‘classical discrimination’
between items is accounted for by the variation in thresholds between items, this is not
true for the Rating Scale Model, where there is no variation in thresholds between items,
and for dichotomously scored items, the assumption of equal discrimination still holds.
By the late 1990s, however, the appeal of multi-faceted Rasch measurement for understanding issues in communicative language testing proved irresistible, and there was a
steady uptake in many world centres, including the United States (Table 2).12
The table divides the papers into two basic categories: those assuming or supporting
the use of Rasch modelling; and those arguing against its assumptions. The papers are
further classified to indicate the geographical affiliation of the authors. Australian research
supportive of Rasch modelling is prominent, particularly in the early period; much of this
appeared in the house journal of the Language Testing Research Centre in Melbourne,
Melbourne Papers in Language Testing (MPLT). We can see that in the early 1990s,
research published by researchers working outside Australia were for the most part questioning the use of Rasch models; by the end of the decade that was no longer the case.
Some of the papers that appeared in that time give a feeling for these developments.
Buck’s objections to Rasch on the grounds of the unidimensionality assumption in the
context of the testing of listening (Buck, 1994) and Henning’s defence of Rasch (Henning,
1992) are typical of the debates at the main international conference of language testing
researchers, the Language Testing Research Colloquium, at that time. The cautious interest in multi-faceted Rasch measurement in the United States is demonstrated by the
papers of Bachman and some of his students using FACETS: a jointly authored paper
(Bachman, Lynch, & Mason, 1995) compared analyses using FACETS and G-Theory on
a data set from a test of speaking; Weigle (1994) used FACETS in her PhD thesis to
investigate the effect on the measurement qualities of individual raters who had taken
part in a program of rater training. In Australia, Brown (1995) demonstrated the potential
of FACETS to explore substantial issues of validity in second language performance
assessments in her investigation of the effect of rater background on a workplace related
test of spoken Japanese for tour guides. Raters with tour guiding experience were shown
to be oriented to the assessment criteria differently from those without such experience,
particularly on a task involving handling a problematic relationship with a client (a tourist). The paper raises the question of whose criteria should prevail in language tests
contextualized within workplace tasks – linguistic or real-world criteria, reflecting a
McNamara and Knoch
569
Table 3. Published journal research using Rasch, 1990–1999
Rating scale/
MFRM (17)
Partial credit (7)
Skills in test
Speaking, Writing (22) Reading, Listening (8) Other (6)
Author affiliation Australia, NZ (22)
USA (6)
Europe (3)
RoW (5)
Role of Rasch
Primary (13)
One of several (17)
Marginal (6)
Function of Rasch Discuss (2)
Use (23)
Both (11)
Program used
FACETS (12)
(Con)Quest (4)
Others (8)
Model used
Basic (11)
N=35
N=36
N=36
N=36
N=36
N=24a
Note: aThe total in this cell is lower because some authors did not specify the program used.
distinction proposed between ‘strong’ and ‘weak’ second language performance assessments (McNamara, 1996).
Table 3 summarizes the language testing research featuring Rasch measurement
appearing in the 1990s. The major trends it reveals are: (a) a much greater use of multifaceted Rasch measurement; (b) in the context of more research on the assessment of
speaking and writing; (c) it is now much more often one of several statistical techniques
used; (d) Rasch measurement is now mostly a statistical methodology to be used rather
than simply discussed; and (e) the over-representation of Australian research.
Character of papers 2000–2009
By about 2000, then, the Rasch wars were essentially over. The acceptance of Rasch
measurement, particularly multi-faceted Rasch measurement, as a useful tool in the
armory of language testing researchers, especially in performance assessments, is
reflected in the summary provided in Bachman’s influential survey of the state of the art
of language testing at the turn of the century (Bachman, 2000). Bachman begins by noting the growing use of Rasch measurement:
IRT has also become a widely used tool in language testing research … the Rasch model, in its
various forms, is still the most widely used in language testing … More recently, the Rasch
multi-facet model has been applied to investigate the effects of multiple measurement facets,
typically raters and tasks, in language performance assessments. (Bachman, 2000, pp. 5–6)
But more significant in this context are his comments on the state of the debate over the
appropriateness of its use:
The abstract technical debate about dimensionality and the appropriateness of different IRT
models has been replaced by a much more pragmatic focus on practical applications, particularly
with respect to performance assessments that involve raters and computer-based tests.
(Bachman, 2000, p. 22)
This is confirmed by the following summary of the publications in the first decade of
the current century (Table 4). As can be seen from the table, the use of Rasch measurement in language testing research appears to have become universally uncontroversial
570
Language Testing 29(4)
Table 4. Published journal research using Rasch, 2000–2009
Basic (12)
Rating scale/Partial credit (4) MFRM (29)
Speaking, Writing (26) Reading, Listening (6)
Other (12)
Australia, NZ (15)
USA (11)
UK, Europe (8)
RoW (16)
Role of Rasch
Primary (15)
One of several (26)
Marginal (6)
Function of Rasch Discuss (5)
Use (39)
Both (3)
Model used
Skills in test
Author affiliation
N=45a
N=44
N=50
N=47
N=47
Note: a The totals in this column differ as some authors did not specify the model they used, some discussion
papers did not target a specific language skill and some papers were authored by multiple authors from different parts of the world. In total, 47 papers were included in the sample.
and routine. No longer is its use restricted to one or two centres; it is used by researchers
in many different countries. Most typically, multi-faceted Rasch measurement is used
with judge-mediated data from communicative language tests, often simply in order to
establish the psychometric qualities of such tests, but also, and more interestingly, to
address substantive validity issues.
Another feature of the current scene is that Rasch is just one of a battery of psychometric tools used in the research, and increasingly, qualitative methods (especially introspection) are used in order to support or interrogate the quantitative findings.
Some examples of these validity studies will give a feel for the current situation. Bonk
and Ockey (2003) used multi-faceted Rasch measurement in their study of a group oral
assessment in English language at a Japanese university, in which groups of three or four
students were assessed in conversation by two raters. The study demonstrated the relatively high degree of variability among the raters, confirming that such assessments are
of modest quality in terms of reliability. The study also addressed the question of rater
change over time, and found that the raters tended to become harsher with experience.
Nevertheless they conclude that the group oral test of the type they studied is, despite its
shortcomings, useful as a general measure of oral proficiency, especially in contexts
where oral skills would not otherwise be assessed.
A paper by Elder, Knoch, Barkhuizen, and von Randow (2005) returned to the unresolved issue of whether the information on the quality of ratings of individual raters
available from multi-faceted Rasch measurement analyses of rater harshness and consistency, and patterns of biased ratings involving particular criteria or particular tasks,
could form the basis of feedback to raters which would improve their subsequent performance. Earlier studies (e.g. Lunt, Morton, & Wigglesworth, 1994; Wigglesworth, 1993)
of the effectiveness of such feedback to raters had proved inconclusive. The study was
carried out using data from ratings of ESL writing in a university screening test in New
Zealand. In this case, the feedback given using Rasch-based estimates of relative leniency, consistency and bias was complemented by qualitative feedback, and the paper
reports on participants’ perceptions of the usefulness of the feedback. The overall finding
was that the feedback was helpful in many but not in all cases.
Brindley and Slatyer (2002) carried out a study using Rasch analysis of factors
affecting task difficulty in listening assessments. The context of the study was the testing of competencies of immigrant adult ESL learners in Australia. The study involved a
comparison of different versions of similar listening content, constructed through
McNamara and Knoch
571
altering macro-variables thought likely to affect the difficulty of the listening task such
as speech rate, whether the material was listened to once or twice, item type (short
answer questions, completion of a table, and sentence completion items), and speech
genre (conversational vs. formal). Rasch calibration to compare the difficulty of different versions was complemented by a qualitative analysis of individual items in terms of
the kind of information necessary to answer the question, the characteristics of the surrounding text and the characteristics of the stem. The study found a complex interaction
of the effects of the variables and the item characteristics, which made the effect of each
variable on its own hard to generalize. The study has important implications for our
understanding of the second language listening process and of problems in designing
listening tests.
Conclusion
Many challenges face language assessment research at the current time. A range of new
topics has emerged. These include, for example, the problems arising in the assessments
of combined cohorts of non-native speakers and native speakers in globalized work and
study environments; the introduction of the assessment of different language skills (e.g.
listening and writing) using integrated test tasks; and the automatic scoring of speech and
writing. In this, the availability of new and more complex Rasch-based analytic tools –
the development of multidimensional Rasch models and the programs (e.g. Conquest) to
implement them – provides opportunities and challenges. The increasing complexity of
the models and programs creates opportunities to explore their potential for application
in language testing contexts; but it also raises the question of the accessibility of these
tools to individuals lacking extensive psychometric training, which will include many
working in language testing research.
This history of the uptake of Rasch measurement within language testing research
has implications for other fields of educational measurement and for measurement
more generally. It demonstrates the potential for Rasch measurement to deal with more
than scaling issues: in language testing research Rasch measurement has been used to
address a range of substantive validity questions. This history also raises the complex
issues involved in joint work between construct specialists and psychometricians.
Increasingly, in educational assessment, there is a need for psychometricians to draw
more extensively on the expertise of subject specialists. Similarly, subject specialists
seeking to develop assessments in their subject areas require the skills of psychometricians in order to develop meaningful assessments. Often the cooperation between individuals with different kinds of training may lead to problematic results, as the
controversy over the ITESL test developed by Griffin et al. (1988), discussed above,
has shown. Ideally, the expertise is combined in a single person, and that is the
language testing tradition; but it is a demanding task to gain the expertise required.
The necessary complementarity of applied linguistics and measurement expertise in
language test development and language testing research is what characterizes the
field; the history of the uptake of Rasch measurement within language testing demonstrates its consequences and difficulties.
572
Language Testing 29(4)
Acknowledgements
An earlier version of this paper was given as a plenary address at the Pacific Rim Objective
Measurement Symposium (PROMS 2008), Ochanomizu University, Tokyo, August 2008. We are
grateful to Mark Wilson, Ray Adams, John De Jong, Charles Stansfield, Dorry Kenyon, Nick
Saville, Neil Jones, Lyle Perkins, Thom Hudson, Fred Davidson and Brian Lynch for advice on
historical and technical points, and to the reviewers for the helpful insights they gave.
Notes
1. This term is no longer used ‘because it confounds this area with the “State-Trait” issue in
psychology’ (Mark Wilson, personal communication, 31 May 2011).
2. While the community of ‘professional’ language testers came to Rasch models late, early work
on tests of reading preceded this: Georg Rasch himself developed the Basic Rasch model as
part of the construction of a series of vertically equated reading tests (Rasch, 1960/1980), and
Rentz and Bashaw (1977) used Rasch methods to develop a reading scale.
3. Conference of the American Educational Research Association.
4. As well as a paper co-written by Perkins and Miller (1984) comparing Rasch and Classical
Test Theory which had appeared in the first issue of Language Testing.
5. This was the 1988 LTRC, held at the University of Illinois, Urbana-Champaign.
6. Similar seminars were organized in the United States by the Educational Testing Service in
Princeton.
7. This paper was presented at a conference and subsequently published in a book of proceedings.
8. Computer Adaptive Testing.
9. Rasch models were by far the most prominent representative of Item Response Theory models in the Australian educational measurement context at that time.
10. This was the sole journal dedicated to research on language testing in that period. It commenced publication in 1984. A number of papers on Rasch appeared in other collections
and journals, but the tables in the present paper are restricted to surveying publications in
dedicated language testing journals, as explained in the Introduction; certain other important
papers, particularly in the early period, are mentioned in the body of the article.
11. The study was done in 1990 and presented at AERA in 1991, and at the LTRC in 1992; it was
published as Stansfield and Kenyon (1995).
12. This transition to greater acceptance of Rasch was reflected in the educational measurement
community in the United States more generally, as Stansfield recollects (personal communication, 22 March 2011): ‘The Rasch SIG [Special Interest Group] was not viewed favorably at
AERA. In fact, its members were viewed as lightweight and somewhat renegade … Suffice it
to say that professors of educational measurement typically said it was not worth studying or
using. However, it prevailed and grew due to many factors, including the wonderful yet strong
personality of Ben Wright, the work of Mike Linacre, the practicality and conceptual simplicity of the method, the high cost of 2 and 3 parameter test analysis software and the very large
N required for using that software. As a result of all of the above, the use of Rasch analysis
in research and testing in school districts throughout the US became commonplace. All of a
sudden, you turned around and everybody was using Winsteps or FACETS.’ Nevertheless
ETS still has a preference for non-Rasch IRT models, and in general, ‘the controversy has not
faded’ (Mark Wilson, personal communication, 31 May 2011).
McNamara and Knoch
573
References
Adams, R. (1989). Measurement error and its effects upon statistical analysis. Unpublished doctoral dissertation, University of Chicago.
Adams, R., Griffin, P., & Martin, L. (1987). A latent trait method for measuring a dimension in
second language proficiency. Language Testing, 4(1), 9–27.
Adams, R., & Siek Toon, K. (1993). Quest: The interactive test analysis system. Melbourne: Australian Council for Educational Research.
Alderson, C. (1986). Innovations in language testing? In M. Portal (Ed.), Innovations in language
testing (pp. 93–105). Windsor, Berks: NFER-Nelson.
Alderson, C. (1987). An overview of ESL/EFL testing in Britain. In C. Alderson, K. Krahnke,& C.
Stansfield (Eds.), Reviews of English language proficiency tests (pp. 3–4). Washington, DC:
Teachers of English to Speakers of Other Languages.
Alderson, C., Clapham, C., & Wall, D. (1995). Language test construction and evaluation. Cambridge: Cambridge University Press.
Andrich, D. A. (1978). A rating scale formulation for ordered response categories. Psychometrika,
43, 561–573.
Bachman, L. (2000). Modern language testing at the turn of the century: Assuring that what we
count counts. Language Testing, 17(1), 1–42.
Bachman, L., Davidson, F., Ryan, K., & Choi, I.-C. (1995). An investigation into the comparability of two tests of English as a foreign language: The Cambridge TOEFL comparability study.
Cambridge: Cambridge University Press.
Bachman, L., Lynch, B., & Mason, M. (1995). Investigating variability in tasks and rater judgements in a performance test of foreign language speaking. Language Testing, 12(2), 238–257.
Baker, R. (1987). An investigation of the Rasch model in its application to foreign language proficiency testing. Unpublished PhD thesis. University of Edinburgh, Edinburgh.
Baker, R. (1997). Classical test theory and item response theory in test analysis. LTU Special
Report No. 2. Lancaster: Centre for Research in Language Education.
Bartholomew, D. J., Deary, I. J., & Lawn, M. (2009). Sir Godfrey Thomson: A statistical pioneer.
Journal of the Royal Statistical Society. Series A (Statistics in Society), 172(2), 467–482.
Bonk, W., & Ockey, G. (2003). A many-facet Rasch analysis of the second language group oral
discussion task. Language Testing, 20(1), 89–110.
Brindley, G., & Slatyer, H. (2002). Exploring task difficulty in ESL listening assessment. Language Testing, 19(4), 369–394.
Brown, A. (1995). The effect of rater variables in the development of an occupation-specific
language performance test. Language Testing, 12, 1–15.
Buck, G. (1994). The appropriacy of psychometric measurement models for testing second language listening comprehension. Language Testing, 11(2), 145–170.
Canale, M. (1986). Theoretical bases of communicative approaches to second-language teaching
and testing. Applied Linguistics, 1, 1–47.
Chen, Z., & Henning, G. (1985). Linguistic and cultural bias in language proficiency tests. Language Testing, 2(2), 155–163.
Clark, J. L. D., & Clifford, R. T. (1988). The FSI/ILR/ACTFL proficiency scales and testing
techniques: Development, current status and needed research. Studies in Second Language
Acquisition, 10(2), 129–147.
Davidson, F. (1991). Statistical support for training in ESL composition rating. In L. Hamp-Lyons (Ed.),
Assessing second language writing in academic contexts (pp. 155–164). Norwood, NJ: Ablex.
Davidson, F., & Henning, G. (1985). A self-rating scale of English difficulty: Rasch scalar analysis
of items and rating categories. Language Testing, 2(2), 164–179.
574
Language Testing 29(4)
Davies, A. (2008). Assessing academic English: Testing English proficiency 1950–1989 – the
IELTS solution. Cambridge: Cambridge University Press.
De Jong, J. H. A. L. (1983). Focusing in on a latent trait: An attempt at construct validation using
the Rasch model. In J. Van Weeren (Ed.), Practice and problems in language testing 5. Papers
presented at the International Language Testing Symposium (Arnhem, Netherlands, March
25–26, 1982) (pp. 11–35). Arnhem: Cito.
De Jong, J. H. A. L. (1991). Defining a variable of foreign language ability: An application of item
response theory. Unpublished PhD thesis, Twente University, The Netherlands.
De Jong, J. H. A. L., & Glas, C. A. W. (1987). Validation of listening comprehension tests using
item response theory. Language Testing, 4(2), 170–194.
Elder, C., Knoch, U., Barkhuizen, G., & von Randow, J. (2005). Individual feedback to enhance
rater training: Does it work? Language Assessment Quarterly, 2(3), 175–196.
Goldstein, H. (1979). Consequences of using the Rasch Model for educational assessment. British
Educational Research Journal, 5, 211–220.
Griffin, P. (1985). The use of latent trait models in the calibration of tests of spoken language in
large-scale selection-placement programs. In Y. P. Lee, A. C. Y. Fok, R. Lord, & G. Low
(Eds.), New directions in language testing (pp. 149–161). Oxford: Pergamon.
Griffin, P., Adams, R., Martin, L., & Tomlinson, B. (1988). An algorithmic approach to prescriptive assessment in English as a second language. Language Testing, 5(1), 1–18.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response
theory. Newbury Park, CA: Sage.
Hamp-Lyons, L. (1989). Applying the partial credit model of Rasch analysis: Language testing
and accountability. Language Testing, 6(1), 109–118.
Henning, G. (1984). Advantages of latent trait measurement in language testing. Language Testing, 1(2), 123–133.
Henning, G. (1987). A guide to language testing. Development, evaluation, research. Cambridge,
MA: Newbury House.
Henning, G. (1988a). An American view on ELTS. In A. Hughes, D. Porter,& C. Weir (Eds.),
ELTS Validation Project: Proceedings of a conference held to consider the ELTS Validation
Project Report. English Language Testing Service Research Report 1 (pp. 84–92). London:
British Council/University of Cambridge Local Examinations Syndicate.
Henning, G. (1988b). The influence of test and sample dimensionality on latent trait person ability
and item difficulty calibration. Language Testing, 5(1), 83–99.
Henning, G. (1992). Dimensionality and construct validity of language tests. Language Testing,
9(1), 1–11.
Henning, G., & Davidson, F. (1987). Scalar analysis of composition ratings. In K. Bailey, T. Dale,
& R. Clifford (Eds.), Language testing research: Selected papers from the 1986 Colloquium.
Monterey, CA: Defense Language Institute.
Henning, G., Hudson, T., & Turner, J. (1985). Item response theory and the assumption of unidimensionality for language tests. Language Testing, 2(2), 141–154.
Ingram, D. E., & Wylie, E. (1979). Australian Second Language Proficiency Ratings (ASLPR). In
Adult Migrant Education Program Teachers Manual. Canberra: Department of Immigration
and Ethnic Affairs.
Jones, N. (1991). Test item banker: An item bank for a very small micro. In C. Alderson & B.
North (Eds.), Language testing in the 1990s (pp. 247–254). London: Modern English Publications/British Council/Macmillan.
Jones, N. (1992). An item bank for testing English language proficiency: Using the Rasch model to
construct an objective measure. University of Edinburgh, Unpublished PhD thesis.
McNamara and Knoch
575
Lawley, D. N. (1943). On problems connected with item selection and test construction. Proceedings of the Royal Society of Edinburgh, 61, 273–287.
Linacre, J. M. (1989). Many faceted Rasch measurement. Unpublished doctoral dissertation, University of Chicago.
Linacre, J. M. (2009). Facets Rasch measurement computer program. Chicago: Winsteps.com.
Lo Bianco, J. (1987). National policy on languages. Canberra: Australian Government Publishing
Service.
Lunt, H., Morton, J., & Wigglesworth, G. (1994). Rater behaviour in performance testing: Evaluating the effect of bias feedback. Paper presented at 19th annual congress of Applied Linguistics Association of Australia, July 1994, University of Melbourne.
Lynch, B., Davidson, F., & Henning, G. (1988). Person dimensionality in language test validation.
Language Testing, 5(2), 206–219.
Lynch, B., & McNamara, T. (1998). Using G-theory and Many-facet Rasch measurement in the
development of performance assessments of the ESL speaking skills of immigrants. Language
Testing, 15(2), 158–180.
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrica, 47, 149–174.
McNamara, T. (1990a). Assessing the second language proficiency of health professionals.
Unpublished PhD thesis, The University of Melbourne.
McNamara, T. (1990b). Item response theory and the validation of an ESP test for health professionals. Language Testing, 7(1), 52–75.
McNamara, T. (1991). Test dimensionality: IRT analysis of an ESP listening test. Language Testing, 8(2), 139–159.
McNamara, T. (1996). Measuring second language performance. London & New York: Longman.
McNamara, T. (2001). Ten years of the Language Testing Research Centre. In C. Elder, A. Brown,
E. Grove, K. Hill, N. Iwashita, T. Lumley, T. McNamara, & K. O’Loughlin (Eds.), Experimenting with uncertainty: Essays in Honor of Alan Davies (pp. 5–10). Cambridge: Cambridge
University Press.
McNamara, T., & Adams, R. (1994). Exploring rater behaviour with Rasch techniques Selected
Papers of the 13th Annual Language Testing Research Colloquium (LTRC). Princeton, NJ:
Educational Testing Service, International Testing and Training Program Office (also available as ERIC Document Reproduction Service No. ED 345 498).
North, B. (1993). The development of descriptors on scales of proficiency: Perspectives, problems,
and a possible methodology. NFLC Occasional Paper. Washington, DC: National Foreign
Language Center.
North, B. (1995). The development of a common framework scale of descriptors of language proficiency based on a theory of measurement. Thames Valley University, Unpublished PhD thesis.
Nunan, D. (1987). Methodological issues in research. In D. Nunan (Ed.), Applying second language acquisition research (pp. 143–171). Adelaide: National Curriculum Resource Centre.
Perkins, K., & Miller, L. D. (1984). Comparative analyses of English as a Second Language reading comprehension data: Classical test theory and latent trait measurement. Language Testing,
1(1), 21–32.
Pollitt, A., & Hutchinson, C. (1987). Calibrating graded assessments: Rasch partial credit analysis
of performance in writing. Language Testing, 4(1), 72–92.
Pollitt, A., Hutchinson, C., Entwistle, N., & De Luca, C. (1985). What makes exam questions difficult? An analysis of ‘O’ grade questions and answers (Research Report for Teachers No. 2).
Edinburgh: Scottish Academic Press.
Rasch, G. (1960/1980). Probablilistic models for some intelligence and attainment tests. (Copenhagen, Danish Institute for Educational Research), expanded edition (1980) with foreword and
afterword by B.D. Wright. Chicago: University of Chicago Press.
576
Language Testing 29(4)
Rentz, R. R., & Bashaw, W. L. (1977). The National Reference Scale for Reading: An application
of the Rasch Model. Journal of Educational Measurement, 14(2), 161–179.
Skehan, P. (1989). Language testing. Part II. Language Teaching, 22(1), 1–13.
Stansfield, C. W. (Ed.). (1986). Technology and language testing: A collection of papers from
the Seventh Annual Language Testing Research Colloquium. Washington, DC: Teachers of
English to Speakers of Other Languages.
Stansfield, C. W., & Kenyon, D. M. (1995). Comparing the scaling of speaking tasks by language
teachers and by the ACTFL guidelines. In A. Cumming & R. Berwick (Eds.), The concept of
validation in language testing (pp. 124–153). Clevedon, Avon: Multilingual Matters.
van Weeren, J. (1983). Practice and problems in language testing 5. Non-classical test theory;
Final examinations in secondary schools. Papers presented at the International Language
Testing Symposium (Arnhem, Netherlands, 25–26March 1982).
Weigle, S. C. (1994). Effects of training on raters of English as a second language compositions:
Quantitative and qualitative approaches. Unpublished PhD, University of California, Los
Angeles.
Widdowson, H. G. (1996). Linguistics. Oxford: Oxford University Press.
Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater consistency in
assessing oral interaction. Language Testing, 10(3), 305–323.
Wood, R. (1991). Assessment and testing. Cambridge: Cambridge University Press.
Wright, B. D., & Andrich, D. A. (1987). Rasch and Wright: The early years. Rasch Measurement
Transactions, Pre-History, pp. 1–4.
Wu, M. L., Adams, R., & Wilson, M. (1998). ACER Conquest: Generalised item response modelling software. Melbourne: ACER Press.
Download