SciAssist / examples /H01-1042_body.txt
wing-nus's picture
Upload 2 files
0c8ddbe
raw
history blame
No virus
11.2 kB
Machine translation evaluation and language learner evaluation have been associated for many years , for example [ 5 , 7 ] . One attractive aspect of language learner evaluation which recommends it to machine translation evaluation is the expectation that the produced language is not perfect , well-formed language . Language learner evaluation systems are geared towards determining the specific kinds of errors that language learners make . Additionally , language learner evaluation , more than many MT evaluations , seeks to build models of language acquisition which could parallel ( but not correspond directly to ) the development of MT systems . These models frequently are feature-based and may provide informative metrics for diagnostic evaluation for system designers and users . In a recent experiment along these lines , Jones and Rusk [ 2 ] present a reasonable idea for measuring intelligibility , that of trying to score the English output of translation systems using a wide variety of metrics . In essence , they are looking at the degree to which a given output is English and comparing this to humanproduced English . Their goal was to find a scoring function for the quality of English that can enable the learning of a good translation grammar . Their method for accomplishing this is through using existing natural language processing applications on the translated data and using these to come up with a numeric value indicating degree of `` Englishness '' . The measures they utilized included syntactic indicators such as word n-grams , number of edges in the parse ( both Collins and Apple Pie parser were used ) , log probability of the parse , execution of the parse , overall score of the parse , etc . Semantic criteria were based primarily on WordNet and incorporated the average minimum hyponym path length , path found ratio , percent of words with sense in WordNet . Other semantic criteria utilized mutual information measures . Two problems can be found with their approach . The first is that the data was drawn from dictionaries . Usage examples in dictionaries , while they provide great information , are not necessarily representative of typical language use . In fact , they tend to highlight unusual usage patterns or cases . Second , and more relevant to our purposes , is that they were looking at the glass as half-full instead of half-empty . We believe that our results will show that measuring intelligibility is not nearly as useful as finding a lack of intelligibility . This is not new in MT evaluation -as numerous approaches have been suggested to identify translation errors , such as [ 1 , 6 ] . In this instance , however , we are not counting errors to come up with a intelligibility score as much as finding out how quickly the intelligibility can be measured . Additionally , we are looking to a field where the essence of scoring is looking at error cases , that of language learning . The basic part of scoring learner language ( particularly second language acquisition and English as a second language ) consists of identifying likely errors and understanding the cause of them . From these , diagnostic models of language learning can be built and used to effectively remediate learner errors , [ 3 ] provide an excellent example of this . Furthermore , language learner testing seeks to measure the student 's ability to produce language which is fluent ( intelligible ) and correct ( adequate or informative ) . These are the same criteria typically used to measure MT system capability .1 In looking at different second language acquisition ( SLA ) testing paradigms , one experiment stands out as a useful starting point for our purposes . One experiment in particular serves as the model for this investigation . In their test of language teachers , Meara and Babi [ 3 ] looked at assessors making a native speaker ( L1 ) / language learner ( L2 ) distinction in written essays .2 They showed the assessors essays one word at a time and counted the number of words it took to make the distinction . They found that assessors could accurately attribute L1 texts 83.9 % of the time and L2 texts 87.2 % of the time for 180 texts and 18 assessors . Additionally , they found that assessors could make the L1/L2 distinction in less than 100 words . They also learned that it took longer to confirm that an essay was a native speaker 's than a language learner 's . It took , on average , 53.9 words to recognize an L1 text and only 36.7 words to accurately distinguish an L2 text . While their purpose was to rate the language assessment process , the results are intriguing from an MT perspective . They attribute the fact that L2 took less words to identify to the fact that L1 writing `` can only be identified negatively by the absence of errors , or the absence of awkward writing . '' While they could not readily select features , lexical or syntactic , on which evaluators consistently made their evaluation , they hypothesize that there is a `` tolerance threshold '' for low quality writing . In essence , once the pain threshold had been reached through errors , missteps or inconsistencies , then the assessor could confidently make the assessment . It is this finding that we use to disagree with Jones and Rusk [ 2 ] basic premise . Instead of looking for what the MT system got right , it is more fruitful to analyze what the MT system failed to capture , from an intelligibility standpoint . This kind of diagnostic is more difficult , as we will discuss later . We take this as the starting point for looking at assessing the intelligibility of MT output . The question to be answered is does this apply to distinguishing between expert translation and MT output ? This paper reports on an experiment to answer this question . We believe that human assessors key off of specific error types and that an analysis of the results of the experiment will enable us to do a program which automatically gets these . We started with publicly available data which was developed during the 1994 DARPA Machine Translation Evaluations [ 8 ] , focusing on the Spanish language evaluation first . They may be obtained at : http : //ursula.georgetown.edu . 3 We selected the first 50 translations from each system and from the reference translation . We extracted the first portion of each translation ( from 98 to 140 words as determined by sentence boundaries ) . In addition , we removed headlines , as we felt these served as distracters . Participants were recruited through the author 's workplace , through the author 's neighborhood and a nearby daycare center . Most were computer professionals and some were familiar with MT development or use . Each subject was given a set of six extracts -a mix of different machine and human translations . The participants were told to read line by line until they were able to make a distinction between the possible authors of the text -a human translator or a machine translator . The first twenty-five test subjects were given no information about the expertise of the human translator . The second twenty-five test subjects were told that the human translator was an expert . They were given up to three minutes per text , although they frequently required much less time . Finally , they were asked to circle the word at which they made their distinction . Figure 1 shows a sample text . The general secretary of the UN , Butros Butros-Ghali , was pronounced on Wednesday in favor of a solution `` more properly Haitian '' resulting of a `` commitment '' negotiated between the parts , if the international sanctions against Haiti continue being ineffectual to restore the democracy in that country . While United States multiplied the last days the threats of an intervention to fight to compel to the golpistas to abandon the power , Butros Ghali estimated in a directed report on Wednesday to the general Assembly of the UN that a solution of the Haitian crisis only it will be able be obtained `` with a commitment , based on constructive and consented grants '' by the parts . Our first question is does this kind of test apply to distinguishing between expert translation and MT output ? The answer is yes . Subjects were able to distinguish MT output from human translations 88.4 % of the time , overall . This determination is more straightforward for readers than the native/non-native speaker distinction . There was a degree of variation on a persystem basis , as captured in Table 1 . Additionally , as presented in Table 2 , the number of words to determine that a text was human was nearly twice the closest system . 4 The second question is does this ability correlate with the intelligibility scores applied by human raters ? One way to look at the answer to this is to view the fact that the more intelligible a system output , the harder it is to distinguish from human output . So , systems which have lower scores for human judgment should have higher intelligibility scores . Table 3 presents the scores with the fluency scores as judged by human assessors . Indeed , the systems with the lowest fluency scores were most easily attributed . The system with the best fluency score was also the one most confused . Individual articles in the test sample will need to be evaluated statistically before a definite correlation can be determined , but the results are encouraging . 4 For those texts where the participants failed to mark a specific spot , the length of the text was included in the average . The final question is are there characteristics of the MT output which enable the decision to be made quickly ? The initial results lead us to believe that it is so . Not translated words ( non proper nouns ) were generally immediate clues as to the fact that a system produced the results . Other factors included : incorrect pronoun translation ; incorrect preposition translation ; incorrect punctuation . A more detailed breakdown of the selection criteria and the errors occurring before the selected word is currently in process . An area for further analysis is that of the looking at the details of the post-test interviews . These have consistently shown that the deciders utilized error spotting , although the types and sensitivities of the errors differed from subject to subject . Some errors were serious enough to make the choice obvious where others had to occur more than once to push the decision above a threshold . Extending this to a new language pair is also desirable as a language more divergent than Spanish from English might give different ( and possibly even stronger ) results . Finally , we are working on constructing a program , using principles from Computer Assisted Language Learning ( CALL ) program design , which is aimed to duplicate the ability to assess human versus system texts . My thanks goes to all test subjects and Ken Samuel for review . The discussion of whether or not MT output should be compared to human translation output is grist for other papers and other forums.2 In their experiment , they were examining students learning Spanish as a second language . Data has since been moved to a new location .