Studie: Indsaml fejlrettelsesdata fra Wikipedia

I forbindelse med noget andet jeg var ved at arbejde på, så stødte jeg på et paper: Pfeil, Ulrike, Panayiotis Zaphiris, and Chee Siang Ang. Cultural differences in collaborative authoring of Wikipedia. Journal of Computer‐Mediated Communication 12.1 (2006): 88-113. Selve hoveddelen af artiklen er ikke voldsomt interessant eller særlig overbevisende. Men det deres data gav mig en god ide:

Jeg kikkede naturligvis mest på de sproglige, grammar, spelling. Jeg tænker på en tabel fra Foundation literacy acquisition in European orthographies, British Journal of Psychology (2003), 94, 143–174.

Af de fire sprog de har valgt, så er fransk det sværeste if. tabellen, hvorefter at man skulle forvente flere stavefejlrettelser. Derefter hollandsk, og til sidst tysk. Det er præcis den rækkefølge med ser i dataene fra deres lille Wikipedia-studie. Japansk har også en lav stavefejlrettelses%, så spørgsmålet er naturligvis nu: passer det med hvor regulær den japanske retskrivning er? Jeg kikkede lidt omkring på Wikipedia, men kunne ikke umiddelbart finde noget om hvor regulær japansk retskrivning er. Men det er velkendt at de har et besværlig skriftsprog med 3 forskellige systemer af skriftegn.

Derefter gik turen til Google Scholar, hvor man kan finde alt muligt godt.

Hino, Yasushi, et al. “The effects of polysemy for Japanese katakana words.” Reading and Writing 10.3 (1998): 395-424.

Because each katakana character corresponds to a single syllable (mora), katakana is considered to be a shallow orthography which has virtually no spelling-to-sound irregularities. Thus, in terms of the dual-route model, the nonlexical route would be able to produce correct phonological codes for all these words. As noted, according to Balota and Chumbley’s arguments, if both word frequency and polysemy effects are due to lexical selection, both of these effects should not vary in size across tasks in which lexical selec- tion is fully involved. Our use of a completely regular orthography, however, changes those predictions. First of all, with respect to word frequency effects, the cross-task equivalence would not be expected because, as noted, a dual- route analysis suggests that low frequency words often do not require lexical selection. Thus, the expectation is that there would be a smaller frequency effect in naming than in lexical decision. More importantly, with respect to polysemy effects, the cross-task equivalence should hold for high frequency words because, for these words, the lexical route generates phonological codes much faster than the nonlexical route, meaning that the contribution of the nonlexical route to performance in the naming task would be minimal. For low frequency katakana words, however, the expectation would be that the polysemy effect should be smaller in naming than in lexical decision because of the large contribution of the nonlexical route in naming. [min emfase]

Delattre, Marie, Patrick Bonin, and Christopher Barry. “Written spelling to dictation: Sound-to-spelling regularity affects both writing latencies and durations.” Journal of Experimental Psychology: Learning, Memory, and Cognition 32.6 (2006): 1330.

The dual-route model of spelling production (e.g., Ellis, 1982) proposes that two processing systems operate in parallel: a lexical route that retrieves spellings of known words from a memory store of word-specific knowledge and a nonlexical (or assembled) route that generates spellings using a process of sublexical sound-to- spelling conversion. The assembled spelling route would be effi- cient in languages whose orthographies have predictable or con- sistent orthographic-to-phonological correspondences (such as Turkish, Italian, and Japanese kana) but would be considerably less effective for English and French, whose orthographies are characterized by highly inconsistent relationships (e.g., the vowel /i:/ is spelled in many different ways in English words, as in eel, tea, theme, thief, Keith, people, me, key, quay, ski, etc.). There are many irregular and some almost arbitrarily spelled words in En- glish (e.g., pint, yacht) and French (e.g., fraise, monsieur). The lexical route would work for all known words (irrespective of regularity) but could not provide spellings for new words or nonwords. The assembled route would work for nonwords but would often produce phonologically plausible errors (PPEs), par- ticularly to irregular words, such as yacht (YOT) and monsieur (MESSIEU). [min emfase]

Buchanan, Lori, and Derek Besner. “Reading aloud: Evidence for the use of a whole word nonsemantic pathway.” Canadian Journal of Experimental Psychology/Revue canadienne de psychologie expérimentale 47.2 (1993): 133.

Reading Japanese Written Japanese consists of three distinct scripts. The logographic Kanji represents content words while the syllabic Kana scripts consist of Hiragana which represents grammatical morphemes and Katakana which represents borrowed words such as television and computer. Both Hiragana and Katakana have very consistent spelling-sound correspondences (i.e., they are shallow scripts). Transcribing a word which normally appears in one Kana script into the other Kana script produces a pseudohomophone, a word that is ortho- graphically unfamiliar at the whole word level but retains its original pronunciation. Since readers must rely entirely on the assembled routine to read such character strings aloud, any evidence of priming for these words is evidence that the use of assembled routine can result in priming. [min emfase]

Altså, uden at kende noget mere til japansk kana (katakana+hiragana), så lader det til at være meget regulært. Så skulle man forvente relativt få rettelser for stavefejl. Det er en skam at japan også bruger kanji, ellers ville deres retskrivning være næsten helt konsistent, hvilket ville få en til at forudsige, at de rettede færre fejl end tyskerne gør. Men deres fejlrettelses%er var ca. ens.

Næste skridt må jo være at kikke på nogle flere Wikipedia-sider for at bekræfte mønstret, og at kikke på andre sprog (især dansk og engelsk). Deres metode kræver dog en del manuelt arbejde, så den er ikke optimal. Måske man kunne forsøge sig med en keyword baseret løsning? Når man laver en rettelse på Wikipedia, så kan man nemlig skrive nogle få ord om hvad ens rettelse gik ud på. Mange som retter stavefejl skriver det nok her, hvilket gør det muligt at måle antal stavefejlrettelser udfra keywords. Det kræver dog stadig at man kan de sprog som man vil undersøge, ellers ved man ikke hvilke keywords man skal søge efter.

At gøre det ovenstående ville dog nødvendiggøre at dataene ikke var sammenlignelige med deres. Deres dataindsamlingsmetode er dårlig (kræver for meget tid), men ideen er god: bruge data fra Wikipedia til at estimere hvor ofte folk retter fejl.