
English speakers who are 18 or under use the word 鈥榣ike鈥 in conversation over five times as often as speakers who are over 70; 鈥榖ecause鈥 is the most misspelled English word globally; the word 鈥榣ove鈥 is said and written over six times more frequently than the word 鈥榟ate鈥. We know all of this because of a multibillion-word database called the Cambridge English Corpus.
English speakers who are 18 or under use the word 鈥榣ike鈥 in conversation over five times as often as speakers who are over 70; 鈥榖ecause鈥 is the most misspelled English word globally; the word 鈥榣ove鈥 is said and written over six times more frequently than the word 鈥榟ate鈥. We know all of this because of a multibillion-word database called the Cambridge English Corpus.
For learners of English to become proficient, subtle differences can be extremely important.
Claire Dembry
If the Cambridge English Corpus, created by Cambridge 探花直播 Press, were to be printed on single-sided A4 paper and stacked into a tower, it would stand 600 m high, almost twice the height of the tallest building in the UK. If it was read aloud at an average reading speed, it would take 88,766 hours to read; working 7 hours a day, 5 days a week, that鈥檚 49 years.
探花直播multibillion-word Cambridge English Corpus is a constantly updated record of how English is being used today in all its forms 鈥 spoken, written, business, academic, learner and e-language. Amassed over two decades, the electronic database draws on sources that range from the more expected (books, newspapers, journals, radio, television) to the more surprising (song lyrics, junk mail, voicemail messages and recordings from flight control).
Cambridge 探花直播 Press researchers use the Corpus to investigate the most common words, phrases and grammatical patterns in English, and then use the results to improve English language teaching books.
鈥淐ontext in English is important,鈥 explained Dr Claire Dembry, Language Research Manager, 鈥渨e analyse patterns in language and how English changes depending on context and circumstances. For learners of English to become proficient, these sorts of subtle differences can be extremely important, and it is only by amassing a vast number of examples that our writers, lexicographers and researchers can determine how best to describe the patterns of English in our learning materials.鈥
It all began in the 1990s, when a few CDs of American newspapers in electronic form were loaded into a database that both stored the data and 鈥榪ueried鈥 it, working out the relationships between words. Gradually, the embryo corpus was extended with further material and, today, almost any conceivable form of English can be found in the database.
At an early stage, Cambridge 探花直播 Press realised that just as important as knowing how English is being used, is the knowledge of the features of English that learners find difficult. 鈥淭his decision, which led to the Cambridge Learner Corpus, had far-reaching effects and has become probably the single most important unique selling point for the Press鈥檚 English Language Teaching publishing,鈥 said Ann Fiddes, Global Language Research Manager.
It turns out that words such as because (misspelled as becouse), which (wich), accommodation (accomodation), advertisement (advertisment) and beautiful (beatiful) are the top five words most commonly misspelled by learners globally.
To arrive at conclusions like this has taken years of painstaking identification (and tagging with computer readable codes) of misspellings and grammatical errors made in Cambridge English Language Assessment Examinations in the Cambridge Learner Corpus.
Comprehensive information about the learners who originally wrote the exam scripts 鈥 first language, nationality, age, gender, scores, and so on 鈥 is stored.听 These data, along with the 鈥榚rror tagging鈥, has enabled Cambridge 探花直播 Press to publish materials addressing directly the different types of errors of individual markets and individual language groups.
鈥淭his is hugely important for the Press and has meant that we have, for example, been able to publish the successful English for Spanish Speakers editions of global products, and become the market leader in Corpus-based publishing,鈥 explained Fiddes.
Now, Cambridge 探花直播 Press and Cambridge English Language Assessment have joined forces and set their sights on academic English.
探花直播Cambridge English Corpus already contains over 400 million words of academic English 鈥 the largest and most extensive collection of its kind.听 It takes as its source written and spoken academic language at undergraduate, postgraduate and professional level from a range of academic disciplines and worldwide institutions. New research is pulling in data from sixth-form students as well as other academic levels, covering a much wider range of disciplines, genres and language backgrounds.
鈥淪ome interesting patterns have already emerged,鈥 said Fiddes. 鈥淚n our collection of academic English samples, the size adjectives significant, considerable, substantial and serious are much more frequent than big, massive, enormous and tremendous. In spoken English, however, big tops the list. We also found that in academic English, verbs such as solve, pose, face, resolve, tackle and circumvent frequently occur with the noun problem. These kinds of insights help us to develop a better understanding of the language skills needed by students at English-speaking universities.鈥
As part of their current research, the team welcomes contributions of academic English to the corpus, and invite anyone interested in participating to contact them for more information ().
鈥淐orpus work is very closely linked with advances in technology and we are investigating automating many of our manual systems, such as error tagging and speech transcription,鈥 added Fiddes. 鈥淥ur research has already allowed us to partially automate the mark up of errors in learner writing.
鈥淭hese technologies will increase the speed at which we can maintain our grasp on what English is now, and what it might be in the future. 鈥
For more information about the Cambridge English Corpus, please visit
This work is licensed under a . If you use this content on your site please link back to this page.