Corpus General

In the eighteenth century Dr Samuel Johnson spent many long years compiling "A Dictionary of the English Language", published by Longman in 1755. He based his dictionary on quotations from famous authors, and these were copied onto slips of paper and became one small part of a huge filing system. Over 250 years later, things are made much easier for lexicographers (people who write dictionaries) and other writers by using the Longman Corpus Network.

 

The Longman Corpus Network is a diverse, far-reaching group of databases consisting of many millions of words.  It provides Longman lexicographers and coursebook writers with a large amount of information about words, usage, language trends and grammatical patterns.

 

Computer technology has provided a wealth of linguistical information at the touch of a button so that now we have hard facts about the language on the computer screen to support and to help us in our work.  All Longman learner dictionaries are compiled using the Longman Corpus Network to ensure students get all the help they need. We analyze typical learners’ mistakes and include notes on how to avoid them.

 

The Corpus consists of five highly sophisticated language databases:

  • The The Longman Learners' Corpus was the first corpus to record and monitor the written output of students of English and enables us to pinpoint their specific needs;
  • the Longman Written American Corpus comprised of 100 million words of American newspaper and book text;
  • the Longman Spoken American Corpus is a unique resource of 5 million words of everyday American speech;
  • the Spoken British Corpus (part of the British National Corpus) gives us objective information for the first time on what spoken English is really like and how it differs from written British English
  • the Longman/Lancaster Corpus with over 30 million words covers an extensive range of written texts from literature to bus timetables

These five combine forces to form the Longman Corpus Network and together with the British National Corpus this provides the wealth of information for writing coursebooks and dictionaries that both accurately represent the English language and satisfy students' needs at every level.

 




 

 

The Longman Corpora

What is the Learner’s Corpus?
Students and teachers throughout the world send in essays and exam scripts to help us create the Longman Learners' Corpus, a 10 million word computerized database made up entirely of language written by students of English. Every nationality, every language level is represented in the corpus and this provides an unprecedented insight into the English language learner.

 

What does it tell us?
Each student essay is coded by nationality and language level amongst other things (fig.1) - TUR, IN for Turkish intermediate student, SSP, EL for a Spanish elementary student (from Spain as opposed to South America), and then entered onto the computer to form part of the corpus. We are able to focus in on a selected group of students e.g. French advanced students, and then understand what the specific problem areas particular to this group are (fig. 2). Alternatively, we are able to focus in on a word or a phrase and view the errors made by the entire gamut of students.

 

FIG 1.

TUR, IN

pap which is in Leyden stowe to having nice

TUR, IN

e to having nice drink there. They had nice

SSP, EL

attic and loft space. The house is so nice

 

FIG 2.

how the life should be

nice

and beautiful in peace. The

ned there. There was a

nice

and big place that Korosh a

well and which are so

nice

and charmy. We might be a

from school and it is

nice

and clean with a quit dista

your drink. Sunday is a

nice

and easy-going day. There a

people that I've met are

nice

and friendly people and lik

 

How do we use this information?
The Longman Learners' Corpus offers so much invaluable information about the mistakes students make and what they already know, that it is the perfect resource for lexicographers and material writers who want to produce dictionaries and course books that address students' specific needs. The Longman Learners' Corpus was used to write the Usage Notes in our dictionaries.  The results from the corpus showed that there was 100% error in the meaning and the use of the word cloth: My cloths and shoes were wet, We have very good cloth stores etc. Having pinpointed such a problem, our lexicographers were then able to write a corresponding Usage Note (fig. 3).

 

FIG 3.


USAGE Do not use cloth or cloths to mean "the things that people wear".

Instead use clothes a clothes shop. | The guests all wore casual clothes.

 

 

Become Part of The Learners' Corpus
In order to grow the Longman Learners' Corpus we need more and more student scripts. To get involved in this important research project, please send in your students' papers at the end of a session.  For every 100 submissions, Longman will send you a copy of a Longman Dictionary for your classroom.  For more information on this offer contact Tania Saiz-Sousa.

 

The Longman Spoken American Corpus

The Longman American Spoken Corpus is comprised of 5 million words of text. The gathering of recordings was undertaken for Longman by the University of California at Santa Barbara. It represents the everyday conversations of more than 1000 Americans of various age groups, levels of education, and ethnicity, and includes speakers from over 30 US States. The recorded speech is transcribed onto a computer database and analyzed by our lexicographers to determine frequency of use, precise meanings and typical phrases that students need to study.

 

The Longman Written American Corpus
The Longman Written American Corpus is a dynamic database of 100 million words comprised of running text from newspapers, journals, magazines, best-selling novels, technical and scientific writing, and coffee-table books. The corpus is constantly being refined and new material added. The design of the Longman Written American Corpus is based on the general design principles of the Longman Lancaster English Language Corpus and the written component of the British National Corpus. Like other corpora in the Longman Corpus Network, words can be concordanced, wordlists created, and statistical features analyzed, allowing lexicographers to compare and contrast usage in British and American English.