Testing and assessing vocabulary

test

It is almost a truism that vocabulary is tested directly or indirectly in all tests of a learner's language ability. It is difficult to conceive of any type of test which is lexis free. Even when a test item looks like this, for example:

Select the correct answer:
    John couldn't get in because he ____________ left his keys at the office.
    a) was leaving
    b) had left
    c) would leave
    d) has left

and is presumably designed to test the subject's knowledge of past-tense forms in English, it requires the test taker to understand the connection between keys and the ability to get in somewhere, the meaning of the modal auxiliary verb, the meaning of the adverb in get in and the logical connection implied by the word because.
Without that lexical knowledge, it's hard to demonstrate grammatical knowledge and get the right answer.
So, if vocabulary knowledge is routinely a part of testing all the other sorts of language ability ...

... why test vocabulary separately?

This is not the place to set out the different purposes that tests fulfil, whether they are achievement, diagnostic, proficiency or progress tests. Nor is this the place to discuss the motivating factors that tests sometimes enhance. We are concerned here with testing vocabulary in particular, not testing in general.
Guides to general areas of testing and lexis are linked in the list of related guides at the end.

There are a number of good reasons for testing vocabulary discretely from other skills and abilities.

Backwash:

Explicitly testing vocabulary often results in teachers paying more attention to its teaching and being more consistent and discerning about what items they focus on.
Backwash may also have an effect on the learners. If they know that vocabulary is going to tested discretely, they may well be motivated to review what they have encountered and consigned to notebooks, probably in no particular order. They may even be persuaded to revisit and reorganise their vocabulary notebooks.
As a measure of overall ability:

Vocabulary knowledge has been shown to be a good indicator of a learner's overall ability in a language so, for diagnostic and placement purposes, vocabulary testing is a useful tool.
Face validity:

Some learners make very great efforts to acquire vocabulary because they recognise, quite rightly, that although it is difficult to communicate without grammar, it is impossible without words. If we do not test vocabulary in an identifiably discrete way, learners may not feel that their abilities are being fairly assessed.
Depth vs. breadth:

Testing vocabulary incidentally, in a mix of other test types, may give us some measure of the breadth of learners' lexical knowledge (i.e., the size of their lexicons in crude terms) but is unlikely to provide anything like the precision we require if we want to measure the depth of their knowledge of lexis. This means testing vocabulary separately so we can get some estimation of how well items are known, not just how many are recognised.
Learning is remembering:

Vocabulary learning is not subject to a rule-based approach in the same way that learning grammar rules and applying them can be. There are distinct patterns, of course, such as collocational aspects, affixation, multi-word verbs, synonymy, homonymy and so on but, essentially, learning vocabulary means remembering vocabulary and testing it is a strong motivating factor in encouraging learners to review and recycle what they have encountered.
Revision and review:

Vocabulary is an area where it has been shown that multiple exposures to lexemes in context are required before items can be said to have been acquired.
Vocabulary testing provides an opportunity, when giving feedback, to review, recycle and extend learners' knowledge.
Spacing:

It has been shown that it is better to space out vocabulary learning and recycling rather than concentrate it in blocks of intense effort. In the trade, this is known as distributed practice. Such practice, it is argued, allows short-and long-term memories to integrate.
Testing at regular intervals allows the teacher to space out the learning and recycling at gradually increasing intervals.

What to test: depth vs. breadth

Hughes asserts that for the purposes of a placement test, i.e., "in essence a proficiency test" (Hughes, 1989:147):

All we would be looking for is some general indication of the adequacy of the student's vocabulary.
Ibid.

If that were all there was to it, we simply need to focus our test on the vocabulary items we consider most frequent and useful for our learners, perhaps drawing on something like the general service word list or an academic word list and design a test to see if our learners can accurately understand (by some kind of matching task) and use (through a form of gap-fill testing) the items we have targeted.
That will give us a rough-and-ready indication of the breadth of their passive and active vocabulary.

Another possibility, of course, lies in asking how well learners know vocabulary items, not how many they know. This will mean focusing on phenomena such as pronunciation, collocation, colligation, word formation (morphology), word grammar (transitivity, countability etc.) and perhaps some other factors concerning hyponymy, synonymy, simile, metaphor, style, register and idiomaticity. This is what is meant by focusing on depth of understanding as well as on breadth of knowledge.
For this to work, we need to be a little more imaginative in how we construct test items, as we shall see.

What to test: targeting the test

What you test is dependent on why you test, i.e., what the test is designed to tell you.

Achievement and progress tests
will focus, if they are to be fair, only on the items taught or encountered on the course.
They are either (or both):
1. Formative and frequently carried out to identify what needs to be recycled and reviewed.
2. Summative and carried out at the end of a course to see how well the items have been acquired as a way of evaluating the success of the programme.
Proficiency, placement and diagnostic tests
will focus on getting an estimate of the size and depth of learner's knowledge and will depend, usually, on some kind of sampling.
If, for example, we have identified 2000 words that a learner at A2 level should know, a test of 100 randomly selected words from the list will represent a sample of 5% which is actually rather good statistically, although a sample of 200 words would, of course, be twice as good and twice as time consuming.
When learners are at C1 level, however, they are expected to know around 4000-5000 words and random sampling becomes much more difficult because to attain a 10% sample rate, we would need a 400-item test which would take at least 3 hours to complete if one allows 30 seconds per item.
That is impractical in most settings and explains why vocabulary testing in public examinations is often integrated into testing other aspects of language knowledge and ability.

Measuring breadth: vocabulary size

This is the attempt to discover how wide the test taker's vocabulary is in terms of understanding and using lexemes. It is not a particularly sophisticated measure of language competence but there is evidence that breadth of vocabulary is a good indicator of general language proficiency. However, there are provisos:

Cross-language facilitation and interference:
This is not an issue when all the learners share a common first language and no other apart from the target language.
However, in groups where the learners come from a variety of language backgrounds and/or in which some members of the group may have learned other additional languages, the issue of cross-language influences begins to be felt.
When learning English, for example, it is unlikely that learners from a Romance language background or who have learned a Romance language, no matter what their first language is, will have very much difficulty understand a word such as consolidation because a word which looks similar and carries the same meaning exists in most of these languages (French, Italian, Spanish, Romanian, Portuguese etc., in all of which the word begins consolida- with only the ending differing slightly if at all from English).
Speakers of Slavic languages will be slightly more challenged but a cognate word exists in most of them which can be identified with a little effort.
However, learners from other language backgrounds, especially non-Indo-European ones will have no such support from their first languages and will need to have learnt the word from scratch. Even in German, where a similar word exists, a more natural translation might be Vertiefung which bears no superficial relationship to the English word. In other languages, the form of the word also bears no relationship to the English word at all:
    samstæðu (Icelandic)
    укрепление (ukrepleniye) (Russian)
    sağlamlaştırma (Turkish)
    ukuhlanganiswa (Zulu)
    vakauttaminen (Finnish)
    pagpapatatag (Filipino / Tagalog)
    fanamafisana (Malagasy)
etc.
Register:
Learners who have certain interests and/or professions may find that some items are well known to them which are obscure to people with other backgrounds. For example, a learner, whatever his or her first language, who happens to be a chemist will have little difficulty understanding a word like sulphate (or sulfate if you prefer AmE) which is similar if not the same in a very wide range of languages (but not all). Other learners will be more challenged.
Equally, a learner with a background in banking might well understand the terms direct debit, exchange rate, deposit etc., whatever her or his first language, where other learners will struggle.
A learner who is particularly interested in motor racing will probably be familiar with bend, pit-stop, chicane, chequered flag and a number of other terms which are obscure to those without any specialist knowledge.

The moral is to try as far as possible in the selections of items to test to avoid bias of this sort and also, where appropriate, to focus only on the items which have been taught, or at least encountered, on the course. That is feasible if one is designing a progress or achievement test, less so in designing placement, diagnostic or proficiency tests as we saw.

Ways of measuring vocabulary size

The following are not confined to measuring vocabulary size because they can be used to measure other aspects of lexical knowledge (if they are somewhat different designed). Here are some examples and some comments:

Synonym tests are simple to design and administer. For example:

Choose the item which is closest in meaning to fire:

blaze
combustion
ignition
eruption

This is a test which depends on knowing all the words and being able to match meanings. A more searching test, sometimes, is to choose words and distractors which are close in form but not meaning. For example:

Choose the item which is closest in meaning to search:

seek
clench
reach
trench

and a test like that can also be used to test whether learners can distinguish between homophones, like this:

Choose the item which is closest in meaning to feet:

pause
pores
pours
paws

The problem, of course, with test items like these is that there is no context so distractors must be very carefully chosen to eliminate any possibility that, in certain circumstances and with certain meanings, more than one correct answer may exist.

Definition tests are easy to design if you have a learners' dictionary to hand. Examples are:

Which word means extremely frightened?

afraid
horrified
petrified
scared

Which definition of frown is correct?

an expression showing anger or disapproval
a gesture showing dislike
raising your eyebrows to show surprise
stretching your mouth to show dislike

There are problems with both of these test types because the first depends on understanding that the words are adjectives not verbs and the second, of course, depends on understanding the words in the definitions such as disapproval, gesture (vs. expression), eyebrows etc.

Gap-fill tests can get over the issue of a lack of co-text. For example:

Fill the gap with the correct word from the list of four:
The computer program __________ much faster processing of information.

empowered
enabled
let
qualified

The drawback with this kind of test is that, although the distractors should not in theory contain words the test taker does not know, it is often very difficult to identify distractors which are conceivable (but wrong) rather than wrong (and obviously so). Another issue is that the co-text should also not contain unknown or ambiguous items.

Gap-fill tests in which no alternatives are given, may also be a way of measuring productive ability rather than recognition. For example:

Fill the gap with one word only:
Mary lost her key so she __________ mine to get into the flat.

The obvious drawback with this sort of test of vocabulary is that it is very hard to write a series of item in which only one possible word is allowable. In the gap in this example, borrowed, took, stole, appropriated, nicked and a range of other possibilities are allowable and that complicates marking by introducing an element of judgement of appropriacy. Would you allow purloined, for example?
A way around this is to redesign the task like this to give the first letter and an indication of how many letters the word contains:

Fill the gap with one word only:
Mary lost her key so she b _ _ _ _ _ _ _ mine to get into the flat.

but that, naturally, makes it easier.

Using pictures to elicit productive vocabulary is a technique commonly deployed. For example:

Write the correct words for the sports next to the pictures:
_______________
_______________
_______________

Unfortunately, this only works for lexical items that can be unambiguously identified from pictures and even then, some items may be representable by more than one word and that complicates marking.

Definition tests, too, can be used to measure productive ability, like this:

Fill the gap with one word only:
Electronic devices which are connected to an amplifier and fit over both ears to play sounds are called:
____________________ .

The drawback is that there are very few words which can be completely unambiguously defined in this way.

Measuring depth: vocabulary use

The first decision concerns the selection of the aspects of a word that you want to test. In the general guide to teaching lexis, the following were identified as what it may be necessary to know in order to 'know' a word:

what a word means – what it denotes and what it connotes (if appropriate)
how it is connected to other words which mean similar things (e.g., buy, sell, bargain, discount etc.)
what words it commonly goes with (collocation) so we know we can't have a high tree but prefer tall as the adjective, for example
what other meanings it can have (e.g., shop, bank etc. can have different meanings and fall into different word classes)
how the word changes depending on its grammar (e.g., shop, shops, shopping, shopped etc.)
what grammar the words uses (e.g., does it take a direct object, an indirect object, both, a preposition, does it have an odd plural or an irregularity? etc.)
how to pronounce the word.
what kind of situations the word is used in and who might use it. Is it, for example, typical of a certain register?

Depth of meaning also concerns passive and active vocabulary, of course. Here are some example test items with a commentary:

Item 1, word knowledge:

Use the word complain in a sentence of a minimum of 8 words. Your sentence must contain a subject and an object.
____________________________________________________________________________________________

Clearly, this sort of test requires subjective marking although the marker will only be looking for accuracy concerning the target and ignore the rest but it tests a wide range of knowledge because the test taker needs to be able to

recognise the word class
understand the meaning of the verb
know that it is a prepositional verb usually combined with about or of
use the verb transitively with an appropriate object

For small test samples, this kind of item can be revealing. The test can also be done orally and that will include a check on whether it can be pronounced adequately.

Item 2, collocation:

Mark with a or a which words on the left can be used with the words at the top.
The first one is an example.

	rain	snow	wind	sunshine
heavy
pouring
strong
powerful
drifting
blowing
blazing
swirling
forceful

For variety and a little more precision, test takers can also be invited to put a by any item they consider doubtful.

Collocation can also be tested on a scale of naturalness so we could have:
Item 3:

Mark these sentences with a 1, 2 or 3 using a .
1 means it is the most natural
2 means it is possible but unnatural
3 means it is very unlikely or impossible
You can use each number as many times as you like.

	1	2	3
weighty issue
heavy issue
bulky issue
lions groan
lions rumble
lions roar
out of control
beyond control
on control

Collocations of many sorts can be tested this way because there is a cline from wholly unnatural to slightly and fully natural.

Item 4, formality:
Formality sensitivity can be tested in the same way:

Style:
Mark these sentences with a 1, 2 or 3 using a .
1 means it is formal
2 means it is neutral
3 means it is informal
You can use each number as many times as you like.

	1	2	3
please pass the salt
give me the salt
would you hand me the salt, please
they tend to be annoying
they are a pain
they are irritating
I'm averse to swimming
I am disinclined to swim
I don't like swimming

Item 5, register:
Register sensitivity can be addressed in the same way:

Mark with a or a which words on the left you would expect to hear in the settings at the top.
The first one is an example.

	IT	business	football	theatre
spreadsheet
transfer
performance
applause
critic
shoot
substitute
understudy
replacement

Item 6, paradigmatic and syntagmatic relationships:

Mark with a or a which words on the left you can associate with the words at the top.
The first one is an example.

	delayed	alteration	computer	light
late
change
machine
electronic
train
minor
bright
program
operation

This is not an easy test to understand in terms of what you have to do so learners need a little training to look for the two types of relationships at which it is aimed.
Again, for variety and a little more precision, test takers can also be invited to put a by any item they consider doubtful.
The test encourages the learner to try to recognise words of a similar nature and word class (paradigmatic relationships) as well as those likely to co-occur syntactically (syntagmatic relationships).

Item 7, colligation:
It is possible to test learners' understanding of word grammar in a number of different ways. For example:

Mark with a or a which phrases are correct.
Then, if necessary, write the correct form in the box on the right.

		or	Correction
1	I am sorry for late
2	I allowed him to come
3	She let him to stay
4	I concealed it under the table
5	I concealed behind the curtain
6	They arrived the hotel
7	He donated them the money
8	We handed over the doorman the tickets
9	We expected him to arrive late
10	We hoped her to come early
11	We can probable come
12	It's difficult but please try
13	It's hard but please attempt
14	She's an unwell child
15	I very almost was late

Only four of the above are correct (2, 4, 9 and 12) and the others target specific aspects of colligation which are exemplified in the guide to the area, linked below in the list of related guides.

Item 8, word class:
This is a simple test of what is known about a word:

Mark with a or a which words on the left belong in the word class at the top.
The first one is an example.

	noun	transitive verb	intransitive verb
bank
sugar
strength
dig
drift
blow out
haste
music
pause

Item 8, lexical sets
Sensitivity to sense relations can be tested in the same way:

Word sets:
Mark the odd ones which are not in the same word set as the first word with a .
The first one is an example.

late	delayed	overtime	behind	overdue
change	alter	modify	cancel	postpone
machine	device	tool	utensil	gadget
light	fire	ignite	illuminate	show
taxi	mini-cab	rickshaw	train	rental car
minor	trivial	small	minimum	important

Item 9, testing hyponymy:
This is a key set of relationships to test. It can be done, for example by:

Mark with a which word includes the meaning of the four words on the left.
The first one is an example.

	facility	shop	building
bank post office health centre town hall
	container	holder	vessel
tin box can carton
	education	building	institution
university college school academy

Item 10, synecdoche, simile, metaphor and other matters:
We can also test more sophisticated and difficult areas of lexical relationships, like this.

Which words can best replace the underlined words in :
The White House has decided to impose tariffs on steel.

the US senate
the US President
the President's office

Which words can best replace the underlined words in :
He became an actor.

He went on the stage
He studied acting
He went into the film industry

Complete the similes:

He's like a fish out of __________
It's as fast as __________
She's as thick as __________
I'm as deaf as __________
It went like a __________
They purred like __________

I have a lot on my plate this week means:

I eat too much
I'm very busy
I am worried by many things

Item 11, word formation:
The understanding of affixation can be test both receptively and productively, like this:

Mark these words as correct () or incorrect (). If it is incorrect, put the correct form on the right.

	or
advertisement
hopeability
understandingness
annoyment
painfulness
treatable
hatefulness
walkable
capableness

Productive ability can be tested this way, too, as in, e.g.:

Fill the gaps with the correct form of the base word. Put a where it is not possible to make a word.
The first one is an example.

	noun	transitive verb	intransitive verb	adjective
snow	snow		snow	snowy
sweet
add
dig
love
rain
hurry
contain
old

As you can see from the example of love here, items need to be carefully chosen because a range of derived words may be formable (lovable, lovely, loving, loved etc.).
Another way to do this is to populate a grid with some of the target stems or derivatives and get the learners to complete it with a word or a . Like this:

Fill the gaps with the correct form of the words. Put a where it is not possible to make a word.
The first one is an example.

noun	verb	adverb	adjective
snow	snow		snowy
	hate
		hurriedly
advertisement
			hot
	please
		sideways
thought
			cheerful

A simpler way is something like:

Select the correct word:

unpossible
inpossible
impossible

Select the correct word:

dirtity
dirtiness
dirtfulness

Item 12, pronunciation
Although pronunciation is probably best tested orally for obvious reasons, that is not always practical especially if the test setter and the test taker are not in the same place. It is possible to test it in writing, however. For example:

Which word rhymes with hoped?

dropped
shopped
soaped
locked
adopt

Which word contains the same sound of the 's' as in sugar (/ʃ/)?

sword
leisure
school
shame
measure
muscle

One can design items of this sort in which the test taker needs to select multiple possibilities and, if the test taker is familiar with the phonemic script, it makes life considerably easier so we can have, e.g.:

Which words contain the sound /uː/?

sword
foot
lost
loose
sure
should
goose
cruise

and one can add, "as in choose" to the rubric to make it clearer.

This may not be an ideal way of testing pronunciation and it is unlikely that one can focus on anything more than vowel and consonant pronunciation this way but it may be the only way in some settings. Trying to test features of connected speech, with the possible exception of the weak-form schwa (/ə/), is very difficult.

If you follow some of the guides linked below, you may discover other phenomena concerning lexis which, with a little imagination, you can assess in ways similar to those exemplified above.

Related guides:
idiomaticity	which considers levels of transparency, strong collocation, binomials and so on
collocation	a guide to a key area to see what you might be testing
colligation	a guide with examples of colligation types that you may consider testing
synonymy	which includes explanations of metonymy, synecdoche, simile, metaphor and hyponymy, all of which can be tested
lexical relationships	for an overview of synonymy, hyponymy and other terms
testing and assessment	a general guide to testing, assessment and evaluation with some key terms explained
the lexis index	for a list of other guides in this area

References:
Hughes, A, 1989, Testing for Language Teachers, Cambridge: Cambridge University Press
Schmitt, N, 2000, Vocabulary in Language Teaching, Cambridge: Cambridge University Press