Language evolution

darwin

This is about language, not languages.
The ways that individual languages change and develop is interesting and relevant to anyone who works with language but it concerns cultural change not evolution in the sense that we will be using the word.
We are concerned here to outline how and in what circumstances humans may have evolved the ability to use language (not a language – language).
The answer, as we shall shortly see, is that nobody really knows but there are some interesting and plausible ideas out there that may be of some interest.
(In what follows, to avoid too much clumsiness, when the word animal is used, we are referring to non-human animals, of course.)

Language vs. communication

The first thing we need to do is define our terms and the first of those terms is what we mean when we use the word language.

Firstly, we are not concerned to define it solely as a means of communication (although it is, quite obviously) because it is more than that.

Here are some examples of communication which, for our purposes, fall outside the definition of language although they are, in one way or another, forms of communication.

Alarm calls

Lots of studies have shown that many animal species can not only signal danger (or an alarm) but can also signal the type of danger and, in some cases, its location. Species which exhibit various types of alarm calls have been studied extensively and include macaques, West African green monkeys, capuchin monkeys, meerkats, elephants, prairie dogs, parrots, blackbirds and more.
It has been clearly shown that vervet monkeys, to take one example, can produce alarm calls that distinguish between three types of predator (pythons, eagles and leopards) (Seyfarth et al, 1980).
It has also been reported that one species of New World monkeys can not only signal the type of predator in question but also its location (Cäsar et al, 2013). If that is the case, it would imply an ability, albeit rather limited, to use some kind of syntax to combine ideas. As we shall see, that may be significant.

Bird song

The most obvious message communicated by bird song is "Come and mate with me!" and that is one reason why song styles are species specific, of course. Robins do not want to attract blackbirds but they do want to attract other robins.
Other meanings which have been attributed to various bird songs and calls include the common one of territory marking and defence but have also been posited as a way to identify specific individuals of the same species.
It has also been noted that to some extent at least the songs of birds (but not their calls) are learned and that there are dialects within species depending on the songs of others in the area around them.
If that is true, there are consequences because it would be evidence of cultural transmission rather than a genetically encoded ability to perform certain songs.
Parrots and many other bird species are also good mimics of other birds and animals, of human speech and of ambient sounds such as certain types of machinery. This is not, however, any form of communication because a parrot does not intend to communicate anything in particular.

Whale song

Like bird song, the only certain message that whales are communicating is an invitation to mate although other signals have been proposed including communal hunting organisation and echo-location of prey.
Whale song, too, shows some evidence of cultural transmission with whales of the same species in certain ocean areas singing very differently from those elsewhere. There is also evidence of variation over time which may suggest that the songs develop.
The songs are species specific and whales do not show evidence of any cross species development of songs. The songs have been noted as distinct in fin whales, orcas, blue whales and many other species.
Dolphins and porpoises, too, are sometimes very vocal, producing a range of whistles and clicks which have been shown to communicate some basic messages.

Bee dancing

The dancing of bees has long been observed but it is only comparatively recently that the message communicated by the form of the dance has been unravelled (von Frisch, 1967). Briefly, bees dance with a significant waggle which transmits two pieces of information about a food source:

the direction is indicated by the direction of the dance and is orientated by the position of the sun (so it has to be done almost immediately before the earth moves too far in relation to the sun)
the distance to the food source is indicated by the duration of the dance. The longer it goes on, the further away is the food source.

Bee dancing is limited to giving information about a food source but the interesting thing about it is that it refers to displacement from now and displacement from here. The bee is communicating concerning something that is not here and not now, in other words and that is not something any of the other forms of animal communication have been shown to do.

Animals that communicate with humans

While we are, of course, very sceptical when a pet owner declares that his dog, cat or ferret is able to understand what he says and that he, in turn, understands the pet, there is no doubt that animals that have lived for many generations among humans have come to rely on them for food and water and have, accordingly, developed ways of telling us humans that they need food, need a walk or need attention in some other way. This is undoubtedly a form of communication but it isn't language as we shall see.
As has often been pointed out, your cat or dog may be able to tell you that she's hungry but not that she's worried about her weight or that she intends to enjoy a nap in the sunshine later on.

Other forms of animal communication

There are some familiar forms of communication not mentioned above and these usually concern threats, mating rituals or territory marking. Some examples will do because this guide is not centrally concerned with a review of animal behaviours.
Dogs bark and growl to warn strangers and wag to show pleasure, cats hiss and purr, squirrels rattle, hiss and screech, snakes can do the same and many mammals mark territory with urine or special scent glands. Even some plants have been shown to communicate threats from parasites chemically.

Human language

In order to identify what it is about human language that makes it unique, we need to set out its defining characteristics and then see how any of the animal communication devices we have outlined stand up to the tests for language.
There are lots of ways to define the nature of true language but we will focus here on those that seem most important (there are others, including spontaneity and turn taking).

Arbitrariness

The sound symbols we use in human language are almost fully arbitrary. While there is, perhaps, some sense in which a word like squish represents in its sound something of its meaning, almost all the words in a language, any language, do not have any connection with what they represent. Even the so-called onomatopoeic words which some aver sound like their referents turn out to be variable across languages although the sound they purport to represent is not. Dogs in English go woof, woof and in Greek go yav, yav, for example, and in French, that's oauf, oauf, in Spanish guau guau, in German wau-wau, in Czech haf haf, incidentally, and so on in a range of languages few of which use the same formulation for the same noise.
There is no recognisable connection for example between the flying dinosaur and the word bird (or Vogel, oiseau, pájaro, vták, txori and so on for 7000-odd languages).
(That is not to say that all words in all languages are arbitrarily formed. Many languages, especially Japanese, some African languages and South East Asian languages make extensive use of what are called ideophones which allegedly conjure up the meanings they have and the relationships go beyond sounds, including verbs of movement, textures, smells and tastes. An example in English is zigzag in which the form of the written word represents its meaning (to some extent) although whether someone unacquainted with the word would guess that without context is arguable).
On the other hand, in animal communication, there is often a clear lack of arbitrariness in the signals that are sent. A dog baring its teeth and growling is leaving you in little doubt about its intentions and it might be said that loud screeching to signal danger is also not arbitrary in the sense that the words eagle, python and leopard are arbitrary symbols).
Other forms of animal communication, such as bird and whale song are more easily described as arbitrary.

Cultural transmission

While there is some evidence for cultural transmission in the form of bird and whale song, other instances of animal communication seem to be genetically inherited and not learned. Bees do not learn to dance; they know how to dance. Even when bird songs can be shown to be culturally transmitted, bird calls are not. All members of a single species will call in a particular manner, making identification of an unseen individual straightforward to those in the know.
While humans may have a genetically endowed capacity for learning language (an assertion concerning which the jury is still out, incidentally) we do not have a genetically endowed ability to speak any single language and nor will we find some languages easier to acquire than others.

Semanticity

Words have discrete meanings. The meaning they carry is understood within the speech community in which they occur by mutual agreement. Words are not merely the signs which represent concepts they are symbols of meaning.
Within animal communication, it is hard to find any single item that one could call a word or symbol with a specific meaning. A dolphin's click may signal I am here (or it may simply be a navigational aid) but there is no way to break down even a series of clicks and whistles into discrete and meaningful units. In other words, the clicks do not mean I am here but they may mean iamhere.

Verbal

Language is primarily oral-aural and not visually representational (as bee dances are) or accessed through any other sense (as scent markings are, for example). Although non-verbal systems of human language exist (essentially the various sign languages, braille and the written systems) primacy is given to the oral-aural medium.
Animal communication, too, is often in the medium of audible signals because that happens to be a very efficient medium but need not be.

Innovativeness, creativity and productivity

True language is infinitely innovative. All users of any language can invent sentences and utterances never previously seen or heard which will be immediately understood by any other speaker of the language. New words arise in all languages to describe new events and entities or old words are given new meanings. Words die and are born.
Although some animal communication systems are flexible and show some development over time and space, it is not apparent that the users can innovate to any degree.
It is possible that something like
The table is in the corner
has been said and written many times (perhaps thousands of times) but
The dogfish's wallet is in the food mixer
has probably never been said and this may well be the only time it has been or ever will be written.
The communication systems used by animals are fixed and consist of a small number of unchanging messages. Human language is virtually infinite.

Displacement

All human languages can refer to things which are not here, things and events which are not current and to things which are purely imaginary and unreal. We can, for example, say:
John went there yesterday
and efficiently refer to someone who is neither the speaker nor the hearer, who need not be present, who did something not here and not now.
Bee dances certainly do refer to the not here (in saying in which direction and at what distance a food resource may be found) but no other animal communication systems have been shown to do that. Even bees, of course, are referring to a present food source and not, say, speculating about a future source or ruing the absence now of a past source.

Pattern, structural dependence and recursiveness

Language is patterned. There are patterns of syntax and word combination which are almost infinitely flexible but which are based on a rule-bound system of syntactical relationships including, e.g., noun phrases, verb phrases and so on. These syntagmatic and paradigmatic relationships allow an almost infinite variation in the message which is sent and received. It also means that language is dependent for its meaning on its structure.
For example, if we take 5 words – saw, a, woman, unicorn, the – there are, mathematically, 120 different ways (5!, is the representation for mathematicians) to arrange them but only four out of that number will result in a well-formed and acceptable English sentence.
Syntax in language restricts in this sense but also allows almost infinite innovation because any of the words in that set can be replaced by others in the same word class to make completely new sentences such as
    The bus sold a banana
    A cod guaranteed the bungee jump
which, clearly make little sense and have probably never been said or written in English before now. They are, however, comprehensible to a user of English in a way that, e.g.:
    the jump cod bungee a guaranteed
is not.

Human language is also almost infinitely recursive. We can, for example, embed a noun phrase within another noun phrase as in:
    The headmaster's wife's sister's children's toys
or develop endless series of modifications within modifications as in, e.g.:
    The house which he sold to his brother who was happy to move in on the day when it suited the removal people's staff who were ...
or we can embed verb phrases endlessly as in, e.g.:
    John thinks the shop is open
    I know John thinks the shop is open
    You know I know John thinks the shop is open
and so on, literally ad infinitum.
No animal communications system has been shown to do anything like that.
There is nothing in animal communication in which any pattern of grammatical or structural rule can be discerned.
Syntax is, in other words, absent from animal communication or so minimal that it is as good as absent.

Primate language

Humans are primates, of course, but what we are interested in here is whether other primates, specifically chimpanzees, are capable of real, human-like language.
For that we turn to the two most famous (or infamous) studies, both conducted in the USA concerning Nim and Washoe.
The earliest of these was Washoe, a chimpanzee brought up as a human child would be and taught around 350 signs for various entities which, it was claimed, she could combine syntactically to make new meanings.
The failed experiment was with Nim, brought up in a laboratory (and later cruelly abandoned to an animal experimental laboratory where he died prematurely). Nim learned around 125 signs but showed little if any ability to string ideas together in novel ways, using his ability purely functionally to get whatever he desired. Great claims were made that Nim was ordering words syntactically and collections of his signs were published at length. However, as Aitchison (2008:42) concludes:

It would require a considerable amount of imagination and wishful thinking to detect a coherent structure in such a collection.

Aitchison also concludes (op cit.:47) that while some primate language seems to exhibit arbitrariness, semanticity and to some extent creativity and displacement, that's as far as it goes. None exhibits patterning and structural dependency and that's a key issue. Additionally, for physiological rather than psychological reasons no primate language can be verbal rather than visual.

The results of both these studies and many others which have been undertaken with gorillas and orangutans as well as chimpanzees, remain controversial, as do the ways in which both chimpanzees were abandoned as soon as their usefulness was at an end.
There are some who claim that one or both animals learned enough syntax to cast doubt on the assertion that true language is a human-specific ability. Others see the results as a triumph for self-deception on the part of the investigators and point to an almost complete lack of evidence that any non-human primate has been successfully taught any form of language which could not be taught to a rat or pigeon with adequate amounts of operant conditioning.

The truth may lie somewhere in the middle but even if it were shown that some form of what we will call true language is teachable to other primates, it does not come close to the ways in which human children develop linguistic abilities. Neither Washoe nor Nim showed any understanding of turn taking and spontaneous speech designed to communicate an idea that was not already present.

That non-human primates can be taught some rudimentary language should not blind us to the fact that they do not seem, as humans are, to be predisposed to acquire language. Other primates can be taught language but at the cost of hundreds of hours of intense training whereas human children learn language merely by exposure to it and with almost no formal training at all.

Language evolution

While it may be argued that some forms of animal communication show one or sometimes even two of the seven characteristics listed above, none can be shown to have all of them and many have none at all.
It is also true that there remains a case to be made concerning whether Washoe and Nim's abilities come close to exhibiting the seven characteristics we have identified here.

We are dealing, then, with a phenomenon which is truly sui generis, having no known parallel (at least on our planet) and existing in a class by itself.
Figuring out how language evolved has been frequently described as one of the hardest problems in science so if you have come here looking for the right answer, you will be disappointed.
What we can do here, however, is map out some of the ideas that have been suggested and what evidence has been assembled to support them and see how they measure up.

The first thing to do is pose answerable questions. It is of little use simply asking, "How did language evolve?" if we aren't fully sure what it is that evolved. We will, therefore, refine the question in the light of the seven characteristics of true language that we have so far identified and ask, instead:

How did an arbitrary system of meaningful symbolic units arise?
How did a patterned system of syntax emerge to combine those units in productive ways allowing for innovation, the expression of displacement and recursiveness?

Simple logic would lead us to a first conclusion: symbolic units must have developed before a combinatory system into which they can be organised. If you have no symbols (i.e., words) to employ, syntax has no purpose and nothing to work on. That does not mean that they arose simultaneously or in partnership because the evolutionary history of each may be different.
It has been suggested (Fisher and Marcus, 2006) that: at present there is no way to validate the core assumption that lexicons evolved before grammar and that may be the case but it really does not need validating for simple logical reasons. It is not possible to see how grammar could have evolved without a lexicon but it is easy to see how it could happen the other way around. Assuming the reverse is akin to assuming that an eye could evolve in total darkness.
There is, by analogy, an obvious and probably demonstrable route by which a scent organ can evolve but that is not dependent on or connected to an explanation of the evolution of bipedal locomotion. The two happened to evolve according to the laws of evolution but they are otherwise unconnected. The ability to detect smells is valuable whether or not one has a bipedal locomotion system and bipedalism is, presumably, as useful a method of moving around whether one has a sense of smell or not.
Here we must take a short detour to explain what these laws of evolution are so that we can use them to assess the value of our theories.

Three bases for evolution

For evolution to occur naturally, three things are needed:

Individuals must vary slightly with respect to their physiology and behaviour. That is called variation in the phenotype.
Regarding language, this means that some individual hominids somewhere must have had a variation in brain structure, the structure of the larynx or other variations that made them even if only very slightly better able to produce, decode and use speech to express meaning.
Variation must be passed down the generations. That is, the characteristics must be heritable.
For language, that means that the abilities to handle speech and symbolic systems as well as process syntax must be reflected in the animal's genome and be subject, as all other genetically determined characteristics are, to hereditability. It is no good, in other words, learning to be better at language if you die without passing on the genetic information that allows the next generation to be better at language.
It is also no good simply learning more words or developing the muscles to vary your speech because these learned or developed traits are not reflected in your genome and die with you.
Different characteristics of individuals must confer different rates of survival and reproduction. That is differential fitness.
Regarding language, this means that the ability to be slightly (only very slightly) better at manipulating language units must have endowed its possessors with a small but visible (to evolution) advantage allowing better health, greater attractiveness, longer lives or a combination of those which led to greater reproductive success and hence the spread of the genes which determine the behaviour.

Gene or meme?

First, the definitions of what we mean here:

Non-technically defined, a gene is a biological unit of heredity which has a physical existence. Slightly more technically, It is a sequence of nucleotides forming part of a chromosome. We can see it, analyse it and even manipulate it.
Genes are, by definition, passed on to later generations of the species so, for example, a gene for eye colour, song form, bark volume or flower shape will be passed down the generations of animals and plants from parent to offspring. Providing only that the gene's actions are beneficial or neutral in effect, the gene will continue to pass down the generations. Any mutation (change to the genome) which is detrimental in terms of reproductive success will be eliminated from the genome as the generations pass.
Genes can encode for behaviour or physical characteristics but may not do so directly. They may function to affect the actions of other genes (and are called transcription genes when they do that).
A meme, on the other hand, is a unit of cultural transmission and is not a physical entity. It is an element of culture which is passed from person to person non-genetically.
So, for example, the habit of wearing a baseball cap the wrong way around or using LOL to mean laughed out loud is not a behaviour which is inherited, it is one which is passed indiscriminately from person to person whether they are related or not. Again, memes which confer some benefit, real or perceived, will succeed in reproducing and spreading more widely. Those which are not or which lose their novelty or other benefit will die out.
By this definition, all lexemes are memes.

Memes behave in a superficially similar way to genes: they are passed to others, they compete and they are identifiable as being successful when they have greater reproductive capacity. Some memes, like some genes, are more successful at reproduction than others and are, therefore, more widely distributed. Others may die out when they lose the ability to reproduce – almost nobody talks about hearing something on the wireless any longer.
Because a language is a cultural artifact, its development and content may be seen as memetic rather than genetic.
Language, by contrast, rather than a language, is less easily described as memetic because, although individual words and expressions may be transmitted culturally, the ability to manipulate them is not.
We need to explain how the ability to process complex syntax and a huge range of symbols for ideas, events and entities emerged in humans and in no other life forms.

Question begging

All research and speculation concerning the evolution of language assumes that there is something that can be explained by an appeal to a biologically controlled process. If language is simply a culturally transmitted ability, then no genetic basis for it needs to be assumed and no thought needs to be given to how it has been shaped by millions of years of evolution.
The distinction that is being made here, and needs another short detour to explain, is that the acquisition of a language is a memetically determined phenomenon but the acquisition of language, rather than a language, is a genetically determined one.

If you have followed the guide on this site to first- and second-language acquisition, you will be familiar with Aitchison's six defining characteristics of a biologically rather than culturally determined ability. The six characteristics are set out here in black with some explanation in blue. For more, see the source text (Aitchison, 2008:71, drawing on Lenneberg, 1967):

The behaviour emerges before it is necessary.
Children are fed, clothed and looked after well into life (sometimes until well after puberty). Children do not need language to survive. Walking and upright posture come into the same category. Those, too, are behaviours which emerge before they are needed.
Its appearance is not the result of a conscious decision.
A child does not suddenly decide to learn a language. You may decide to become a concert pianist at a very early age but the decision is a conscious one which means putting in a good deal of practice. You did not, however, ever decide to learn your language. It simply happened.
Its emergence is not triggered by external events (though the surrounding environment must be sufficiently ‘rich’ for it to develop adequately).
Children begin to talk even when their immediate environment is unchanging. They live in the same place with the same people, eating the same food and doing much the same things.
Direct teaching and intensive practice have relatively little effect.
If you are determined to become a concert pianist, it is quite likely that the amount and quality of teaching and practice you get will be directly related to your eventual skills level.
Not so with language. Although carers often make explicit efforts to correct children's language production, the evidence is that it has almost no measurable effect.
There is a regular sequence of ‘milestones’ as the behaviour develops, and these can usually be correlated with age and other aspects of development.
All children develop speech in the same way, reaching certain milestones at approximately the same age.
For example, at 12 to 18 months, children communicate in single or double words and set phrases but by 18 to 24 months, they exhibit increasing vocabulary and rudimentary grammar. This happens everywhere with every language with all children.
There may be a ‘critical period’ for the acquisition of the behaviour.
There is a good deal of evidence drawn from cases of feral children denied access to language data or to children who are brain damaged or otherwise hindered in their ability to access language data that the critical period is between 2 and 13 years. If language is not acquired then, it will be only partially acquired in later life, if at all, although there are exceptions and the data are limited, thankfully.

There is more on this to the guide linked below to first- and second-language acquisition.

If it can be shown that language acquisition is not culturally determined but the result of internal genetically-programmed changes, just like the ability to crawl and walk, then we need to explain how the ability arose and for that we have to turn to evolutionary science as the only plausible explanatory mechanism.
We must not, however, necessarily accept the ways in which the demonstrably biologically determined nature of language acquisition has been extended by some, notably Chomsky, to include a pre-programmed Universal Grammar of all languages whose nature is hard-wired into the brain's structure. That does not necessarily follow.
Evans and Levinson are, for example, sceptical of the claims of innate grammatical structures and they conclude:

the great variability in how languages organize their word-classes dilutes the plausibility of the innatist UG position
Evans & Levinson, 2009:14

because, as they and others have pointed out, human languages are, in fact much more variable than has been properly recognised by those working within a limited range of, usually, Indo-European languages (and often only one of them).
For more, see the guide to Chomsky, linked below.

It is now generally accepted, however, that the ability to learn one's first language is a genetically controlled one but the nature of the language (its words, forms, structures, phonology and so on which are unique to it) is either culturally transmitted or, if one accepts the notion of a Universal Grammar, one which is only variable within pre-determined limits.

Now, finally, we can start to consider our two questions.
We will look first at question 1:
How did an arbitrary system of meaningful symbolic units arise?

Calls and words

One theory, usually not held by linguists, is that words emerged from animal calls because evolution does not invent things unnecessarily but tinkers with what already exists. If it can be shown that animal calls have some meaning, the theory goes, then arbitrary meaning can arise from that. This has been called the theory of genre continuism (Bickerton, 2005:513).
This means, for example, that the word snake (which bears no relationship to the animal to which it refers) arose from the alarm call that was in use to alert other members of a tribe to the presence of a dangerous predator.
There are, however, some very distinct differences between human-language words and animal calls, and the issues are set out by Bickerton (op cit.):

Genetic determination
1. Calls are genetically determined. All monkey alarm calls, for example, within species are identical and monkeys do not learn what a call means or how to produce it by imitation or education.
2. Words, by contrast are culturally transmitted. They are memes, in fact. The word snake has a significance only to those within the culture that uses it. Other languages do not recognise it: serpent (French), Schlange (German), had (Slovak), serpiente (Spanish), φίδι [feedi] (Greek) etc.
Propositional value
1. Calls are propositions so the alarm call means
      Look out there's a snake
  the mating call means
      Come mate with me
  the territory call means
      Go away or I'll attack
  and so on.
2. Words alone are not propositions; they are made part of propositions by syntax. The word happy for example certainly carries meaning for speakers of English but is not in and of itself a proposition. It can be made part of a proposition but that requires syntax and other words as in, e.g.:
  She was not happy
Symbolism
1. Calls are not symbolic; they refer to the here and now and are meaningless in the absence of what they are referring to. The alarm call cannot be used to refer to the danger posed by a snake which is not here and now.
  A call can mean:
      There's an eagle overhead
  but it cannot mean
      There was an eagle overhead
  or
      Have there recently been any eagles around here?
  and however complex the call is (and they are usually not complex at all) it cannot be broken down into the element that means eagle and the element that means overhead.
2. Words, on the other hand are symbolic and can refer to something displaced in time and space from the word's use.
  Even a one-year-old's single utterance of, for example:
      Bear
  can, depending on context, mean:
      I want my bear
      Bring me my bear
      My bear is here
  and so on.
  Words take meanings from context, calls are context independent.

These are serious problems because the differences between calls and words are not ones of degree, they are qualitatively different. To assume that words (symbols) arose by slow evolution from calls means that you have to demonstrate how these qualitative differences came about. In other words, for example, you have to show which part of a territory-determining call means I live here and which part means No trespassing.

Social intelligence

The idea that an ability to use language was selected for by the demands of group interaction has been around for some time in various forms. The forms it takes are:

Language began as an accompaniment to or replacement for social grooming in primates
How are you today?
Gossip has also been suggested as a way that language evolved in this setting.
Language began as a way of facilitating the ability to hunt cooperatively
You go round the back and drive the antelopes towards us
kind of thing.
Language began as a way of training in tool making
You hit the flint just here with the horn and it flakes this way, you see?
Language arose out of the development of some kind of primitive theory of the mind in which we are aware that others can be fooled or manipulated for us to get our own way
No, I'm not interested at all in that piece of food. Gosh, look over there!
Language evolved when human (or hominid) group sizes became too large for primate grooming to play its usual social cohesion roles
I love you, too.

There are problems with most of these theories.

Many animals interact in a way that involves social grooming (ponies, horses, cats, baboons, cattle, monkeys, bats, lions and even some insects) but have not developed language as an accompaniment to it.
Many other animals hunt cooperatively (jackals, wolves, falcons, dolphins, chimpanzees, lions, crocodiles and even some insects) without having developed language to help the process.
Tool making skills are culturally inherited and passed down through demonstration and imitation and do not need language to facilitate the process.
Some tool making in animals may be genetically determined behaviour, some may be learned later but none needs language for its transmission from generation to generation. Elephants, for example, may modify branches to use as fly swats, dolphins learn to use sponges to help with catching prey, sea otters use rocks to crack open shellfish, crows and gulls drop prey shelled animals onto hard surfaces to break them open, orangutans use modified branches for many different purposes, including, swatting, scooping, probing and so on and the list can be greatly extended. However, none of these behaviours, either learned or inherited, requires the use of language for its transmission.
Most apes (and especially chimpanzees) can be shown to possess a theory of mind which allows for the concept that others may not be thinking as we are but have still not developed language to facilitate the conceptual process involved. Successful deception requires, to some extent, a theory of mind so animals may take on the appearance of other, more dangerous or poisonous ones, may feign death or injury to deceive a predator, ravens will cache food in secret, many apes will do the same and may also use bogus alarm calls to deceive or distract others and so on but none of this requires language per se.
There is no evidence that hominid groups actually did grow very large at any time and, in any case, some primate groups are very large without having a concomitant ability to use language instead of physical grooming. Baboons and some monkeys, for example, may live in large multi-level societies in which there are families within clans, within bands within larger troops (up to 250 individuals in baboons) but, again, no language needed to have evolved to manage the socially complex relations.

Nevertheless, we know from our theory of evolution that there must have been some selective pressure which would lead to greater and greater symbolic vocabularies (a lexicon of sorts) which would have conferred some advantage on its possessors.
Because the acquisition of more and more symbols does not require any fundamental rewiring of the brain, there seems no reason not to suppose that a slow incremental process of enlarging the vocabulary of language symbols for events and entities could not arise quite slowly and steadily. Once you have one symbolic unit (aka lexeme) there is no impediment to increasing the numbers you can possess.
It is straightforward enough to imagine that the acquisition of a greater vocabulary, without any syntactical framework would have given its possessors a distinct selective advantage. If, for example, one member of a group has discovered a source of food (fruit, a carcass etc.) the ability to say food and gesture in its general direction will be an advantage but the ability to say what sort of food at what distance away requiring what resources for its exploitation would confer even greater advantages.
The assumption is, therefore, that while syntax does require some clever rewiring of the human neural system, the acquisition of a greater and greater range of symbolic signs did not.

What we have here is the development of what is now a generally accepted concept of a ...

Proto-language

The human proto-language is assumed to have some identifiable characteristics which can be seen in the sorts of language that children use around the age of two and also in some pidgins and other primitive languages which manage without anything we would call syntax.
It consists of a wide range of symbols with nothing that we can term structure in which to embed the ideas. Nevertheless a proto-language would have some characteristics of what we have defined above as true language:

Symbols would be arbitrary and have specific referents so noun phrases would form the backbone of the language. There would, the theory goes, be specific ways to refer to things of general and specific interest to the group which would be accepted by all speakers of the proto-language (which would not be the same as another group's language). So for example, the language would be able to express:
    a lion
    a fruit (and name types thereof)
    a close relative
    a more distant relative
    fire
    warmth
    danger
and so on.
There would have to be some symbols that refer not to physical entities but to actions (what we now call verbs) such as
    kill
    go
    share
    cut
    gather
etc.
There may also have been some symbols referring to abstract entities such as times and directions such as:
    in that direction
    an hour's walk away
    over that hill
etc.

It is not surprising that a proto-language should focus mostly on things and events because that is how the universe is ordered.
The physical world, as Newton knew, can be described in terms of objects and events. In other words, things do stuff, have stuff done to them or stand in particular relationships to other things. Our pre-Newton ancestors could not have explained the processes but could hardly have failed to notice that some things are heavy, some light, some burnable, others not, some edible, some poisonous and so on. They will also have noticed and taken note that some things are found in particular relationships to other things: fish in water, plants in earth, certain animals in groups and so on.
In fact, people who rely for their continued existence on successful hunting and gathering usually have a very well developed inventory of objects and relationships. Such knowledge does not help those who rely on working and shopping for their existence and the knowledge has decayed, having no further adaptive function. We know, however, even in a supermarket we have never entered before, that tea is likely to be found in the same area as coffee and sugar will probably not be far away.

The really critical issue is whether a proto-language would be, in the manner of animal calls, propositions which were embodied in a single expression or whether there would be separate units which could be combined to make meaning.
A single utterance can, for example, mean:
There is danger from a snake
or it can be a set of symbols each carrying its own meaning along the lines of:
Danger | exists | in this direction | from a snake
in which the utterance is synthetic, being made up of four discrete units (the second of which is redundant) rather than representing a single idea.

Here, at last, we have an inkling of how syntax may have developed.
While it is possible to have single utterances which signify danger from a snake, danger from an alligator, food for scavenging, food for collection, place of shelter from predators, place of shelter from rain and so on and on, it is much more efficient to combine units so the term for shelter or danger, for example, remains constant while the phenomenon from which the shelter is available or the danger emanates can be attached to it. So we get paradigmatic relationships forming such as:
    snake danger
    lion danger
    fruit food
    meat food
and so on. This still isn't syntax but the way ahead for it to become syntax is clear.
Just as modern languages can posit a noun collectively to represent all instances of something, so we can attach modification to make it clear which instance we are referring to. Therefore, modern languages do not need a separate symbol to represent each and every instance of a type of entity but can use a generic term for the entity and modify it to identify this instance of its existence. Thus, we can distinguish between
    car
and
    my car
or
    yellow car
etc.
The selective advantage of such a system is clear because it multiplies the ability to refer to things without multiplying the memory resources needed to store separate lexemes (or symbols).
What we do not yet have is a fully synthetic true language but we do have the makings of one because we have:

A wide and widening vocabulary of symbols representing both nouns (overwhelmingly) with some verbs and (possibly) directions.
A primitive but functionally useful way of combining sounds (symbols) to make differences in meaning so we can distinguish between:
Dangerfromasnake
and
Danger | from a snake | yellow | in a tree | above you

Jackendoff (1999) identifies what he refers to as fossil remains of proto-language in the language of infants and in adult language items such as No!, a generalised prohibition, or Shh!, a specific prohibition.
These single word (or even single sound) utterances are not combinatorial and do not depend on any form of syntax for their comprehension.
Infant language (pre-syntax) is often identified as akin to a proto language, a theory that mirrors the idea that ontogeny recapitulates phylogeny (i.e., that the development of the embryo mirrors the stages of the evolution of the adult animal, an idea usually attributed to Ernst Haeckel (1834–1919), and now almost fully discredited). The application to language evolution is that the child's language begins with a set of symbols (a lexicon) and only later does the ability to set the symbols in syntactical units (a grammar) arise.

Jackendoff also sees, among much else, some logical emergence of what he refers to as agent-first syntax averring that proto-syntax would naturally have assigned the first position in an utterance to the agent of any event so we have:
He made fire
not
Fire he made
etc.
An immediate objection, of course, is that some human languages do not put the agent first.

He goes further and identifies phenomena in modern languages which are syntactically quite promiscuous, in particular, disjunct expressions such as in my opinion or unfortunately which can occur virtually anywhere in a clause as in, e.g.:
    She, unfortunately, lost the money
    Unfortunately, she lost the money
    She lost the money unfortunately
    In my opinion, he is unhelpful
    He is, in my opinion, unhelpful
    She is unhelpful, in my opinion
He does not, however, in that paper, distinguish between disjunct and adjunct use and adjunct use in most languages is far more strictly constrained in terms of syntax.
He concludes (op cit.:279):

I have tried to show that (1) there are indeed many special aspects of language, but (2) that they could have evolved incrementally, not unlike the eye and the parts of the brain that the eye serves. Having less than the whole system would still have been useful. What is also new here is the hypothesis that certain design features of modern language might be ‘fossils’ of earlier evolutionary stages.

What is as yet unclear is how the giant steps from a simple call with a single here-and-now meaning to a large vocabulary of arbitrary symbols to a combining mechanism into which to insert the symbols were made.
It may be the case that other developments in human neurophysiology were recruited to be used for the processing of syntax or that the selective advantages of having the ability to use simple syntax, however nascent, were great enough in themselves to force evolution down the road of creating the mental wiring to allow it to happen.
Nobody really knows (yet).

The emergence of syntax

We can now attack the second question we began with:

How did a patterned system of syntax emerge to combine those units in productive ways allowing for the expression of displacement and recursiveness?

Here we are on thin ice.
How you explain the emergence of fully formed syntax depends, of course, on what you think it is. At its simplest level, syntax may be described as a set of paradigmatic relationships which determine the role of each constituent of a clause or utterance. So, for example:
    Share that fruit
has three components which can each be replaced but only by equivalent concepts so we could also have:
    Make new fire
    Help her carry
    Avoid dangerous snakes
and so on.
This is deceptively simple because each unit is, in fact, performing a different syntactical function. Syntactical rules, in English in this example, do not allow:
    Fire make new
    Carry help her
    Snakes avoid dangerous
etc.
but there is no a priori reason why one type of ordering should be preferred over another, providing only that the message is clear and the conventional arrangement is shared by those who need to use it.

The selective advantage of having even a basic syntax is quite obvious because it allows the language user to set the lexicon in a context / co-text and that allows a much greater range of messages to be sent and understood. While there may well be an evolutionary advantage to being able to say Food there is a much greater advantage in being able to say what sort of food, where and at what distance. Even greater advantages are conferred by being able to say whose food it is and what we should do with it.

Although it a relatively easy matter to point out how better syntax (i.e., more flexible and productive syntax) will have been selected for because it confers obvious benefits on its possessors, it is far less easy to say how this ability evolved in the first place.
It may be the case that for many millennia, a proto-language consisting of large numbers of arbitrary symbols for things and events was all that humans possessed and that syntax evolved very much later and quite suddenly. That is Bickerton's view.
An alternative idea is that syntax itself began as a rudimentary ordering of symbols and gradually grew in complexity and subtlety over 200,000 to 300,000 years of human evolution. That is the view of Pinker and Jackendoff.
As yet, the jury is out because there simply isn't enough evidence to decide one way or the other.

As we saw above, this all begs the question that there is something unusual to be explained, of course, and there are some who would aver that language is memetic not genetic so its acquisition can be explained by recourse to social factors alone.
This is not a widely held view and most people in the field allow that a child's phenomenal ability to acquire a language with almost complete success in a matter of a few years does have to be explained and that the explanation lies in positing some kind of genetic programming which allows it. In other words something akin to a mental module specifically devoted to language acquisition and use.
The details are debated and debatable but that there is something unusual to be explained is not now generally debated.
Elsewhere on this site there are guides to Chomsky and to the various theories underlying first- and second-language acquisition, linked in the list at the end to which you can refer for more.
From here on the assumptions is that the ability to learn language (any language) is something with which humans are uniquely genetically endowed.

Big brain theories

The simplest and most naïve way of expressing the idea of a big brain theory is to state:

People are intelligent because they have big brains and big brains give us the ability to learn language

It follows that because we are clever animals, we have the ability to invent and learn language.
While simple to state, this does not carry the sort of explanatory power that we need. Intelligence may indeed be required to learn a language (any language) and bigger brains have more of it but the direction of causality remains obscure. It is not clear whether the need or advantage of language learning led to larger brains or the other way around.

Other ideas include the co-opting of already sophisticated neural connections, such as those used for motor-sensory functions, to the use of language. This is the idea that language is not a phenomenon confined solely to certain areas of the brain (the famous two being Wernicke's and Broca's areas) but is handled in a cortex-wide fashion, employing whatever complex neural connections are already present.
This is a view of the human brain as a kind of general-purpose machine for the ordering and combining of information.

A related concept concerns what is called exaptation, the phenomenon that some evolutionary changes, adapted for a particular function can be co-opted for a completely different function. An example might be the process by which feathers may have been an adaptation allowing an animal to keep warm but which were later co-opted into part of a mechanism allowing flight. (One old objection to the whole theory of evolution was to ask, for example, "What's the use of half a wing?". Exaptation has been proposed as an answer.)
In a similar way, adaptations in the human brain which evolved under the pressure to fine-tune motor skills in the making of artefacts might have been co-opted to be used to process syntax. There are compelling similarities, for example, between the ability to imagine a finished artefact and then decide on the precise ordering of steps which must be taken in manufacturing it from the raw material to its completion and the ability to conceive of a message to communicate and then decide on the language items that are required to communicate it and then to arrange the items in such a way that the message is unambiguously sent.
This is a view rejected by many who support the ideas of a universal grammar and a language acquisition device which have to be embedded in their own specialised language module in the brain.

Another theory in this general domain is that syntax arose from phonology, specifically the structure of the syllable so a Subject-Verb-Object ordering grew out of the syllable structure of Onset-Nucleus-Coda. This means that complex phonology preceded syntax (a sensible but unproven idea) and that the ability to structure a syllable was simply carried over to the ability to construct syntax but fails to account very well for languages which do not have SVO ordering but any of the other five possibilities, in particular the very common SOV ordering.
It also leaves out other canonical word orderings concerning adverbials, genitives and more.

Sudden or gradual?

The problem facing those who support a Chomskyan view that the language-learning ability cannot be explained by simply co-opting other elements of cognition is to explain how and when such a large evolutionary step was taken to allow the sudden appearance of a fully-formed universal grammar. On the whole, evolution does not appear to move in sudden steps but is a very gradual process.
There are those who suggest that the very sudden flowering of culture and representative art around 40,000 years ago after hundreds of thousands of years of virtual stasis was brought about by the abrupt development of a syntactical system of language which allowed for much greater subtlety and imagination.
The hidden assumption here is that one needs complex language to have complex thoughts and that is by no means an uncontroversial idea. The guide on this site to language and thought, linked below, considers this in greater detail.
An additional problem is trying to explain how something as complex as universal grammar or the language acquisition device which accompanies it could have been the result of a single (or at most small number of connected) genetic mutations.

On the side of gradualism, however, there are also serious problems to overcome.
There is a need to explain what exactly were the intermediate steps between proto-language and a fully modern syntactical system. After all, if syntax emerged in small steps it ought to be possible at least to suggest what those steps might be between no syntax and fully recursive syntax of the modern sort.
There is also, of course, the need to explain what the first step was and why it happened.

The big question for proponents of both gradualism and suddenness used to be to explain why only one species seems to have acquired language but recent research has supported the idea that other species of hominid, particularly Neanderthals, also possessed language.

FOXP2

This is a gene that, when mutated or damaged in humans results in an inability to handle syntax. In mice and songbirds, too, changes to the gene result in serious effects on the nature of vocalisations and the ability to learn sequences of sounds.
The protein produced by the gene has remained, according to research cited in Reich (2018: 8), unchanged for the 100 million years which separate the lineages of chimpanzees and mice. However, in both humans and Neanderthals, the gene has evolved much more rapidly, twice in 50,000 years and once again much more recently.
We should not, of course, run away with the idea that FOXP2 is in any sense 'the language gene' (because, for one thing it does not affect the organism directly, being a transcription gene which turns off or on other genes).

The existence of FOXP2 and its obvious connection to syntactical ability is, however, significant and may go some way to resolving the gradualist-suddenness debate.

It is tempting, too, to see mutations in this gene as giving rise to the ability, in humans and closely related species, to handle complex syntax and that might also explain how it is that no other extant species shares the skill.

Related guides
first- and second-language acquisition	for a guide to some current theories and how they may be relevant to teaching languages
second-language acquisition	for an overlapping guide to some current theories
Chomsky	such an influential figure that he gets a guide to himself
language, thought and culture	for an overview of theories linking language and thought and whether one determines the other or vice versa
How to speak to an alien	this is a speculative article which tries to imagine what an alien language might be like and it draws on some of the ideas above
the roots of English	this is a guide about a language, not language
types of languages	for a guide relevant to Universal Grammar

References:
Aitchison, J, 2008, The Articulate Mammal, 5th Edition, Oxford: Routledge
Bickerton, D, 2005, Language evolution: A brief guide for linguists, University of Hawaii, Lingua 117 (2007) 510–526, available online at www.sciencedirect.com
Cäsar, C, Zuberbühler, K, Young, RJ and Byrne, RW, 2013, Titi monkey call sequences vary with predator location and type, https://royalsocietypublishing.org/doi/10.1098/rsbl.2013.0535
Evans, N & Levinson, S, 2009, The Myth of Language Universals: Language diversity and its importance for cognitive science, in Behavioral and Brain Sciences, Cambridge: Cambridge University Press
Fisher, SE and Marcus, GF, 2006, The eloquent ape: genes, brains and the evolution of language, Nature Reviews | Genetics Vol.7
Jackendoff, R, 1999, Possible stages in the evolution of the language capacity, Trends in Cognitive Sciences – Vol. 3, No. 7, July 1999
Pinker, S and Jackendoff, R, 2004, The faculty of language: what’s special about it?, Cognition 95 (2005) 201–236 available from sciencedirect.com
Seyfarth, RM, Cheney, DL, Marler, P, 1980, Monkey responses to three different alarm calls: evidence of predator classification and semantic communication, Science 210, 801–803.
Reich, D, 2018, Who We Are and How We Got Here, Ancient DNA and the New Science of the Human Past, New York: Pantheon Books
von Frisch, K, 1967, The Dance Language and Orientation of Bees, Cambridge, Mass.: The Belknap Press of Harvard University Press.
(For more on FOXP2 and its role in language development, try the eminently accessible ScienceDirect article at https://www.sciencedirect.com/science/article/pii/S0002929707629024)