Ever since the 1950s, when research and engineering in the USA and, to a lesser extent, in the UK, started to expand dramatically, English has been the lingua franca of the scientific community (Garfield, 1998). As a consequence, many scientists all over the world are now obliged to describe their research and discuss their results in a language that is not their mother tongue. This clearly affects the communication of science in the worldwide academic community, because the way a researcher writes in English depends largely on his or her familiarity with the language.
For the sake of communicating science, the scientific community has to allow certain unavoidable differences in style, provided they are within the bounds of English grammar. But a scientist is not expected to be either a professional writer or a translator. Furthermore, there is no standard scientific English against which to compare a text, so it is difficult to evaluate the style of a scientific publication. In fact, there is not even a standard for the English language itself, as various countries, such as Canada, the Caribbean, India, the Philippines, New Zealand and the USA, have developed varieties of English that are as distinct from British English as they are from each other (Ritter, 2002).
…there is no standard scientific English against which to compare a text, so it is difficult to evaluate the style of a scientific publication
Although it is not possible to define a common standard for written English in scientific communication, it is valuable to identify local peculiarities and differences in writing from authors from various countries. These clearly prevail in some journals more than others, depending on the level of copy‐editing of the final text by editors and publishers. Such variations in the use of English, due to the authors' native language and cultural background, can not only make a text more difficult to understand and distract the reader from the content, but also hold the danger that the meaning and content of a sentence is diluted or misinterpreted by a reader with another language background. Thus, locally favoured words and phrases should be recognized, and eventually avoided, to increase the clarity of scientific communication.
To determine such variations in the scientific literature, we examined the MEDLINE database of biomedical articles (www3.ncbi.nlm.nih.gov/Entrez/index.html). This database contains more than 11 million references to biomedical articles, including the address of the main author, the country of the publisher and often an abstract of the publication. To associate abstracts with nationalities, we first extracted the name of the country from the affiliation field (Perez‐Iratxeta & Andrade, 2002). We eventually restricted the study to the 50 countries with the greatest numbers of abstracts in the MEDLINE database (Table 1). Almost half of the publications selected were from a country where English is not the official language or where less than 10% of the population speak English as their first language. The grammatical analysis of the text was performed using the program TreeTagger, which is freely available software developed at the University of Stuttgart, Germany (www.ims.unistuttgart.de/projekte/corplex/TreeTagger), that associates a part‐of‐speech tag to each word in a text (see sidebars Box 1, Box 2 on page 448). We chose several parameters to illustrate the language variation observed for different countries.
Use of TreeTagger I
Use of TreeTagger II
First, we computed the average number of words and verbs per sentence and found that although these parameters vary greatly between countries, there are some correlations with the native language of a country (Fig. 1A). Anglo‐Saxon scientists write longer sentences—an average of 27 words and 3.8 verbs per sentence for the UK—as would be expected from their familiarity with English. Another remarkable difference is seen in the implied involvement of the author in his or her research. This personal involvement can be diminished by the use of the passive voice, which is discouraged in writing in general (Strunk & White, 1979), and in particular for technical writing (Day, 1994; Brown, 2000), but which nevertheless often pervades scientific articles (Möhn & Pelka, 1984). We distinguished passive sentences as those containing any form of ‘be’ followed by a verb in the past participle, allowing one adverb in between, such as “were significantly associated”. Another indicator of personal involvement is the use of the first‐person pronouns ‘I’ and ‘we’. Fig. 1B plots these parameters, and shows a significant difference between the USA and the UK, with the USA standing out from the bulk of the Germanic countries in the top‐left corner. Writers from Slavic countries occupy the opposite corner. Such an effect might also be related to the different role of the passive voice in some languages, for example Japanese and Russian, compared with English.
The use of prepositions and adverbs also differs according to the local language (Fig. 1C). Writers from German‐speaking countries, for instance, use many adverbs compared with Spanish speakers; indeed, the two languages differ considerably in the way they form adverbs and use them in a sentence. An example is the expression “sorfältig statistisch ausgewertet” in German, meaning “carefully statistically evaluated”. The literal Spanish version “cuidadosamente estadísticamente evaluado” sounds odd, and Spanish speakers would rather write “evaluado con un método estadístico de manera cuidadosa”, which literally translates to “evaluated with a statistical method in a careful way”. This substitutes the adverbs with equivalent noun–adjective pairs. Scientists from Slavic countries stand out as using many prepositions, which is in contrast to writers from several Asian countries.
Scientific language should be clear, conclusive and unequivocal. However, scientists often use words that imply uncertainty, such as the modal verbs ‘would’, ‘could’, ‘should’, ‘may’ or ‘might’, or adverbs such as ‘likely’, ‘possibly’ or ‘probably’. Anglo‐Saxon countries are prominent in this respect (Fig. 1D), whereas Chinese, Altaic and German‐speaking countries tend to avoid such adverbs and modal verbs. There is also a country‐specific difference in the use of nouns (Fig. 1E). These words can be substituted by a personal pronoun (for example, “It was isolated from kidney”), referring to the use of the noun in a previous sentence (such as “Protein X has a low molecular weight.”) This back‐referring is more common among authors from Romanic countries, particularly those that are Spanish‐speaking, who use the most personal pronouns per total number of nouns in a text. Another common way to substitute nouns is by abbreviation, which is more prevalent among scientists from those Asian countries with ideographic writing, who tend to formulate shorter representations of many words.
Another good marker for local pecularities are words that can be used interchangeably. In our analysis, we chose the pairs ‘may/might’ and ‘though/although’ (Fig. 1F). Papers by Anglo‐Saxon writers show the highest prevalence of ‘although’ and ‘may’. By contrast, scientists from India are fond of using ‘though’, which is another example of how a country develops its own norms in the use of English. Finally, we analysed which words are specifically used in the scientific literature from these 50 countries (Table 2). Some of these words indicate a focus on certain research fields in a country, but others indicate language usage or even social differences between countries.
Clearly, this study has its limitations, as it takes raw data from MEDLINE abstracts that represent only the biomedical literature. Also, the authors’ affiliations do not necessarily indicate the real distribution of a publication's authors, as exemplified by this article, which has been written by German and Spanish scientists from German institutions, communicating to each other in a kind of English. Nevertheless, there are detectable differences in the use of English in the publications from the countries that we analysed. The most obvious factor is, of course, the local language in a country, as indicated by the clustering of countries using the same language or languages of the same family in Fig. 1. But these groupings are not perfect, and there is a great variation in the use of English depending on the parameters used in the study.
…scientific communication … could contribute to a convergence towards a global consensus for the English language
In addition, other cultural and geographical factors have a role in the variability of scientific English. For instance, the mobility of the scientific community that puts scientists of different countries in contact is one such factor. In general, scientific communication, recently made easier by the worldwide web, email and electronic journals, could contribute to a convergence towards a global consensus for the English language. But such a consensus may not be ‘proper’ English as defined by a British or US dictionary. To illustrate this point, there are many words that are already more broadly accepted in the non‐native English‐speaking community (Fig. 2). We think that these words are preferred by non‐native speakers because they are more simple and easier to interpret, whereas a native speaker would find these words either too colloquial or would choose a synonym from a wider range of words with more particular gradations of meaning. This situation is not limited to the international scientific community but takes place in other settings as well.
It is not yet clear whether this situation constitutes an impoverishment or an improvement of the English language. What is clear is that current atypical word usage by various countries can make communication more difficult. An example is the use of the term ‘subvention’, which is used for ‘grant’ or ‘subsidy’ in the Brussels administration of the European Union, but which is not a common term for native English speakers. Other examples are the German bastardized term ‘handy’ for a mobile phone or the product name ‘Bitter Sin’ in Spain, a drink that is bitter and non‐alcoholic—‘sin alcohol’. But, as we pointed out earlier, there is no norm for the English language, so such developments are not necessarily bad, provided that they conform to syntax rules. Nevertheless, for the sake of clarity of communication, divergence in scientific writing should be minimized or at least slowed down, so that deleterious innovations can be recognized and weeded out, and scientists will be able to understand each other better.
…for the sake of clarity of communication, divergence in scientific writing should be minimized or at least slowed down…
To keep divergence at bay, teaching of the English language is probably not sufficient, as local teachers may further spread particular local biases and variations. Much more important is regular contact between scientists from various countries, particularly with native English speakers, which would help all concerned to adhere to a standard form of scientific English. This does not necessarily mean face‐to‐face communication, but could also occur through reading of scientific literature published in English. In this respect, the editorial control of published material has an increasingly important function. The evolution of scientific English as a variant form of English should be seen as a healthy development and may improve communication in due course. Human languages have changed over centuries, and English itself was enriched by both Roman and Norman invasions. We should therefore not fear for the English language when it is again invaded by hordes of scientists from all over the world, albeit much more peacefully.
We thank H. Schmid for providing TreeTagger and E. Minch for comments on the manuscript. We are grateful to the National Library of Medicine for licensing MEDLINE to us. We thank the members of our international group, who are expert users of many languages, for discussions: F. Ciccarelli (Italian); M. Suyama (Japanese); S. Schmidt, J. Korbel and C. von Mering (German); Y. Zdobnov (Russian); I. Letunic (Croatian); D. Torrents (Catalan, Spanish); W.C. Lathe III (US English, Korean); Y.P. Yuan (Chinese‐Mandarin); P. Shah (Hindi, Gujarati, English); and D. Jaeggi (British English).
- Copyright © 2003 European Molecular Biology Organization