In conclusion, we should point out that the rationalist method suggested by Chomsky is also accompanied by biases and limitations which are not negligible and can be corrected by the use of empirical methods. In particular, this method leaves a large space for the subjectivity of linguists while it overestimates the linguistic skills of speakers. Indeed, the use of grammaticality judgments presupposes that all speakers have a definite and consistent intuition regarding all the sentences in their mother tongue. However, such is not the case. If all English speakers agree that a sentence like “Mary dog her walks” is incorrect in English, whereas the sentence “Mary walks her dog” is correct, judgment will not be so unanimous in the case of complex sentences, as the one mentioned above: “When do you think he will prepare which cake?”. These divergences become problematic as soon as these judgments are used for building a linguistic theory. What is more, while it is likely that many English speakers would reject a sentence such as “He does be working” for being grammatically incorrect, in certain areas of the English-speaking world (such as Ireland), this sentence would be acceptable. By resorting to many different speakers and including them in reference corpora of speakers coming from different geographical areas, corpus linguistics makes it possible to respond to this problem in a much more satisfactory way.
What is more, in many areas of linguistics such as lexicology, language acquisition and sociolinguistics, the idea of relying on the internal judgments of linguists is simply not conceivable. No one can study children’s language by remembering how he or she spoke as a child, or make assumptions about language differences between men and women by imagining how he or she would speak if he/she were a man or a woman. In all these fields, the use of text corpora has been obvious for a long time and corpora use was never interrupted as a result of Chomsky’s work. The paradigm shift in recent decades has taken place in areas where it is conceivable to use a purely rationalist methodology, for example syntax.
Finally, it is important to remember that the role of linguistic theory and the intuition of researchers is not absent in most corpora studies. Indeed, a majority of linguists consider corpora studies as a tool, making it possible to validate or invalidate hypotheses on language, formulated in advance, on the basis of scientific literature and their linguistic intuitions. We will see many examples of this approach (empirical validation) throughout this book. This corpus-based research approach is opposed to an approach which considers corpus data as the only point of reference, both in a theoretical and a methodological sense. In this approach, linguists begin their research without an a priori and simply let hypotheses emerge from corpus data (this is called a corpus-driven approach). This approach is almost unanimous among linguists working with an empirical methodology. On this point, we agree with Chomsky’s metaphorically explained opinion where he states that working with linguistics in this way would be the equivalent for physicists of hoping to discover the physical laws of the universe by looking out of their window. Observing data without a hypothesis often leads to not being able to make sense of data. It is for this reason that the approach that we will adopt in this book corresponds to a corpus-based approach, considering these as available tools for linguists to be able to test their hypotheses.
1.4. Corpus linguistics and computer tools
As we have seen above, corpus linguistics, as performed nowadays, cannot do without computers. Even if works related to corpus linguistics have existed for a long time (such as the indexing of the Bible by theologians or the file-based construction of dictionaries by scholars like Antoine Furetière in French or Samuel Johnson in English), this discipline was only able to properly take off after the arrival of computing.
Corpus linguistics depends on computer science for various reasons. The first one, which we have already mentioned above, is related to the need for computerized texts in order to be able to carry out truly quantitative research. Nevertheless, looking for elements in a corpus, even a computerized one, by using a simple word processing tool is rather inconvenient. Going back to the example of the search for terms related to love in Flaubert, which we discussed earlier, we find that the use of the search function of a typical word processor quickly reaches its limits. First of all, in order to verify that all occurrences found when looking for the verb to love correspond to expressions of love as a feeling rather than to modal uses as in the phrase “ I would love it that you kept quiet ”, it is necessary to examine each occurrence and thus browse the entire text. Second, to find all the occurrences of the verb to love , it is necessary to perform a different search for each verbal form, for example love , loved , etc. It is for this reason that other computing tools, specifically devoted to corpus linguistics, have been developed.
In particular, concordancersare useful for searching all the occurrences of a word, plus their context of use and for displaying the results line by line in a single query. These tools also make it possible to establish the list of words contained in the corpus, together with their frequency, and to generate a list of keywords matching the content of a corpus. In the case of corpora containing texts as well as their translation, certain tools called aligners make it possible to align the content of the corpus sentence by sentence. That being done, bilingual concordancers search directly for the occurrences of a word in one of the two languages of the corpus, and simultaneously extract the matching sentence in the other language. We will learn how to use these tools in Chapter 5, which is devoted to the presentation of the main French corpora, as well as the tools for analyzing them.
Then, in Chapter 7, we will also see that in order to answer certain research questions, it is necessary to annotate the content of a corpus. For example, let us imagine that we wish to study the different contexts in which we can use the causal adverb since . If we only look up the word since in the corpus, we will also find occurrences which do not correspond to the use of this word as a causal adverb, but to its use as a preposition, for example in “I haven’t seen Mary since Christmas”. So, to be able to correctly look up the uses of since we are interested in, we should only keep those which are adverbs and exclude prepositions. This search can be greatly simplified if the corpus has been annotated by determining, for each word, its grammatical category. This operation, called part-of-speech tagging, can be performed automatically by certain software.
Another problem might arise if we decide to study the use of relative phrases such as “the girl who is intelligent” or “the violin which was left on the bus”. For this study, a good starting point would be to look for relative pronouns such as who or which in order to find occurrences of relative sentences in the corpus. The problem is that these pronouns are also used in interrogative sentences such as “Who do you prefer?” or “Which hat is yours?” In this case, looking for the grammatical category of the word will not solve the problem, because they are both pronouns. In order to find only the occurrences of who and which as relative pronouns, we should use a corpus in which the syntactic structure of each sentence has been analyzed in such a way that we can assign a grammatical function to each word and group them into syntactic constituents. Tools for analyzing the syntactic structure of sentences have also been developed in the context of works for automatic language processing. These automatic analyses still require human checks so as to avoid any form of error, but their performance is continually improving. The arrival of these tools has greatly accelerated research in corpus linguistics. We will discuss this issue in Chapter 7, which is devoted to annotations.
Читать дальше