This evidence for multimodal influences on phonetic convergence is consistent with neurophysiological research showing visual speech modulation of speech motor areas. As has been shown for auditory speech, visual speech can induce speech motor system (cortical) activity during lip‐reading of syllables, words, and sentences (e.g. Callan et al., 2003, 2004; Hall, Fussell, & Summerfield, 2005; Nishitani & Hari, 2002; Olson, Gatenby, & Gore, 2002; Paulesu et al., 2003). This motor system activity also occurs when a subject is attending to another task and passively perceives visual speech (Turner et al., 2009). Other research shows an increase in motor system activity when visual information is added to auditory speech (e.g. Callan, Jones, & Callan, 2014; Irwin et al., 2011; Miller & D’Esposito, 2005; Swaminathan et al., 2013; Skipper, Nusbaum, & Small, 2005; Skipper et al., 2007; Uno et al., 2015; Venezia, Fillmore, et al., 2016; but see Matchin, Groulx, & Hickok, 2014). This increase is proportionate to the relative visibility of the particular segments present in the stimuli (Skipper, Nusbaum, & Small, 2005). Relatedly, with McGurk‐effect types of stimuli (audio /pa/ + video /ka/), segment‐specific reactivity in the motor cortex follows the integrated perceived syllable (/ta/; Skipper et al., 2007). This finding is consistent with other research showing that with transcranial magnetic stimulation (TMS) priming of the motor cortex, electromyographic (EMG) activity in the articulatory muscles follow the integrated segment (Sundara, Namasivayam, & Chen, 2001; but see Sato et al., 2010). These findings are also consistent with our own evidence that phonetic convergence in production responses is based on the integration of audio and visual channels (Sanchez, Miller, & Rosenblum, 2010).
There is currently a debate on whether the involvement of motor areas is necessary for audiovisual integration and for speech perception, in general (for a review, see Rosenblum, Dorsi, & Dias, 2016). But it is clear that the speech system treats auditory and visual speech information similarly for priming phonetic convergence in production responses. Thus, phonetic convergence joins the characteristics of critical time‐varying and indexical dimensions as an example of general informational commonality across audio and video streams. In this sense, the recent phonetic convergence research supports a supramodal perspective.
Research on multisensory speech has flourished since 2005. This research has spearheaded a revolution in our understanding of the perceptual brain. The brain is now thought to be largely designed around multisensory input, with most major sensory areas showing crossmodal modulation. Behaviorally, research has shown that even our seemingly unimodal experiences are continuously influenced by crossmodal input, and that the senses have a surprising degree of parity and flexibility across multiple perceptual tasks. As we have argued, research on multisensory speech has provided seminal neurophysiological, behavioral, and phenomenological demonstrations of these principles.
Arguably, as this research has grown, it has continued to support claims made in the first version of this chapter. There is now more evidence that multisensory speech perception is ubiquitous and (largely) automatic. This ubiquity is demonstrated in the new research showing that tactile and kinesthetic speech information can be used, and can readily integrate, with heard speech. Next, the majority of the new research continues to reveal a function for which the streams are integrated at the earliest stages of the speech function. Much of this research comes from neurophysiological research showing that auditory brainstem and even cochlear functioning is modulated by visual speech information. Finally, evidence continues to accumulate for the salience of a supramodal form of information. This evidence now includes findings that, like auditory speech, visual speech can act to influence an alignment response, and can modulate motor‐cortex activity for that purpose. Other support shows that the speech and talker experience gained through one modality can be shared with another modality, suggesting a mechanism sensitive to the supramodal articulatory dimensions of the stimulus: the supramodal learning hypothesis.
There is also recent evidence that can be interpreted as unsupportive of a supramodal approach. Because the supramodal approach claims that “integration” is a consequence of the informational form across modalities, evidence should show that the function is early, impenetrable, and complete. As stated, however, there are findings that have been interpreted as showing that integration can be delayed until after some lexical analysis is conducted on unimodal input (e.g. Ostrand et al., 2016). There is also evidence interpreted as showing that integration is not impenetrable but is susceptible to outside influences including lexical status and attention (e.g. Brancazio, 2004; Alsius et al., 2005). Finally, there is evidence that has been interpreted to demonstrate that integration is not complete. For example, when subjects are asked to shadow a McGurk‐effect stimulus (e.g. responding ada when presented audio /aba/ and visual /aga/), their ada shadowed response will show articulatory remnants of the individual audio (/aba/) and video (/aga/) components (Gentilucci & Cattaneo, 2005).
In principle, all of these findings are inconsistent with a supramodal account. While we have provided alternative interpretations of these findings both in the current paper and elsewhere (e.g. Rosenblum, 2019), it is clear that more research is needed to test the viability of the supramodal account.
1 Alsius, A., Navarra, J., Campbell, R., & Soto‐Faraco, S. (2005). Audiovisual integration of speech falters under high attention demands. Current Biology, 15(9), 839–843.
2 Alsius, A., Navarra, J., & Soto‐Faraco, S. (2007). Attention to touch weakens audiovisual speech integration. Experimental Brain Research, 183(3), 399–404.
3 Alsius, A., Paré, M., & Munhall, K. G. (2017). Forty years after Hearing lips and seeing voices: The McGurk effect revisited. Multisensory Research, 31(1–2), 111–144.
4 Altieri, N., Pisoni, D. B., & Townsend, J. T. (2011). Some behavioral and neurobiological constraints on theories of audiovisual speech integration: A review and suggestions for new directions. Seeing and Perceiving, 24(6), 513–539.
5 Andersen, T. S., Tiippana, K., Laarni, J., et al. (2009). The role of visual spatial attention in audiovisual speech perception. Speech Communication, 29, 184–193.
6 Arnal, L. H., Morillon, B., Kell, C. A., & Giraud, A. L. (2009). Dual neural routing of visual facilitation in speech processing. Journal of Neuroscience, 29(43), 13445–13453.
7 Arnold, P., & Hill, F. (2001). Bisensory augmentation: A speechreading advantage when speech is clearly audible and intact. British Journal of Psychology, 92(2), 339–355.
8 Auer, E. T., Bernstein, L. E., Sungkarat, W., & Singh, M. (2007). Vibrotactile activation of the auditory cortices in deaf versus hearing adults. Neuroreport, 18(7), 645–648.
9 Barker, J. P., & Berthommier, F. (1999). Evidence of correlation between acoustic and visual features of speech. In J. J. Ohala, Y. Hasegawa, M. Ohala, et al. (Eds), Proceedings of the XIVth International Congress of Phonetic Sciences (pp. 5–9). Berkeley: University of California.
10 Barutchu, A., Crewther, S. G., Kiely, P., et al. (2008). When/b/ill with/g/ill becomes/d/ill: Evidence for a lexical effect in audiovisual speech perception. European Journal of Cognitive Psychology, 20(1), 1–11.
11 Baum, S., Martin, R. C., Hamilton, A. C., & Beauchamp, M. S. (2012). Multisensory speech perception without the left superior temporal sulcus. Neuroimage, 62(3), 1825–1832.
Читать дальше