Workshop: How do we chunk up speech in real time – and how consistent are our perceptions?

Anna Mauranen, University of Helsinki

An intriguing question in human language processing is how we manage to make sense of the continuous, rapid flow of speech that we hear in real time, despite the limitations of working memory (Christiansen & Chater 2016). It is reasonably well established that many kinds of complex stimuli are processed by segmenting them into smaller chunks, which are integrated in an emerging representation of a larger whole. This has been found for example with complex static objects (e.g., Biederman 1987, Kimchi 2015), visual events (Zacks & Tversky 2001; Radvansky & Zacks, 2014), and music (Sridharan et al., 2007). Chunking continuous stimuli can thus be conceived as a domain-general phenomenon (e.g., Blumenthal-Dramé et al., 2017), but research into it is largely missing for language. There is, of course, a considerable body of experimental research on the segmentation of relatively low-level phenomena (phonology, morphology, syntax, lexical semantics) based on contrived examples. On the other hand, there is rich corpus-based evidence of repeated multi-word expressions (aka formulaic expressions, constructions, fixed expressions, among others), which many scholars (e.g. Bybee 2003) assume, but have not shown, to be also units of processing. These traditions do not meet, and very little research is found on the segmentation of continuous naturally-occurring speech.  In this theme session, we address the issue of chunking as it is performed intuitively by linguistically naïve listeners on extracts of continuous, spontaneous speech. We hypothesise, in line with Sinclair & Mauranen (2006), that fluent speakers of a language chunk up the language they hear in largely convergent ways.

The session is based on two research projects at the University of Helsinki, which investigate chunking in naturally-occurring continuous speech. The experimental methods include a behavioural (Vetchinnikova et al., 2017; Vetchinnikova et al., under revision) and a brain scan component (Anurova et al., under revision). Findings support the hypothesis that listeners’ intuitive marking of chunk boundaries is highly convergent. The question arises about what linguistic cues affect their perception of boundaries. Our presentations look at chunk perception from various interrelated angles: different languages, native and non-native speakers, and chunking under different experimental conditions. Analytical methods include quantitative and qualitative approaches. The focal issue throughout is which linguistic cues might best explain the placement of chunk boundaries: prosodic, syntactic, meaning, discourse structure, or combinations of these.

 

Estimating chunking ability of L2 listeners

Svetlana Vetchinnikova, University of Helsinki

Linguists and cognitive scientists believe that humans understand speech by chunking it up into smaller units (Sinclair & Mauranen, 2006; Christiansen & Chater, 2016; Henke & Meyer, 2021). Author et al. (under review) proposed a distinction between such perceptual chunking and usage-based chunking which has received much more attention in the literature (Bybee, 2010; Ellis, 2017; McCauley & Christiansen, 2019). Perceptual chunking provides a temporal window for further processing, while usage-based chunking gives rise to multi-word units and more complex structure in language. This paper probes the hypothesis that perceptual chunking is related to comprehension.

Fifty participants of an English as a lingua franca background listened to 97 extracts of natural speech and simultaneously marked chunk boundaries in the transcripts using a purpose-build web-based application ChunkitApp (Author et al. 2017). After listening to each of the extracts, they answered either a true-false comprehension question or a self-evaluation question asking: “Do you understand what the speaker was saying?” with three possible answers yes/roughly/no. The participants’ language proficiency was tested with the elicited imitation task. Earlier research showed that extracts varied in how easy or difficult it was to chunk them (Author et al. under review). This paper will use Rasch analysis to estimate the chunking ability of the participants. It will then relate chunking ability to their comprehension of the extracts and to their language proficiency. It is expected that listeners who found the extracts more difficult to understand were also worse in chunking them. It is also possible that chunking ability can predict language proficiency.

 

References

Bybee, J. L. (2010). Language, usage and cognition. Cambridge University Press.

Christiansen, M. H., & Chater, N. (2016). The Now-or-Never Bottleneck: A Fundamental Constraint on Language. Behavioral and Brain Sciences, FirstView, 1–52. https://doi.org/10.1017/S0140525X1500031X

Ellis, N. C. (2017). Chunking in Language Usage, Learning and Change: I Don’t Know. In M. Hundt, S. Mollin, & S. E. Pfenninger (Eds.), The Changing English Language (pp. 113–147). Cambridge University Press. https://doi.org/10.1017/9781316091746.006

Henke, L., & Meyer, L. (2021). Endogenous Oscillations Time-Constrain Linguistic Segmentation: Cycling the Garden Path. Cerebral Cortex, 31(9), 4289–4299. https://doi.org/10.1093/cercor/bhab086

McCauley, S. M., & Christiansen, M. H. (2019). Language learning as language use: A cross-linguistic model of child language development. Psychological Review, 126(1), 1–51. https://doi.org/10.1037/rev0000126

Sinclair, J., & Mauranen, A. (2006). Linear unit grammar. John Benjamins.

 

Chunking-in-noise: high-level segmentation of spontaneous speech in different listening conditions

Alena Konina, University of Helsinki

Speech perception requires segmentation of the input to make sense of what is being said. Multiple linguistic cues contribute to the perception of a boundary. Mattys et al. (2005) suggest that speech segmentation happens under the influence of both sentential, lexical and sublexical cues, with the former taking precedence. All cues interact during speech segmentation, with lower-level tiers (acoustic and prosodic) becoming more relevant when higher-order information (syntactic and semantic) is not available (e.g., when the signal is degraded).

The present study seeks to understand whether the cue hierarchy introduced by Mattys and his colleagues is applicable to speech segmentation extending beyond syllables and words into multi-word units, or chunks.

We conducted an experiment using shortish spontaneous speech extracts as stimuli across two listening conditions: ‘in quiet’ (without interference) and ‘in noise’ (with the signal degraded by a babble noise mask). The extracts were randomly selected from the ELFA corpus (Mauranen, 2008) and MICASE (Simpson et al., 2002) and voiced over to improve the sound quality. All the extracts were scaled to average root-mean-square (RMS) for sound normalisation purposes.

Two groups of native English speakers (N=29, 17 females, mean age = 35.4 and N=27, 15 females, mean age =32) without history of hearing disorders were recruited online. The experiment was conducted through a custom tablet application, ChunkitApp (Vetchinnikova et al., 2017), which plays each extract through participants’ headphones, simultaneously displaying their transcript on the screen. In both conditions, participants were instructed to follow their intuition and tap the screen when they felt like one chunk ended and another began.

The extracts were annotated for boundaries in syntax (manually) and prosody (continuous wavelet transform, Suni et al., 2017). We fit a logistic mixed-effects model with chunk boundaries marked by participants as the response variable, with listening conditions, syntactic and prosodic boundaries and their interactions as fixed effects, and random intercepts for participant and extract. Our analysis shows that in both listening conditions, cue interaction increases the chance of segmentation and prosodic cues carry more weight than syntactic ones. If the signal is degraded, however, only the presence of both guarantees segmentation. Our results thus lend support to the cue hierarchy proposed by Mattys and colleagues. It appears that in high-level segmentation, prosodic cues are more robust than syntactic cues in this particular speech-in-noise setup. Other noise gradations are needed to tease out more profound differences.

 

Making sense of natural speech: prosodic and syntactic cues in L2 speech segmentation

Aleksandra Dobrego, University of Helsinki

Language arranges itself along a continuous line, either in time (speech) or in space (text). As working memory is presumably limited to four units (Cowan 2001), it goes largely uncontested that language processing must proceed in chunks (Christiansen and Chater 2016). We report two experiments of segmentation in natural, spontaneous speech. We investigate how language experience affects natural speech segmentation and how these segmentation patterns may be reflected in the brain.

In the first experiment, we tested intuitive chunking in L1 and L2 speakers of English, assuming that L1 speakers have more extensive experience of English than L2 speakers. We asked participants to listen to extracts, follow the transcript on an iPad, mark boundaries by tapping the screen and answer a comprehension question after each extract (adopted from Vetchinnikova et al. 2017). The perceived boundaries resulted in ’chunks’. We assessed the participants’ agreement and segmentation strategies and found that prosody is what both groups rely on most, with L1 users using it slightly more. Moreover, both groups performed alike in the degree to which they converged on boundaries and successfully answered comprehension questions, suggesting that language experience has a slight effect on cue utilization but does not affect the ultimate outcome of natural speech segmentation.

In the second experiment, we went on to investigate the roles of prosody and syntax in L2 speech segmentation using MEEG recordings in healthy adults. The objective was to test how chunk boundaries and segmentation cues might be reflected in brain activity. Participants listened to extracts of natural speech from the same database as in Experiment 1, again followed by comprehension questions. We inserted 2-second gaps into each extract, some at chunk boundaries obtained from Experiment 1, others within chunks, and recorded brain activity during these two contrasting types of pauses. Pauses at chunk boundaries elicited a CPS in sources over bilateral auditory cortices. By contrast, pauses within a chunk elicited a biphasic emitted potential with sources in the bilateral primary and non-primary auditory areas with right-hemispheric dominance and were perceived as interruptions. Chunk boundaries and non-boundaries thus elicit distinct evoked activity in the brain. Moreover, chunk boundaries were influenced by both prosody and syntactic structure, whereas chunk interruptions by prosody only, suggesting that the integrity of the intonation contour may be considered an essential property of the perceived chunk.

 

What causes the perception of boundaries in Finnish – prosodic and syntactic-semantic features examined

Tiia Winther-Jensen, University of Helsinki

The starting point for speech segmentation has mostly been the needs of a linguist working on speech analysis. Little attention has been given to how untrained or “ordinary” language-users process linguistic input (Barnwell 2013). With the data collected in a listening experiment from linguistically untrained native Finnish speakers I investigated the possible causes of the perception of chunk boundaries.

This paper deals with the prosodic-phonetic and syntactic-semantic characteristics of chunk boundaries Finnish speakers perceive in spontaneous speech. Firstly, I will show how perceived boundaries match prosodic boundaries: those detected automatically using a Continuous Wavelet Transform technique as well as manually analyzed using Praat. I demonstrate which acoustic features predict the perception of a boundary in Finnish spontaneous speech. With the analysis of the data at hand, I question the role of pauses as “punctuation marks of the spoken language”. More reliable acoustic features in boundary places are changes in fundamental frequency and speech tempo.

Secondly, syntactic analysis of the data shows that certain conjunctions almost always cause the perception of a boundary. In this presentation, I will look into the semantics of these conjunctions as well as the semantic features in other boundary places.

Finally, I suggest it might be worth considering a way of looking at segment boundaries not as a strict dividing line in between orthographic words, an end and a beginning, but as a feature of the words themselves, possibly even clusters of words. This view emphasizes the role of beginnings in defining chunk boundaries. In this it resembles the cesura approach by Barth-Weingarten (2016).

 

References

Barth-Weingarten, D. 2016. Intonation Units Revisited: Cesuras in talk-in-interaction. John Benjamins Publishing.

Barnwell, B. 2013. Perception of prosodic boundaries by untrained listeners. In Szczepek, R. B., & Raymond, G. (Eds.). Units of talk – units of action. John Benjamins Publishing Company.

 

What happens if a chunk is interrupted?

Anna Mauranen, University of Helsinki

There is reason to assume that language is processed like many other complex stimuli, for example visual events (e.g. Radvansky & Zacks, 2014), by segmenting it into smaller chunks that get integrated into larger representations of meaningful wholes. The chunking lets listeners out of the impasse of the ‘now-or-never bottleneck’ (Christiansen & Chater 2016) generated by the constant inflow of input under the limitations of working memory, and enables them to make sense of it.

When listeners chunk up ongoing speech, they tend to do it convergently, as posited by Sinclair & Mauranen (2006). We found support for this in a behavioural study which invited linguistically naïve participants to chunk up continuous extracts of authentic speech (Vetchinnikova et al., under revision). We then went on to use the same extracts with similar participants but with silent 2-sec. pauses inserted as triggers in an MEEG experiment. The trigger insertion criteria were based on timing, not linguistic properties. Responses to triggers at chunk boundaries with high levels of convergence were significantly different from responses to triggers at non-boundaries leading to interruptions of speech (Anurova et al., under revision).

Apparently, perceived interruptions obstruct predictions already formed and preactivations of meanings. These may be based on preceding syntactic, prosodic, semantic, or discourse cues, of which the last two are hard to capture without close qualitative analysis. This paper imposes a qualitative analysis of 50 of the 10-45 sec. extracts up to the triggers to assess the severity and explore potential reasons for their disruptiveness for constructing a meaningful representation of the extract. The perspective is, admittedly, the analyst’s post hoc view, but close reading can bring to light what escapes quantitative analyses.

The analysis suggests that different levels of language, discourse (beyond sentence) and syntax (sentence, clause) are differently relevant in their overall effect and at different points in the speech flow. At times discourse level cues seem to exert strong prediction and preactivation of upcoming meanings, at other points the very local level, i.e. clause or phrase, assumes priority. Moreover, there would seem to be, overall, more and less intense episodes with regard to meaning, that is, some passages in the speech flow provide more ingredients for constructing semantic representations and preactivations than others. The less intense stretches relate to organizing language, with dysfluencies and restarts getting absorbed into more meaning-constructing stretches, which together make up perceived chunks.