Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Lexical Processing and Prosodic Structure: A Look into Word Form Encoding, Lab Reports of German Philology

The different types of information about a word in the lexicon, including semantic, syntactic, morphological, and phonological information. It also explores the evidence for the order of activation of these components and the role of prosodic structure in word form encoding. The document also touches upon the importance of studying higher frequency words and homophone pairs in understanding phonetic variation in word form.

Typology: Lab Reports

Pre 2010

Uploaded on 08/31/2009

koofers-user-bcw
koofers-user-bcw 🇺🇸

10 documents

1 / 45

Toggle sidebar

Related documents


Partial preview of the text

Download Lexical Processing and Prosodic Structure: A Look into Word Form Encoding and more Lab Reports German Philology in PDF only on Docsity! 112 A Prosodic View of Word Form Encoding for Speech Production* Patricia Keating, Phonetics Laboratory, UCLA Stefanie Shattuck-Hufnagel, Speech Group, Research Laboratory of Electronics, MIT Note: This paper grew out of an invitation to Pat Keating to summarize and critique the model of phonological encoding presented in Levelt, Roelofs and Meyer (1999) at the Labphon7 meeting in Nijmegen, the Netherlands, in 2000. As a result, and because this model specifically addresses the problems of phonological and phonetic processing which concern us here, we focus largely on it, leaving little opportunity to discuss the advances in other models of the speech production planning process, as developed by e.g. Butterworth (1989), Crompton (1982), Dell (1986), Dell et al. (1997), Ferreira (1993), Fromkin (1971, 1973), Fujimura (1993), Garrett (1975, 1976, 1984), MacKay (1972), Pierrehumbert (in press), Shattuck-Hufnagel (1979, 1992), Stemberger (1985), Vousden et al. (2000), and others. 1. Introduction In our view (shared with Levelt), one of the most interesting questions in the study of language is how a speaker crosses the boundary between abstract symbolic representations and concrete motor control programs. Given this assumption, we can ask what are the processes that take as input a sentence with its words, and provide as output a quantitative representation which can guide the articulatory motor system in producing a particular utterance of that sentence. An articulated utterance must by definition have concrete patterns of timing and amplitude, as well as an F0 contour and a rhythmic structure. Because these patterns are influenced by, but not fully specified by, the underlying morphosyntactic representation, generating a sentence with its words does not suffice to prepare the utterance for articulation; a futher process is required to generate the structures which will permit computation of the appropriate articulatory patterns. The process of generating the representation which will support an articulatory plan for an utterance, on the basis of the morphosyntactic specification of its underlying sentence and other kinds of information, has come to be called Phonological Encoding (Levelt 1989). The meaning of the term Phonological Encoding is not entirely obvious to non- psycholinguists, so the first task is to highlight how phonology is part of a code. The term ‘code’ is used in different subfields with different meanings, and the usages that are most likely familiar to phoneticians and phonologists do not correspond to the one intended by psycholinguists modeling speech production. This Phonological Encoding is not precisely equivalent to the Speech Code of Liberman et al. (1967) (an encoding of phonology by speech), nor to the encoding of phonologies by orthographies, nor to the encoding of speech into phonology during speech processing or acquisition (as discussed by Dell 2000 or Plaut & Kello 1998), nor even to the encoding of higher level linguistic structures by phonology (a recent example being Steriade 2000 on paradigm uniformity effects). The psycholinguistic usage is quite subtle, yet simple, compared to these others, if one distinguishes between two uses of the term. The first (the usage in Levelt (1989)) is the larger process of word form encoding, including various aspects we will describe below, while the second (the usage in Levelt, Roelofs and Meyer (1999)) is the process by which an incomplete phonological representation is completed. This second meaning * We would like to thank Bruce Hayes for helpful discussion, and the organizers and audience of LabPhon7. 113 includes the set of processes that take as input an abstract and somewhat skeletal lexical representation of a word's phonology, and generates a representation which is phonologically complete. Later steps in the planning process which specify the surface phonetic shape that this phonological representation will take have been called Phonetic Encoding. The result of these two processing steps is that the information stored in the lexicon about the forms of the words in a sentence is not the same as the information in the planned sound structure for a specific utterance of that sentence. Over the past few decades, it has become increasingly clear that these processes must refer to prosody. That is, the timing and amplitude patterns of articulatory gestures in speech are systematically governed not only by contrastive representations of segments and features (at the level of the individual word), but also by suprasegmental aspects of the speaker's plan, which are collectively referred to as sentence-level, utterance-level or phrase-level prosody. Some of the evidence for this view will be discussed below. As we have come to understand more about how phrase-level prosodic constituent structure and prosodic prominence influence the phonetic implementation of utterances, the need to integrate word-level and phrase-level planning in models of speech production has become more urgent. This need was addressed in Levelt's comprehensive 1989 book on the speech production process by focusing one chapter on the generation of phonetic plans for single words, and another on the generation of such plans for connected speech. The conceptual division between single-word and connected-speech processing is further reflected in the 1999 BBS target paper by Levelt, Roelofs and Meyer (LRM99). LRM99 develops the model of single-word planning in greater detail, describing its computer implementation (Roelofs 1992, 1997) along with extensive experimental evaluation. The experiments cited in support of their model use psycholinguistic methods to measure psycholinguistic aspects of language behavior, such as reaction times in primed naming tasks or lexical decision tasks. In addition, the computer model can simulate the effects of experimental manipulations such as priming, and these simulations have provided supporting results that parallel and extend the human behavioral findings. The model provides a detailed account of how speakers might cross the rift between abstract underlying word specifications and a phonetically-specified plan that can guide the articulation of a single word-level element. However, discussion of how this word-level planning process is integrated with prosodic planning for an utterance as a whole is limited to the generation of the word-level prosodic constituent called the Prosodic or Phonological Word (PWd)1 (Hayes 1989, Nespor and Vogel 1986, Selkirk 1995, Peperkamp 1997, Hall and Kleinhenz 1997). Issues relating to the hierarchy of prosodic structures involving e.g. Phonological Phrases and Intonational Phrases and their concomitants, such as rhythm and intonational contours, are handled by referring to the discussion of planning for connected speech in Levelt's 1989 book (L89); in chapter 10 of that volume we find a description of how higher-level prosodic structure can be built from PWd elements. The model of single-word planning in LRM99 is considerably more detailed than the L89 version in some respects and more limited in scope in others. It provides an opportunity to evaluate the Max Planck Institute group's approach to single-word production planning in the light of new information that has emerged in the intervening years from prosodic theory, from acoustic and articulatory analysis of surface phonetic variation, and from behavioral experimentation, and also in light of the view of connected speech planning laid out in L89. The 1 The abbreviation PWd, which we will use throughout, has been used in the literature to refer to both the term Prosodic Word and the term Phonological Word; it appears that the content of these two terms is similar. 116 • Evidence for order of activation: Further evidence for the separation of phonological and syntactic processing of words comes from experiments showing that lemmas are activated earlier than word forms. For example, studies of event-related patterns of electrical activity in the brain show that Dutch speakers can access the gender of a noun and then stop before accessing its phonological form, but not vice versa (van Turennout et al. 1997, 1998). • Evidence for constraints on activation: The failure of activated semantic associates of a word to show phonological activation (e.g. presentation of the word pen might activate its semantic associate paper but there is no sign that paper‘s phonological form is activated) indicates that a word can be activated without its full phonological form (Levelt et al. 1991). Again, this supports the hypothesis that the two types of information may be stored separately. Despite this range of supporting evidence, the distinction between lemmas (with their morphosyntactic information) and phonological forms (with their metrical and segmental information) is not uncontroversial, and it is even more controversial whether these two kinds of information can be accessed completely independently of one another (see the commentaries after LRM99). We turn now to another claim in the LRM99 model: that successive levels of processing can overlap during production planning. As the surface syntactic structure of a sentence is being developed for production and individual words are being selected to fit its lexical concepts, the process of generating an articulatory plan can begin, because this processing is incremental. That is, as soon as enough information is available from one type of processing for an initial portion of the utterance, the next level of processing can begin for that chunk. In the spirit of Fry (1969), and as proposed by Kempen & Hoenkamp (1987), once the surface syntax has been developed for an initial fragment of syntax, word-form encoding can begin for this element while syntactic encoding proceeds for the next section of the message. As a result, the articulatory plan for the beginning of an utterance can be generated and implemented before the entire syntactic plan for the utterance is in place. Such an incremental mechanism is consistent both with the intuition that a speaker doesn’t always know how the end of an utterance will go before beginning to utter it, and with some experimental results reported in the literature. It will run into difficulties, however, to the extent that aspects of later portions of an utterance have an effect on the articulation of its beginning. We will return to this issue below. As noted above, in LRM99’s view the generation of an articulatory plan (i.e. the encoding of word form) involves three main components: Morphological, Phonological and Phonetic Encoding. The paper provides an example, shown here as Figure 2, illustrating the steps before and during Phonological Encoding. Correspondences between the strata in this figure and the three encoding processes are somewhat complex, and will be described below. 117 Figure 2. LRM99 Fig. 2, example of encoding for escorting. 2.2 Morphological Encoding The initial steps toward Phonological Encoding are ones that most phoneticians and phonologists don’t think about—how, from a lexical concept (Conceptual Stratum in Figure 2), a lemma is found (Lemma Stratum in the figure), and how, from the lemma, its phonological word form is found (Form Stratum in the figure). Each lemma corresponds to a word, at least in languages like English, and specifies (among other things) what kinds of morphemes may be required to create a well-formed version of the word in various contexts. Thus, depending on the lemma’s morphosyntactic specifications, at least one morpheme and perhaps more may be required to make the desired form of the word for a particular utterance. For example, in English a verb may require a past or present or present progressive tense marker, a noun may require a plural marker, etc. Retrieving the required morpheme(s) is Morphological Encoding (shown in the top row of the Form Stratum in Figure 2, with nodes between angled brackets). One reason that the step of retrieving the morphemes is necessary before Phonological Encoding can take place is that the specification of morphosyntactic categories in the lemma (such as aspect and 118 tense, shown at the Lemma Stratum in the figure) must be combined to select, for example, the appropriate bound morpheme <ing>. It appears that the morpheme nodes shown between the lemma node and the segmental nodes in the figure represents a stored network of morphemes, separate from the lemma and segment networks. If this is the case, it may have interesting consequences for predictions about error patterns, since misactivation of nodes during retrieval is a prime source of selection errors in such spreading activation models. We will not pursue these implications here, but point out that the ability to generate such predictions is one of the strengths of the model’s explicitness4. In the example shown in Figure 2, activation of the lexical concept for the action ESCORT has caused competition among various possible lemmas (not shown here), but this competition has ended in the successful selection of the single lemma escort. Since escort is marked as present-progressive for this utterance, its activation results in the retrieval of two morphemes, <escort> and <ing>. That is, since escort is present progressive here, it needs not only the morpheme <escort> but also the morpheme <ing> to be activated in order to complete Morphological Encoding. 2.3 Phonological Encoding The subprocess of Phonological Encoding (the rest of the Form Stratum in Figure 2) is rather complex in this model, incorporating two kinds of retrieval (metrical and segmental) and a syllabification mechanism. For expository purposes, we have expanded LRM99’s figure as Fig. 2a, to show just the portions we discuss below, concerning the Form Stratum. Figure 2a. Expansion of LRM99’s figure from above which exemplifies the encoding for escorting. This expansion shows only the Form Stratum, with additional labeling. 4 One of the claims of LRM99 is that for each lexical concept, several lemmas compete for final selection, and that the phonological forms of the competing but unselected lemmas are not phonologically activated; only the word form of the actually-selected (winning) lemma reaches activated status. If two lemmas are somehow both selected, a speech error incorporating phonological elements of the word form information associated with both lemmas can result, such as symblem for symbol + emblem. morphemes and metrical frame segments possible syllabifications ↑ ↑ ↑ chosen syllabifications 121 Given this assumption, along with the combined metrical frame and the combined string of segments of a PWd, the string must be syllabified for articulation by fitting each segment into a syllable. Although the details of their implementation are not laid out in this paper, one way of interpreting Figure 2a is that the speaker considers all possible syllables and then selects the optimal path through this set according to syllabification principles. This will ready the string for the selection of syllable-sized articulatory plans from a stored set. Recall that the segments are not syllabified in the lexicon; the metrical frame (with its syllables) is completely separate from the segment information there. That is, the metrical frame specifies the number of syllables, but not the segments which belong in them; the segments are stored separately, and their syllabification is not provided. Only now, in building the PWd for a particular utterance, are segments syllabified. In the Form Stratum of the lexicon, each segment is annotated with its possible positions in the set of possible (i.e. legal) syllables. The choice of actual syllabification from among these possible ones follows general phonological principles (e.g. maximize onsets), as well as language-specific ones (such as ambisyllabicity in English). Crucially, the selection of syllable structures must be made separately for each new utterance, since the choice will vary with the nature of the following material, i.e. whether the next syllable is in the same PWd, and begins with a vowel which could accept a word-final consonant from the preceding word as its onset. Figure 2a shows LRM99’s illustration of this with escorting. Theoretically, there are several possible three-syllable structures within this word. Under one, the consonant /t/ serves as coda in the second syllable; under another, /t/ serves as the onset of the third syllable (British pronunciation). The syllabification that is activated, on the basis of the structure of the entire PWd including possible inflections and clitics, is shown by the arrows added by us to the bottom of Figure 2a; this syllabification maximizes the onset of the third syllable. This means that the final /t/ of escort is now in a syllable associated largely with a different morpheme, -ing. For escort us, the syllabification process is similar, so that the /t/ syllabifies into a different lexical item for this PWd as well. (This could be called resyllabification except that the /t/ had not previously been syllabified; therefore it is referred to as “resyllabification”, in quotes.) LRM99 use this “resyllabification” to illustrate why they postulate a process of Phonological Encoding in the first place, in that the stored information about form is not the same as the planned form that governs the actual pronunciation of the utterance. For example, phonological segments from two different stored lexical entries can find themselves in the same syllable in the PWd structure for a particular utterance. Note that even if lexical words such as escorting are precompiled, combinations such as escort us surely must be built on-line. In general, a word’s pronounced form in an utterance depends on its phrasal context, and thus the planning of its surface phonological form requires an encoding process, since it cannot result from the simple retrieval of the stored phonological form which defines the word and contrasts it with other word forms.6 At this point, Phonological Encoding is complete. The phonological segments of the PWd are syllabified into a legal metrical frame, which is now ready for Phonetic Encoding (not shown in Figure 2a). 6 Another account of surface phonetic variation has recently emerged in the form of episodic theories of the lexicon, which postulate that a speaker stores the range of forms heard and produced for each word, and retrieves the one that is appropriate to the current phrasal context (Pierrehumbert 2001). This approach moves the issue that is described here as an encoding problem into the domain of storage and retrieval. 122 2.4. Phonetic Encoding Phonetic Encoding in this model begins with the phonological syllables that emerge from Phonological Encoding. Given a string of phonological syllables, Phonetic Encoding consists of finding their counterparts in a syllabary of stored syllabic gestural scores, or of constructing such scores for syllables which are not stored. The hypothesis is that gestural scores (as proposed by Browman and Goldstein 1990) of high-frequency syllables are stored in a syllabary, while rarer syllables’ gestural scores must be constructed on-line. Phonetic Encoding of a whole word then consists of retrieving the gestural scores for the individual syllables of the word and combining them. In this respect it is rather like syllable-based concatenative speech synthesis, and thus faces similar issues of contextually-adjusting gestural scores (see discussion in section 4). These concatenated and contextually-adjusted gestural scores can then be articulated in order to pronounce the word. This approach neatly solves the problem of whether syllables are frames for holding segments, or organized chunks of segmental material. In this model, both claims are true: the phonological syllables are frames, while the phonetic syllables are chunks. 2.5. Relevance for phonologists and phoneticians In sum, the issues addressed by Phonological Encoding models are essentially the same as for traditional phonology. All such models try to determine: • the units of representation • how much of the representation is already in the lexicon • how contextual variants are computed The LRM99 model embodies such answers as: • Units of representation are (sometimes underspecified) phonological segments. • Word forms in the lexicon contain no syllables or moras, but include sparse metrical frames which specify number of syllables and location of main stress, for non-default forms only. • Contextual variants are captured by stored syllable templates. When LRM99 commit to these choices in their model, it is on the basis of their interpretations of the psycholinguistic literature – sometimes existing studies, sometimes studies carried out specifically to settle such questions, but always some kind of experimental evidence. Phonologists and phoneticians should therefore be especially interested in why these choices have been made. Segments are the basic unit of sound representation in this model because that is LRM99’s interpretation of the speech error evidence (Fromkin 1971). The decision in favor of underspecified segments also comes from an interpretation of speech error evidence (Stemberger 1991). The reason for positing that information about the number of syllables in a word is stored in the lexicon is that experiments show implicit priming for this aspect of word form. That is, when the speaker knows that all of the words in the target set to be spoken have the same number of syllables, he or she can initiate the utterance faster than if the words in the target set differ in their number of syllables (cited in LRM99 as Meyer et al. in prep). On the assumption that speed of initiation of an utterance reflects the difficulty of lexical retrieval, implicit priming by shared 123 number of syllables suggests facilitation of lexical access by this shared aspect of word form. This interpretation predicts that words without stored metrical forms, i.e. metrically regular words, won’t show this effect7. Similarly, the location of non-default main stress can be primed, suggesting it is available in the lexicon. Finally, evidence that syllable-internal structure (such as moras or onset-nucleus-coda organization ) is not in the lexicon comes from LRM99’s interpretation of both kinds of evidence: speech errors do not unambiguously reflect syllable- internal structure (Shattuck-Hufnagel 1992), and there is only limited evidence that syllable structure can be primed (Meijer 1994, Baumann 1995, Schiller 1997, 1998, though see also Meijer 1996 and Sevald et al. 1995). Thus the empirical support which underlies the proposals for processing steps and representations in this model recommends it to the attention of phonologists and phoneticians grappling with the same issues. Another aspect of the model that was determined by experimental evidence is of practical importance to phoneticians and phonologists. In the model, homophones are different lemmas with the same form, i.e. they share a single word form in the lexicon. This is illustrated in Levelt et al.’s example of MORE and MOOR (for a dialect in which these two words are pronounced the same) (see Figure 4). These words have very different meanings and therefore separate lemmas, but since they are pronounced the same, the question arises as to whether they have identical but separate word forms, or instead share a single word form. The answer in the model comes from experiments comparing how quickly different words are accessed. In general, low- frequency words are accessed more slowly than high-frequency words. However, low-frequency homophones are an exception to this generalization, in that low-frequency homophones like moor with high-frequency partners like more are accessed as fast as the high-frequency partners are. That is, it seems that the low-frequency homophone inherits the advantage of its high- frequency partner. This result can be captured in the model if (given some assumptions about how access works) the two words share a single word form, and thus it is the same word form that is accessed in the two cases. The lesson from this example concerns the study of frequency- based phonetic differences. If one wants to study whether higher frequency words are pronounced differently from lower frequency words when they occur in the same context, then homophone pairs seem to be an appealing object of study because they are segmentally matched. However, if such pairs share a single word form, then there is no point in studying their pronunciation differences; there could be no phonetic differences between them, except those induced by the different contexts in which they appear (see Jurafsky, in press). Figure 4. Example of homophones sharing a word form in the lexicon (from Levelt et al. 1999). Note that <m"r> is not a traditional morpheme but rather the first level of representation of word form information. 7 However, there is also another interpretation of the priming result: speakers may be able to generate a series of identical metrical frames without retrieving them from the form entry in the lexicon. 126 with the result that vowels may appear reduced while consonants do not.) The output of Phonological Encoding, the phonetic plan, is then the input into the Articulator (not shown in Figure 5), which executes the plan. In order to construct the two outputs shown in the figure, the Prosody Generator constructs other, intermediate, outputs. Most notably, prosodic (called “metrical”) structures – not only PWds and utterances but also PPs, IPs -- are constructed by the Prosody Generator, but are not seen as outputs in the figure. Instead these structures influence aspects of the ultimate outputs. Intermediate outputs are constructed from different combinations of the input information. For example, information from the citation metrical spellouts of the words (which includes pitch accent information in this model) and from the surface structure are combined to give a metrical grid for the sentence. The grid is the basis for computing values along some of the prosodic parameters, such as durations and pauses. The metrical grid information in turn combines with citation segments to allow PWd construction. Only after the segments are put into these PWds, and then adjusted depending on their contexts, does the final segmental output (“segmental spellout for phonological words”) obtain. Thus the Prosody Generator affects segmental spellout, because it provides the phrasal/prosodic conditions that affect it. Two examples of this kind of effect in addition to the syllabification effects treated in LRM99 are discussed in L89, but they both are thought to involve syntactic rather than prosodic structure: French liaison as described by Kaisse (1985), and contraction of auxiliaries/clitics. Therefore, the work done by the Prosody Generator with respect to segmental spellout is fairly limited: it uses information about syntactic phrasing that is already in the input, rather than prosodic phrasing computed later. Another case discussed in L89 is assimilation across PWd boundaries, as in ten books when the /n/ becomes /m/-like because of the following /b/. This case is special because it operates beyond the domain of a single PWd, and is discussed in the next section. As another example, surface structure underlies building the grid’s constituents, which are phonological phrases (the core metrical unit in this model) and intonational phrases. Metrical information combines with intonational meaning to give rise to the utterance’s intonational melody, which consists of a nuclear tone and prenuclear tune drawn from an inventory in the style of Halliday (1967), as described by Crystal (1969). The intonation is still not itself an output, but instead underlies the values assigned to each syllable for the F0 parameter, which is an output. Thus it is clear that the box in Figure 5 labeled “Prosody Generator” contains a number of interesting components which do a lot of work. 3.4. Incremental processing with little lookahead The L89 model of Phonological Encoding, like other components of the model, operates incrementally and thus ideally requires no lookahead (L89:373 says “very little lookahead”). Levelt takes great pains to show how operations that might seem to require lookahead can be accomodated in an incremental model, thus providing a set of novel and interesting proposals; he also acknowledges some challenges. An important example is the Rhythm Rule or Early Accent (here, Beat Movement). A word with late prominence, such as Japanese, creates a clash with an early prominence on the following word, as in Japanese Institute; this clash is resolved by removing the prominence from the offending syllable of the first word, and sometimes making an earlier syllable of that word prominent instead (Horne 1990, Liberman 1975, Selkirk 1984, Shattuck-Hufnagel et al. 1994, 127 Monaghan 1990, Beckman et al. 1990). In order to know whether a clash has occurred or not, he speaker must know the prominence pattern of the upcoming word. The speaker must also know whether the two words are in the same phrase, given that Hayes (1989) and others have claimed that the Rhythm Rule is bounded by the Phonological Phrase. In L89, a syntactic phrase boundary after the first word blocks Beat Movement, but otherwise the Prosody Generator looks ahead to the metrical frame of the next word, to see if Beat Movement should occur. Information about phrase boundaries is given by boundary symbols after phrase-final words, and, conceived of as a property of the word itself, thus hardly seems to count as lookahead. More challenging are cases of iterative Beat Movement, such as sixteen Japanese Institutes, because the lookahead encompasses more than the next word. Levelt suggests that such cases are difficult to produce and rare, restricted to more formal speaking styles where speakers plan more carefully. Nonetheless they are a challenge, and the suggested approach to a solution is to posit a larger buffer (and thus longer lookahead) in more careful speech (p. 385). Other cases of some lookahead include building Phonological Phrases (dealing with final fragments and nonlexical heads), cross-PWd assimilation as in ten books, and aspects of intonation (to get the best location for nuclear tone, to incline up to the nucleus, to conjoin a prenuclear pitch accent with the next in a hat pattern). Again some appeal is made to careful speech planning: allowing lookahead in careful speech allows “a more euphonious output, more rhythmic phrasing, and larger melodic lines” (p. 405). In contrast, progressive final lengthening, which can extend over several syllables and would seem to require looking ahead to anticipate an upcoming phrasal boundary, is treated entirely incrementally as a left-to-right process (p. 390). Each word is longer than the one before, up until a break at the end of an Intonational Phrase. 4. Discussion of these models As we have seen, the LRM99 model represents a significant advance over earlier, less- explicit models of Word Form Encoding. The model is particularly well worked out for the higher levels of the encoding process, i.e. for Morphological and Phonological Encoding, where the supporting evidence from psycholinguistic experimentation is strong. A key aspect of this model is the integration of the grammatical and lexical information about a sentence into a different kind of structure, i.e. the prosodic structure for the particular utterance of that sentence which the speaker is planning to utter. We are persuaded that this general view is correct. Moreover, the model provides a number of specific advances in the understanding of speech production processes. For example, the fact that the output constituents of Phonological Encoding are prosodic rather than morphosyntactic elements (i.e. PWds rather than lexical words) provides a natural connection to the process of building higher levels of prosodic structure for the utterance. Other strengths of the model, whether new or drawn from the L89 version, include its provision for computing phrase-level prosodic phenomena (such as pitch accent, rhythmic patterns, and intonation contours), its proposal that syllable programs may be specified in terms of articulatory gestures (which appear to be the most appropriate vocabulary for the task of describing systematic phonetic modification and reduction in natural connected speech), and its provision for the computation of phonetic plans from constituents larger or smaller than the syllable. These features highlight the importance of the next step of elaborating precisely how the model accomplishes these tasks. The model also addresses data from priming 128 experiments which shed light on the time course of different aspects of the Word Form Encoding process, it separates semantic, syntactic and phonological information in the lexicon in accord with the empirical evidence, and it introduces the concept of the PWd as the constituent which both forms the planning frame for phonological encoding, and begins the process of generating the prosodic structure of the utterance. In short, LRM99 builds on and significantly extends the L89 model of single word processing. However, the L89 model of phrase-level planning has not yet been revisited in the same way, and the task of integrating the LRM99 model of single word processing with a model of connected speech processing has been largely left for future work. LRM99 envisioned that using the PWd as their basic unit would make the rest of this task straightforward; however, even in L89 it was not easy to see precisely how this integration could be accomplished, and the greater explicitness of LRM99 has made this difficulty even clearer. In our view the problem lies in the idea of encoding the PWd first, and then doing the prosody later. That is, the problem arises from the fundamental conception, shared by L89 and LRM99, that Word Form Encoding is completely separable from phrase-level processing: that these are two separate things that need to be planned separately and then brought together. This conception is of course a very traditional one, according to which segmental and suprasegmental characteristics can be separated because they control different aspects of the speech signal. However, we believe instead that segmental and suprasegmental characteristics need to be integrated throughout production planning, because prosody is entwined with segmental phonetics. To address these issues, in section 5 we propose a prosody-first model, in which the higher-level prosodic structure of an utterance becomes available as needed during the Word Form Encoding process. The specific design principles that LRM99 and L89 rely on are not the only principles that are compatible with their general approach of encoding word form by integrating morphosyntactic and prosodic information. In fact, we believe that accumulating evidence about the prosodic factors that influence phonetic variation in word form suggests the value of exploring the different tack which we embark on. We are influenced in this direction by the degree of context-related phonetic variation in word form, the role of higher-level prosodic structure in governing this phonetic variation, and the fact that this variation involves not just the specification of values for traditional prosodic parameters like F0, duration and amplitude of the syllable but also specification of traditionally segmental parameters such as the timing of complex articulatory gestures within a syllable or even a segment. We organize our presentation of the evidence supporting our claim according to two questions about how Word Form Encoding can be dependent on context beyond the PWd. The first is whether the speaker needs to look ahead to later portions of the utterance, and the second is whether the speaker needs to look up into the higher levels of prosodic constituent structure. The LRM99 answer to both of these questions (with some hedging) is no, following from the basic tenet of incrementality, which postulates that, just as the speaker generates the syntactic structure of a sentence in left-to- right increments, and can begin the further processing of an early increment before later increments are complete, so the speaker also generates the word forms and prosodic structure of an utterance of that sentence incrementally, one PWd at a time, building higher levels of prosodic structure on the PWds as they emerge from the Word Form Encoding process. In contrast, our answer to both of these questions is yes. We also ask a third question: whether the speaker needs to look inside the syllable in order to compute the differential effects of context on subsyllabic constituents. Again our answer is yes; the rest of this section explains why we take this view. 131 from studies showing that the length of the sentence influences its prosodic constituent structure. For example, Gee and Grosjean (1983) inferred a tendency for phrases (perhaps intonational phrases) to have equal lengths, that is, for the preferred locations for large breaks to come near the halfway points of their sentences. Watson (2002) showed that the length of upcoming portions of the utterance (as well as of earlier parts) influences the speaker's choice of location for Intonational Phrase boundaries in American English. In this spirit, Jun (1993) showed that the number of syllables influences the location of Accentual Phrase boundaries in Korean. Even if this kind of result is limited to read speech, in which speakers are given the overall length of the utterance in advance, it shows that the speech production mechanism is set up to make use of such information. As with the previous cases of stress clash resolution and eurhythmy, the camel’s nose is already under the tent. 4.2. Evidence that speakers look up to higher levels of structure to do Word Form Encoding Over the past decade or so, increasing evidence has emerged to show that the surface phonetic form of a word in a particular utterance is significantly influenced by the prosodic context in which it occurs, including both prosodic prominence and multiple levels of prosodic constituent structure. As a result, it is reasonable to postulate that the process of Word Form Encoding requires the speaker to have access to this information. We will discuss a sample of the evidence that supports this claim under two headings: evidence for the phonetic effect of prosodic constituent edges, and of prosodic prominences. 4.2.1. Edge effects. Edge effects show that prosodic structure plays an active role in Word Form Encoding, and also reflect the hierarchy of prosodic constituents. The phonetic shape of an individual segment depends not only on its position in its syllable, but also on its position in foot, PWd and larger phrasal constituents. Fougeron (1999) provides a thorough review of studies of the effects of position in prosodic domains, both initial and final, on the realization of individual segments or features. The cases involving initial positions have been called domain- initial strengthening. For example, at LabPhon2 Pierrehumbert & Talkin (1992) showed that /h/ is more consonant-like when it is phrase-initial than when it is phrase-medial, that word-initial vowels are more likely to be laryngealized when they are Intonational-Phrase-initial and/or pitch- accented, and that the Voice Onset Time (VOT) of /t/ is longer phrase-initially. Similarly, at LabPhon6, Keating et al. (in press) presented results from several languages showing that the articulation of a stop consonant’s oral constriction is progressively stronger as the consonant occupies the initial position in progressively higher-level prosodic domains; this strengthening does not affect the entire syllable, but only its first portion. A sample of Keating et al.’s data is shown in Figure 6. This figure, obtained using dynamic electropalatography, shows the pattern of maximum contact between the tongue and the surface of the palate during the articulation of a Korean stop /n/ in different phrasal positions – in Korean, the (post-pausal) Utterance, the Intonational Phrase, the Accentual Phrase, and the Word. As can be seen, the contact is greater when the stop is initial in larger phrasal domains. Such findings provide support for the view that speakers need to know something about the prosodic organization of the utterance above the level of the PWd, in order to complete Word Form Encoding. Reduction processes that operate differently at the edges of prosodic constituents also support this claim. For example, a particle can be reduced when it occurs in the middle of a phrase, as for up in e.g. Look up the word, but 132 not when it occurs at the end, as in Look the word up (Selkirk 1995) even when unaccented. This means that the speaker must know whether there is a boundary after this word, in order to determine the quality of its vowel; it is unlikely that this determination can be made by allomorph selection, in contrast to the case of contracted auxiliaries analyzed in L89:375-380. WiUi IPi APi Figure 6. Sample linguopalatal contact data for Korean /n/: each point is an electrode on the surface of the speaker’s palate, with larger points indicating contact with the tongue (from Keating et al. in press). Final lengthening (e.g. Fougeron 1999 review; Cambier-Langeveld (2000)) is perhaps the best-known edge effect. It is challenging to any incremental phonetic model in that it involves not only what appears to be lookahead (because final lengthening may begin before the final word), but also syllable-internal effects (because final lengthening of some segments is greater than that of others). As mentioned in section 3, L89:389-390 suggests an account of this apparent lookahead in terms of progressive (left-to-right) lengthening over the phrase. Yet this mechanism seems implausible in that it predicts that lengthening begins with the second word of a phrase, and also that syllables at the ends of longer phrases should be longer than syllables at the ends of shorter ones. Contra the first prediction, Wightman et al. (1992) found that lengthening did not extend even as far back as the vowel preceding the final syllable. A final example of edge effects is the pattern of constituent-edge glottalization in American English. For left edges, Dilley et al. (1996) report that reduced word-initial vowels are significantly more likely to be glottalized at the onset of a Full Intonational Phrase than an Intermediate Intonational Phrase. For right edges, Epstein (2002) showed that non-modal phonation is associated with Low boundary tones, but not with low phonetic pitch in general. Redi and Shattuck-Hufnagel (2001) report that final creak is more likely at the end of a Full Intonational Phrase than at the end of an Intermediate Intonational Phrase, a constituent which is lower on the hierarchy. If the creak is a planned variation in voice quality, rather than an automatic consequence of another planned event such as lowered F0 or reduced air flow in these locations, then this pattern provides another piece of evidence for the speaker's need to know something about the higher-level prosodic structure of the utterance in order to carry out Word Form Encoding. 4.2.2 Prosodic prominence effects. Speakers also need to know something about the phrase- level prominences or pitch accents in an utterance in order to complete Word Form Encoding. Since, as we have seen, pitch accent patterns are not solely determined at the PWd level, this means that some aspects of phrase-level prosody, namely prominence patterns, must already be determined when PWd encoding takes place. The effects of prominence on word form have been clearly established; Fougeron (1999) provides a thorough review of such studies. For example, de Jong (1995) suggested that under prominence, segments have more extreme oral articulations that enhance their distinctive characteristics. Edwards et al. (1991) report phonetic 133 differences between syllables which are lengthened due to nuclear pitch accent prominence, and those which are lengthened due to a following prosodic boundary. More recently, Cho (2001) compared the articulation of segments at edges of prosodic domains vs. under prominence. Under accent, segment articulations are more extreme, longer, and faster; at boundaries, segment articulations are more open, longer, and slower. Thus these effects are not the same, which indicates that speakers take into account both aspects of prosodic structure, prominences and edges, in phonetic encoding. Voice quality is similarly affected by prominences as well as boundaries. Pierrehumbert and Talkin (1992) showed that a word-initial vowel is more likely to be glottalized if it is in a pitch accented syllable. Dilley et al. (1996) showed a similar pattern for a wider variety of pitch accent types, and the wide variation in the rate of word-initial vowel glottalization among their five radio news speakers suggests that this parameter may be under voluntary control. Epstein (2002) showed that accented words, regardless of pitch accent type, are like phrase-initial words in that both have a tenser voice quality, and that this relation is not due to positional effects on F0. Like prosodic constituent edge effects, these effects of prosodic prominence on word form, combined with the dependence of prosodic prominence on overall phrasal prosody, show that speakers must know about prosodic aspects of upcoming portions of an utterance in order to complete the Word Form Encoding of the current PWd. 4.3. Evidence that speakers look up and/or ahead to do Word Form Encoding The distinction between looking up and looking ahead is not always clear, and indeed many, perhaps most, processes arguably involve both. In this section we consider several processes that might involve both looking up and looking ahead. How individual cases might best be classified into our categories is of course not the point here, since our categories are for expositional convenience only; the crucial point is that of looking beyond the current PWd in some way. 4.3.1. Phonological processes dependent on prosodic structure beyond the PWd. There are many cases in the phonological literature suggesting that speakers make use of information about later PWds, as well as about their prosodic positions, during the Word Form Encoding of earlier words. The thrust of this literature on ‘phrasal phonology’, which goes back at least to Selkirk (1984) and Nespor and Vogel (1986), is that segmental phonological rules can be conditioned by prosodic domains, in the sense that they apply only within particular domains, or only across particular boundaries. Post-lexical, phrasal phonological (i.e. not phonetic)8 operations mentioned by LRM99 and/or L89 include syllabification of coda consonants (discussed extensively in LRM99), French liaison (L89:367), and English r-insertion (L89: 302); Hayes and Lahiri (1991) describe two assimilatory rules of Bengali that apply within Phonological Phrases; Selkirk (1986) provides other examples of phonological rules that are part of the “sentence phonology”. Here we mention two further examples of phrasal (across PWd) operations. The first is Korean nasalization assimilation as described by Jun (1993). In Korean, certain final consonants become nasals when before nasals in the same Intonational Phrase. Jun carried out a phonetic experiment that established that the derived nasals are phonetically just like underlying nasals, that is, the neutralization is complete. Thus, on our view as well as Jun’s, 8 We take phonological rules to be those which change the value of a distinctive feature, adjust stress or insert/delete a segment, resulting in a different word form; in contrast, processes which change the timing or shape of articulatory configurations that implement the distinctive features of phonological segments we view as phonetic. 136 A final example of departure from the principle of phonetic encoding via syllable score lookup is L89's treatment of phrase-final lengthening, as noted in section 4.2.1. In a good-faith effort to reconcile this well-known effect with incremental processing, a mechanism is proposed for beginning the lengthening process after the first word, and increasingly lengthening each word until the Intonational Phrase is completed. Although this proposal avoids the necessity for lookahead, it seems to make a number of counterintuitive predictions, e.g. that all non-initial words in a long phrase will be somewhat lengthened, and that the final word or words of a longer phrase will be more lengthened than the same words in a shorter phrase. In our view, a wide variety of phonetic phenomena which appear to require manipulation of elements smaller than the syllable also suggest that when the syllable-sized articulatory scores are retrieved, the work of phonetic encoding has just begun. For example, Berkovits (1993) has shown (for Hebrew) that in carrying out utterance-final lengthening, speakers lengthen the onset consonant of the final syllable less than the nuclear vowel, which is in turn lengthened less than the coda consonant. Similar effects have been reported for English by Wightman et al. (1992) and Turk (1999). Even the individual gestures of a complex phonological segment have been shown to be independently timed. For example, Sproat and Fujimura (1983) studied the timing of the tongue-tip gesture of a constituent-final English /l/ with respect to the tongue-body gesture, and reported that the tip gesture was delayed in proportion to the duration of the preceding vowel, which presumably reflects the level of the constituent in the hierarchy. Gick (1999) has reported similar effects for the gestures of English /r/. The L89 model includes a post-syllable-selection phonetic adjustment procedure which presumably provides for the contextual adjustment of timing and amplitude of the articulatory gestures of the syllable scores, so this kind of sub-syllabic processing is not incompatible with the spirit of the model. However, L89 is at some pains to describe the operation of even this post-prosody-generation procedure in incremental terms which do not involve lookahead, and as we have seen there are reasons to wonder whether this can work. Moreover, a number of phenomena that have been documented since 1989 expand the work that must be done by a post-prosodic sub-syllabic mechanism for phonetic adjustment, i.e. a mechanism that cannot easily operate on the syllable as a whole, but appears to require access to individual gestural components. For example, left edge effects of the sort reviewed in section 4.2.1 appear to affect only the first segment of a French syllable, as for example the /k/ in a /kl/ cluster, or the vowel in a vowel-initial word (Fougeron 1998). As the operation of this aspect of Word Form Encoding is described more explicitly in future work, we suspect that it will come to resemble the sort of ‘prosody first’ approach described below in Section 5. Another set of Word Form Encoding phenomena which the model does not yet address involves suprasyllabic phonetic effects, i.e. effects on whole words. For example, Wright (in press) reports the hyperarticulation of ‘hard words’, e.g. those which are low in frequency and/or in a high-density lexical neighborhood. Similarly, Jurafsky et al. (in press) and others have described phonetic effects of word frequency and predictability. The current L89/LRM99 model does not explicitly address these recently-documented effects, but again it may be possible to do so by marking encoded PWds for later adjustment in a post-syllable-retrieval process. In sum, it appears that the integration of prosodic parameter values with the syllable-sized gestural scores will require not only look ahead and look up, but also the manipulation of subsyllabic structures and a mechanism for computing phonetic effects on larger constituents such as whole words. While the addition of a processing stage that can accomplish these things is not incompatible with the L89/LRM99 model, and in fact is hinted at by some of their 137 discussion, it expands the scope of Word Form Encoding to include substantial post-prosodic adjustments of the forms of PWds in light of their larger structural and segmental context. It remains to be seen how this requirement can be reconciled with the incremental approach which is central to the single-word-utterance-based view. The point we wish to emphasize is that, even after the pre-compiled gestural scores for the syllables of a PWd have been retrieved, the speaker is still quite far from articulatory implementation—i.e. a good deal more processing is required to specify the systematic differences that we observe both with adjacent context and with higher- level structure, as well as with non-grammatical factors such as speaking rate and style or register. The lines of evidence summarized above show that, in carrying out Word Form Encoding, particularly its phonetic aspects but also its phonological processes, speakers make use of information about later constituents and higher-level constituents than the current PWd. How can the look-ahead and look-up requirements be satisfied, while preserving the substantial insights embodied in the LRM99 model? The following section sketches one view of how this might be accomplished. 5. Building prosody before Phonological Encoding An alternative to the local, incremental construction of prosodic specifications upwards from the information available at each PWd is the initial construction of a higher-level prosodic representation of the entire utterance, with subsequent integration of word-form information into this larger structure. That is, the speaker may construct at least some higher-level prosodic representation without reference to word forms, providing a locus for those aspects of prosodic processing that do not require word-form information. As more information about both word forms and non-grammatical factors becomes available, restructuring of this initial default representation may then occur. Many different types of information about words are needed for planning the production of an utterance, including such things as their number and serial order, whether or not a word will receive phonological content in this utterance, metrical information such as the number of syllables and stress pattern, the number and serial order of contrastive phonemic segments within a word, and the articulatory instructions for realizing these contrasts. Our hypothesis is that not all of this information about the words of an utterance is required at the same time; as a result, various kinds of information can be retrieved separately, as needed for further processing. This view of 'build prosody first, then retrieve segments' contrasts with the 'retrieve segments first, then organize them into prosodic constituents' view. 5.1. Motivation and general principles What is the motivation for computing higher levels of prosodic structure for an utterance early on in the production planning process, before Word Form Encoding instead of after? The previous section presented compelling evidence that all aspects of Word Form Encoding, including Phonetic Encoding, must refer to prosodic structure. This prosodic structure therefore must have already been built. But in order to build this prosodic structure, we need to take account of a number of factors, of which morphosyntax is just one. Our hypothesis is that the . 138 initial prosodic structure is derived directly from the syntax, and then restructured on the basis of non-syntactic information. Some aspects of this prosodic restructuring can be carried out in the absence of word form information; others require at least some information about word form such as number of syllables and stress pattern; and still others require full knowledge of the phonological segments of the words. Our general approach, then, is to break down the process of computing the prosodic structure for an utterance into two stages. The first pass creates default prosodic constituents based on the syntax/semantics (that is, on non-word-form information). The second pass is influenced by word-form and prosodic information. The approach we sketch here is based on three assumptions about the time course of prosodic planning and retrieval of this word form information: a) different aspects of prosodic restructuring require different kinds of word form information; b) word form information becomes available in stages, as LRM99 suggest; and c) those aspects of prosodic restructuring which can be carried out with minimal word information are carried out early in the planning process, before complete word-form information is available; restructuring into the final prosodic representation, which requires complete word form information, is carried out later. By a prosodic representation, we mean a hierarchical grouping of words into higher level constituents, such as phrases of different levels, along with indications of the relative prominence of different constituents, intonational accents marking prominences, and any tonal markings of constituent boundaries. Without committing ourselves to any particular view of what the constituents must be, or whether they can be recursively nested, we will use here, for illustration, a prosodic hierarchy with multiple levels: an Utterance consists of a string of one or more Intonational Phrases (IP), which in turn consist of a string of one or more Phonological/Intermediate (Intonational) Phrases (PP/IntermIP), which are made up of a string of one or more PWds. We adopt this simplified hierarchy here for the purposes of illustration; for example, we do not include a separate constituent to account for the difference in phonetic behavior between PWds like editing and larger groupings like edit it in American, as discussed in Hayes (1989) and Shattuck-Hufnagel (forthcoming). The reader is referred to Shattuck-Hufnagel and Turk (1996) for a comparison of the full constituent hierarchies posited by various prosodic theories. What is the relation between such a prosodic representation of an utterance and the surface syntactic structure of the underlying sentence? Syntax is clearly a significant factor in determining the prosody, but it is not the only factor. Like others working in the prosodic hierarchy tradition, we envision that the prosodic representation is the locus for integrating morphosyntactic influences with non-grammatical influences such as speaking rate, information structure, affect etc. On this view, these disparate factors exercise their influence on surface phonetic form by influencing the prosodic structure of an utterance. In this discussion, however, we will focus on the grammatical factors, i.e. on the interaction of the evolving prosodic representation with two aspects of morphosyntax: surface syntax, and word form. We will have less to say about how the effects of non-grammatical factors are incorporated into the planning process. Our basic assumption is this: during Phonological Encoding, the process of constructing the prosodic representation takes advantage of information in stages. This means 141 described by e.g. Selkirk 1986, Nespor and Vogel 1986, Hayes 1989: form a PP from the left edge of XP up to X, optionally adding a non-branching complement to the right. In our sentence, this will give two PPs, corresponding to subject and predicate. Finally, a default PWd structure is copied from the lemmas at the bottom of the surface syntactic tree, assuming one PWd for each syntactic word that will have phonological content (which is all of them in this utterance). For the minimal prosodic structure we are employing here, this completes construction of the initial syntax-based prosodic representation of the utterance, shown in Figure 8. Figure 8. Default prosodic structure for The puppies chased those hippopotamuses, derived from the syntactic structure shown in the previous figure. In such a syntax-derived default prosodic structure, we have less word form information than is assumed in LRM99. There it is assumed that all the phonological form information about a word that is stored in the lexicon, including metrical frames and segments, is available for the construction of a PWd, and presumably it remains available after the PWd is constructed. (L89’s Ch. 10 Prosody Generator works this way: higher levels of prosodic structure are built on PWds, and PWds are built with complete word form information.) In contrast, the representation in our Figure 8 contains no word form information, because at this point none has yet been retrieved. Our approach will be to restructure the default prosody on the basis of additional information, information about the forms of the words. To what extent can we proceed incrementally in accessing form information from the lexicon, accessing it only as we have exhausted the possibilities for building prosodic structure, and even then, accessing only enough to proceed a bit further? In other words, how much prosodic processing can be accomplished with how little word form information? 5.2.2. Restructuring the default prosody on the basis of word form information. As we noted above, the syntactic tree tells us roughly how many words there are in the sentence because the terminal nodes of the tree are word lemmas. The LRM99 model incorporates the notion of a default metrical frame for English, the trochee. Logically speaking, we could construct a default metrical structure for the sentence as a whole by simply plugging in a trochaic metrical frame for each (non-empty) terminal node in the tree, and we could then use this default metrical structure as the basis for a first pass of form-based restructuring. Although we 142 considered this possibility, we cannot identify any work that such a move would do for us, and so we leave it aside here. Instead, we consider restructuring based on information taken from the lexicon. Although much research remains to be done in this area, it seems clear that some aspects of prosodic phrasing depend on information that is not available in the prominence- marked syntactic tree in Figure 7. L89 (section 10.2.3) discusses additional factors influencing IP breaks, such as desire for intelligibility, rate, length, and prominence relations. Nespor and Vogel (1986: 7.2) cite length, rate, and style, in addition to prominence, as factors influencing their restructuring process. Jun (1993: Ch. 5) cites rate, focus, phonological weight (word length), and semantic weight. It seems generally agreed that length is a particularly important factor. Jun (1993: Ch. 5) showed clearly that length matters for the Korean Accentual Phrase: in an experiment where the number of words potentially forming an Accentual Phrase was controlled, but the number of syllables in those words was varied, Accentual Phrases preferentially had five or fewer syllables. If combining words into one phrase would result in more syllables, two phrases were preferred. Thus she showed not only that length matters, but that at least in this particular case, length is counted in terms of number of syllables, not number of words. However, given a lack of other such studies, we do not know whether the relevant metric of length is the same across prosodic domains, or across languages. Therefore we will consider the length metric separately in each section that follows. 5.2.2.1. Intonational Phrase: Nespor and Vogel (1986) proposed that the length metric for IPs is essentially phonetic duration, e.g. the combined effect of length and speaking rate. We will assume, for simplicity in the current speculation, that restructuring of IPs proceeds on the basis of the number of words, but not of their lengths in syllables. Although it is not clear that number of words is the length factor that is important in determining IP structure, information about the approximate number of words is already available from the surface syntactic tree, and thus provides a coarse-grained metric of phrase length that requires no word form information. It is thus like rate, style, or other non-grammatical, but non-word-form, factors influencing phrasing; but at the same time it could be regarded as an aspect of form – phrase form. Taking number of words as our metric, then, if the number of words is too great, or if the speaker desires shorter IPs for some other reason, the default IP can be broken into two or more IPs: [The mayor of Chicago] [won their support] (from Selkirk 1984). Similarly, if the speaker desires longer IPs, or if the number of words in two successive is is very small, they may be combined into one IP: [Who me?]. Finally, the speaker may choose among several possibilities for location of IP boundaries, as in [Sesame Street is brought to you by] [the Children's Television Network], vs. [Sesame Street is brought to you] [by the Children's Television Network] (from Ray Jackendoff, p.c.). The likely locations for these restructured phrase boundaries are constrained by the surface syntax, but the speaker may choose to overcome these constraints.10 In other words, the default IP structure can be tested against any form-based constraints, such as length, and can also be revised according to other, non-grammatical requirements of the speaker's communicative and expressive intent. In the case of the IP, we posit no word-form constraints. We assume that an utterance with five words, as in our example utterance, can happily comprise a single IP. Restructuring is optional for our example; we assume it does not 10 It may be necessary to provide a mechanism for additional IP restructuring after the metrical frames of the words have been retrieved, as when the single lexical word fantastic is produced as two separate intonational phrases, Fan- Tastic! 143 apply. Thus the IP determined by default from the syntax has satisfied the form-based requirements for a fully prosodic constituent and is now a Full Intonational Phrase (FullIP) in the ToBI transcription system. At this point, with the restructuring of IPs complete (vacuously for our example), it is possible to assign the tonal elements associated with the boundaries of FullIPs, i.e. the Boundary Tones, to the edges of the FullIPs. The information required to select among these tones is not fully understood; however, it seems likely that these decisions are based on dialogue structure and information structure constraints, rather than on word form information, which is not yet available. Therefore we posit that FullIP Boundary Tones are assigned at this point (just one for our single FullIP). Figure 9 shows the result of processing of the IP, with L% representing the Boundary Tone. Figure 9. Interim prosodic representation of The puppies chased those hippopotamuses, with default prosodic structure (shown in figure 8) restructured at the IP level. L% is a low Boundary Tone. 5.2.2.2. Phonological Phrase. The first-pass syntax-based construction for our utterance resulted in two PPs. On the second pass, these two phrases are checked against the factors that influence phrasing at this level, of which we will discuss only length. One assumption in the literature has been that length constraints operate in terms of the number of immediate constituents of the PP, e.g. the number of Clitic Groups (Nespor and Vogel 1986) or of PWds (Inkelas and Zec 1996). However, as noted above, Jun (1993) showed that the small phrase of Korean, the Accentual Phrase, is sensitive to the number of syllables, rather than the number of words. Let us then assume that, to evaluate the need for PP restructuring, we need to know the number of syllables associated with each lemma. Therefore we now retrieve, from the lexicon, the metrical frame for each morpheme/word corresponding to each lemma. The number of syllables is all we need at this point, but since we adopt LRM99's proposal that the syllable count is bound up in the metrical frame with the stress pattern, we will retrieve the stress pattern too (indicated in the frames by a slash through a syllable branch). If a word has a default stress pattern, and consequently no stored metrical frame, then we could specify a trochaic metrical frame. We plug the metrical frames into their locations, i.e. into the default PWds which were provided by the original lemmas, which have now been organized into IP and PP structures. 146 Figure 10. Interim prosodic representation of The puppies chased those hippopotamuses, with metrical frames added to lemmas, and with default prosodic structure (shown in figure 8) restructured through the PP level. L- is a low Phrase Accent. Figure 11. Interim prosodic representation of The puppies chased those hippopotamuses, with default prosodic structure (shown in figure 8) restructured through the PWd level. 5.2.3. Phonological Encoding. With the restructuring of prosodic constituents and prominences complete, it is now possible to carry out the processes of Phonological Encoding, i.e. of serially ordering the sound segments of the words, and adapting their phonological shape to the prosodic context. The segments of a word are retrieved from the lexicon, and mapped into the metrical frames of the words in the utterance, which were retrieved earlier. (For simplicity, in our example we follow LRM99 in assigning consonants to syllable onsets wherever possible, even in 147 pre-stress positions where they are arguably ambisyllabic or syllable-final (e.g. the second /p/ in puppies).) It is during this process that segment-level sound ordering errors, such as exchanges, may occur. Imagine for example that having built the metrical frame for the entire utterance, the speaker then retrieves the segments for all of the lexical items at once, but maps them into the frame left-to-right. This provides a mechanism for (a) the availability of a downstream segment to be mis-mapped into earlier locations, (b) a frame to maintain the location where the downstream segment should have occurred, so that it can receive the displaced segment from the earlier location, and (c) compatibility with the evidence for left-to-right phonological mapping extensively cited in LRM99. With word-final segments in place, we can determine the forms of inflections like past (e.g. /t/ in chased) and plural (e.g. /z/ in puppies), and obtain the final surface syllable structure of the utterance (e.g. that the plural adds a syllable in hippopotamuses but not in puppies). It is likely that additional restructuring processes take place once the syllable structure of each PWd and the syllable status of each affix is known. For example, we envision that if a canonical syllable structure has been constructed, then empty slots in this syllable will be deleted, making resyllabification of a final consonant possible in some circumstances if the following syllable begins with a vowel. It also may be at this point that phonological consequences of combining clitics with their hosts are computed. Once the segments are in place, they can be adjusted as appropriate according to their positions in not only the word-sized metrical frames, but in every domain above them. Crucially, the position of every segment in every prosodic constituent can be determined locally by scanning vertically up the tree. As each segment is inserted into the prosodic structure, its encoding (i.e. syllabification, determination of prosodic-context-dependent phonological variation, and possibly restructuring below the level of the PWd) will depend on its position as determined by this vertical scan. Indeed, as described in section 4, much phonological processing could not happen before this point, as it is only now that all the prosodic domains are known. Our example sentence offers no obvious processing of this sort, but, for example, if there is a redundant phonological feature [spread glottis] indicating aspiration, this would be the point at which it would be assigned to the pre-stress /p/ in puppies and hippopotamuses. Finally, because all the form information is at hand, we can compute those aspects of form-based restructuring such as structure- and rhythm-governed prominence restructuring, described in section 4. We presume that any such prominence restructuring operations can be carried out in the absence of information about the phonological segments of the words, and as such should be able to occur before segmental phonological encoding. However, we do need to know which syllables carry lexical stress and therefore are potential docking sites for pitch accents, and which syllables are not. This is not known until the PWds themselves, or at least their metrical frames, are built during phonological encoding. Speakers realize prominences (a property of words) by first attaching pitch accents to the main stressed syllables of those words; then any necessary restructuring takes place. Thus we propose that prominence restructuring occurs here, a relatively late operation. In our example, there is no stress clash to resolve, and the tendency for an early and a late pitch accent is already satisfied. Perhaps a pitch accent could be added to break up the long span of unaccented syllables, but the intermediate location is on the function word those, which is awkward to accent; and while chased is accentable, its location would not be eurhythmic. Therefore we assume no prominence restructuring in this example. Thus in our example, there are two pitch accents, on the first syllable of puppies and the third syllable of hippopotamuses. Figure 12 shows the results of Phonological Encoding. 148 Figure 12. Final (though partial – some domains are omitted) prosodic representation of The puppies chased those hippopotamuses, with segments added to metrical frames, and pitch accents associated with stressed syllables. H* is a H-tone pitch accent. 5.2.4. Phonetic encoding. As we have stressed in our discussion so far, the result of Phonological Encoding is still rather far from speech. There is still much work to do, and prosodic structure will play an important role in that work, just as it did for Phonological Encoding. Recall that Phonetic Encoding in LRM99 is usually the retrieval of syllable-sized motor plans from a stored set of gestural scores, and in L89 it is also their realization with the traditional prosodic parameters duration, F0 change, and acoustic amplitude, appropriate for the phrase. In contrast, as earlier sections have made clear, we think of Phonetic Encoding not only as these traditional computations, but also as the shaping of each phonetic feature (or gesture) by its context, both segmental and prosodic. Consider, as a partial example, the articulatory planning for the puppies from our sample sentence, which following LRM99 we have given as a single PWd. Because this begins the utterance, its initial consonant is initial in an Utterance and every smaller prosodic domain; in contrast it is final in its PWd but in no higher domains; in addition, puppies bears a lexical stress and a pitch accent on its first syllable. As a result of being in a single PWd, there will be substantial articulatory overlap between the clitic and the host, giving the clitic a short vowel, and vowel-to-vowel interactions should be strong. As a result of being in utterance-initial position, the manner of the /ð/ in the is likely to be strengthened to a stop articulation. As a result of being at the end of the PWd, the last syllable of puppies will be lengthened somewhat. As a result of the pitch accent on puppies, the stop /p/ in the first syllable will be more closely articulated, and the vowel following it will be more open and have a tenser voice quality. As a result of the lexical stress and the pitch accent, that same /p/ will also have a larger glottal opening and thus more aspiration. On the other hand, the word puppies is probably not low- frequency or difficult enough to require any special hyperarticulation to enhance its intelligibility, though the sentence as a whole is odd enough that it might. 151 flapping into the clitic (e.g. in visit it). He attributes this difference to contrasting prosodic structures for the two types of syntactic structures. Shattuck-Hufnagel (to appear) reports acoustic evidence to support this observation, at least for some speakers. Eventually, it will be important to model the ways in which the prosodic structures used by speakers of different languages and dialects may result in different effects on the surface phonetic forms of words, just as we need to address how different prosodic structures for different utterances of the same sentence in a single language may result in different surface phonetic word forms. This kind of cross-language processing distinction has not yet been addressed in detail by any model, so it is not yet possible to evaluate how well it can be dealt with by one approach vs. another. Like LRM99, we will leave this issue for future work. Finally, it must be noted that the detailed LRM99 model of single word Phonological Encoding, along with its extensive integration with experimental data and its sophisticated handling of time course information, is the result of a mind-boggling amount of effort at the Nijmegen laboratory over the past decade to understand the complex workings of the human speech production system. The explicitness and comprehensiveness of this model have inspired, and continue to inspire, a generation of cognitive scientists and linguists to tease out and test its many implications, and to develop alternatives. Any proposal for a 'prosody first, segments later' model will have to account for the large body of relevant experimental data from this and other laboratories, from studies of language breakdown in aphasia and of language acquisition, as well as for the insights that have come from the development of formal phonological theories. The less-detailed L89 model of connected speech planning, in turn, tackles the even more challenging problem of understanding how speakers generate phrase-level prosodic structure, and how they specify the potentially extreme but highly systematic phonetic variation that has long been observed in continuous speech, particularly conversational speech. The work of the next decade will be to bring speech planning models to the admirable level of completeness exhibited in LRM99, without giving up the ambitious goals of L89. Such an accomplishment would have implications for many fields other than cognitive modeling of speech production, including automatic speech synthesis and, eventually, remediation of speech pathologies and the teaching of second languages. We look forward to these developments with eager expectation. References Baumann, M. (1995) The production of syllables in connected speech. Nijmegen U. dissertation. Beckman, M. E., Swora, M. G., Rauschenberg, J. & de Jong, K. (1990) Stress shift, stress clash and polysyllabic shortening in a prosodically annotated discourse. In Proceedings of the 1990 International Conference on Spoken Language Processing, Kobe, Japan, 1: 5-8. Beckman, M. & Pierrehumbert, J. (1986) Intonational structure in Japanese and English. Phonology Yearbook 3: 255-309. Berkovits, R. (1993) Progressive utterance-final lengthening in syllables with final fricatives. Language and Speech 36: 89-98. Bolinger, D. (1958) A theory of pitch accent in English, Word 14: 109-149. Booij, G. (1985) Coordination reduction in complex words: a case for prosodic phonology. In H. van der Hulst and N. Smith (eds.), Advances in Nonlinear Phonology, Dordrecht: Foris, pp. 143-160. 152 Browman, C. P. & Goldstein, L. (1988) Some notes on syllable structure in articulatory phonology. Phonetica 45: 140-155 Browman, C. P.& Goldstein, L. (1990) Representation and reality: physical systems and phonological structure. J Phonetics 18: 411-424. Browman, C. P.& Goldstein, L. (1992) Articulatory Phonology: An overview. Phonetica 49: 155-180. Butterworth, B. (1989) Lexical access in speech production. In W. Marslen-Wilson (ed), Lexical representation and processes. Cambridge, Mass: MIT Press D. Byrd & E. Saltzman (2002) The elastic phrase: Dynamics of boundary-adjacent lengthening. USC and Boston U. Ms. Byrd, D., Kaun, A., Narayanan, S., and Saltzman, E. (2000) Phrasal signatures in articulation. In M. B. Broe and J. B. Pierrehumbert (eds.) Papers in Laboratory Phonology V: Acquisition and the Lexicon, pp. 70-87. Cambridge University Press. Cambier-Langeveld, T (2000) Temporal Marking of Accents and Boundaries. U. Amsterdam. LOT dissertation 32, HIL 50 Cho, T. (2001) Effects of prosody on articulation in English. UCLA dissertation. Cohn, A. (1990) Phonetic and Phonological Rules of Nasalization. UCLA Working Papers in Phonetics 76 Crompton, A. (1982) Syllables and segments in speech production. In A. Cutler (ed), Slips of the Tongue and Language Production. Berlin: Mouton. Crystal, D. (1969) Prosodic systems and intonation in English. London: Cambridge U. Press. de Jong, K. (1995) The supraglottal articulation of prominence in English: Linguistic stress as localized hyperarticulation, JASA 97: 491-504 Dell, G. (1986) A spreading activation model of retrieval in language production. Psychological Review 93: 283-321. Dell, G. (2000) Commentary: Counting, connectionism, and lexical representation. In M. B. Broe and J. B. Pierrehumbert (eds.) Papers in Laboratory Phonology V: Acquisition and the Lexicon, 335-348. Cambridge University Press. Dell, G., Burger, L.K. and Svec, W.R. (1997) Language production and serial order: A functional analysis and a model. Psychological Review 104: 123-147 Dilley, L., Shattuck-Hufnagel, S. & Ostendorf, M. (1996) Glottalization of word-initial vowels as a function of prosodic structure. J Phonetics 24: 423-444. Edwards, J., Beckman, M.E & Fletcher, J. (1991) The articulatory kinematics of final lengthening. JASA 89: 369-82. Epstein, M. (2002) Voice Quality and Prosody in English. UCLA dissertation. Ferreira, F. (1993) Creation of prosody during sentence production. Psychological Review 100: 233-253 Fougeron, C. (1998) Variations Articulatoires en Début de Constituants Prosodiques de Différents Niveaux en Français. U. Paris III – Sorbonne Nouvelle dissertation. Fougeron, C. (1999) Prosodically conditioned articulatory variations: a review. UCLA Working Papers in Phonetics 97: 1-74. Fromkin, V. A. (1971) The non-anomalous nature of anomalous utterances. Language 47: 27-52. Fromkin, V.A. (1973) Introduction. In Fromkin, V.A. (ed), Speech Errors as Linguistic Evidence, 11-45. The Hague: Mouton. Fry, D. (1969) The linguistic evidence of speech errors. BRNO Studies of English 8: 69-74. 153 Fujimura, O. (1993) C/D Model: a computational model of phonetic implementation, Cognitive Science Technical Report #5, Center for Cognitive Science, Ohio State University Fujimura, O. (2000) The C/D model and prosodic control of articulatory behavior. Phonetica 57: 128-138. Garrett, M.F. (1975) The analysis of sentence production. In Bower, G.H. (ed), The Psychology of Learning and Motivation, 133-177. NY: Academic Press Garrett, M.F. (1976) Syntactic processes in sentence production. In Wales, R.J. and Walker, E. (eds) New approaches to language mechanisms, 231-256. Amsterdam: North Holland Garrett, M.F. (1984) The organization of processing structure for language production: Applications to aphasic speech. In D. Caplan and A.R. Lecours (eds), Biological perspectives on language, 172-193. Cambridge, Mass: MIT Press. Gee and Grosjean (1983) Performance Structures: A Psycholinguistic and Linguistic Appraisal. Cognitive Psychology 15: 411-458. Gick, Bryan (1999) The Articulatory Basis of Syllable Structure: A Study of English Glides and Liquids. Yale U. dissertation. Gow, D. (in press) Assimilation and anticipation in continuous spoken word recognition. J Memory and Language Guenther, F.H. (1995) Speech sound acquisition, coarticulation, and rate effects in a neural network model of speech production. Psychological Review 102: pp. 594-621. Hall, T.A. and Kleinhenz, U. (eds) (1997) Studies on the Phonological Word. Amsterdam and Philadelphia: John Benjamins. Halliday, M. A. K. (1967) Intonation and grammar in British English. The Hague: Mouton. Hayes, B. (1984) The phonology of rhythm in English. Linguistic Inquiry 15: 33-74 Hayes, B. (1989) The prosodic hierarchy in meter. In Kiparsky, P. and Youmans, G. (eds), Rhythm and Meter, Phonetics and Phonology 1, Academic Press. Hayes, B. and Lahiri, A. (1991) Bengali intonational phonology. Natural Language and Linguistic Theory 9: 47-96. Horne, M. (1990) Empirical evidence for a deletion formulation of the rhythm rule in English. Linguistics 28: 959-981. Inkeles, S. and Zec, D.(eds) (1990) The Phonology-Syntax Connection. Chicago and London: U. of Chicago Press Inkelas, S. And Zec D. (1995) Syntax-morphology interface. In Goldsmith, J. (ed.) The Handbook of Phonology. Oxford: Blackwell, pp. 535-49 Jun, S. (1993) The phonetics and phonology of Korean prosody. OSU dissertation. Jurafsky, D., Bell, A., and Girand, C. (in press) The Role of the Lemma in Form Variation. To appear in Warner, N. and Gussenhoven, C. (eds.), Papers in Laboratory Phonology VII Kaisse, E. (1985) Connected speech: the interaction of syntax and phonology. Academic Press. Keating, P. (1984) Phonetic and phonological representation of stop consonant voicing. Language 60: 286-319 Keating, P. (1990a) The window model of coarticulation: articulatory evidence. In J. Kingston & M. Beckman (eds.) Papers in Laboratory Phonology I , Cambridge University Press, pp. 451-470. Keating, P. (1990b) Phonetic representations in a generative grammar. J. Phonetics 18: 321-334 Keating, P. (1996) The Phonology-Phonetics Interface. In U. Kleinhenz (ed.) Interfaces in Phonology, Studia grammatica 41, Akademie Verlag, Berlin, pp. 262-278.
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved