The results of this study, stated broadly, suggest that it is feasible to extend speech-act analysis to meta-locutions, that humans use feedback to maintain coherence, and that irregularities in speech are clues to understanding rather than noise to be cleaned away. A computational model of meta-locutionary acts was developed and validated through both a protocol study and a rule-based simulation.
This dissertation began by posing a question which I called the integration problem: how can the speech-act theory of conversational action be reconciled with the sociological view of the irregularities of actual interactive discourse? In response, I suggested that conversants maintain a shared model of their conversation; that is, they are jointly creating a single conversation and each conversant has a model of the conversation they believe themselves to be creating. In light, then, of the coöperative, feedback-infused, locally-controlled nature of conversation, I suggested that the conversational repairs which maintain the shared model can be seen as acts akin to speech acts, except that they specifically affect the process of the conversation itself rather than more general aspects of the world related, for example, to conversants’ underlying needs.
The view of conversational coherence as being produced by meta-acts of the conversants establishes the goal of the research reported in this dissertation: Trying to identify and to model computationally a set of acts relative to these sub-sentential levels of conversation. I then described a set of conversational effects, as reported in the sociological literature, which formed a basis for ascribing meta-acts to the conversants. I presented an experimental technique to observe these acts.
I applied the theory of meta-locution to the protocols which I had obtained. That is, given the theory, did it and account for the interaction observed in actual conversations? The theory suggested that meta-locutionary actsÑacts by a conversant which control the process of the conversationÑcould be modeled at several different conversational levels, ranging from domain actions to turn-taking. In the case of turn-taking specifically, for example, the set of directive acts included hold-turn, give-turn, take-turn, and accede(turn). The interaction observed between the conversants could be explained computationally in terms of this set of meta-locutionary acts. Examples of operators were presented which, given the state of a conversant's model of the conversation, would account for the observed acts. In three specific cases, I used the meta-locutionary analysis to explain otherwise anomalous utterances. The protocol analysis also confirmed the observation that humans use feedback to maintain coherence. Indeed, most of the acts accounted for in the protocol are not domain-level acts but rather concern and affect the control and comprehension of the conversation itself. It is also true that the conversation obtained in the protocol contains all kinds of interruptions and sentence fragments. The meta-locutionary analysis of the protocol evidence is consistent with a view that each utterance, however fragmentary, has meaning in the conversational context. In understanding human communication, the individual fragments of the conversation are valuable as clues to coherent interpretation. I note that in the observed interaction, the production of utterances is inherently tied to understanding of the other conversant’s utterances. Moreover, a number of the utterances are contemporaneous, suggesting that timing of utterance production is a mutually determined process, as might be expected in a coherent mixed-initiative dialogue. This suggests that in analyzing interactive discourse, language production and language understanding cannot be separated.
The computational model developed through the protocol analysis was tested through a rule-based simulation. Note that the simulation did not constitute the theory; rather, the theory was partially implemented as a rule-based system to evaluate its adequacy for computer-based interaction. The simulation presented in Chapter VI showed that meta-locutionary operators can control mixed-initiative discourse in a manner that resembles the kind of interaction observed in natural conversation. Meta-locutionary acts specifically modeled in the simulation included the requesting, taking, giving, and holding of conversational turns. The simulation also showed situated action producing a form of global coherency through the use of local, context-driven operators.
The research presented in this dissertation represents a set of first steps in understanding the control of mixed-initiative discourse. The central thesis of accounting for coherence through meta-locutionary acts appears well-founded. That the particular acts proposed here represent the ultimate expression of the theory seems subject to doubt. The simulation study made clear the need for specialized operators (and presumably acts) dealing with correction and repair. Other classes of acts will probably emerge. Aside, though, from the obvious future work in this area involving the expansion and refinement of the acts, and aside from the design issues discussed with respect to the implementation of the simulation, there are other issues to address. Some of these issues are representational, and have to do with understanding what we are observing in conversations recorded in protocol studies. Other issues arise with respect to what other kinds of human-human interaction might this methodology be applied.
With respect to the process of transcribing and encoding the protocols, there remain aspects of the conversations which are as yet unrepresented. First, pauses, while implicitly represented in the time-line transcript, are not specifically noted as either things or acts. It is possible that pauses may, in and of themselves, be acts. The interpretation of pauses as possible acts seems subjective in the extreme, however. Chafe (1986) suggested, in the context of generation, that pauses correspond to the process of memorial retrieval; that is, concepts have activation levels which determine the retrieval costs for related concepts. According to Chafe, patterns in discourse are shaped by the speaker's costs of retrieval corresponding to active, semi-active, and inactive concepts. In any event, the problem of accounting for pauses remains unsatisfactorily addressed by the meta-locutionary model.
Another problem arises out the subjective nature of the coding of acts from the utterances, which is necessarily a process of interpretation. In a qualitative study in which rules for coding acts from lexical representations have not been expressed, it falls upon the coder to make a subjective interpretation of the act in the context of the conversation. In the case of the protocol analyzed in this dissertation, this process has been an iterative one, with successive refinements and alterations of the encodings of the acts and the states in order to produce a rational account of the observed conversation. Does not this process result in an encoding for the conversation which must work, under the circumstances? The answer to this question is in three parts. First, although the acts are motivated by accounts in the socio-linguistic literature, the development of the specific acts and their computational representations is the express goal of this research. That is, this process is the one which Cohen (1984) proposes. Second, the set of acts, their representations, and the operators, are validated by application in the simulation. Third, there is a need for future work in applying the acts developed on the basis of the protocol reporter here to other protocols. If effectiveness outside of the context of their development can be observed, then the acts may reasonably considered valid.
Other Voices, Other Rooms
The theory may also have application for other sources of human interaction and other contexts. Clearly, the theory of meta-locutionary acts could be applied in other domains. If the theory is true, then meta-locutionary acts should be consistent among different speakers of a given language; otherwise we would find it hard to converse at all. Moreover, the acts should be fairly domain-independent. The dissertation research looked at two different pairs of speakers, both in the sequence-recollection domain. The main obstacle to application of the theory to other domains is difficulty of representing the domain knowledge with sufficient exactness to permit the simulation to run. The complexity of even the sequence-recollection domain, especially as compared to the meta-locutionary knowledge, turned out to have been daunting. The problem is not that it is hard to find some representation which would permit a computer program to perform the domain task in the abstract. Rather, the problem is in linking the representation of the domain to mental states which can form the basis for linguistic action (Stucky, 1988).
Speech act theory has also been applied to communication within organizations (Winograd & Flores, 1986). There may meta-locutionary analogs for communicative control and coherence in this interaction. This may be actually be a fruitful domain because of the relatively formalized and statically represented nature of intra-organizational communication.
The protocol analysis and simulation suggest two areas of inquiry as to the role of cognitive constraints in conversation. Understanding these constraints would aid explication of the observed discourse and would immediately serve to improve communication through computer-human interfaces.
One of the issues which the simulation forced attention was that of the size of the units of communication. How much can people understand at once? For computers this is generally not a problem; if the agents were simply Prolog programs they could have just transferred their sequences to each other and merged them. Obviously, most people cannot do this; the protocol subjects certainly didn’t. The human cognitive constraints which determine how much we can absorb at a time thus have consequences for the design of mixed-initiative interfaces. The computer program must accommodate the human’s limited capacity. Conversely, there may be constraints on the human capacity for linguistic production. Systems which are receiving information from humans may be better equipped to understand the interaction if they can relate the size of the units of language produced to factors associated with the production process.
The variation in human skills for interaction is large (see e.g., Argyle & Cook, 1976). Presumably conversants take account of and adapt for these differences, which inject a correspondingly large measure of uncertainty into the interaction process: Is the other conversant really attending? Was that a nod? Variation in the size of the units of language which conversants can produce or understand creates uncertainty as to (1) the degree of understanding and (2) the degree to which conversants believe that they have been understood. These factors, then, suggest reasons why feedback is such an important part of conversation. Conversants are continually faced with issues of understanding which are not ordinarily resolvable through one-sided inference. Thus greater understanding of the limits of human cognitive capacity would better allow modelers of interaction to base operators on the underlying reasons for the use of meta-locutionary acts.
Tolerance of Uncertainty
Having observed that conversation is replete with sources of uncertainty, I now turn to the some of the issues of how conversants manage the uncertainty they encounter. The central question I want to ask is: How tolerant of uncertainty are conversants? The answer to this question has important implications for problems such as deciding when to initiate conversational repairs. In the model and simulation, uncertainty was reduced to qualitative simplifications such as true and mutually_known_true. Yet conversants apparently proceed with imperfect knowledge of the state of the conversation or, when faced with obvious differences in their conversational models, will continue without repair if the differences are not too great. How much imperfection is too much? How great a difference is too great?
In addition to uncertainty with respect to mutuality of knowledge, there is uncertainty as to whether purely mental perlocutionary effects have been achieved. If a tutor asserts some fact to a student is the tutor justified in assuming that the students now knows the fact? How well is well enough? These are issues which might be addressed by psycho-linguistic experimentation. Conversations could be induced in which responses indicating various degrees of comprehension are returned to the subjects.
Finally, if conversants are not certain of the other’s knowledge, they are also uncertain as to their own knowledge--or at least their understanding of their knowledge. As they listen, conversants seem to assume that if they are momentarily off track they will eventually recover using later information. Thus they may continue to give affirming signals to the speaker despite imperfect understanding of what is being said. Again, this phenomenon seems to be a matter of gradation. The level of tolerable uncertainty probably varies with context. To what extent, then, are conversants apt to defer repair? In interacting with computers, the perceived (or real) bother of repair may lead people to defer it to point of irreparability; the program’s limited ability to track the conversant’s knowledge may have been quickly exhausted.
The bandwidth limitations of human-computer interaction appear to limit the richness and ease of meta-locutionary action. Differences in modality cause changes in interaction patterns (see e.g., Argyle & Cook, 1976; Chapanis, et al., 1972, 1977; Grosz, 1982; Cohen, 1984). The changes are not always straightforward:
Under telephone or no-vision conditions, utterances are often shorter, some studies find more pauses, and less mutual influence, but there are fewer interruptions. There is some evidence of poorer synchronizing (more pauses), and for transfer of function to other signals (more attention signals). There is a reversal of a clear-cut expectation about interruptions, suggesting either that people learn to use different cues over the telephone, or that the shift of gaze cue is not very helpful. Evidently people either do not or cannot interrupt each other if they cannot see each other. (Argyle & Cook, 1976, pp. 163-164)
This evidence suggests that with present technology the human-computer interface is at an inherent disadvantage for mixed-initiative interaction. There are a number of approaches for possible solutions to this problem. The first approach would involve alternate lexicalization of meta-locutionary acts which are expressed in the physical channel. The idea here is to provide, through design, a set of verbal lexemes that correspond to physical kinemes. Unfortunately, the history of artificial languages in human use is dim. The languages which thrive are those created through conventional acceptance of negotiated units and meanings. This suggests a second but more difficult approach: provide the computer program with (1) the skills to negotiate the use of meta-locutionary interaction and (2) connections to the community of programs in which the language is being negotiated. Human beings would have, I surmise, great difficulty in developing language skills (and the language itself) if isolated from the general community of language users. If anything, computer programs are more susceptible to failure on this account because their language-learning skills are at best rudimentary.
Beyond alternate lexicalization, it may be that different modalities lead to the use of different meta-locutionary acts altogether. That is, conversants do not simply express the same acts in different ways; rather, they actually interact differently. This is, I think, a fair inference from the results described by Argyle and Cook (1976) and Grosz (1982). Some of the observed differences may not arise directly out of the limitations on the kinds of communication permitted. Instead, there may be second-order effects from, for example, declines in the speed of feedback. That is, feedback would still be possible through the available channels, but the process of conversion would be too slow. Such effects certainly might lead to significant changes in the naturalistic feel of the interaction. Furthermore, it may be the case that the feedback-based production processes are, through nature or through entrenched habit, dependent on prompt interaction for their effectiveness. In this case, substitute acts for feedback would have to account for speed of interaction. In computer-human interaction through graphically and aurally based interfaces, the ease and speed of feedback is asymmetric. The computer program can provide feedback much more rapidly than it can interpret it. This suggests that if meta-locutionary acts are to be useful for computer interfaces greater efforts should be made in the technology of physical input devices.
Humans and Computers
Over the course of the preceding chapters, I have tried to account for the differences between human-human and human-computer interaction. Many of these differences, I argued, can be explained by understanding mixed-initiative discourse as a multi-level process which uses meta-acts to produce coherence. In reaching these conclusions (and in applying them), though, we again face the limits of our knowledge of fundamental aspects of mind. In the case of intention, for example, the model presented in this dissertation achieves a reasonable level of success by ascribing intentionality to meta-behaviors. The upper levels of the conversational process rely on fairly diffuse intentions which are then translated into smaller and more specific intentions. In this process, planning is not used (although it might be useful in generating the high-level intentions); rather, the intentional acts set up expectations which are taken into account as part of the conversational context for later action. But as I discussed in Chapter III, the limits of this approach are near. It may be that intention itself is not a useful concept for inducing action. Suchman (1987) suggested that plans are a post-hoc rationalization of the organization of behavior rather than the mechanism which produces it. How can we exclude, then, the possibility that intention itself is not simply a post-hoc rationalization of the causes of behavior rather than the mechanism which leads to it? The problem, though, is that at this point we simply do not have an alternative theory of action. If not for intention, how else--or why else--would people do things?
Another fundamental limit arises out of what we perceive to be mixed-initiative interaction. In this dissertation, I have attempted to explicate mixed-initiative discourse through meta-acts. But if these acts are all co-temporally related behaviors at different conversational levels, the problem of mixing initiative also may arise at the meta-locutionary levels. I have suggested that the “lower” levels of conversational interaction represent a set of base cases for this recursion, but this cannot be concluded with confidence absent further research.
Yet even if we are reaching limits of knowledge, I hope that these limits are better defined and better illuminated as the result of this work. In the issues which I discussed in this work, the fundamentals come down to the fact that research into human-computer interaction must tell us more about what it means to be a human being, what it means to be a computer, and what it means to interact.