A TESTING METHODOLOGY AND ARCHITECTURE FOR COMPUTER SUPPORTED COOPERATIVE WORK SOFTWARE - By Robert Francis Dugan Jr.
A TESTING METHODOLOGY AND ARCHITECTURE FOR COMPUTER SUPPORTED COOPERATIVE WORK SOFTWARE By Robert Francis Dugan Jr.
A thesis submitted to the graduate faculty of Rensselaer Polytechnic Institute in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Major Subject: Computer Science May 26, 2000 (for Graduation August 2000) Approved by ______________________________________________________ Professor Ephraim P. Glinert, Computer Science Chairperson of Supervisory Committee ______________________________________________________ Professor Edwin H. Rogers, Computer Science Member ______________________________________________________ Professor Mark K. Goldberg, Computer Science Member _______________________________________________________ Professor Mark Embrechts, Decision Sciences and Engineering Systems Member Rensselaer Polytechnic Institute Troy, New York ii Rensselaer Polytechnic Institute Abstract A TESTING METHODOLOGY AND ARCHITECTURE FOR COMPUTER SUPPORTED COOPERATIVE WORK SOFTWARE by Robert Francis Dugan Jr. Despite enormous potential, CSCW software is still immature. In particular, leading researchers in both the CSCW and testing fields have noted CSCW testing tools are nonexistent. This thesis contributes a methodology and architecture for execution based testing of CSCW software. The CSCW Application MEthodoLOgy for Testing (CAMELOT) provides an organized set of specific techniques that can be used for technological evaluation. The evaluation is organized into two phases: single user and multi-user. Single user evaluation is subdivided further into general computing and human computer interaction. General computing examines software components that provide basic application capabilities. Human computer interaction focuses on the interface between the user and the software application. Multi-user evaluation examines distributed computing and human-human interaction. Distributed computing scrutinizes components responsible for multitasking and multiprocessing in the application at the thread, process, processor and machine level. Human-human interaction focuses on how the software facilitates interaction between users during application use. Rebecca, our testing architecture contributes to both general and multiuser testing systems. In the area of general testing Rebecca: - Provides an extensible component and event model that allows the record/playback of non-GUI events - Allows selective event recording through record filtration - Promotes the integration of the test system into the development environment - Outputs test scripts in the developer’s native language - Reduces re-recording using component-centric events and runtime component resolution - Simplifies the test process using a simple VCR-like interface In the area of multiuser testing Rebecca: iii - Integrates live users into a test session with triggers that playback virtual user behavior based on user interface, state change, timer, or user customized events - Provides runtime configuration of triggers via the threshold models - Simplifies virtual user synchronization with deadlock detection and recovery - Simplifies multiuser script editing via a global clipboard - Maintains IPC independence, but allows IPC to be recorded - Scales well with a resource conserving architecture Our architecture was implemented in Java as a working system called Rebecca-J. The methodology, architecture, and working system were evaluated by testing a mature CSCW application. The evaluation uncovered several dozen problems with the CSCW system. In addition to validating our approach, the evaluation prompted immediate improvements to the architecture and implementation, and provided important ideas for future enhancements. iv TABLE OF CONTENTS 1 Introduction......................................................................................................................................... 1 1.1 Problem Overview and Motivation....................................................................................... 2 1.2 The Contributions of Our Research ..................................................................................... 6 1.2.1 CSCW Application Methodology for Testing .......................................................... 6 1.2.2 Rebecca: An Architecture for Execution Based Testing of CSCW Software..... 7 1.2.3 Evaluation....................................................................................................................... 8 1.3 Overview of this Document................................................................................................... 9 2 A Survey of Computer Supported Cooperative Work...............................................................10 2.1 Groupware Applications .......................................................................................................11 2.2 Groupware Toolkits ...............................................................................................................19 3 A Preliminary Experiment ...............................................................................................................21 3.1 Architecture.............................................................................................................................21 3.2 Experimental Method ............................................................................................................22 3.3 Task Overview........................................................................................................................23 3.4 Evaluation, Results, and Analysis of Team Performance................................................27 3.5 Lessons Learned from the Development of CollabBillboard ........................................28 4 Survey of Prior Work in Testing Systems .....................................................................................32 4.1 Goals of Testing......................................................................................................................32 4.2 Research Testing Systems......................................................................................................35 4.2.1 Requirements ................................................................................................................35 4.2.2 Specification ..................................................................................................................35 4.2.3 Design............................................................................................................................42 4.2.4 Implementation ............................................................................................................43 4.2.5 Integration .....................................................................................................................47 4.2.6 System Testing ..............................................................................................................48 4.3 Human Computer Interaction Testing ...............................................................................49 4.3.1 Testing Architectures...................................................................................................49 4.3.2 Usability Testing ...........................................................................................................51 4.4 Commercial Test Systems .....................................................................................................52 4.4.1 Test Planning.................................................................................................................53 4.4.2 Test Management.........................................................................................................54 4.4.3 Test Development........................................................................................................55 4.4.4 Test Execution..............................................................................................................56 4.4.5 Test Analysis..................................................................................................................58 4.4.6 Test Measurement........................................................................................................60 4.4.7 Multiuser Testing..........................................................................................................61 5 A CSCW Application Methodology for Testing .........................................................................64 5.1 Related Work ...........................................................................................................................64 5.1.1 Taxonomy of Evaluation Methodologies................................................................64 5.1.2 CSCW Evaluation Methodologies ............................................................................65 5.2 A Technology Focused Methodology.................................................................................67 5.3 Single User Evaluation...........................................................................................................69 v 5.3.1 General Computing .....................................................................................................69 5.3.2 Human Computer Interaction...................................................................................71 5.4 Multi-user Evaluation.............................................................................................................73 5.4.1 Distributed Computing ...............................................................................................74 5.4.2 Human-Human Interaction .......................................................................................79 5.5 Conclusion...............................................................................................................................84 5.5.1 Ordering an Evaluation...............................................................................................84 5.5.2 Comparison to Existing Methodologies ..................................................................85 5.5.3 Part of a Complete Evaluation ..................................................................................85 6 Rebecca: An Architecture for Testing CSCW Applications......................................................87 6.1 General Architecture ..............................................................................................................89 6.1.1 Registration Management ...........................................................................................91 6.1.2 Event List Management..............................................................................................91 6.1.3 Component Management...........................................................................................95 6.1.4 Playback Management.................................................................................................97 6.1.5 State Management ........................................................................................................99 6.1.6 Trigger Management..................................................................................................101 6.2 General Infrastructure..........................................................................................................103 6.2.1 IDE Integration..........................................................................................................103 6.2.2 User Interface Independence...................................................................................106 6.2.3 Extensible Component and Event Models...........................................................108 6.2.4 Record Filtration.........................................................................................................113 6.2.5 Script Simplification...................................................................................................114 6.2.6 Playback Control and Feedback..............................................................................117 6.2.7 Native Language Recordings ...................................................................................119 6.3 Multiuser Support .................................................................................................................122 6.3.1 Interprocess Communication Independence........................................................122 6.3.2 Playback Orchestration .............................................................................................125 6.3.3 Triggers........................................................................................................................137 6.3.4 Threshold Model........................................................................................................139 6.3.5 Global Clipboard........................................................................................................160 6.3.6 Scalability.....................................................................................................................161 6.3.7 Application Independence........................................................................................163 7 Evaluation ........................................................................................................................................166 7.1 The Reconfigurable Collaboration Network ...................................................................167 7.2 Evaluation Phase I: Converting Rebecca to Java 1.2 .....................................................170 7.3 Evaluation Phase II: Getting Rebecca to work with RCN...........................................171 7.3.1 Component Detection ..............................................................................................172 7.3.2 Component Naming..................................................................................................173 7.3.3 Component Existence...............................................................................................175 7.3.4 Modal Dialogs.............................................................................................................175 7.3.5 Menu Bars....................................................................................................................176 7.3.6 Synchronization Feedback........................................................................................176 7.4 Evaluation Phase III: Evaluating RCN.............................................................................177 7.4.1 Single User Tests ........................................................................................................177 7.4.2 Multiuser Tests............................................................................................................185 vi 7.5 Discussion ..............................................................................................................................192 8 Conclusion and Future Work........................................................................................................194 8.1 CSCW Application Methodology for Testing .................................................................194 8.2 Rebecca: An Architecture for Execution Based Testing of CSCW Applications ...........................................................................................................................195 8.3 Evaluation..............................................................................................................................197 8.4 Future Work...........................................................................................................................197 8.4.1 The Future of CSCW Evaluation............................................................................197 8.4.2 Multiuser Recording...................................................................................................198 8.4.3 User Swapping ............................................................................................................198 8.4.4 Remote Windowing ...................................................................................................199 A Appendix: RCN Bugs Discovered During Evaluation.............................................................200 A.1 Error message displayed when starting up RCNPublicServer in Win95/98 ...............................................................................................................................202 A.2 Configuration of PATH shell variable necessary for NativeLibrary.dll for RCNPublicServer in Win95/98.......................................203 A.3 ISServer does not always flush terminated RCNPublicServer ..................................204 A.4 Documentation Errors.........................................................................................................205 A.5 Inconsistent use of Quit, Exit, Leave, Cancel ..................................................211 A.6 “Pick a IS” is grammatically incorrect...............................................................................212 A.7 No version number displayed in RCNPublicServer, rcnClient, ISServer ...............................................................................................................................213 A.8 User Preference Dialog Displays Invalid Colors ............................................................214 A.9 Preference Dialog Displays Too Many Colors................................................................215 A.10 Preference Dialog Allows Same Color for Two Users in Same Session....................216 A.11 No lock mechanism for simultaneous edits of Team Information.............................217 A.12 Race Condition Joining a Session ......................................................................................218 A.13 Ghost Cursor Hidden By New Applications...................................................................219 A.14 Sticky Mouse Buttons...........................................................................................................220 A.15 Multiple Client Control of Public Machine......................................................................221 A.16 Incorrectly Translated Keys ................................................................................................222 A.17 Sticky Shift, Alt, and Ctrl Keys ....................................................................................223 A.18 Race Condition in rcnClient’s User Interface ...........................................................224 A.19 Race Conditions Joining Sessions, Users, Teams, Publics ............................................225 A.20 Inconsistent use of OK, Okay............................................................................................226 A.21 Flickering Ghost Cursor......................................................................................................227 A.22 Confusing Display of Session Clients ...............................................................................228 A.23 Memory Leaks in Public and Client When Ghosting ....................................................229 A.24 Can’t play Indiana Jones from rcnClient......................................................................230 A.25 Correspondence from RCN Development Team..........................................................231 B Rebecca-J Information....................................................................................................................233 References ...............................................................................................................................................234 vii LIST OF FIGURES Figure 1: Rapid prototyping model of the software life cycle [10] .................................................... 3 Figure 2: Time/Space Taxonomy of Groupware [5] .........................................................................10 Figure 3: CollabBillboard socket shadow.............................................................................................22 Figure 4: Sketch of experimental design...............................................................................................23 Figure 5: Selecting a billboard site in the city.......................................................................................24 Figure 6: Control window for assembling billboard. Both users see the same window, view the entire billboard frame and move pieces.................................................24 Figure 7: Assigned roles "view billboard" window. This user has a zoomed out view of the billboard frame but cannot move any pieces ...................................................25 Figure 8: Assigned roles "place billboard" window. This user has a zoomed in view of the billboard frame and can move pieces. ...............................................................26 Figure 9: Z Language schema for CollabBillboard.............................................................................37 Figure 10: GIL: Specification for queueRemotePieceUpdate$n......................................................38 Figure 11: GIL specification for drawRemotePieceUpdate$n.........................................................39 Figure 12: Control flow graph for loop with five possible logic paths...........................................45 Figure 13: Code fragment from CollabBillboard................................................................................46 Figure 14: Usability guidelines from [87]..............................................................................................49 Figure 15: Final Exam C/S Test Multiuser Architecture ..................................................................59 Figure 16: Taxonomy of Evaluation Methodologies [122]...............................................................65 Figure 17: Intersecting Technologies of a CSCW Application ........................................................67 Figure 18: CAMELOT’s Single/Multiuser Stages..............................................................................68 Figure 19: Technology and Social Aspects of CSCW [122] .............................................................85 Figure 20: General architecture diagram for Rebecca........................................................................89 Figure 21: Registration management architecture diagram for Rebecca. .......................................90 Figure 22: High level view of event list model/view/controller architecture ...............................93 Figure 23: Detailed view of event list model/view/controller architecture. .................................94 Figure 24: Component management architecture diagram for Rebecca ........................................95 Figure 25: Playback management architecture diagram for Rebecca..............................................97 Figure 26: Algorithm for event list replay. ...........................................................................................98 Figure 27: Algorithm for native language replay.................................................................................98 Figure 28: State management architecture diagram for Rebecca.....................................................99 Figure 29: Trigger management architecture diagram for Rebecca...............................................101 Figure 30: Connecting to Rebecca-J using IBM’s Visual Age Visual Composition Editor.........................................................................................................................................105 Figure 31: Connecting to Rebecca-J using inline code ....................................................................106 Figure 32: Recording is played back correctly event though UI components have moved. .......................................................................................................................................107 Figure 33: UI Components translated to Rebecca’s Component Hierarchy...............................109 Figure 34: Creation and initialization of PropertyChangeComponentInt in AgentTester ...............................................................................................................................110 Figure 35: AgentTester’s modified setter for monitoring state change to integer count ..........................................................................................................................................111 viii Figure 36: Implementation of dispatchEvent() for PropertyChangeEventRecord....................111 Figure 37: Implementation of playbackEvent() for AgentTester...............................................112 Figure 38: Selective recording with Rebecca-J...................................................................................114 Figure 39: Recorder turned on and recording of plus push button press made.........................115 Figure 40: Push button press events copied and pasted back into the event list........................116 Figure 41: Result of replay.....................................................................................................................116 Figure 42: Feedback for synchronization state in Rebecca-J..........................................................118 Figure 43: Implementation of MouseEventRecord’s toJavaString() Method .............................120 Figure 44: Sample output from MouseEventRecord’s toJavaString() Method .......................121 Figure 45: Sample implementation of executeEventRecordList().................................................121 Figure 46: Recording customized with a for loop ............................................................................122 Figure 47: Implementation of setCount() ..........................................................................................124 Figure 48: Implementation of remoteSetCount() .............................................................................124 Figure 49: Record filtration to remove redundant events while recording IPC. ........................125 Figure 50: Original Playback Orchestration Proposal .....................................................................126 Figure 51: An (V+E) algorithm to determine cycles in a graph.................................................128 Figure 52: Resource graph (left) with deadlock cycle detected (right) ..........................................129 Figure 53: Reworked playback orchestration ....................................................................................130 Figure 54: Algorithm to Process Synchronization Events..............................................................131 Figure 55: Algorithm for the removal of a synchronization event from a script. ......................131 Figure 56: Determining synchronization points for SecondWind’s recording. ..........................132 Figure 57: Synchronization Dialog for SecondWind’s Recording.................................................133 Figure 58: Synchronization event inserted just before mouse press on slider bar in SecondWind’s recording.....................................................................................................134 Figure 59: Timer trigger and virtual user script to support the metronome in Rebecca-J...................................................................................................................................135 Figure 60: Deadlocked scripts. .............................................................................................................136 Figure 61: Deadlock dialog. ..................................................................................................................136 Figure 62: User interface for triggers in Rebecca-J...........................................................................137 Figure 63: A threshold editor is necessary for a simple event type threshold model. ........................................................................................................................................140 Figure 64: Rebecca-J’s editor for the propertyChangeInt threshold model. ...............................140 Figure 65: The mouseRegion threshold model editor...............................................................141 Figure 66: Timer browser in Rebecca-J ..............................................................................................142 Figure 67: Configuring a timer trigger for a single virtual user. .....................................................143 Figure 68: Ordering recording players in Rebecca-J ........................................................................144 Figure 69: Adding a customized threshold model to ThresholdList’s initialize() method.......................................................................................................................................145 Figure 70: Implementation of the compare() method for low level key event threshold models. .....................................................................................................................146 Figure 71: Implementation of compare() method for keySequence threshold model. ........................................................................................................................................147 Figure 72: Constructor for mouseRegion threshold model............................................................148 Figure 73: Implementation of event sequencing threshold model in Rebecca-J. .......................149 Figure 74: A shared drawing/chat application ..................................................................................154 Figure 75: An example of trigger chaining.........................................................................................156 ix Figure 76: Trigger chaining extends shared drawing area test........................................................157 Figure 77: Trigger state chaining example..........................................................................................158 Figure 78: Derivation of unique name from root component.......................................................174 x LIST OF TABLES Table 1: SCR Table for CollabBillboard...............................................................................................40 Table 2: Equivalence classes for cos...................................................................................................44 Table 3: Final Exam C/S-Test™ TML Script Commands for Multiuser Script Synchronization ..........................................................................................................................62 Table 4: Session Control window from SQA Suite™.......................................................................64 Table 5: General Computing Techniques from 1[14] and 2[10]........................................................70 Table 6 General Computing ∩Human Computer Interaction Techniques from 1[125], 2[40] ..................................................................................................................................72 Table 7: Usability Techniques from [40] ..............................................................................................73 Table 8: Distributed Computing Techniques......................................................................................77 Table 9: General Computing ∩Distributed Computing Techniques............................................78 Table 10: Human Computer Interaction ∩Distributed Computing Techniques .......................79 Table 11: Human-Human Interaction Techniques ............................................................................82 Table 12: Human-Human Techniques Organized by CAMELOT Code .....................................84 Table 13: Rebecca’s Remote Objects....................................................................................................93 Table 14: Threshold Models implemented in Rebecca-J.................................................................139 Table 15: Bugs discovered in RCN using CAMELOT and Rebecca-J ........................................177 Table 16: RCN's shared objects classified by coupling and architecture......................................189 Table 17: Results of RCN Ghost Scalability Testing .......................................................................190 xi ACKNOWLEDGMENTS My six-year doctoral journey was filled with detours. Some needed to be explored, some should have been left alone, and some thank goodness, I managed to avoid. I suspect the twenty other students that entered the program with me in 1994 began a similar journey. Thirteen made it through the qualification exam. Four passed the candidacy exam. Three are completing the degree. I wanted to thank the people who helped me make this journey a success. Thank you, Mr. Brown, my eighth grade science teacher at East Lyme Junior High School. Your guidance and confidence in my abilities awakened a source of inner strength and drive that changed my life forever. You are a wonderful teacher. Thank you, Mom and Dad for watching over me when I was young and being supportive while letting me find my own way as an adult. Dad, you taught me the value of hard work and persistence. Mom, you taught me an artist’s creativity. Thank you Mike, Tim, Kathleen, and Grandmommie. During the frustrations and doubts that appeared along the way, your faith and confidence that I was doing the right thing reaffirmed my own. Thank you cousins: Jenny, Katie, Lizzie, Byron, Brendan, and Aeron. You were my home away from home during my tenure here at Rensselear. Your questions over the years about what grade I was in were fun to answer and a pressing reminder to finish. Thank you, friends in the Computer Science department: Jeff Neshewait, Patrick Fry, Amir and Amanda Sehic, Dr. Stephen Blythe, Rick Klein, Gregg Steuben, Louis Ziantl, Terry Hayden, Pam Paslow, Darren Lim, Quincy Stokes, and Lina Guzman. You made me feel welcome, gave me great advice, and showed me how to have fun in Troy. The time has passed too quickly. xii Thank you, friends in the Literature, Language, and Communications department: Lynne Cooke, Anne Navin, Dr. Joe Downing, Dr. Lee Honeycutt and Carolyn Honeycutt. You kept the other side of my brain from atrophying while I was immersed in geekdom. Thank you, Bill Oldfield and Paula Paul. You two had a profound impact on my professional career. You gave me responsibility and challenging projects and your confidence in me brought a dawning realization that I could succeed at anything I set my mind to. Thank you, Dr. Steven Howes. You’ve been my best friend for over twenty years. You blazed the Ph.D path before me, showing me that it was attainable by mere mortals. You advice and support as a veteran of the process was invaluable. Thank you, WPI professors Nabil Hachem, Matthew Ward, E. Malcom Parkinson, Michael Gennert, and Stanley Selkow. I clearly remember the look on Matt’s face one night as he described the sabbatical to Australia he was about to take. Your encouragement and enthusiasm when I was deciding whether to pursue a doctorate tipped the scale. Thank you, Professor Ephraim Glinert. You took me on as an advisee and gave me freedom to pursue my own curiosity. Your wise counsel kept me from straying down too many dead ends. Your willingness to fund my research on and off campus, and to grant a brief leave of absence gave me the flexibility I needed to get the degree completed. Thank you, Professor Edwin Rogers. Your advice and involvement as a member of my committee was invaluable. You prodded me to produce a formal testing methodology that has become an important part of the thesis. Your research group’s application, RCN, was exactly the kind of collaborative system I needed for an evaluation of the thesis. Finally, our many conversations about sailing helped kept me sane. Thank you, Rensselaer computer science senior J.J. Johns. You lead the RCN development team and helped me a great deal during the evaluation phase of my thesis. I appreciated your willingness to accommodate my needs while taking a full course load and continuing your regular RCN duties. xiii A special thank you to my wife, Becky Dugan. I’m so grateful for your support this past year. You’ve taken care of a lot of the details of daily life so I could focus on finishing this dissertation. You’ve also been a great editor, therapist, and friend. “The journey is the reward” to quote a Tao saying and I couldn’t agree more. These past six years have been the most incredible of my adult life. I’ve honed my skills as a computer scientist, taught classes, worked for an Internet startup company, and conducted serious academic research. To top it all in a Wilderness First Aid class on campus I met the best thing that ever happened to me: my wife Becky. Thank you Rensselaer! 1 1 Introduction Human beings are social animals [1]. Many of the developments that are the hallmarks of human society can be traced to the need to interact and cooperate. Language allows more efficient and expressive communication. Money is used to acquire goods and services from others. Organizations such as the family, university, workplace, government, and law that preserve, protect, and advance humanity rely on complex interplay between individuals [2]. Technology has also played an important role in the evolution of humanity. Tools of the mind - for gathering, processing, and distributing information - have had the greatest impact in the twentieth century [3]. Among these tools is the computer, arguably the most powerful tool ever developed. This power comes from the computer’s ability to deal with information management in a generalized fashion [4]. "A computer-based system that supports groups of people engaged in a common task (or goal) and provides an interface to a shared environment" is called groupware [5]. Douglas Englebart published a visionary paper describing a groupware system called NLS in 1968. NLS contained many of the basic functions that can be found in modern groupware systems including e-mail, shared annotations, shared screens, shared pointers, and audio/video conferencing [6]. During the 1970s, e-mail and threaded text conversations (e.g. conferencing systems and bulletin boards) became commonplace. The need to interact and socialize combined with the technological progress of the past several decades has led to the development of a branch of study known as Computer Supported Cooperative Work (CSCW). Research expertise in the CSCW field covers a wide range of disciplines including computer science, psychology, anthropology, and education. Applications that fall under the CSCW umbrella are diverse: electronic mail, newsgroups, chat, multi-user editors, meeting support, videoconferencing, shared simulations, and workflow are some examples. Despite enormous potential, CSCW applications are still immature. Four software technology components must be successfully integrated in order to create a useful system: 2 General computing provides basic application functionality found in any software system. Determining that a software program, even a simple one, functions correctly has been the subject of decades of research. Human-computer interaction technology supports the interaction between a user and the application. All of the difficulties inherent in developing the interface for a single user system apply including: iterative design, reactive programming, multithreading, undo/redo, and real-time programming. Distributed systems cover software that supports the execution of the application on multiple computers. Classic problems of multiprocessing that have to be confronted include: inter-process communication, process synchronization, session management, and fault tolerance. Human-human interaction deals with functionality supporting interaction among several users. Issues include: coordination, coupling, privacy and user awareness. Creating, testing, and maintaining a program that uses any one of these software technologies is difficult. The effort involved in a system that combines all these technologies is truly daunting. To attack these difficulties, the CSCW research community has tried to simplify the creation of groupware through the development of toolkits. These toolkits address four important areas: run-time architecture, programming abstractions, groupware widgets, and session management. The run-time architecture aids the programmer with process management and inter-process communication. Programming abstractions simplify synchronization of distributed events and data. Groupware widgets provide the programmer with GUI components for multiuser applications. Session management allows the programmer to customize how users create, join, leave, and manage a multiuser application. 1.1 Problem Overview and Motivation Researchers concede that there is room for improvement of groupware toolkits [7]. For example, little work has been done to integrate audio and video into CSCW applications [8]. By examining the software lifecycle, other areas for CSCW application improvement can be discovered (see Figure 1). The difficulties encountered in creating CSCW applications also apply to their verification. For example, security and privacy need to be validated before users can feel confident that 3 their private work is protected. The development process that user interface intensive CSCW applications go through requires constant reevaluation. Undo/Redo scenarios can get extremely complicated in multiuser settings, and require thorough verification. Distributed systems like CSCW software are “notoriously difficult to write, test, and debug" [9]. . Leading researchers in both the CSCW and testing fields note “CSCW testing tools are non-existent” [8]. To date CSCW evaluation efforts have been broad based, advocating the examination of both the social and technological aspects of an application. These broad based approaches combined with the research community’s preference for social evaluation have created a lack of specific techniques for the technological evaluation of CSCW software. A methodology is sorely needed given the complexity of the testing task. Figure 1: Rapid prototyping model of the software life cycle [10] In addition to the lack of techniques for testing, there is also the logistical problem of finding “real” users to exercise the software [11]. A usability test requires users to exercise the application. Typically, the first user to exercise a compiled and linked program is the developer. When the developer is satisfied that the application is operating properly, a Specifications Verify Rapid Prototyping Verify Requirements Design Verify Implementation Test Integration Test Maintenance Changed Requirements Verify 4 second stage begins when real users are brought in for further study. It is relatively easy for a developer to play the role of a user in a single user application, but in a CSCW setting this becomes more challenging. “Because we need at least two or more people for each observation scenario, we spend more time scheduling subjects and setting up equipment to observe each subject” [12]. It is hard enough to get one user to commit to a block of test time. It is even more difficult to get two or more users to agree to the same block of time. Higher costs in terms of time and money are incurred during CSCW testing because of this scheduling problem and the greater number of users needed for testing. A common sense approach runs both users’ portions of the program on a single machine. Input and output are straightforward since both users have the same keyboard, mouse, and display. It is possible for the developer to see the immediate effect that one user’s action has on the other because all output goes to the same display. From a cost standpoint, this method is attractive because it only requires a single machine. There are, however, significant drawbacks to this approach. Concurrency is severely restricted because only one user can have the input focus on the machine. Network performance is inaccurately represented because communication between users never leaves the local machine. General system performance is also misrepresented. In a heavily graphical application, for example, the performance when multiple users run on the same machine may be unacceptable due to intense image manipulation. Screen real estate can also be a problem. Since many CSCW applications are designed for one user per display, it may be difficult to view both users’ output simultaneously. Other one-per-machine resources may not be shared properly. For example, there is only one system cursor per machine. It may be impossible to test a system cursor remote control function until the application runs on two machines. Multiuser audio output is also difficult to test on a single machine. Distributing a two-user application across two machines eliminates most of the single user problems and more accurately represents how the system will behave. However, a single developer trying to exercise the application on two machines requires a great deal of dexterity and agility. Two displays provide an overwhelming amount of screen surface to observe during simultaneous visual updates. Multiple keyboards and mice allow concurrent input, but require dexterous skills for anything beyond a simple key-press or 5 button-click. Imagine a single developer trying to type two sentences on two keyboards at the same time! Sophisticated simultaneous mouse manipulation is also difficult. Audio also presents a problem. It can be difficult, for example, to isolate which machine is producing audio output during execution. Headphones offer an option with multiple testers, but this isn’t possible with a single developer. The difficulty of usability testing a CSCW application increases when three, four, or more users are added to the system. We acquired first hand experience with the difficulties of developing and testing a CSCW application during the creation of CollabBillboard [13]. CollabBillboard is a multiuser simulation developed to test the theory that explicit user roles can induce greater collaboration. Although our evaluation of the application supported the hypothesis, we found the entire process frustrating. The biggest problem was how much we underestimated the amount of time needed to complete the application. It took almost three times longer than we expected! A major contributor to the delay was difficulty in finding subjects to help test the application. For the reasons discussed above, a single user, the developer, was not sufficient to thoroughly exercise the program. It was often necessary to comb the halls for volunteers, and as the months went by, they became increasingly reluctant. We began an investigation of testing systems to determine if any of them could have helped during the development of CollabBillboard. The research community has focused primarily on efforts that automate verification early in the software life cycle. The earlier a software error can be detected in the life cycle, the less costly it is to fix [14]. Even with black box and white box testing (see Section 4.2.4), which appear late in the cycle, automatic testing techniques are used. Work in early life testing has proven impractical for large complex applications. Late cycle techniques like black and white box testing are intractable for all but the simplest of programs. The research community has almost completely ignored the system test stage. The commercial world, on the other hand, takes a less formal, execution-based approach to verification. The tester is responsible for manually creating test cases to be executed against the system, with little guidance from the testing tool. The test cases are executed against the application during the implementation, integration, and system test phases. 6 Fixed test cases are insufficient for the verification of a CSCW application. There is no opportunity for the CSCW tester to participate in the test. The tester cannot change the direction of a test case on the fly. As a passive observer, the tester cannot view the actions of a user effectively because the automated test executes too quickly. Finally, the tester lacks fine-grained control over the virtual users participating in the test. 1.2 The Contributions of Our Research Our research has focused on improvements to execution based testing of CSCW software. We have developed CAMELOT, a CSCW Application MEthodoLOgy for Testing. Developers and quality assurance personnel can use CAMELOT to evaluate software technology that comprises a CSCW application. We devised Rebecca, an architecture for an execution based test system, motivated by the desire to support live user participation in a CSCW test. In addition, the architecture makes important contributions to general execution based testing systems. To determine the efficacy of our work, CAMELOT and a Java based implementation of Rebecca were used to evaluate a mature CSCW application: Rensselaer Collaborative Network (RCN). The evaluation uncovered over twenty bugs in RCN, flaws in Rebecca and the implementation, and provided valuable feedback for future work. 1.2.1 CSCW Application Methodology for Testing Existing methodologies take a broad based approach to the evaluation of a CSCW application. While acknowledging that technology plays a role in a CSCW system, these methods give few details on how its evaluation should proceed. The CSCW Application Methodology for Testing (CAMELOT) provides an organized set of specific techniques that can be used for technological evaluation. The methodology breaks the testing process into two stages: single user and multi-user. In the single user stage, General Computing and Human Computer Interaction features are examined. During the multi-user stage, Distributed Computing and Human-Human Interaction aspects are investigated. A unique code is associated with each technique. The code provides a classification scheme for the tests used and problems uncovered during application evaluation. We believe CAMELOT’s techniques are inclusive of most of the technology tests an evaluator would want to perform on a CSCW application. 7 1.2.2 Rebecca: An Architecture for Execution Based Testing of CSCW Software A critical component is missing from multiuser CSCW application development that is taken for granted in single user applications: support for live user testing. Anytime someone wants to test a single user application, they can pose as the user and run the application. As explained above, it is very difficult for a single person to perform a live user test when multiple users are required. State of the art commercial and research testing systems do not provide adequate guidance or support for a single person to perform live multiuser verification. Our approach to integrating a live user into an execution based testing architecture focuses on the shortcomings of traditional execution based test systems. Rebecca makes significant contributions to the general infrastructure of execution based testing systems: The record/playback process is improved beyond the user interface with extensible component and event models. Any application activity can be replayed if the source is defined as a component, and the activity is defined as an event. A record filtration system is defined that allows the user to filter events by selecting which components participate in a recording. In past systems, the only filtration options were manually intensive intermittent recording or editing of the recording. Unlike traditional testing systems that view testing as a separate task from development, the architecture seamlessly integrates into existing integrated development tools such as IBM's Visual Age. For sophisticated data structures and control flow in a test script, Rebecca describes a blueprint for exporting recordings in a familiar format: the IDE's native programming language. This contrasts with traditional test systems which require the user to learn a proprietary scripting language. Re-recording of scripts after application changes have been made is reduced using runtime resolution of components and component-centric events. Recording script management is simplified with a VCR-like metaphor for creating, editing and executing tests. This allows the user to create and run a test in seconds. Rebecca also breaks new ground in the area of multiuser execution based testing: The ability to incorporate live and virtual users into a single test session using distributed triggers. With triggers, virtual users react to events generated by other 8 users (live or virtual). Existing test systems completely prescribe a test session which precludes meaningful live user participation. Virtual users can react to four classes of events using triggers: user interface, state change, timer, and customized. This allows the virtual user to respond to virtually an application activity, much like a live user. Threshold models are provided which allow the tester specify the characteristics of an event or sequence of events that will fire a trigger. A threshold model has a user interface component, which allows runtime specification of firing conditions. An extensible object oriented framework for complete customization is also included. Improvements to synchronization during multiuser playback including an orchestration metaphor, simplified synchronization mechanisms, deadlock detection, and deadlock recovery. A global recording clipboard, which simplifies the process of sharing some or all of a recording between virtual users. Ability to record, playback, and monitor application communication while maintaining independence from the communication mechanism. Existing test systems do not provide the ability to monitor application communication. The few academic systems that do provide this ability are mechanism specific. A resource conserving architecture. This allows the system to run in tandem with an IDE, and improves scalability as the number of users participating in a test increases. It is expected that Rebecca will impact the development of future execution based testing systems and collaborative software. Rebecca promotes the integration of testing early in the software life cycle. This is critical because studies have shown that the earlier a bug is discovered the less expensive it is to correct. The architecture also provides guidance for the development of future multiuser testing. This guidance includes independence from the application's communication infrastructure, improvements to multiuser synchronization, triggers, and a scalable design. Finally, Rebecca-J, a Java-based implementation of the test system architecture is available for immediate use for the development and testing of Java-based collaborative software. In addition to improvements in multiuser testing, this should immediately benefit the research community by alleviating the need for live users during a multiuser test. 1.2.3 Evaluation We believe the evaluation of our methodology and testing architecture was a success. Unsolicited correspondence from the RCN team (see Section A.25) showed gratitude for 9 the problems uncovered by the CAMELOT and Rebecca approach. Two-dozen bugs were discovered in this mature CSCW application. Some of the problems were cosmetic. However, some of them were serious and are being corrected to make RCN a robust application. Rebecca was also significantly improved. Flaws in the component management architecture were uncovered and corrected. Problems with modal dialogs were also fixed. Finally, several ideas for enhancements to Rebecca were formulated. 1.3 Overview of this Document The rest of the document is broken up into several chapters. Chapter 1 gives the reader an understanding of the scope of CSCW and describes some of the major groupware toolkits. Chapter 2 describes CSCW application CollabBillboard and the lessons we learned from creating the software. One of the biggest problems we had developing CollabBillboard was testing the system between versions. Chapter 4 looks at state of the art academic and commercial contributions to the field of software testing. Chapter 5 describes CAMELOT, the CSCW Application Methodology for Testing, a set of techniques we have developed specifically for testing collaborative software. Chapter 6 describes Rebecca, an architecture we have created for a collaborative software testing system. In Chapter 7, CAMELOT and Rebecca are evaluated by using them to test the Reconfigurable Collaboration Network. Chapter 8 concludes the thesis with some thoughts on future work. 10 2 A Survey of Computer Supported Cooperative Work Groupware applications can be classified in several ways. One common method of categorization looks at how an application deals with issues of time and space. When multiple users are using a groupware application, when and where interaction occurs helps to define the system’s capabilities. Temporally, users can interact at the same time or at different times. Spatially, users can interact in the same place or from different places. Figure 2 illustrates these possibilities. Figure 2: Time/Space Taxonomy of Groupware [5] An example of same time/same place groupware is Rensselaer’s Design Conference Room Collaboration Network. This software was designed for face-to-face design meetings. Participants each have access to a private workstation and use a floor control policy to control access to a shared public workstation [15]. Chat programs like Internet Relay Chat (IRC) are examples of same time/different place groupware. Users communicate with each other via shared text windows where messages are typed and responses are viewed in real-time [16]. Other groupware that facilitates communication between users at the same time regardless of spatial location is known as synchronous groupware. E-mail is an example of different time/different place groupware. The final category, different time/same place groupware, has no known applications. A bulletin board where people could leave messages for each other demonstrates this type of collaboration [5]. synchronous distributed interaction asynchronous interaction face-to-face interaction Same Time Different Times Same Place Different Place asynchronous distributed interaction 11 2.1 Groupware Applications This section presents an overview of the major classes of CSCW applications and some of the important software systems that have been developed to implement them. Some Common CSCW Applications: A number of common computer applications fall under the domain of CSCW. Electronic mail consists of the asynchronous exchange of information between a sender and one or more recipients [17]. Newsgroups operate in a manner similar to electronic mail. Information is exchanged asynchronously between a sender and newsgroup covering a specific topic of interest through activity known as “posting”. Users interested in a particular newsgroup then download and read postings using a newsreader. Newsgroups are a more public form of expression than electronic mail, which directs messages to a limited group of recipients [18]. Chat allows two or more users to communicate synchronously using text. Users can add messages to the shared text window by typing in a private compose area of the client chat program and selecting a “send text” option. Within seconds, the text message will appear in the shared text window of all room occupants. Chat is more conversational than electronic mail or newsgroups, because of the real-time communication [19]. Videoconferencing is a method of synchronous distance communication between participants using live audio and video. There is strong interest in this technology because of the time and money involved in attending face-to-face meetings. Despite obvious benefits, videoconferencing has not replaced face-to-face meetings because of issues like lack of support for eye contact, difficulty integrating remote users from multiple sites, and insufficient network bandwidth [20]. Workflow software is asynchronous groupware that helps improve the process of performing multi-person tasks in the workplace. Some examples of improvement include reduced lag time because manual task routing is eliminated; and better feedback about the state of the tasks that comprise the business process [21]. 12 Shared Windows: A shared window system allows synchronous collaboration through a logical window physically replicated on the screens of participating users. A user provides input and views output in this window in exactly the same manner as other windows on the display. However, any action in a shared window is immediately reflected on the displays of other participating users [22]. Single user applications running inside the shared window are collaborative with no modifications. This straightforward approach to collaboration has drawbacks, however. WYSIWIS is the only viewing option, and conflict resolution is limited to a generalized floor control policy. Social awareness is supported to a limited degree in some systems through telepointers and a shared transparent layer where users can make graphical and text annotations [8]. Shared windows are used in areas including classroom/meeting support: where users can share the same view of an application relevant to the discussion; technical support: where technician can walk a user through a software problem VConf was one of the earliest shared window systems developed [23]. Rensselaer’s Design Conference Room Collaboration Network takes the idea of shared windows to an extreme by sharing windows, display, and an entire workstation between users [15]. Farralon’s Timbuktuand Travelling Software’s LapLinkare examples of commercial systems. Multiuser Editing: Multiuser editing can be asynchronous or synchronous. Within these divisions, further specialization occurs based on the type of information (e.g. text, graphics). Multiuser asynchronous text editors allow multiple users to edit the same document over time. At any specific moment, only one user can be editing the document. Synchronization of the document among users, distributed access, and control are essential requirements for this type of editing system. Commercial word processors like Microsoft Wordprovide primitive support through file locking which prevents simultaneous editing, and file sharing which provides multiuser access and control. Despite limited capabilities, the basic asynchronous multiuser editor mirrors how collaborative documents are produced and has been readily adopted by the business community. 13 More advanced asynchronous editors support a variety of collaboration styles. This is important because it has been observed that collaboration needs change during a document’s evolution [24]. The PREP asynchronous editor breaks a document up into layers called columns. The main text, co-author’s notations, and comment request/responses are examples of columns. A column is composed of chunks that correspond to a logical unit of information (e.g. paragraph, request/response pair). Each user receives a copy of the document. Updates to the local copy are received from other users on a periodic basis. Specialized software helps the user visualize and integrate remote updates into the local copy. How these updates are sent and received is controlled by user configurable parameters of interaction. Grain size controls the size (column, chunk, keystroke) of a document update. Flow determines when an update occurs (automatic, upon request). Transmission speed controls how fast information must flow from one site to another via the network. PREP also allows users to manage the task of multiuser editing. Users are able to negotiate interaction parameters, set document access control, and make commitments to deadlines [25]. Quilt[26] is an example of another asynchronous editor. Multiuser synchronous text editors allow multiple users to simultaneously edit the same text document. Changes made to the document by one user are immediately seen by all other users. Users may be allowed separate, independent views of the document (What You See Is Not What I See - WYSINWIS) as in the GroupKit Fish-Eye editor [27], or the views may be linked (What You See Is What I See - WYSIWIS) as with XEROX PARC’s Cnoter [28]. Conflict invariably arises during multiuser editing sessions. Rapport [29] uses a floor control mechanism in which users request permission to modify sections of the document. GROVE [5] relies on simple voice communication to resolve differences. Cognoter [9] uses access control to prevent other users from modifying an area that is already being changed. Public and private access to sections of the document is a desirable capability. GROVE supports the ability to limit a section’s read/write access to one or more users. The editing experience is enhanced through social awareness: allowing a user to know who else is modifying the document, and where the changes are being made. Groupkit’s Fish-Eye editor uses icons to represent each user in the editing session, and a graphical representation of the entire document with fish-eye lenses over sections that 14 users are currently editing. Many editors also support telepointers (also known as ghost cursors, or remote pointers) to indicate the location of a remote user’s pointer. Multiuser synchronous editing of complex information requires functionality similar to the synchronous text editors mentioned above. Shared drawing systems are an example of this kind of editor. Users share a drawing area where text, 2-dimensional graphics, and images can be manipulated [30]. Any change made to the drawing area by one user is immediately seen by all other users. Microsoft NetmeetingWhiteboard is a commercial example of such a system. NetMeeting Whiteboard supports WYSINWIS by dividing the shared drawing area into sheets, with a set of horizontal and vertical scrollbars for navigation within a single sheet. Users are allowed to lock sheets to prevent other users from making changes. There is no support for a private drawing area. Social awareness is limited to telepointers. One unique feature of the system is the ability to cut and paste any visible window or window portion onto the drawing surface [31]. Meeting Support: Meeting Support consists of technological and physical environment additions to a conference room. Interest in this type of technology is widespread because statistics have shown that workers spend an average of 30-70 percent of their time in meetings [9]. Technologically, networked computers, whiteboards, shared views, and a Group Decision Support System (GDSS) are the major components. Most CSCW meeting rooms allocate one networked computer per attendee, and network to other devices in the room. The stand-alone whiteboard, an important focus of attention in the regular conference room, is integrated electronically. Rensselaer’s Design Conference Room (DCR) [15] includes a Softboard™ whose software records activity as the user writes with a magic marker on the whiteboard surface. In addition to saving the final board image, the software can play back the strokes that created the image. XEROX PARC’s DOLPHIN System [32] and Berkeley’s Colab System [9] take whiteboards to a new level with liveboards which are essentially large touch sensitive computer displays. Handwriting recognition, sketching, and gesturing capabilities facilitate interaction with the device. The issue of public and private information is an important one during meetings. Sometimes, users may wish to share information displayed on a private display, while at 15 other times there is a desire for privacy. Colab allows a single window to be shared among meeting members. This window is usually displayed on the liveboard at all times. Anything a user wants to share with the rest of the group must be pasted into this shared window. DOLPHIN uses a sophisticated shared hypermedia document model where artifacts generated privately can be shared between users and the liveboard. The DCR allows sharing through a public computer and display. Users can access the public computer/display through their private computer’s keyboard and mouse. Many CSCW meeting rooms include special GDSS software to facilitate the meeting process. The DCR provides a set of flexible, unstructured tools including floor control for controlling the public display, anonymous chat for brainstorming and private chat for side conversations. Colab provides two applications: Cognoter and Argnoter. Cognoter is used for group creation of presentations. Software guides the participants through three stages: brainstorming, organizing, and evaluation. Argnoter is used for group decisions on competing proposals. The program brings participants through three different stages: proposing, arguing, and evaluating. GroupSystems [33] provides applications to support brainstorming, commenting on a specific topic, and idea organization. The physical design of the conference room is very important. Colab and DOLPHIN accommodate six participants around a U-shaped table. The liveboard is placed at the top of the “U”. GroupSystem accommodates 24 participants with two concentric tiered rows of seats centered around a large shared display. Each participant has access to a computer and display, which is slightly recessed to allow greater visual contact with other users. The DCR uses a hexagonal table that accommodates six. Each participant has a private computer and public access to a shared computer and display. One unique property of the DCR is that all display devices are completely recessed within the table. This affords users total use of the conference room table surface, removes visual obstructions completely, and helps to make the technology less obtrusive. Simulations: Simulations involving multiple participants have become commonplace with the ubiquity of networked computers in diverse application domains including defense, aeronautics, and entertainment. The U.S. Department of Defense has been actively developing networked simulators over the past decade. The result of this effort is 16 the Distributed Interactive Simulation (DIS), a set of protocols that allow network connected simulators to participate in synchronous combat operations using a shared electronic terrain [34]. Advantages of DIS over single user simulators include group instead of individual training, support for user participation anywhere on earth, time sensitive challenges that demand immediate responses from the users, creation of new tasks based on the actions of the users, and rich interaction possibilities due to the large number entities (user and computer controlled) simultaneously supported [35]. In the entertainment arena, multiuser games enhance the recreational experience because they allow cooperation/competition with live users. Presumably, a live user will offer more interesting challenges than a computer generated opponent. A synchronous simulated automotive race is much more interesting if the car being challenged belongs to a friend down the hall (or in the next state!) [36]. Communication between users, if supported at all, is limited to a shared chat window. Game servers are appearing on the internet that allow users to join in games with other users anywhere in the world, anytime of the day or night (e.g. Microsoft’s Internet Gaming Zone, Blizzard’s battle.net, Mplayer, Iron Wolf) [37]. Sample games include public domain systems like Xpilot and Netrek and commercial systems like Warcraft, Quake II, and Jedi Knight. Computer Supported Collaborative Learning (CSCL): CSCL applications occupy an entire sub-discipline within CSCW. Any application that facilitates both cooperation and learning falls under the CSCL umbrella. Some important areas of research include distance learning, teaching rooms, knowledge construction, and shared reality. Distance Learning is playing an increasingly important role at the college level. A distance learning student is usually a full-time professional taking classes part-time. Most courses are viewed as lectures broadcast live (or tape delayed). Interaction with the lecturer and on-campus class is limited to the telephone, and asynchronous text exchanges (e-mail, newsgroups, or the web) [38]. A lack of real-time interaction inhibits the kind of exchange seen in the regular classroom, and in face-to-face collaboration [39]. Desktop videoconferencing technology may help to solve this problem, however this has its own challenges. It is difficult for an instructor maintain an awareness of remote students (i.e. 17 gestures, gaze direction, body language) simultaneously at multiple sites. Turn taking is also a problem [40]. Teaching Rooms are classrooms that incorporate computing technology to facilitate synchronous, face-to-face cooperative learning. Each student usually has access to a networked connected computer. The computer display can be recessed to give the student a line of sight to the lecturer. The lecturer may have the ability to display information on his computer on a large screen visible to the entire class. The instructor may also have the ability to project any student’s display onto the large screen [40]. Rensselaer’s Collaborative Classroom (CC) [41] has made a number of improvements to the basic teaching room. The CC provides seating for teams of two to six students per table. Embedded in the table is a networked Windows workstation. Students share control of this workstation using specialized software that runs on their private laptops, or with shared keyboards and mice provided with the table. Any computer in the room can view the display of, or take control of any other computer in the room. This allows variety of interaction styles including instructor demonstration, peer learning, team meetings, instructor consultation, client consultation and class-wide presentation and critique. Research has shown that teaching rooms can create experiences that are more interesting for students than the traditional classroom. The teaching room is not a panacea, and has had mixed responses from faculty. Some refuse to return to an ordinary classroom. Others apply newly discovered teaching techniques to the regular classroom. Still others find changes in teaching styles are too radical, and decide to return to a more traditional lecture format [40]. Knowledge Construction: Knowledge Construction focuses on collective building of domain understanding. A newsgroup is a basic form of group knowledge construction. The Computer Supported Intentional Learning Environment (CSILE) system, from the Ontario Institute for Studies in Education, is a community database created by students on networked computers on and off campus [42]. Students can create multimedia notes, comment on other student’s notes (with automatic notification to the original author), and organize notes into different informational structures. The Collaboratory Notebook 18 provides students access to a shared multimedia document modeled after a scientific notebook [43]. A student can create eight kinds of pages: questions, conjectures, evidence for, evidence against, plans, steps in plans, and commentaries. Hyperlinks provide the ability to create non-sequential relationships between the pages. Other systems modeled after the collaborative notebook include CaMILLE for engineering students [44] and CALE for medical students [44]. KMap is a web-based tool for creating and browsing concept maps [45]. A concept map is a visual representation of information and forms of argument. KMap represents pieces of knowledge as text-labeled nodes, with links between the nodes representing knowledge relationships. When the cursor is over a node, the user can select from a list of associated multimedia information. KMap can be used to generate concept maps individually or in a group, then to place them on the web for wider audience to comment and improvement. Some of the advantages of knowledge construction are elimination of turn taking problems, peer commentary, progressive results, time for reflection, independent thought, and cumulative/progressive results [42]. Shared Reality: Shared Reality refers to computer constructed worlds where students can explore, collaborate, and learn. Examples of shared realities include Multiuser Dungeons (MUDs), microworlds, and collaborative games. A MUD is a text based shared reality that consists of rooms, exits, objects, and users. A server hosts the MUD, accepts user connections, allows users to manipulate and add to the shared reality, and supports interaction between users. Users communicate synchronously via a chat-like interface. This same interface also reports the results of interactions with objects and rooms. Historically, MUDs have been a form of recreational activity; however, recent applications include MUDs for astrophysicists [46], system administrators [47], and students. For example, MOOSE Crossing is an educational system where children develop social and computer skills by programming rooms and objects for a MUD [48]. MUDs are an effective community for learning because they provide motivation for learning, emotional support, technical support, and an appreciative audience [49]. SharedARK is a system for creating synchronous, shared microworlds [50]. A SharedARK microworld is an infinite, shared, two-dimensional “flatland” of which only a small portion is visible on any one-computer display. Users manipulate objects using a hand shaped 19 pointer. The system can operate in both face-to-face and distance modes. When users encounter each other in SharedARK, they can set up audio/video links. A basic model of the physical world is built into the system. Users can experiment and create objects that have mass, density, and momentum. Several applications have been created including the Puckland [51] simulator for elastic collisions and ARKCola [52], a simulation of a soft drink bottling plant. Experiments with SharedARK systems have shown that students are more engaged and perform deeper evaluations of problem sets than they do when working with paper and pencil [50]. Other examples of shared reality include MacCandy [53] and TurboTurtle [54]. MacCandy simulates a candy factory where candies are packed in rolls of ten and rolls are packed in boxes of ten. The system was designed to help second grade students learn about estimation, symbology, and addition/subtraction. The microworld is the focus of classroom-wide discussion when displayed on the instructor’s screen at the front of the room. TurboTurtle is a system for exploring Newtonian physics, similar to SharedARK. A distinguishing feature of the system is its sophisticated support for awareness of other users including user lists, telepointers, and shared widget controls. 2.2 Groupware Toolkits With so many issues to consider, building a groupware application can be a daunting task. Researchers have attempted to reduce the development burden by producing groupware toolkits. Most of this work has been aimed at synchronous groupware. These toolkits contain generic building blocks that can be used to assemble a CSCW application faster than conventional single user development tools. Typical groupware toolkits address the four important areas [8]: Run-time Architecture – aid the programmer with process management, process interconnection and inter-process communication Programming Abstractions – make it easier for the programmer to synchronize distributed events and data Groupware Widgets – provide the programmer with a set of generic groupware GUI tools for synchronous multiuser applications Session Managers – allow programmer to customize how users create, join, leave, and manage participation in a CSCW application. 20 At last count, more than thirty groupware toolkits have been developed by the research community. Toolkits frequently cited as reference systems include Groupkit [55], Rendezvous [56], and Suite [57]. Groupkit is a Tcl/Tk based toolkit available on Unix, Windows95, and Macintosh platforms. It uses a replicated architecture, with event broadcasting when local changes need to be sent to remote users. Remote events are processed in a manner similar to local events. A large number of groupware widgets are provided including social awareness, multiuser toolbars and text widgets, telepointers, and transparent annotation windows. A programmer-configurable session manager is also furnished. Rendezvous is an LISP/X-Windows based toolkit available on Unix platforms. It is a centralized system based on Smalltalk’s Model-View-Controller (MVC) architecture [58]. Much of the remote event handling and synchronization is abstracted into a programmable constraint system. By specifying constraints between user interface components and the data model, the constraint solver automatically keeps user views and their data synchronized. The toolkit is based on an object-oriented version of LISP that provides over 350 reusable classes. These classes include support for telepointers, floor control, and multiuser text and graphics. Classes are also included for session management. Suite is a C-based user interface independent toolkit available on Unix platforms. It is a centralized system designed around the concept of a multiuser text editor. Applications consist of editable objects, which are made up of publicly accessible shared variables. These shared variables are modified through calls issued from interaction variables associated with a specific local user interface. When an end user interacts with a widget, it modifies the interaction variable that in turn modifies the active variable. Changes to shared variables trigger update callbacks for the interaction variables of other users. Enduser coupling configuration is one unique feature of the system. Users are able to specify how frequently their user interface updates/is updated by the application’s shared objects. Suite is user interface independent, so there are no groupware widgets. Session management is enabled at a high level by giving the end user the ability create and modify 21 user groups within an application session. The programmer can add additional functionality like access control using Suite primitives. 21 3 A Preliminary Experiment We gained first hand experience with the difficulties of developing a CSCW application during the creation of CollabBillboard. CollabBillboard grew out of ideas we had been developing about assigned roles in a team [13]. Instead of dividing a task into smaller independent subtasks to be completed in parallel, team members are assigned different but complementary roles for completing a shared task. Our hypothesis was that explicitly assigned roles could induce stronger collaboration among team members. To test the hypothesis we developed this synchronous collaborative simulation. Although an evaluation of CollabBillboard supported the theory, we found the entire process frustrating. The biggest problem was how much we underestimated the amount of time needed to complete the application. It took almost three times longer than we expected! One of the major contributors to the delay was finding physical users to help test the application. For reasons discussed in Section 1.1, a single user, the developer, was not sufficient to thoroughly exercise the program. It was often necessary to comb the halls for volunteers, and as the months continued, they became increasingly reluctant. 3.1 Architecture CollabBillboard is a synchronous face-to-face two-player simulation that attempts to address some shortcomings of previous multiuser simulations through explicitly assigned roles and group evaluation. Assigned roles require each user to take on a specific role during the simulation. These roles are complementary, but non-overlapping. Both users must cooperate within their roles in order to achieve the simulation goal. Group evaluation, rather than individual based, uses team based performance criteria. The CollabBillboard application is designed for networked personal computers running Windows 95 or NT. The development environment, Microsoft Visual C++ (VC++), was augmented with Microsoft Foundation Classes (MFC) for GUI support, DirectX for high performance graphics, and Winsock for communication. 22 Applications developed with VC++ and MFC have a structure oriented around the user interface. Each dialog is associated with a C++ class. Events generated by widgets in the dialog are converted to messages that invoke class methods. To enable multiuser capabilities, CollabBillboard includes a shadow socket class with each dialog. The socket shadow contains methods for communication setup/takedown, sending special events, and receiving special events. Send event methods report local events and data that are of interest to remote users. The receive event method converts remote user messages to a local event and data format. Figure 3 depicts the socket shadow class for the initial dialog panel. The member functions OnAccept and OnConnect are invoked during communication setup/takedown. SendOK is invoked by the dialog class method ButtonOK that is invoked when the user presses the OK button. OnReceive is invoked when a remote message arrives. For this dialog, OnReceive gets remote ButtonOK events and invokes same local dialog method. Class CcollabBillBoardDlgSocket : public CollabBillBoardSocket { private: void OnAccept(int theErrorCode); void OnConnect(int theErrorCode); void OnReceive(int theErrorCode); public: BOOL InitializeSockets(); BOOL SendOK(); }; Figure 3: CollabBillboard socket shadow The system requires one machine per user. The complete simulation state is replicated on each machine. Participants can be situated at different physical locations. However, the game is designed with activities that require high bandwidth communication between participants. For this reason, a face-to-face experimental setup was used. 3.2 Experimental Method A study was conducted by to evaluate the effect CollabBillboard might have on collaboration between pairs of users. The study used two versions of the program, one with and another without assigned roles. Time to completion, percent of time spent conversing, and accurate billboard placement were some performance criteria measured. Subjects were then given a paper and pencil collaborative exercise. The results of this 23 exercise were compared against a solution key. Finally, the subjects were given a survey to complete that allowed them to express their subjective feelings about the simulation and about collaborative experiences during the session. Figure 4: Sketch of experimental design. A long desk with monitors at opposite ends was set up in an office. Users sat on different sides of the desk, each in front of a monitor. The monitors were set up so that each could be seen only by the user in front of it, and were angled so that both users would sit between a three foot gap between the monitors on the table; this arrangement afforded line-of-site viewing for non-verbal communication. 3.3 Task Overview Research participants worked on one of two versions of the CollabBillboard simulation. One version of the simulation used assigned roles, while the other (the control) did not. Participants were grouped into pairs, with each pair using one version of CollabBillboard. When the simulation was completed, participants worked through a classic paper and pencil collaborative exercise called Lost At Sea [59]. At the end of the experiment, the pair was asked to complete a survey about their experiences. Pairs of participants were scheduled for a one-hour session. When they arrived, they were introduced to each other, the tasks to be performed were explained, and they were asked to sign a consent form. A tape recorder was started to record the audio exchange during the CollabBillBoard portion of the session. Participants started the CollabBillBoard application on their respective machines. When network communication was established, 24 one of the users pressed the OK button on the initial dialog window, and both users were presented with a task menu. Figure 5: Selecting a billboard site in the city. The session moderator explained that the participants were part of a fictitious advertising company that wanted to place a billboard in the city of Boston. Two major tasks were needed to complete the application: select a site in the city to place the billboard; assemble the scrambled pieces of the billboard on the site’s billboard frame. Figure 6: Control window for assembling billboard. Both users see the same window, view the entire billboard frame and move pieces. The first task, Site Selection, brought up a shared map of the city of Boston, Massachusetts (see Figure 5). Telepointers were used to indicate remote user focus on the map. As users moved over possible sites, an informational window appeared describing the site. When a 25 site was selected, it was highlighted. These actions appeared on both participants' maps, with separate colors indicating a local or remote action. Once participants selected a site, they proceeded to the second task. The second task, Billboard Assembly, involved assembling randomly placed pieces of the billboard in the correct order and properly centering them on a billboard frame. At this point, the assigned roles and control versions of the program diverged. The control version brought up a shared billboard frame that users could add billboard pieces to. Each new piece appeared simultaneously in the same random location on both participants’ screens (see Figure 6). Participants could grab and move any piece of the billboard at any time. The frame contained a green box representing the local user’s position in the frame. A red box represented the remote user’s position. To move a billboard piece, a user placed the green box on a billboard piece, selected the grab button, and then used the directional arrows. A zoom window was included for fine-grained piece movement. Figure 7: Assigned roles "view billboard" window. This user has a zoomed out view of the billboard frame but cannot move any pieces The assigned roles version of the program split the billboard piece assembly into separate subtasks: View Placement and Place Billboard. The View Placement task presented the user with a zoomed out view of the billboard frame. This user could see all billboard pieces and a green box, which represented the Place Billboard user’s view. The View Placement user could add new pieces to the frame, and move the other user’s view. 26 However, the View user could not move a billboard piece even if the Place user was currently grabbing one (see Figure 7). The Place Billboard task presented the user with a zoomed in section of the billboard frame. The Place Billboard user could navigate around the billboard frame using the dialog’s arrow widget. The user could also grab, move, and drop billboard pieces (see Figure 8). Figure 8: Assigned roles "place billboard" window. This user has a zoomed in view of the billboard frame and can move pieces. Complications arose with assigned roles because neither user could complete the simulation goal independently. The Place Billboard subtask had a view that represented a small portion of the billboard frame (approximately 1/4 of a billboard piece). This view could be very disorienting. The View Placement task had a good view of the frame, but did not allow the user to move billboard pieces. Consequently, both users depended on each other to complete the billboard assembly. Once the Billboard had been assembled in either the control or assigned roles version of the program, the team received a score based on four factors: choice of billboard site, properly assembled billboard, properly centered billboard, and time to completion. A brief discussion about the score with the moderator then ensued. At this point, the tape recorder was turned off. 27 The second part of the session involved a classic paper and pencil collaborative exercise called Lost At Sea. Participants were told to read a brief scenario where they imagined themselves on a sinking ship. They had to rank 15 items in the order that they would be taken because the ship might sink at any moment. After the task was completed, the moderator discussed the US Merchant Marine’s ranking of the same items. The final activity of the session was a survey. The survey covered three areas: subjective feelings about CollabBillboard, subjective feelings about collaboration during the session, and personal information. When the survey was completed, the participants were debriefed by the moderator. 3.4 Evaluation, Results, and Analysis of Team Performance Team performance was determined using measurements depending on the stage of the session. For the CollabBillboard stage, five team measurements were used: choice of billboard site, properly assembled billboard, properly centered billboard, time to completion, and conversation as a percentage of task completion time. For the Lost at Sea stage, 17 team measurements were made. The first 15 were absolute values of the difference between the correct ranking for each item and the team’s ranking of the item. Next was a cumulative sum of these deltas. Finally, time to complete the stage was measured. For the exit survey stage, 31 questions were asked to subjectively assess CollabBillboard and collaborative experiences during the session. Most of these questions used a rating scale from one to five, with lower numbers representing a more positive feeling about the question and higher numbers indicating a negative feeling. A “no opinion” option was available for each question. The complete details of the results and analysis of team performance are available [13]. The results and analysis of our study support the hypothesis that assigned roles can improve collaboration both during the simulation and in subsequent group activities. Although it took longer for the assigned roles group to complete the simulation, they produced higher quality results indicating collaboration that is more effective. 28 Conversation, another measure of collaboration, occurred during 85% of the assembly task for assigned roles and only 44% of the assembly task for the control. On the second collaborative activity, the assigned roles group completed the work in less time with superior results. In every instance that the exit survey had statistically valid mean differences, the responses were more positive about collaboration in the assigned roles group. 3.5 Lessons Learned from the Development of CollabBillboard The development of CollabBillboard was a lengthier process than we had anticipated. Our original schedule called for three months to be spent on application development, but in actuality, eight months were needed to complete the system. A number of lessons were learned from reflecting on the experience. The lack of development tools, in particular, a VC++ groupware toolkit, contributed to the delay. Originally, we intended to build only the assigned user roles version of CollabBillboard. Building a second, control version was necessary to evaluate the system However, the majority of our time was spent developing, testing, and reworking the human-computer and human-human interfaces for the application. These interfaces account for a sizeable portion of the elements that make up a CSCW application. Testing them was a continual problem. Finding subjects to help test the application was also a challenge. It was often necessary to comb the halls for volunteers, and as the months went on, they became increasingly reluctant. The next several paragraphs present additional problems that we uncovered in the process of developing CollabBillboard that we feel may have been detected earlier and resolved more efficiently with a multiuser-testing environment. Usability testing examines the program's human factors issues [14]. General application issues include Is application appropriate to the user background and experience? Are outputs meaningful and non-offensive? Are error diagnostics meaningful? Are the interfaces consistent throughout the application? Are there too many options? Is the system easy to use? There is no formula for constructing a CSCW application because it is not always clear how some of the issues discussed in Chapter 1 should be addressed. As with other 29 GUI intensive applications, the correct implementation can require many iterations of a prototype followed by usability testing. Other issues requiring iterative usability testing include user interaction coordination, user awareness, undo/redo, locking policy, and session management In our implementation of CollabBillboard, we found that tight coupling of telepointers was visually distracting when users tried to select a site to place the advertising billboard. After several iterations of the program, a looser coupling was implemented where the local user was informed only when the remote user made a site selection [13]. We ran into several shared workspace synchronization problems because local user actions interfered with the processing of a remote user action. For example, in an early prototype of the system one user could rotate the billboard picture. In test with live users, it was relatively easy for them to create a scenario where their pictures were rotationally unsynchronized. One problem that we had a lot of difficulty with later was correctly reflecting the positions of billboard pieces moved by the remote user. It took about a week of test trials with live users to find and debug the error. A similar kind of problem occurred with enforcing boundary conditions on pieces moved by remote users. Stress testing subjects the program to heavy loads or stresses. A stress test differs from a load test in that it focuses on data volume over time versus just data volume [14]. Synchronous CSCW applications are particularly susceptible to stress problems because of interactivity requirements. Events processing on both the network and user machines are a common cause of interactivity loss. In CollabBillboard, for example, the control version of the simulation locked the local user out local mouse events when a remote user flooded the system with billboard piece move events. Several days of investigation uncovered a flaw in the Windows 95 OS design that gave network and DirectX graphics events priority over local mouse events. To circumvent this design, coupling was loosened by creating a temporal buffer that accumulated remote user draw events until a timer expired. Compatibility/Conversion testing identifies problems between the new software and preexisting programs and data [14]. Conversion issues revolve around the ability of the new software to support persistent storage data formats from earlier versions or other programs. 30 Conversion may also require the new software to output data in a format readable by preexisting software. The distributed nature of CSCW applications makes them particularly susceptible to compatibility problems when different machines have different versions of the executable. CollabBillboard suffered from several compatibility problems. Version 1.0 of CollabBillboard was made publicly available on the web in September 1997. A second version of the program was made available in April 1998. The event data generated by these versions are incompatible because of a change from reporting relative coordinates to absolute coordinates on the billboard frame. Version 2.0 of CollabBillboard provides two separate applications: user roles and control. Since the communication protocols for both forms are identical syntactically, it is possible to connect a client from one application with a server from the other. This combination results in an unstable environment that causes the application to crash when the users begin the billboard assembly task. Recovery testing exercises the software's ability to handle situations during programming, hardware, and data errors [14]. To test programming errors, code can be injected with problems (e.g. hard coding an invalid assert). Simulation is a common technique for testing hardware errors (e.g. returning a network message with an incorrect number of bytes). Data errors can be purposely created to analyze the system's reaction (e.g. user types in "-1" as number participants in a CSCW session). In addition to general kinds of recovery testing, CSCW applications should also test the effects of unpredictable or hostile remote user actions. Early testing of CollabBillboard discovered a problem when one user quit the session while the other user remained. The remaining user was able to use the application for several minutes until the system hung. The problem turned out to be that the network messaging API buffered messages for the non-existent remote user and when the buffer overflowed, the system froze. 32 4 Survey of Prior Work in Testing Systems As discussed in the previous chapter, our preliminary experiment with CollabBillboard provided us with first hand experience developing a synchronous CSCW application. One of the greatest difficulties we encountered was testing the software. Because it was a multiuser synchronous system, we needed several physical users exercising the application simultaneously. Because it was a GUI application with human-human and humancomputer interactions, we went through continual iterations to get the interface correct. Most of the people we asked for testing assistance were willing to help a few times, but we began to try their patience around the fourth or fifth system build. This chapter presents a survey of the state of the art in testing. The first goal of the survey was to uncover the major contributions made by academia and industry to software testing. The second goal was to understand current testing system shortcomings that prevent CSCW developers from effectively testing an application. The chapter is organized around four main sections. Section 4.1 lists the important goals of testing. Section 4.2 presents the research community's contributions to testing organized by the software life-cycle process. Section 4.3 discusses academic contributions to GUI-based testing. Finally, Section 4.4 analyzes three commercial testing systems. 4.1 Goals of Testing Testing during the software lifecycle is a process by which the behavioral properties of the software are verified. These properties are correctness, utility, reliability, robustness, and performance. A program is behaving correctly if it "satisfies its output specifications independent of its use of computing resources when operated under permitted conditions" [10]. Correctness is neither a necessary, nor a sufficient condition for an acceptable program. Correctness is not necessary because some kinds of errors can be tolerated. For example, in a graphical editor, a "drag graphical object" command might cause artifacts to appear on the drawing surface along an object's path. This kind of behavior might be considered a bug, but is acceptable if the user is provided with some form of drawing surface refresh command 33 that removes the artifacts. Correctness is also not a sufficient condition for an acceptable program. A program may satisfy its specifications, but the specifications may be incorrect. The utility of a program is determined by the extent to which it meets user needs. Utility answers questions about things like ease-of-use and cost effectiveness. Typically, a program is utility tested in a friendly environment with only valid input. Utility is extremely important, because if the product does not perform useful functions, then there is no point in further testing. Work done with Rensselaer's DCR illustrates this importance. A great deal of effort was expended developing a floor control policy for shared use of the system's public workstation. The policy was implemented in software as a FIFO queue. Meeting participants taking control of the public workstation had to make a request, which was added to the queue behind other requests. When the participant was at the top of the request queue, s/he was allowed to control the public workstation. Although the system was straightforward from a programming standpoint, analysis showed that users tended to ignore the floor control policy, opting instead for a simple control interrupt capability added later. Reliability refers to a program's mean time to failure. Ideally, the program and its supporting infrastructure should never fail, but the cost of verifying this level of reliability can be prohibitively expensive. One area where high reliability is justified is life-critical applications such as aviation software. The Federal Aviation Administration refuses to allow commercial off-the-shelf (COTS) software in any portion of the nation's aviation system, relying instead on thoroughly tested, but expensive, customized software. COTS software, like Windows 95, is notoriously unreliable, and while a simple reboot for a system hang is tolerated by most PC users, it could spell disaster for a busy air traffic control system [60]. For less critical applications a return on investment analysis can determine how much testing will ensure a level of reliability that will keep customers satisfied. A program is considered robust if it is able to handle different, possibly hostile, operating conditions, input, and users. The application should tolerate a variety of operating environments in its supporting infrastructure. This infrastructure includes hardware and software associated with the network CPU, disk and graphics device. A robust CSCW 34 application, for example, should gracefully handle heavy network loads when trying to send and receive events between users. Handling invalid input is also important. If a user types "-1" for the number of participants in a collaborative session, the system should prompt for a correction. Hostile user actions should also be anticipated. If the user hosting a CSCW session exits before the rest of the team, the application should ensure either that the session artifacts are saved, or that the session continues by using a different host. Performance is another important criterion that must be verified before the CSCW application is released. Interactive feedback from a user action must be approximately 16 milliseconds to avoid a feeling of sluggishness [11]. An additional rule of thumb is that local user performance should always take priority over processing remote user actions. This means that the developer must be careful that tight coupling does not impact local activity. In the CollabBillboard application, movement of billboard pieces during the assembly task was tightly coupled. When live user testing began, it was discovered that piece updates from the remote user created a feedback cycle that excluded local user actions until the stream of remote updates ended. Piece movement coupling had to be loosened to allow local user actions to be processed. The choice of centralized versus replicated architecture has a big impact on performance. A replicated architecture will usually have better performance, while the centralized architecture will have less complicated synchronization and locking mechanisms. One method for verifying the system will perform acceptably for its chosen architecture is to observe how it behaves under a scalability test. Network performance is critical to the overall performance of the CSCW application. The application may consume too much bandwidth when sending messages between machines. This happens when messages occur too frequently, contain too much information, or both. Acceptable bandwidth use can be verified by exercising the application over the network. The application also needs to be tested under various network conditions including heavy traffic from sources outside the application, increased traffic from scaling the number of users, and message delay over a wide area network. 35 4.2 Research Testing Systems This section discusses the contribution that the research community has made to testing state of the art. It is organized around a modified version of the phases of the software life cycle: requirements, specifications, design, implementation, integration, and maintenance. The software life-cycle model describes the process of creating and maintaining a software application. Competing life-cycle models have been developed over the past several decades. These models were created to combat the inefficient process of “build and fix” where developers built some software components, showed the results to the client, and fixed the software based on client feedback. Most popular models in the literature evolved from the Waterfall Model [61]. In the waterfall model, software production is broken down into seven stages: requirements, specifications, design, implementation, integration, and maintenance. Figure 1 depicts the rapid prototyping life cycle model. The goal of this model is to quickly turn around versions of the software for client evaluation. Less intercycle feedback reduces the amount of time it takes to produce a prototype. Rapid prototyping is particularly useful for user interface development. The client can be continually involved in the process of creating a friendly, useful, user interface. Feedback from commercial software development has led to the creation of the incremental model. The incremental model develops a product as a series of progressive builds, with each build adding a new set of functions to the application. Each build creates a completely runnable system with increasingly powerful capabilities [10]. 4.2.1 Requirements The purpose of testing in the requirements phase is to determine if the software team correctly understands the user's requirements. Building a prototype and discussing the program with potential users is an effective way of accomplishing this goal [10]. 4.2.2 Specification The purpose of testing in the specification phase is to determine if the software team has correctly translated the functions required by the user into a software specification. The most common forms of specification testing are walkthroughs and inspections. A walkthrough consists of periodic meetings by a small team (led by the author) that reviews the specifications document. The team size rarely exceeds five people and the meetings 36 last less than 2 hours. During a meeting, the goal is to discover, but not correct, problems. The author can correct the problems later. Individuals prepare for the meeting by reviewing the specification and requirements documents [10]. An inspection is a more highly structured process consisting of five formalized steps: overview, preparation, inspection, rework, and follow-up. The International Institute of Electrical Engineers (IEEE) has published an international standard for the inspection process [62]. The overview step is a preliminary meeting where members of the inspection team are assigned roles and given specific tasks to prepare for the inspection. In the preparation step team members examine the specification from the perspective of their assigned roles and prepare checklists for verification during the group inspection. A series of inspection meetings are then held with the team measuring the specification against the checklist. Again, problems are only identified, not solved during these sessions. The rework step corrects problems discovered in the specification. Follow-up ensures that the rework corrected the problems identified, and didn't introduce any new ones. Although no formal studies have been done, it is thought that inspections take more time but are more effective than walkthroughs. IBM's cleanroom verification technique uses inspection as the main verification tool throughout the software life cycle [63]. Specifications can be written in a variety of formats from informal prose to a formal algebraic description. The testing research community is interested in formal specifications because of the potential for early, automated debugging, testing, and analysis [64]. Recall that the sooner a problem can be found in the development cycle, the less expensive it is to fix (see Section 4). A formal specification can also be useful in later stages of the software life cycle such as design phase mathematical proofs of correctness (Section 4.2.3), and input selection/output analysis for functional testing (Section 4.2.4). There are two kinds of formal specification: process-based and model-based. A process-based specification views the program as being comprised of subprograms. A critical part of the specification is to formally specify the interfaces between subprograms and abstract data types (ADT). The specification is developed using a top-down process where successive revisions of the specification result in smaller subprograms with greater interface and ADT detail. The finest level of detail is a formal algebraic notation. 37 Reusable generic specifications are one advantage of this technique. For example, instead of describing an integer specific sort routine, a generic sort routine could be specified. This routine is written at a high enough level that it could sort any data type (e.g. integer, real, programmer defined). When a specific kind of sorting is needed, another refinement of the routine is performed with the data type needed [65]. Larch [66] is an example of a system that supports the process-based specification technique. A model-based specification is a formal mathematical model of the entire software system. The specification not only describes the interfaces and data structures of the software system, but also describes state behavior in a formal way. Z [67] and the Vienna Definition Model [68] are examples of model-based specifications. Z uses a set/relation notation where components that make up the software system are represented as schemas. The following is an example of a Z schema for the CSCW application CollabBillboard: MoveBillboardPiece ∆Billboard owner?: OWNER pieceID?: PIECEID x?: y?: owner ∈(Billboard.pieceList(pieceID?)).owner 0 ≤x? ≤XMAX 0 ≤y? ≤YMAX Billboard.pieceList(pieceID?).x = x? Billboard.pieceList(pieceID?).y = y? Figure 9: Z Language schema for CollabBillboard The MoveBillboardPiece function is responsible for updating the x,y location of a billboard piece on CollabBillboard's shared workspace. ∆Billboard at the beginning indicates that the schema will change the system state by altering Billboard. The schema signature describes the input variables and their data types. For example, x?: indicates that the input variable x can be any natural number. Schema predicates indicate Schema name Indicates schema will cause a state change. Schema signature Schema predicates 38 conditions that must hold for system state and input variables. For example, x must be a non-negative natural number with a value less than the width of the workspace (XMAX) for the billboard piece to be displayed properly. Schema predicates can also contain set or relation operations. The partial predicate pieceList(pieceID?) performs a range lookup on the set pieceList. This set represents a total mapping of piece IDs (domain) to actual piece structures (range). The pieceList(pieceID?) predicate returns the piece whose ID is represented by the input variable pieceID?. The Test Template Framework [64] uses a Z specification to create cases for implementation testing. An analysis of the schema signature is done to create an input space for each variable. A variable's input space is refined into a valid input space through schema predicate constraints. The valid input space is then grouped into categories using techniques based on the category partition testing method [69]. The result of this processing is a Z language specification for a set of generic test cases. Actual test cases are instantiated by executing the function derived from the specification with data that satisfies the Z specification for the input variables. Analysis of the test case results is performed by comparing output against schema signature and predicate constraints. Other specification based testing work includes Haye's [70] techniques for constructing input/output constraints from a Z schema specification, and Stanford's Anna [71] system for runtime checking of Ada programs using specification derived constraints. In addition to general specification systems like Larch and Z, specialized systems have been developed for concurrent and real-time programming. Specialized systems providing verification support include Concurrent Temporal Logic (CTL) [72] an SCR specification system for event-driven applications, Graphical Interval Logic (GIL) [73], a visual temporal specification system, and the constrained expression toolkit [74] for real-time programs. Figure 10: GIL: Specification for queueRemotePieceUpdate$n GIL allows the temporal properties of a concurrent system to be specified using a annotated graphical timing diagram. GIL developers claim that the graphical notation of remoteUpdate$n ^ timerExpired 39 timing diagrams is superior to a temporal logic text specification because visualization increases the understanding of relationships between the temporal properties of the system. The semantics underlying GIL allow the diagrams to be converted into propositional temporal logic, which can then be run through a proof checker. The proof checker is not automatic, and must be told which diagrams should be included in a particular proof. Figures Figure 10 and Figure 11 depict a GIL specification for a portion of the CollabBillboard application. queueRemotePieceUpdate$n. 0<= n <=(Number of Billboard Pieces - 1) A remote update event arrives for a piece of the billboard and the event is queued until a timer expires. Queuing a remote update keeps remote piece movement from interfering with local user performance. The remoteUpdate$n boolean is set to TRUE when a remote update event arrives from another user for Billboard piece n. timerExpired is set to TRUE every 200 milliseconds. The interval depicted by this timing diagram indicates that as long as remote piece updates are being received and the timer hasn't expired, then the condition queueRemotePieceUpdate$n will be TRUE. drawRemotePieceUpdate$n. 0<= n <=(Number of Billboard Pieces - 1) Figure 11: GIL specification for drawRemotePieceUpdate$n A billboard piece is redrawn if it has been queued as a remote update and the timer has expired. remoteUpdate$n ^ timerExpired queueForRedrawPiece$n redraw remoteUpdate$n 40 The conditions that identify this interval are that the piece has had a remoteUpdate$n event associated with it since the last time the timer expired. The implication arrow (→) indicates that if these conditions for the interval are met then the remoteUpdate$n boolean will be set to FALSE for the billboard piece and the piece will be considered queued for local redraw until the actual redraw occurs. Start Mode In Site Remote In Site Button Down Remote Button Down End Mode Clear Map F F F F Clear Map @T - F F Site Info - @T F F Remote Site Info Site Info T F F F Site Info @F F F F Clear Map T @T F F Remote Site Info T @F F F Site Info T - @T F Site Selected Remote Site Info F T F F Remote Site Info @T - F F Site Info F @F F F Clear Map F T F @T Remote Site Selected Site Selected - - F F Site Selected F - @T F Clear Map - F F @T Clear Map - T F @T Remote Site Selected Remote Site Selected - - F F Remote Site Selected F - @T F Clear Map - F F @T Clear Map T - @T Site Selected Table 1: SCR Table for CollabBillboard The Software Cost Reduction (SCR) method is a formal method for specifying the requirements of real-time systems. The SCR method has been used successfully in a variety of application domains including aviation, telephony, and nuclear power. System behavior is modeled as a relationship between two types of variables: monitored variables, which denote environmental quantities, monitored by the system and controlled variables that denote environmental quantities the system controls. Conditions, events, and tables provide details on how monitored variables affect controlled variables. A condition is a predicate defined on one or more variables in the specification. When any variable 41 changes value, it is called an event. An SCR table specifies a variable's value based on conditions and events [75]. The following is a mode transition table for the CollabBillboard application's shared map task. The purpose of this type of SCR table is to show how the system state changes because of new input conditions. The left-most and right-most columns represent the current mode and new mode respectively. Clear Map is a mode where the cursor is not on any billboard site and no site has been selected. Site Info and Remote Site Info are modes where the cursor is over one of the billboard sites and an information box appears describing the box. Site Selected and Remote Site Selected are modes where one of the users has selected a site for billboard placement. In this mode, the selected site is highlighted with a special yellow (local) or gray (remote) box. The In Site, Remote In Site, Button Down, and Remote Button Down represent condition variables that are monitored by the system. An environmental condition can have four possible values: T (currently TRUE), @T (just turned TRUE), F (currently FALSE), @F (just turned FALSE): Several verification tools have been developed for SCR specifications. The SCR* system [75] provides a consistency checker to detect syntax errors, incomplete variable definitions, or circular variable definitions. The CTL system [72] converts an SCR specification into a finite state machine and a set of temporal logic propositions. The converted specification is then nondeterministically executed. The execution proceeds in discrete time units, which represent single state transitions. Since a transition is activated every time unit, at least one of the current state's transitions will be enabled at all times. As the machine is executing, the temporal properties that must hold are checked. The GIL and CTL systems represent two competing verification techniques for realtime/ concurrent specification systems: theorem proving and state-based. The problem with theorem proving is that it is difficult to automate. The GIL system, for example, requires the user to indicate by hand the specifications that will be used to verify a particular constraint. State-based systems like CTL suffer from an exponential explosion in the number of states that must be explored to completely verify a system. The constraint expression toolkit [74], [76] provides tractable automated verification using integer programming. 42 The constraint expression toolkit is used for bounding the time between events in a concurrent real-time system. It converts an Ada-like specification into a set of finite automata, one for each process in the system. The alphabet of each automaton consists of symbols for computation within the process and for synchronous communication with other processes. A set of transition variables is assigned to each automaton edge. These variables count the number of times the edge is traversed during process execution. Start and halt variables are assigned to each node with an exiting start event edge, and entering halt event edge. If the process does not contain a start event edge, then all nodes are labeled with a start variable. The same technique is used for processes without the halt event edge. Equations are then derived by treating the automatons as a network flow where the number of times a state is entered equals the number of times it is exited. Additional equations are added by forcing each automaton to start and halt at exactly one place. A final set of equations can be added by recognizing that transition variables representing communication between processes must sum to the same value. Once the equations have been determined, an integer-programming objective is established, for example: tixi i where ti is the time it takes to move along the edge labeled with transition variable xi and xi is the number of times the edge has been traversed. The bounds can be determined by using integer-programming techniques to solve for the minimum and maximum values of the objective. Although in general integer programming is NP-complete, there are special cases that reduce to polynomial time linear programming. The types of equations generated by the constrained expression toolkit generally reduce to one of these special cases. 4.2.3 Design The essential difference between specification and design is that the specification states what the program is supposed to do, while the design shows how the program will do it. The purpose of testing in the design phase is to ensure a correct implementation of the specification. Informal techniques like walkthroughs and inspections are commonly used 43 in design verification (see Section 4.2.2). Formal techniques, such as proofs of correctness, are also used. One proof of correctness technique uses mathematical induction on loop invariants. The idea is to identify set variables and characteristics about those variables that do not change from loop iteration to iteration. A proof by induction is then performed on the variables and their characteristics [77]. An alternative proof technique is Hoare's axiomatic method, which uses deduction on axioms derived from program statements [78], [79]. Formal proofs of correctness have not found widespread acceptance in the software community as a verification tool. This is due to a number of factors including the mathematics skill needed to manipulate predicate calculus and temporal logic, the immense effort needed to prove even the smallest of designs, and the inability to automate due to the need for human intervention needed to determine things like loop invariants. Despite these drawbacks, these formal techniques have been successful in a number of domains, particularly where the cost of verification is negligible compared to the cost of program failure (e.g. NASA space missions) [10]. However, even if the cost of correctness proving could be ignored, it is not a panacea for software verification because "we can never be sure that the specification is correct" and "we can never be certain that the verification system is correct" [80]. 4.2.4 Implementation In the implementation phase, the actual code has been written and the process of verifying the physical program commences. Verification during the implementation phase is also known as unit, functional, or module level testing. Two kinds of testing are performed during this phase: black box and white box. Black box testing ignores the internals of a routine and uses the specification to determine the expected output given a specific input. Testing is performed by executing the routine with input and analyzing the output for correctness. This form of testing is attractive because the tester does not have to be concerned with the internals of the routine, which allows someone other than the routine author to perform the test. The problem with black box testing is that in order to thoroughly test a routine, all possible inputs must be tried. This results in a combinatorial explosion that causes the test of even a simple 44 routine to be computationally infeasible [14]. Consider the following routine from CollabBillboard that returns the distance between a pair of two-dimensional points: float distance(int x1; int y1; int x2; int y2;) In order to thoroughly black box test this routine, the function must be executed and verified for each possible x and y value. Assuming 32 bit integers, this would result in 2128 test cases, requiring more than 1023 years to complete on an Intel Pentium 166/MMX machine. Equivalence partitioning is a method for reducing the number of black box test cases. The idea is to partition the input space into a set of equivalence classes where any input value in a class is equivalent to any other input value in the class. Equivalence partitioning eliminates the need for exhaustive testing because only one representative test needs to be performed for each equivalence class. For example, suppose a routine that calculates cos, where is the angle in degrees, is to be tested. By examining the behavior of the cosine curve between 0 and 360 degrees a number of equivalence classes emerge (see Table 2). If the equivalence classes are chosen correctly, then a single test with any value from class I (e.g. 5 degrees) should be sufficient for testing the entire range of values from 0 to 89 degrees. The actual input values tested can be selected by hand, or automatically by random sampling [81] from each equivalence class. Equivalence Class Cosine Behavior Range I 1 →0 0 →89 II 0 →-1 90 →179 III -1 →0 180 →269 IV 0 →1 270 →359 V 1 →0 →-1 →0 →-1 Negative multiples of 360 - (0 →359) VI 1 →0 →-1 →0 →-1 Positive multiples of 360 + (0 →359) Table 2: Equivalence classes for cos Despite some heuristics for performing equivalence partitioning [14], it is essentially a manual process. The process requires a deep understanding of both input parameters and the purpose of the routine to be tested. There is no way to guarantee that a partitioning scheme is correct; that each value in an equivalence class will exercise the same code in the 45 same manner in a routine. In the cosexample above if the developer decided to implement the function using a lookup table, then the equivalence classes in Table 2 would be insufficient. The technique is not foolproof, but attempts to create a manageable number of test cases with maximum impact. Boundary value analysis is an enhancement to equivalence partitioning. The idea is that test cases that use input values near the boundaries of equivalence classes have greater impact. Input values are generated from below, on, and above the edges of an equivalence class. More formally [10]: For each range (R1,R2) of an equivalence class, five test cases should be created: (1) < R1 (4) = R2 (2) = R1 (5) > R2 (3) R1 < ∝< R2 In the cosexample above, the boundary conditions for class I would be the following angles:(-1,0, 1,45,89,90,91). Figure 12: Control flow graph for loop with five possible logic paths 46 White box testing uses a routine's internal logic to create test cases. Test case output is compared against expected output given input values and the specification. The advantage of white box testing is precise control over the routine logic exercised by each test case. Unfortunately, testing every possible logic path results in a combinatorial explosion similar to black box testing. Consider the control flow graph shown in Figure 12 of a loop with 5 possible logic paths per iteration. To thoroughly test every logic path for 20 iterations of the loop would require 520 + 519 + … 51 = 1014 test cases. Assuming an Intel Pentium 166/MMX machine, it would take approximately 21 days to simply execute the test cases. This doesn't count time spent analyzing the results. Another problem with path coverage is that it doesn't guarantee that all states will be exercised in the implemented program. Different input values can cause the program to behave differently even if the same path is executed. Consider the statements in Figure 13. Setting theNumber = -1 and theNumber = 0 will cause the same path to be executed, but in one case the program will print out "MINUS -1" and in the other it will generate a divide by zero fault. Figure 13: Code fragment from CollabBillboard Statement coverage reduces the combinatorial explosion of test cases by ensuring every statement in the program is executed correctly at least once. One problem with this approach is that particular test data may give the illusion of statement correctness. Consider the following code sequence from CollabBillboard: This code fragment ensures the upper left corner of the billboard frame's movable view window stays within bounds of the drawing surface. The problem with the statements if (point.x < 0) point.x = 0; if (point.x > YMAX) point.x = XMAX; if (point.y < 0) point.y = 0; if (point.y > YMAX) point.y = YMAX; Test Cases: 1 - point.x = -1; point.y = -1; 2 - point.x = XMAX + 1; point.y = YMAX + 1; 1 - point.x = -1; point.y = -1; 2 - point.x = XMAX + 1; point.y = YMAX + 1; 1 47 above is that the upper bounds for point.x should be XMAX, not YMAX. The error is difficult to detect because an x-value greater than XMAX will always trigger the upper bounds conditional because the drawing surface is wider than it is long. Branch coverage provides statement coverage and additionally ensures every conditional path is executed at least once. Branch coverage will test the conditionals in Figure 13 for both TRUE and FALSE conditions. Data from a test case that should trigger a FALSE path execution for the if (point.x > YMAX)… statement might identify the conditional error. Although an improvement over statement coverage, branch coverage is still very sensitive to test case data selection. Numerous path coverage techniques have been devised which exercise paths through the code. Combinatorial explosion is avoided by executing paths through the code a non-zero minimum number of times. A common path coverage technique constructs a control flow graph to find paths through the code [82]. Path coverage performance can be improved by discovering the minimum number of paths that have to be traversed to cover all paths [83]. One of the challenges of path coverage is to discover the input values that will cause a particular path to be executed. Using data flow graphs to create def-use paths for variables used in the program [84] makes it easier to discover how a particular input value affects program flow. DFG analysis cannot automatically select input values for test cases, but it can let the tester know what paths still need to be traversed for a particular variable. The DELLA PASTA [85] system extends the def-use technique to parallel programs. The core of the DELLA PASTA system is an algorithm that creates paths for variables defined in one thread and used in another. The system is very limited in that it only works in a shared memory architecture and provides no control over the temporal aspects of execution which can also influence path coverage. 4.2.5 Integration When the program has been implemented and individual functions have been tested, it is time to test the program as a whole. Integration testing approaches revolve around how modules are assembled for verification. Separate integration verifies each module separately, then modules are combined all at once and the entire program is tested. Top-down integration integrates and verifies the highest level modules first with stubs for functions in lower level 48 modules. This technique is excellent for identifying major design flaws early in the software life cycle, but does a poor job of detecting flaws in lower level modules. Bottom-up integration assembles and verifies the lower level modules first and tests the higher level modules later. This technique is excellent for identifying problems with lower level functions, but high-level design flaws are detected late in the life cycle. Sandwich integration divides modules into low level "utility" functions, and high level "glue-like" logic functions. Bottom-up integration is performed on the utility modules and top-down integration is performed on the logic modules [10]. 4.2.6 System Testing When the program has been implemented, its individual functions and combined functions tested against the specification, there is still verification to perform. System testing refers to verifying the program against the requirements, not the specification [14]. Facility testing verifies that each objective discussed in the requirements is actually met by the program. Volume testing subjects the program to heavy volumes of data. Stress testing subjects the program to heavy loads or stresses. A stress test differs from a load test in that it focuses on data volume over time versus just data volume. Usability testing examines the program's human factors issues. Security testing tries to subvert the program's security mechanisms. Security testing is particularly important in CSCW where issues of privacy and user roles arise. Performance testing ensures that the program meets requirements for response times and throughput under various workloads and configurations. Configuration testing examines how the program operates in a variety of hardware and software environments. Memory testing is a specific form of configuration testing that verifies the software's main and secondary storage needs. Compatibility/Conversion testing identifies problems between the new software and preexisting programs and data. Install testing exercises the procedures involved with getting the software installed and running. Reliability testing is performed implicitly throughout the software life cycle (see Section 4.1: Reliability). Recovery testing exercises the software's ability to handle situations when programming, hardware, and data errors occur. Serviceability testing investigates requirements for fixing and maintaining the program. Documentation testing verifies that the user documentation is correct. Some verification techniques include document inspection, and incorporating every example into the test case suite. Procedure testing deals with the 49 verification of procedures that users must follow. Acceptance testing is the final test before the software is formally delivered to the user community. 4.3 Human Computer Interaction Testing Human Computer Interaction testing research has focused primarily in two areas: testing architectures and usability testing. Automated testing research has examined the problems encountered when a testing system is used to automate the evaluation of applications with graphical user interfaces. Usability testing has attempted to provide techniques and evaluation techniques for an application’s user interface. 4.3.1 Testing Architectures Script reusability has been the major focus of academic testing architectures. Because of the highly iterative nature of GUI application development, the test scripts recorded with one version of the application quickly become invalid. The bitmap comparison techniques used in early systems were insufficient because of dependencies on precise location and content of the GUI. Advocating a programmatic approach, early researchers argued that test scripts that drive the application by identifying the GUI components programmatically, rather than graphically, have less sensitivity to specific application state [86]. Figure 14: Usability guidelines from [87] The Test Development Environment (TDE) addresses this issue with a visual test development system that abstracts low-level GUI events into higher-level operations on Use a simple and natural dialog Provide an intuitive visual layout Minimize a user’s memory load Be consistent Provide feedback Provide clearly marked exits Provide shortcuts Provide good help Allow user customization Minimize the use and effectiveness of modes Support input device continuity 50 specific GUI components [88]. An organizational tool is provided to group operations into scripts and store them in a design library. To create a test case, the tester uses a visual programming environment to select a set of scripts from the library. The visual language includes provisions for if/then and looping control constructs. Data variance using formbased constraints is also included to increase script reusability. Low-level application events are regenerated from the high level operations to exercise the application. When a new version of the application is developed, the TDE examines the GUI components using the components it is aware of from the scripts in the design library. Discrepancies are identified and can be corrected by the tester with the help of mapping wizards included in the TDE. Other techniques attack script reusability by generating test cases automatically. To thoroughly test the application, however, each GUI action has to be tried in combination with every other GUI action. Like black and white box testing this creates a combinatorial explosion of test cases. Several approaches have been investigated to reduce this growth. Pair-wise grouping restricts the length of an interaction chain to two. The creators of this approach found a significant reduction in test cases without a corresponding drop in detected bugs [89]. Latin-squares arranges n distinct GUI interactions in an n x n grid where every interaction occurred exactly once in each row and once in each column [90]. Test case reduction without significant loss of bugs was also found using this approach. Artificial Intelligence (AI) planning techniques have also been used [91]. One system analyzes the application’s GUI to derive a set of user actions. The test designer manually encodes pre and post conditions for each interaction (e.g. to display panel X the user must press button Y). The designer then defines start and goal states for the application. The system uses an AI planner to find a path from the start state to the goal state using the GUI interactions encoded by the designer. Test case reduction is achieved because only one path is generated for each goal state. Unfortunately, like the techniques in Section 4.2.4, approaches that eliminate test cases can’t guarantee that all problems will be found in an application. In addition to script reusability, researchers have investigated visual programming, script analysis, and multi-modal scripting. A methodology and architecture has been created for 51 testing visual programs like spreadsheets [92]. The system defines cell relation graphs and constructed compiler-like “definition-use” links between cells that define values and cells that used definition cells. The testing system highlights dependent cells that have not been tested. To exercise the cell, the tester changes the value of one or more definition cells. Highlighting is removed once the code in a dependent cell is executed. GUITESTER uses script analysis to determine usability problems in an application’s user interface [93]. Scripts of different users performing the same application task are analyzed. The analysis extracts common interaction patterns, mean mouse movement distances, mean interval between user actions, and the proportion of users who were unable to complete each sub-task. This information is used to identify clarity, safety, simplicity, and continuity problems. For example, a long mean distance between mouse clicks in a relatively short interval could mean the user interface suffers from a continuity problem. Multi-modal scripting integrates additional data into a script recording to improve the richness script playback. A script can be enhanced with synchronized videotape and voice captured at the time the user exercised the application. Observer text and voice annotations can be added later [94]. MITRE’s Multi-modal Logger allows multiple applications and simultaneous users to be recorded in a single script [95]. The rich information provided by these recording systems adds important context to the application during playback analysis. 4.3.2 Usability Testing Academia and industry have produced numerous guidelines for user interface design and evaluation [40], [87], [96], [97], [98]. The guidelines range in size from a concise set of one line statements size as in Figure 14 to a detailed breakdown and description of every aspect of a graphical user interface. Most researchers agree on the general principles for a good user interface. Research is very active, however, in determining if an application violates these principles. Techniques include empirical evaluation, where users are observed using the application in a usability lab or in the field. Observers go to great lengths to avoid contact with subjects in order to preserve realistic application use. Empirical evaluation is an excellent tool real world 52 observation, however, it can be very expensive and time consuming [40]. Another technique, the walkthrough, uses deliberate attempts to expose usability problems in the application. Typically, quality assurance personnel or human factors experts perform the walkthrough, rather than regular users. Walkthroughs provide a cost and time efficient evaluation, but suffer because they lack a real world setting. Karat provides an excellent survey of walkthrough techniques including pluralistic walkthroughs, heuristic evaluations, cognitive walkthroughs, think-aloud evaluations, and scenario-based reviews[87]. More recently, advocates for participatory evaluation have voiced the opinion that having evaluators and possibly developers in the same room with real users offers the benefits of both empirical and walkthrough techniques [99], [100]. 4.4 Commercial Test Systems A survey of testing would be incomplete without a review of modern commercial testing systems. Unfortunately, there appears to be little contact between academia and the commercial testing community. Statements from researchers, such as "testing tools for CSCW applications are non-existent"[8] are simply untrue. At ISSTA '98, the premier annual academic conference on testing, over a dozen well-known researchers were questioned about multi-user testing architectures. None of them, including an individual citing SQA Suite™ in a conference paper, was aware of any multi-user support. The lack of a rigorous review of commercial testing in the literature necessitated examining a variety of alternative information sources including: USENET's comp.sys.testing which provides a regularly updated list of over 200 commercial and public domain testing tools. Reviews in the Software Testing Online Resources (STORM) web site maintained by Roland Untch at Middle Tennessee State University [101]. Software review articles from commercial magazines [102], [103], [104], [105]. Several discussions with Dr. Anne Ferraro who performed a review of commercial testing systems for Microstrategies, Inc. [106]. Test software company web sites. Several criteria were used to determine a system's desirability. First, the system had to run on Windows95/NT platforms. This was necessary because software for the Collaborative 53 Classroom, including CollabBillboard, was developed on these platforms. Second, the system had to support multi-user testing. Determining this capability was challenging because marketing literature uses words like "stress testing", "load testing", and "client/server testing" inconsistently. Sometimes this meant that the product was capable of multi-user testing. Other times this meant that the product could be used to simulate loads or client behavior in a single user environment. Finally, the company had to provide an evaluation copy. Four systems were initially selected: Platinum Technology's Final Exam C/S-Test™ [107], Mercury Interactive's Test Suite™ [108], Rational Software's SQA Suite™ [109], and Segue's Silk Enterprise Edition™. Unfortunately, negotiations with Segue broke down before an evaluation copy of their system was obtained. The review of commercial systems is organized around a reference testing architecture. A software test environment (STE) can be broken down into six functional categories: test execution, test development, test failure analysis, test measurement, test management, and test planning [110]. 4.4.1 Test Planning Test planning provides the tools necessary for managing staff, schedules, and resources necessary for product testing. Areas covered by this function include features of software to be tested, detailed test plans, risk assessment, organization training needs, resource needs, staffing needs, staffing roles, staffing responsibilities, and schedule. SQA Suite™ and TestSuite™ provide extensive tools for test planning. SQA Suite™ defines the testing process as a sequence of six steps: Test Planning ∝Test Development ∝ Test Results ∝Defect Tracking ∝Summary Reporting and Analysis. SQA Manager is provided to define and organize test requirements. Test requirements are defined using a hierarchical folder/document tree. Folders describe level testing objectives with higher level objectives appearing closer to the root. The leaves are documents, which describe the detailed low level requirements for a specific test. TestSuite™ defines the testing process in three steps: Test Planning ∝Test Execution ∝ Bug Tracking. TestSuite™ merges testing planning and development into a single step following IEEE Standard 829 [111]: Define Goals (requirements) ∝Define Major 54 Capabilities to Test (specification) ∝Define Tests (design) ∝Define Steps for each Test (implementation) ∝Automate Tests (automation). Testing is viewed as a life cycle that parallels software development. Wizards are provided which guide the tester through each step of the planning process. Final Exam C/S-Test™ does not provide any test planning facilities. 4.4.2 Test Management Test management deals with the storage and maintenance of test artifacts and their interrelationships. A sophisticated storage mechanism, such as a database, is needed to maintain artifact relationships. SQA Suite™ manages the entire testing process through the SQA Manager [109] program. SQA Manager allows the tester to perform test planning, archive developed test cases, archive the results from test execution, and perform analysis on the test case results. Email and bug tracking support is also provided to tie development, quality assurance, and management into the process. The artifacts of the test process are stored in either a Microsoft Access or Sybase relational database. SQA Manager provides a query mechanism for information in the test repository. Unfortunately, the data model for the repository is not exposed, so there is no way to link the test system into other development tools, such as the code library. This would be useful for synchronizing the bug fixes on the development and test side. A graphing and report writer facility is also included for reviewing and analyzing software defect information (e.g. age and priority of outstanding defects, defect ownership, number of defects over time). TestSuite™ provides similar management through TestDirector [112]. The repository uses Microsoft Access and exposes some of the data model to external applications. Specifically, read only views are available for test case results. This allows the tester to run standard report writing tools against the results. TestDirector provides excellent support for testing during the iterative GUI development process. When a new build is brought into the test system, the widgets on each dialog are analyzed. If there are differences between the widgets in the new build and previous build (e.g. a widget was deleted), and the archived test cases have dependencies on these differences, then the system will alert 55 the tester and provide a wizard to help modify the test cases. Like SQA Manager, bugtracking facilities are not integrated with any code library systems. Another shortcoming of TestDirector is a lack of query tools for data archived in the repository. A graphing and report writer is also included for defect analysis. Final Exam C/S-Test™ does not provide any test management facilities. 4.4.3 Test Development Test development adds the ability to specify test executions. A test suite is developed for the software under verification. The suite consists of individual test cases. Each test case includes the input required to run the case, adequacy criteria to determine if the case passed or failed, and documentation. Final Exam C/S-Test™ records user actions performed on the application under test (AUT). Actions are written to a test script, which can then be played back. User actions are divided into two categories. High level actions involve the manipulation of a GUI widget (e.g. pushing a button). Low level actions involve device level manipulation (e.g. mouse click, or keyboard press). The recorder interprets actions at a high level whenever possible. This gives the test script greater flexibility during execution. A test script that records an OK button press is more flexible than one that records the absolute screen coordinates of the mouse click that caused the button press. If the script is run with the AUT at a new position on the display, the high level action will be replayed, while the low level one will cause undesired behavior. The following test script action sets the keyboard and mouse input focus to window specified: titlename is the text string name of the GUI window. internalId is a special C/STest ™ internal identifier for the window. dbKey is used to lookup information about the window in a special Windows95/NT repository. dbId identifies the name of the repository. delay specifies the maximum amount of time the replay system should wait before deciding that the window cannot be found. setwindow( titlename, internalId, dbKey, dbId, delay); 56 Once a set of user actions has been recorded, the script can be enhanced with constructs from the Test Manipulation Language (TML). It is a weakly typed C-like language that includes conditionals, loops, and four variable types: string, float, int, and list, and includes subprograms. User exit support is provided so the script writer can call on pre-compiled subroutines developed in other languages like C and C++. TestSuite™ provides a similar recording tool and scripting language [108]. In addition to delaying actions with a timer, the language provides a waitbitmap() function which pauses test script execution until an geometric area of the AUT matches the specified bitmap. SQA Suite™ provides a recording tool with an extremely small footprint on the screen. This is an important benefit over the other two test systems. One of the problems with recording test cases was that whenever there was a need to interact with the test-recording tool, the actions necessary to get to the recording tool were also recorded in the test script. The small footprint provided by SQA Suite™'s tool meant that the program's interface could be placed in a location next to, but not on top of or underneath the application. SQA Suite™'s scripting language [113] is a powerful subset of Visual Basic. Support is also included for any program written in Microsoft's Visual Basic if the user doesn't want to perform multiuser tests. 4.4.4 Test Execution Test execution exercises the software and records the results of the execution. The software exercised may have been be specially instrumented for testing. The artifacts of test execution include test system and program output, execution traces, and bookkeeping data (e.g. when test was run, against what build/configuration, with what test case data, by whom). Systems supporting only test execution were the first kind of STEs developed. Final Exam C/S-Test™ provides a single system window for test recording, playback, and analysis. To execute a test case, the tester opens a script file and issues the run command through the system window. In order to begin the test, the AUT must be in the same state that it was when the test script was recorded. A text window displays the test script, highlighting the line currently being executed. A debugger is provided which allows the tester to single step through the script, set breakpoints, and query the contents of any script variable. The playback command two speed options: actual and fast. Actual will 57 replay the script actions at the same speed they were recorded. Fast will replay the script actions with smaller default delays. The results of the test case are saved in a log file for later analysis. Test scripts can be run in automatic batch mode by creating a script with a sequence of testExec(fileName) commands (where fileName is the name of a test script file). TestSuite™ views test execution as more formal process consisting of test cycles, automated and manual tests, and test result analysis. Four test cycles are identified: sanity, normal, advanced, and regression. A sanity test cycle tests the breadth of the application and consists mostly of tests that should have positive results. Normal and advanced cycles increase the depth of application testing and contain cases that are more destructive. The regression cycle verifies that changes in the AUT didn't cause failures other areas of the application. In addition to a batch mode support for scripts, TestSuite™ supports manual testing within the system. During a manual test, a dialog box is provided which allows the tester to indicate pass/fail status of the test and make comments. TestSuite™'s debugger is comparable to Final Exam C/S-Test™. In addition, it provides a variable watch list that allows the user view the values of variables and expressions as a test script is executing. Scripts can be played back in three modes: verify, debug, and update. The default mode, verify, executes the script and performs implicit and explicit verification. Debug mode allows the script to be played back with the debugger. Update sets the reference data used in implicit and explicit verification to be data from the current run. TestSuite™ allows the tester to set a number of execution options beyond the script's playback speed. The min_diff parameter defines the number of pixels that constitute a threshold match for bitmap verification. delay defines a frequency check for window stability. A window is sample at the delay specified rate until two consecutive passes result in the same display. This ensures the window is stable for verification or synchronization checks. SQA Suite™ views test execution in two phases: test development, regression testing. Test development is the process of creating, debugging, and baselining test cases for the 58 AUT. Regression testing executes the developed test cases against the current AUT's current build. The results of the execution are compared against the case's baseline. Any discrepancies are reported as potential errors. Although SQA Suite™ supports batch mode for scripts, it does not integrate manual testing into the process. The SQA Suite™ script debugger is comparable to TestSuite™'s. Because Visual Basic allows complex data types, the debugger also includes a data structure browser. SQA Suite™ only supports verify and debug execution modes. The baseline for a test case must be collected during recording. Script execution options focus on script playback speed, and matching window captions. Caption matching is a particular problem if an application is supported on different versions of the Windows operating system. For example, Windows 3.1 only supports 8 character filenames with 3 character extensions. The tester is also able to set test log options before executing a test script. These options include the level of detail written to the log (all, pass/fail, fail) and whether the results of the test should be written to the test repository. Finally, error recovery options are available. The user can specify how the playback should proceed if a script command fails, test case fails, or the AUT crashes. 4.4.5 Test Analysis Test analysis examines a test case, both during and after execution to determine pass or failure. Artifacts from failure analysis include test case pass/failure, and a report for each failure. Some STEs with failure analysis capability use a test oracle, a subsystem that automatically analyzes software behavior and output during test execution. All-purpose test oracles do not currently exist, but several domain specific oracles have been developed. Poirot [114] analyzes the execution of parallel programs to determine and isolate performance problems. TAOS/GIL [115] compares a program's temporal specification against the trace of its implementation execution. TAOS/Reactive [116] requires the tester to translate specification locations where certain conditions must hold true to the same location within the implementation. An oracle is then constructed by creating assertions on these conditions in the implementation. Final Exam C/S-Test™ TML includes six kinds of verification statements that the script recorder can select. Bitmap verification allows the user to 59 identify a GUI widget or a geometric subset for comparison. A graphical snapshot is taken of the area at recording time. When the test script is run, a pixel by pixel comparison is made between the snapshot and the same area the AUT during playback. GUI object verification saves the state of one or more GUI widgets. During test script playback, a comparison is done between a widget's saved state and actual state on the AUT. Text verification is a special verification tool used for applications that support complex fonts, such as a WYSIWYG editor. Snapshots of the text area are taken and processed using Optical Character Recognition techniques to extract the actual text. Comparisons are made between the text at record and playback time. File verification performs a byte by byte comparison of a files generated at record and playback time. A user exit is provided so that the tester can define application specific verification routines. TestSuite™ and SQA Suite™ provide similar verification tools. Figure 15: Final Exam C/S Test Multiuser Architecture In Final Exam C/S-Test™, the results of a test execution are written to a log file. The log file contains verification pass/fail statements, test script parse and runtime errors, and user defined messages entered into the log file via the log() script command. A text browser is provided so the user can review the log. There are two viewing options: all and fail. All displays all log file output. Fail displays only test script failures. Both TestSuite™ and SQA Suite™ provide more sophisticated log file analysis tools. Monitor Server Workstation Server Workstation Server Workstation 60 SQA Suite™ provides special browser called the SQA Test Log Viewer [117]. The Log Viewer displays an abstraction of the log file that initially lists ten different kinds of log events, the date and time the event occurred, and a pass/fail status. Examples of events include start of a test script, call/return from a procedure, general protection fault, and script command failure. The tester can apply a filter to the event log to view only specific event types. The tester can get more detail about certain events in the log by selecting the event. For example, a test case event that has a failure status will display the script command that actually caused the failure. By double clicking on the test case event, the user can jump to the actual command in the test script editor. SQA Suite™ also provides a special comparator application, which allows the tester to compare the results of a test with the original baseline to determine if the failure recorded, is actually a problem. There are comparators for images, GUI objects, and text. If a test failure has been determined to be a program defect, the tester can enter a defect into the SQA Repository. The defect number will automatically be assigned to the test case results in the log file. TestSuite™ provides a logfile with capabilities similar to SQA Suite™ integrated in the WinRunner application. SQA Suite™ includes a graphing package specifically for performance analysis. The execution times for test scripts and specific start/stop timer script commands are recorded in the log file. The tester can extract the results from the log file and display them on one and two-dimensional graphs. Several types of graphs are supported. Elapsed Times - Summary: graph shows the average elapsed times of repeated executions of a series of test scripts. Elapsed Times - Chronology: graph shows changes in elapsed time over the series of test script runs. Elapsed Times: Avg Min Max: graph shows average, min, max values of repeated executions of a series of test scripts. Performance: graph a series of test script runs vs. size of data processed. Errors: graph error frequency by test script. Neither TestSuite™ nor Final Exam C/S-Test™ provides any performance graphing utilities. 4.4.6 Test Measurement Test measurement includes test coverage measurement, analysis, and instrumentation for data collection during execution traces. Artifacts include test coverage measures. Section 4.2.4:White box testing discussed test coverage issues. Instrumentation presents a testing 61 challenge because code that has been instrumented behaves differently than the original code [118]. Standard profiling tools like prof exist for single process programs which provide call graphs, statement and function counts, and timing statistics. For parallel programs, instrumented communication libraries, such as the Portable Instrumented Communication Library (PICL) which trace the send/receive events and record communication statistics can be used [119]. One problem with massively parallel programs is that their size and lengthy execution times can result in extremely large execution traces. Selective instrumentation reduces the amount of data collected by allowing the tester to select when and what parts of the program will be instrumented. Paradyn, for example, allows code to be instrumented and de-instrumented on the fly [120]. Apart from the recording test script execution times and providing basic test script start/stop timer commands none of the test systems have any sophisticated test measurement and instrumentation capabilities. 4.4.7 Multiuser Testing Final Exam C/S-Test™ uses two kinds of specialized software to conduct multiuser testing. A single copy of the monitor program resides on one of the networked workstations. The monitor provides a session control tools to schedule and view status of test scripts executing on remote workstations. All workstations participating in a multiuser test are controlled with a local server program. The server program identifies the workstation to the monitor as available for testing, and responds to requests from the monitor (e.g. start executing test script). A test script executing in during a multiuser test is called a "virtual user". The log files from remote executions are written to a public directory accessible to all test machines. TestSuite™ and SQA Suite™ use a similar architecture. One area that TestSuite™ and SQA Suite™ differ from Final Exam C/S-Test™ is in a distinction between types of virtual users. In SQA Suite™, a GUI user executes a test script containing interactions with the application's user interface. Only one GUI user is allowed per workstation. The main goal of a GUI user is to perform correctness testing. A virtual user issues http commands against a web server, bypassing the user interface completely. Because of the reduced processing needed by the test system for text 62 commands, there can be many Virtual users on a single workstation. SQA Suite™ guidelines state that each GUI user requires 20 MB of RAM, while a Virtual user requires just 1.5 MB. The purpose of Virtual users is to perform load and stress testing. TestSuite™ GUI and dB Users perform roles similar to SQA Suite™'s GUI and Virtual users. Synchronization plays a vital part in coordinating the execution of test scripts on multiple networked machines. Final Exam C/S-Test™ provides support for both synchronous and asynchronous messaging for synchronization (see Table 3): Script Command Description when ("msgId") enabled { stmts… } Tells TML to look at each incoming message id. If it matches "msgId" then the statements inside the code block are executed. enable "msgId"/disable "msgId" Enables/disables when blocks. sendMessage() Sends a message to remote host. Messages contain no information beyond the message ID. Message is acknowledged if received. multiMessage() Sends a message to multiple remote hosts. No acknowledgement is made if received. waitMessage() Waits for any message to enter the message queue. peekMessage() Looks at message on top of message queue without removing it. sendMessageToTML() C function allows AUT to send messages to local test script. RemoteCallerName() Returns the name of the remote host that caused the test script to be executed locally. Run "file" on "hostId" Runs test script file on remote host identified by host id. Table 3: Final Exam C/S-Test™ TML Script Commands for Multiuser Script Synchronization TestSuite™ coordinates virtual users with a synchronous messaging technique called "rendezvous". Each virtual user declares a rendezvous using the declare_rendezvous(" rzvId") statement. To synchronize across test scripts, the command rendezvous(" rzvId") is issued by all virtual users. Execution will not continue until all virtual users have executed the rendezvous() command with the same id. SQA Suite™ has a similar command, SQAVuSyncAndResume(), which provides the some additional capabilities. Through the monitor, the tester can specify a threshold for the number of virtual users that must reach the rendezvous point before execution can continue. The tester is also 63 allowed to explicitly force a virtual user to continue. Finally, a timeout option is provided which allows the virtual user to continue if the rendezvous condition has not been met. The Final Exam C/S-Test™ session control monitor consists of a status and message window. The status window reports the status of each workstation participating in the test session. Connected indicates that the workstation is ready to run a test script. Running means, a test script is executing on the machine. Getline means that the remote test script is waiting for the tester to enter some text at a special monitor command prompt. Waiting indicates the test script is waiting for a test script event (via the waitMessage() command). Error denotes some kind of error (verification, general protection fault, and script command) occurred. Stop is displayed when the script has successfully executed. Disconnected is displayed when the workstation has been dropped from the test session. The messages window displays any messages transmitted between test scripts via the sendMessage() or multiMessage() command. Both TestSuite™ and SQA Suite™ offer a more sophisticated session control interface. Besides remote workstation status, interfaces provide scheduling and limited synchronization capabilities. Table 4 is a slightly modified version of the session control interface for SQA Suite™ [121]. The label field associates a specific workstation and test script with an identifier. Test station identifies the name of a workstation used in the test session. Test entry contains a list of test scripts to be run in sequential order on the workstation specified by test station. The order the scripts appear in the list is the order they will be run unless overridden by a scheduling method. Status indicates the status of the workstation: editing, connected, not responding, running, run completed. Editing indicates that the tester is modifying the entries for the workstation in the session control window. The other states are self-explanatory. Scheduling method provides the user with some synchronization control. Valid methods are None, Wait, After