Thursday, September 21, 2006

Accessibility Testing

Accessibility Testing Software Compared
Steve Faulkner, Web Accessibility Consultant, Vision Australia Foundation [HREF1], 454 Glenferrie Rd, Kooyong 3144. steven.faulkner@visionaustralia.org.au

Andrew Arch, Manager Online Accessibility Consulting, Vision Australia Foundation [HREF1], 454 Glenferrie Rd, Kooyong 3144. andrew.arch@visionaustralia.org.au

Contents: introduction | conformance | investigations | conclusion | references | detailed results

Abstract
Web accessibility for people with disabilities and other disadvantaged groups is becoming increasingly important for government and educational institutions as they try to meet their obligations under the Disability Discrimination Act and various policies and guidelines for online publishing. Business are also obligated under the DDA not to discriminate against people with disabilities as a result of their online activities.

A plethora of web accessibility testing tools have been released on to the market over the past eighteen months at prices ranging from a few hundred dollars to many thousands of dollars. This study looks at the ability of four of these testing tools to accurately assess the accessibility issues on a web site against the W3C Web Content Accessibility Guidelines 1.0 Priority 1 checkpoints. We conclude that they all have strengths and weaknesses and that none of them are able to identify all the accessibility issues. At this stage these tools can only aid accessibility testing, not provide a definitive assessment.

Introduction
With the increasing requirement for Government, education and business to provide accessible online services in order to provide access to all citizens including people with disabilities, many web managers are turning to the "robotic" tools to spider through their sites and tell them the problems and accessibility 'hot spots'. This has led to a veritable industry of software developers trying to solve this problem for those requiring "accessible" web sites, spured along with the recently enacted Section 508 [Section 508] law in the United State requring Federal Government agencies to build accessible web sites and generally to "buy accessible".

However, as we shall demonstrate in this paper, not all accessibility assessment software tools are created equal. Some overstate the problems while others understate the problems that exist on a web site. And they don't even do this consistently across the sixty five checkpoints of the Web Content Accessibility Guidelines 1.0 (WCAG) [Chisholm, et al ].

Web accessibility testing tools is a relatively new software area - most tools have been available for less than two years. Comparisons of the efficacy of the tools have not been undertaken as the market has matured. Graves [2001] in Government Computer News conducted one of the earliest comparisons between InFocus 508 and PageScreamer 2.3 noting the problems both products had with tables. Harrison [2002] and Harrison and O' Grady [2002] presented a more rigorous comparison of six analysis and repair tools noting the difficulty in isolating errors identified by some of the tools and the emphasis on US accessibility standards rather than the international WCAG ones.

Our aim has been to investigate the efficacy and accuracy of some of the available software applications used for web accessibility testing , in order to help assess the value of such tools in the broader process of ensuring the accessibility of web sites [Brewer & Letourneau].

How is conformance ascertained
What is accessibility?
An accessible web is available to people with disabilities, including those with:

Vision impairment (e.g. low vision or colour blindness) or vision loss affecting their ability to discern or see the screen
Physical impairment affecting their ability to use a mouse or keyboard
Hearing impairment or loss affecting their ability to discern or hear online audio
Cognitive impairments (e.g. dyslexia, ADD, learning difficulties, memory impairment) affecting their ability to comprehend or understand your site
Literacy impairments (e.g. low reading skills or English is not their first language) possibly affecting their ability to fully understand your site and its messages
Beneficiaries from an accessible web, however, are a much wider group than just people with disabilities and also include:

People with poor communications infrastructure, especially rural Australians
Older people and new users, often computer illiterate
People with old equipment (not capable of running the latest software)
People with "non-standard" equipment (e.g. WAP phones and PDA's)
People with restricted access environments (e.g. locked-down corporate desktops)
People with temporary impairments or who are coping with environmental distractions
How do we check for an accessible web site?
The Web Accessibility Initiative (WAI) outlines approaches for preliminary and conformance reviews of web sites [Brewer & Letourneau]. Both approaches recommend the use of 'accessibility evaluation tools' to identify some of the issues that occur on a web site. The WAI web site includes a large list of software tools to assist with conformance evaluations [Chisholm & Kasday]. These tools range from automated spidering tools such as the infamous Bobby [ Watchfire, 2003], to tools to assist manual evaluation such as The WAVE [WebAIM], to tools to assist assessment of specific issues such as colour blindness. Some of the automated accessibility assessment software tools also have options for HTML repair.

What role can automated tools play in assessing the accessibility of a web site?
WCAG 1.0 comprises 65 Checkpoints. Some of these are qualified with "Until user agents ..." and with the advances in browsers and assistive technology since 1999, some of these are no longer applicable - leaving us with 61 Checkpoints. Of these only 13 are clearly capable of being tested definitively, with another 27 that can be tested for the presence of the solution or potential problem, but not whether it has definitively been resolved satisfactorily. With intelligent algorithms many of the tools can narrow down the instances of potential issues that need manual checking, e.g. the use of "spacer" as the alt text for spacer.gif used to position elements on the page.

These automated tools are very good at identifying pages and lines of code that need to be manually checked for accessibility. Unfortunately, many people misuse these tools and place a "passed" (e.g. XYZ Approved) graphic on their site when the tool can not identify any specific accessibility issues, but the site has not been competently manually assessed for issues that are not software checkable.

So, automated software tools can:

check the syntax of the site's code
identify some actual accessibility problems
identify some potential problems
identify pages containing elements that may cause problems
search for known patterns that humans have listed
However, automated software tools cannot:

check for appropriate meaning
check for appropriate rendering (auditory, variety of visual)
The interpretation of the results from the automated tools requires assessors trained in accessibility techniques with an understanding of the technical and usability issues facing people with disabilities. A thorough understanding of accessibility is also required in order to competently assess the checkpoints that the automated tools cannot check such as consistent navigation, and appropriate writing and presentation style.

Investigation of efficacy of accessibility software
Choice of software
The choice of tools to review was based on a number of factors:

The software needed to have the ability check WCAG 1.0 checkpoints (some tools check only for US Section 508 problems). This decision was based on the applicability of the WCAG 1.0 to Australian disability regulations.
The least expensive desktop software products from each software producer. This decision was based upon the hypothesis that the more sophisticated versions of the testing software are built upon the same 'testing engines' and algorithms as the entry level products.
Potential users were considered more likely to purchase the less expensive products as Web Accessibility testing would not be a major priority for many organisations.
The availability of trial versions of the software products to be reviewed and the resources available to conduct the review.
We intend to expand the list of software products reviewed as software and resources become available

Table 1: Software Reviewed Software Vendor URL Cost
AccVerify 4.9 HiSoftware http://www.hisoftware.com US $495
Bobby 4.0.11 Watchfire http://www.watchfire.com US $99
InFocus 4.2 SSB Technologies http://www.ssbtechnologies.com/ US $1,795
PageScreamer 4.1 Crunchy Technologies http://www.crunchy.com/ US $1,495

1. A new version of Bobby has been released since the testing was conducted.

It is evident that there is quite a difference in cost across vendors for their entry level products. This difference in cost is partially reflected in the functionality of the software; some tools automatically fix certain problems, but the core functionality, the testing and reporting on WCAG 1.0 issues, is present in all the software reviewed.

Investigation methodology
Site used for testing
The site "The University of Antarctica" used for the review is a demonstration site developed by WebAIM, a non profit organisation whose stated goal "is to improve accessibility to online learning opportunities for all people".

The site contains examples of the many potential barriers to accessibility. The site is hosted by WebAIM at http://www.webaim.org/tutorials/uofa/.

The site consists of:

28 HTML documents
1 CSS file
16 GIF files
24 JPG files
1 SWF file (Flash)
1 Java file
13 AU files (audio)
1 MOV file (multimedia)
3 MPEG files (multimedia)
The site was chosen because it was built to demonstrate accessibility problems. The scope of the site is quite small and therefore instances of problems are easily quantified. Furthermore, the site was constructed using plain HTML files, there are no pages generated "on the fly' from a database, making the process of manual checking and quantification of the site a manageable task.

It was also reasoned that the site content and structure is relatively stable and therefore further testing of the site at a later time will still produce accurate comparison results. Further to this, a copy of the site will be stored at http://it-test.com.au/UOFA by Vision Australia Foundation to ensure the sites continuing integrity for testing purposes.

Process followed
Each of the products in the review were set to produce reports detailing issues in reference to the WCAG 1.0 Priority 1 Checkpoints. The reporting options if present were set to produce the standard reports. All reports were produced in HTML format.

AccVerify report comprised 60 HTML files:(Accverify report [zip file 1,128kb])

3 summary pages, comprising a listing of Priority 1 checkpoint errors and instances and Priority 1 visual checkpoints (needing human judgment) and instances. Graphical representation of this information is also presented along with a page listing the files that failed against any of the checkpoints along with a link to the associated detailed report and checklist
5 statistical pages listing some of the structural elements of the files tested, e.g. tables, forms, images, with links to the pages containing the elements
26 detailed report pages (1 for each file tested):
individual page summary and graphs of checkpoint errors and visual checkpoints
detailing of instances of specific issues with links to the associated checkpoints on the W3C web site.
divided into Priority 1, 2 and 3 issues
26 checkpoint pages (1 for each file tested):
listing of all WCAG 1.0 checkpoints with indication of whether the page passed/failed, not applicable, or needed visual checking for each checkpoint
short explanation of each checkpoint, paraphrased from WCAG 1.0 guidelines
divided into Priority 1, 2 and 3 issues
Bobby report comprised 72 HTML files: (Bobby report [zip file 183kb])

1 summary page consisting of a short description of all (possible) issues found and links to the site files where an occurrence of the issue was found
1 index page consisting of a list of all the files tested with links to their corresponding detailed report
35 detailed report pages (1 for each file tested):
detailing instances of specific issues with links to (locally stored) explanations of the issues and links to the associated checkpoints on the W3C web site
divided into Priority 1, 2 and 3 issues
further divided into issues either needing, or not needing, user checking to confirm their existence
35 (1 for each file tested) 'text only' versions of the files tested
InFocus report comprising 28 HTML files: (Infocus report [zip file 74kb])

3 summary/index pages:
a summary compliance report page, listing instances of and explaining each violation with links to the reports of "top 5 pages containing this violation".
a page with links to a detailed report for each page tested, also listing the number of checkpoint 'violations' found on each page
a page with links to a detailed report for each page tested ordered by (descending) total number of 'violations' found on each page
25 detailed 'compliance report' pages (1 for each file tested)
listing each violation found, the associated WCAG 1.0 checkpoint and the offending elements within the HTML code of the page.
PageScreamer report comprising 66 HTML files: (Pagescreamer report [zip file 256kb])

1 page containing a copy the WCAG 1.0 guidelines from the W3C web site
1 'detail' page listing every HTML element that triggered a checkpoint violation, the elements' URL location and line number
1 verification summary page for the site
containing a graph showing the number of 'tags' in violation of each checkpoint
a table listing all the checkpoints with links to the related sections of the locally stored copy of the WCAG guidelines.
numbers of Compliance violation instances and whether the site passed/failed the checkpoint or the checkpoint needed further verification
1 file containing a text description of the graph
27 verification summary pages (one for each URL found) containing:
containing a table listing all the checkpoints with links to the related sections of the locally stored copy of the WCAG guidelines.
numbers of compliance violation instances and whether the site passed/failed the checkpoint or the checkpoint needed further verification
35 pages containing lists of links to URL's (with links to associated report files) containing instances of a violation of a particular checkpoint (1 page per checkpoint)
Overview of results
A significant measure of the software tools and their ability to report on the accessibility problems of a site is the ability to find all the URL's on the target site. Some of the apparent discrepancies between software of the URL count could be apportioned to the software only listing the URL's of files that contain errors. But in this situation all 28 HTML files contained at least one instance of non conformance with a WCAG 1.0 checkpoint.

Table 2: HTML files found AccVerify Bobby InFocus PageScreamer Manual Check
26 20 25 27 28

AccVerify
Failed to find any instances of the MPEG (3) (multimedia) or AU (13) (audio) files
Failed to find files (2) that were the targets of forms
Bobby
Identified instances of audio and multimedia files, although it failed to report some instances.
Failed to find a file linked via meta based redirect
Tested and reported twice on the 'Home' page [index.html]
Failed to find a file linked via an image map based link
Failed to find files (2) that were the targets of forms
InFocus
Failed to find files (3) that were the targets of forms
PageScreamer
Tested and reported twice on the 'Home' page [index.html]
Failed to find files (2) that were the targets of forms
Found instances of the MPEG (3) (multimedia) or AU (13) (audio) files but passed site on checkpoints relating to multimedia without verification.
Accuracy and Definitive nature of results
Another significant measure of the products efficacy is its ability to produce both accurate and definitive results without the need for further human interpretation.

Table 3: Status of (15 priority 1) checkpoints tested AccVerify Bobby InFocus PageScreamer Manual Check
failed 2 2 5 3 11
passed 2 n/a 3 - 4 4 4
Human
intervention (%) 11 (73%) 11 (73%) 8 (53%) 6 (40%) n/a
Not reported 0 2 1 2 1 2 2 n/a

Bobby and InFocus did not report upon issues that could be checked by the software and found not applicable e.g. they found no server-side image maps therefore did not report about issues concerning client side image maps
PageScreamer failed to provide any information in the report about 2 Priority 1 checkpoints (4.1, 14.1).
Bobby only reported on those checkpoints that it found the site to be in breach of.
PageScreamer incorrectly reported the site having passed 2 checkpoints (1.3, 1.4) that the manual check revealed to be fails.
Graphical representation of data from Table 3


Table 4: Totals of reported potential/actual failures of (15 priority 1) checkpoints tested AccVerify 1 Bobby 2 InFocus3 PageScreamer4 Manual Check
failed 2 2 5 3 11
Potentials Failures
(human Intervention)
11 11 8 6 n/a
Totals 13 13 13 9 11

AccVerify overstated total potential failures in relation to actual failures
Bobby overstated total potential failures in relation to actual failures
InFocus overstated total potential failures in relation to actual failures
PageScreamer defined the least checkpoints as 'potential failures', but understated total potential failures in relation to actual failures
Graphical of representation of data from Table 4.


Over reporting
'Over reporting of instances of potential checkpoint failures was a common feature of all the products reviewed.

Table 5: Examples of reported Instances of potential/actual checkpoints failures AccVerify Bobby InFocus PageScreamer Manual Check
2.1 - Ensure that all information conveyed with color is also available without color 26 47 25 220 1
5.1 - For data tables, identify row and column headers. 11 12 1 164 1
5.2 - For data tables that have two or more logical levels of row or column headers, use markup to associate data cells and header cells. 11 13 4 464 1

Discussion
While Bobby was the most successful tool in identifying multimedia and audio files it had the most problems identifying HTML files linked via the META and IMAGE MAP elements as well as those files that were the targets of FORM elements. The inability to identify some of the targets of Forms was a common defect among all the software.

The product's ability to give a quantitative answer as to whether the site passed or failed against 15 WCAG 1.0 Priority 1 checkpoints varied from 60% for Pagescreamer to 28% for AccVerify, though it should be noted that the Pagescreamer quantitative results produced 2 false passes. The results highlight that the software tools either did not produce a quantitative report or could not produce an accurate report on the status of the site in relation to the majority of the Priority 1 checkpoints tested. For approximately 10 of the 15 Priority 1 checkpoints tested the reports inform us that the checkpoints need a 'visual', 'manual' or 'user' check. Furthermore, although along with the instructions to do a 'manual' check the reports detail potential instances of checkpoint violations (Table 5), which should be helpful in tracking down issues, when comparing the number of potential instances reported against the actual occurrence, it is evident that none of the tools does a very good job at identifying potential errors. All of the products over-reported potential checkpoint errors/violations. A number of checkpoints were detailed as potential errors on every page by some software tool reports.

All of the products produce a report that upon initial consideration may appear as a detailed analysis of the accessibility problems found by the product on the site tested. Upon closer examination it is revealed that the software tools fail at the initial hurdle of correctly identifying the files to be tested. Furthermore, many of the accessibility issues that may occur on the site are not within the reach of the 'mechanical' rules based analysis that these products undertake. In an attempt to ensure that issues are not missed by the software, the reports tend to overstate the occurrence of potential problems, up to the point where a potential instance of a checkpoint violation is flagged for every file checked, thus undermining even the heuristic values of the report.

Conclusion
The research into the automated accessibility tools reported here and conducted at Vision Australia Foundation indicates that the rule of "caveat emptor" applies as equally to the field of accessibility testing tools as it does to buying a used car.

All of the tools tested have advantages over free online testing tools [Steven Faulkner], the main adavantage being their ability check a whole site rather than a limited number of pages. Also files do not have to be published to the web before they can be tested, the user has greater control of what rules are applied when testing, the style and formatting of reports, and in some cases the software will automatically correct problems found.

None of the tools evaluated, as expected, were able to identify all the HTML files and associated multimedia files that needed to be tested for accessibility on the site.

All the software tools evaluated will assist the accessibility quality assurance process, however none of them will replace evaluation by informed humans. The user needs to be aware of their limitations and needs a strong understanding of accessibility issues and the implications for people with disabilities in order to interpret the reports and the accessibility issues or potential issues flagged by the software tools.

We hope the analysis reported here will aid web site quality assurance managers in their choice of web accessibility testing software, and in their understanding of the limitations that automated tools have in the broader process of ensuring accessible web sites.

References
Chisholm, W. et al (Eds) 1999, Web Content Accessibility Guidelines 1.0, World Wide Web Consortium. [HREF2]

Brewer, J & Letourneau, C. (Eds) 2002, Evaluating Web Sites for Accessibility, World Wide Web Consortium. [HREF3]

Chisholm, W & Kasday, L (Eds) 2002, Evaluation, Repair, and Transformation Tools for Web Content Accessibility, World Wide Web Consortium. [HREF4]

Watchfire, 2003, Welcome to Bobby.[HREF5]

WebAIM, undated, WAVE 3.0 Accessibility Tool.[HREF6]

Section 508: http://www.usdoj.gov/crt/508/508law.html [HREF7]

Graves, Steve 2001, Check sites for 508 with audit-edit tools, Government Computer News [HREF8]

Harrison, Laurie 2002, Web Accessibility Validation and Repair - Which Tool and Why? (introduction), Center On Disabilities Technology And Persons With Disabilities Conference 2002 (CSUN-2002) [HREF9]

Harrison, L & O'Grady, L 2002, Web Accessibility Validation and Repair: Which Tool and Why? (analysis), ATRC, University of Toronto.[HREF10]

Steven Faulkner, Vision Australia Foundation, 2003, Free Web Development and Accessibility Tools. [HREF11]

Hypertext References
HREF1
http://www.visionaustralia.org.au/webaccessibility/
HREF2
http://www.w3.org/TR/WCAG10/
HREF3
http://www.w3.org/WAI/eval/
HREF4
http://www.w3.org/WAI/ER/existingtools.html
HREF5
http://bobby.watchfire.com/bobby/html/en/index.jsp
HREF6
http://www.wave.webaim.org:8081/wave/index.jsp
HREF7
http://www.usdoj.gov/crt/508/508law.html
HREF8
http://www.gcn.com/20_23/reviews/16783-1.html
HREF9
http://www.csun.edu/cod/conf/2002/proceedings/279.htm
HREF10
http://snow.utoronto.ca/access/evaltoolreview/validation.html
HREF11
http://www.visionaustralia.org.au/webaccessibility/workshops/references.html#acheck

Appendix - Table of results
Priority 1 Checkpoints and Indication of software report on each checkpoint In General (Priority 1) Manual Check AccVerify Bobby InFocus Page Screamer
1.1 Provide a text equivalent for every non-text element (e.g., via "alt", "longdesc", or in element content). This includes: images, graphical representations of text (including symbols), image map regions, animations (e.g., animated GIF's), applets and programmatic objects, ascii art, frames, scripts, images used as list bullets, spacers, graphical buttons, sounds (played with or without user interaction), stand-alone audio files, audio tracks of video, and video. fail fail fail fail fail
2.1 Ensure that all information conveyed with color is also available without color, for example from context or markup. fail visual user check manual check verify
4.1 Clearly identify changes in the natural language of a document's text and any text equivalents (e.g., captions). fail visual user check manual check not reported
6.1 Organize documents so they may be read without style sheets. For example, when an HTML document is rendered without associated style sheets, it must still be possible to read the document. fail visual user check manual check verify
6.2 Ensure that equivalents for dynamic content are updated when the dynamic content changes. fail visual user check manual check verify
7.1 Until user agents allow users to control flickering, avoid causing the screen to flicker. pass visual user check manual check verify
14.1 Use the clearest and simplest language appropriate for a site's content. pass visual user check manual check not reported
And if you use images and image maps (Priority 1) Manual Check AccVerify Bobby InFocus Page Screamer
1.2 Provide redundant text links for each active region of a server-side image map. pass pass not reported not reported pass
9.1 Provide client-side image maps instead of server-side image maps except where the regions cannot be defined with an available geometric shape. pass pass not reported not reported pass
And if you use tables (Priority 1) Manual Check AccVerify Bobby InFocus Page Screamer
5.1 For data tables, identify row and column headers. fail visual user check fail verify
5.2 For data tables that have two or more logical levels of row or column headers, use markup to associate data cells and header cells. fail visual user check fail verify
And if you use frames (Priority 1) Manual Check AccVerify Bobby InFocus Page Screamer
12.1 Title each frame to facilitate frame identification and navigation. fail fail fail fail fail
And if you use applets and scripts (Priority 1) Manual Check AccVerify Bobby InFocus Page Screamer
6.3 Ensure that pages are usable when scripts, applets, or other programmatic objects are turned off or not supported. If this is not possible, provide equivalent information on an alternative accessible page. fail visual user check manual check fail
And if you use multimedia (Priority 1) Manual Check AccVerify Bobby InFocus Page Screamer
1.3 Until user agents can automatically read aloud the text equivalent of a visual track, provide an auditory description of the important information of the visual track of a multimedia presentation. fail visual user check fail pass
1.4 For any time-based multimedia presentation (e.g., a movie or animation), synchronize equivalent alternatives (e.g., captions or auditory descriptions of the visual track) with the presentation. fail visual user check manual check pass
And if all else fails (Priority 1) Manual Check AccVerify Bobby InFocus Page Screamer
11.4 If, after best efforts, you cannot create an accessible page, provide a link to an alternative page that uses W3C technologies, is accessible, has equivalent information (or functionality), and is updated as often as the inaccessible (original) page. not tested

Copyright
Vision Australia Foundation, © 2000. The authors assign to Southern Cross University and other educational and non-profit institutions a non-exclusive licence to use this document for personal use and in courses of instruction provided that the article is used in full and this copyright statement is reproduced. The authors also grant a non-exclusive licence to Southern Cross University to publish this document in full on the World Wide Web and on CD-ROM and in printed form with the conference papers and for the document to be published on mirrors on the World Wide Web.

Thursday, March 23, 2006

Test 2

test

Thursday, June 02, 2005

Test Posting

Test

Tuesday, May 31, 2005

A TESTING METHODOLOGY AND ARCHITECTURE FOR COMPUTER SUPPORTED COOPERATIVE WORK SOFTWARE - By Robert Francis Dugan Jr.

A TESTING METHODOLOGY AND ARCHITECTURE FOR COMPUTER SUPPORTED COOPERATIVE WORK SOFTWARE
By Robert Francis Dugan Jr.


A thesis submitted to the graduate faculty of
Rensselaer Polytechnic Institute in partial fulfillment of
the requirements for the degree of
DOCTOR OF PHILOSOPHY

Major Subject: Computer Science
May 26, 2000 (for Graduation August 2000)
Approved by ______________________________________________________
Professor Ephraim P. Glinert, Computer Science
Chairperson of Supervisory Committee
______________________________________________________
Professor Edwin H. Rogers, Computer Science
Member
______________________________________________________
Professor Mark K. Goldberg, Computer Science
Member
_______________________________________________________
Professor Mark Embrechts, Decision Sciences and Engineering
Systems
Member
Rensselaer Polytechnic Institute
Troy, New York
ii
Rensselaer Polytechnic Institute
Abstract
A TESTING METHODOLOGY AND
ARCHITECTURE FOR COMPUTER
SUPPORTED COOPERATIVE WORK
SOFTWARE
by Robert Francis Dugan Jr.
Despite enormous potential, CSCW software is still immature. In particular, leading
researchers in both the CSCW and testing fields have noted CSCW testing tools are nonexistent.
This thesis contributes a methodology and architecture for execution based testing of
CSCW software. The CSCW Application MEthodoLOgy for Testing (CAMELOT) provides
an organized set of specific techniques that can be used for technological evaluation. The
evaluation is organized into two phases: single user and multi-user. Single user evaluation is
subdivided further into general computing and human computer interaction. General
computing examines software components that provide basic application capabilities. Human
computer interaction focuses on the interface between the user and the software application.
Multi-user evaluation examines distributed computing and human-human interaction.
Distributed computing scrutinizes components responsible for multitasking and
multiprocessing in the application at the thread, process, processor and machine level.
Human-human interaction focuses on how the software facilitates interaction between users
during application use.
Rebecca, our testing architecture contributes to both general and multiuser testing systems. In
the area of general testing Rebecca:
- Provides an extensible component and event model that allows the record/playback of
non-GUI events
- Allows selective event recording through record filtration
- Promotes the integration of the test system into the development environment
- Outputs test scripts in the developer’s native language
- Reduces re-recording using component-centric events and runtime component resolution
- Simplifies the test process using a simple VCR-like interface
In the area of multiuser testing Rebecca:
iii
- Integrates live users into a test session with triggers that playback virtual user behavior
based on user interface, state change, timer, or user customized events
- Provides runtime configuration of triggers via the threshold models
- Simplifies virtual user synchronization with deadlock detection and recovery
- Simplifies multiuser script editing via a global clipboard
- Maintains IPC independence, but allows IPC to be recorded
- Scales well with a resource conserving architecture
Our architecture was implemented in Java as a working system called Rebecca-J. The
methodology, architecture, and working system were evaluated by testing a mature CSCW
application. The evaluation uncovered several dozen problems with the CSCW system. In
addition to validating our approach, the evaluation prompted immediate improvements to the
architecture and implementation, and provided important ideas for future enhancements.
iv
TABLE OF CONTENTS
1 Introduction......................................................................................................................................... 1
1.1 Problem Overview and Motivation....................................................................................... 2
1.2 The Contributions of Our Research ..................................................................................... 6
1.2.1 CSCW Application Methodology for Testing .......................................................... 6
1.2.2 Rebecca: An Architecture for Execution Based Testing of CSCW Software..... 7
1.2.3 Evaluation....................................................................................................................... 8
1.3 Overview of this Document................................................................................................... 9
2 A Survey of Computer Supported Cooperative Work...............................................................10
2.1 Groupware Applications .......................................................................................................11
2.2 Groupware Toolkits ...............................................................................................................19
3 A Preliminary Experiment ...............................................................................................................21
3.1 Architecture.............................................................................................................................21
3.2 Experimental Method ............................................................................................................22
3.3 Task Overview........................................................................................................................23
3.4 Evaluation, Results, and Analysis of Team Performance................................................27
3.5 Lessons Learned from the Development of CollabBillboard ........................................28
4 Survey of Prior Work in Testing Systems .....................................................................................32
4.1 Goals of Testing......................................................................................................................32
4.2 Research Testing Systems......................................................................................................35
4.2.1 Requirements ................................................................................................................35
4.2.2 Specification ..................................................................................................................35
4.2.3 Design............................................................................................................................42
4.2.4 Implementation ............................................................................................................43
4.2.5 Integration .....................................................................................................................47
4.2.6 System Testing ..............................................................................................................48
4.3 Human Computer Interaction Testing ...............................................................................49
4.3.1 Testing Architectures...................................................................................................49
4.3.2 Usability Testing ...........................................................................................................51
4.4 Commercial Test Systems .....................................................................................................52
4.4.1 Test Planning.................................................................................................................53
4.4.2 Test Management.........................................................................................................54
4.4.3 Test Development........................................................................................................55
4.4.4 Test Execution..............................................................................................................56
4.4.5 Test Analysis..................................................................................................................58
4.4.6 Test Measurement........................................................................................................60
4.4.7 Multiuser Testing..........................................................................................................61
5 A CSCW Application Methodology for Testing .........................................................................64
5.1 Related Work ...........................................................................................................................64
5.1.1 Taxonomy of Evaluation Methodologies................................................................64
5.1.2 CSCW Evaluation Methodologies ............................................................................65
5.2 A Technology Focused Methodology.................................................................................67
5.3 Single User Evaluation...........................................................................................................69
v
5.3.1 General Computing .....................................................................................................69
5.3.2 Human Computer Interaction...................................................................................71
5.4 Multi-user Evaluation.............................................................................................................73
5.4.1 Distributed Computing ...............................................................................................74
5.4.2 Human-Human Interaction .......................................................................................79
5.5 Conclusion...............................................................................................................................84
5.5.1 Ordering an Evaluation...............................................................................................84
5.5.2 Comparison to Existing Methodologies ..................................................................85
5.5.3 Part of a Complete Evaluation ..................................................................................85
6 Rebecca: An Architecture for Testing CSCW Applications......................................................87
6.1 General Architecture ..............................................................................................................89
6.1.1 Registration Management ...........................................................................................91
6.1.2 Event List Management..............................................................................................91
6.1.3 Component Management...........................................................................................95
6.1.4 Playback Management.................................................................................................97
6.1.5 State Management ........................................................................................................99
6.1.6 Trigger Management..................................................................................................101
6.2 General Infrastructure..........................................................................................................103
6.2.1 IDE Integration..........................................................................................................103
6.2.2 User Interface Independence...................................................................................106
6.2.3 Extensible Component and Event Models...........................................................108
6.2.4 Record Filtration.........................................................................................................113
6.2.5 Script Simplification...................................................................................................114
6.2.6 Playback Control and Feedback..............................................................................117
6.2.7 Native Language Recordings ...................................................................................119
6.3 Multiuser Support .................................................................................................................122
6.3.1 Interprocess Communication Independence........................................................122
6.3.2 Playback Orchestration .............................................................................................125
6.3.3 Triggers........................................................................................................................137
6.3.4 Threshold Model........................................................................................................139
6.3.5 Global Clipboard........................................................................................................160
6.3.6 Scalability.....................................................................................................................161
6.3.7 Application Independence........................................................................................163
7 Evaluation ........................................................................................................................................166
7.1 The Reconfigurable Collaboration Network ...................................................................167
7.2 Evaluation Phase I: Converting Rebecca to Java 1.2 .....................................................170
7.3 Evaluation Phase II: Getting Rebecca to work with RCN...........................................171
7.3.1 Component Detection ..............................................................................................172
7.3.2 Component Naming..................................................................................................173
7.3.3 Component Existence...............................................................................................175
7.3.4 Modal Dialogs.............................................................................................................175
7.3.5 Menu Bars....................................................................................................................176
7.3.6 Synchronization Feedback........................................................................................176
7.4 Evaluation Phase III: Evaluating RCN.............................................................................177
7.4.1 Single User Tests ........................................................................................................177
7.4.2 Multiuser Tests............................................................................................................185
vi
7.5 Discussion ..............................................................................................................................192
8 Conclusion and Future Work........................................................................................................194
8.1 CSCW Application Methodology for Testing .................................................................194
8.2 Rebecca: An Architecture for Execution Based Testing of CSCW
Applications ...........................................................................................................................195
8.3 Evaluation..............................................................................................................................197
8.4 Future Work...........................................................................................................................197
8.4.1 The Future of CSCW Evaluation............................................................................197
8.4.2 Multiuser Recording...................................................................................................198
8.4.3 User Swapping ............................................................................................................198
8.4.4 Remote Windowing ...................................................................................................199
A Appendix: RCN Bugs Discovered During Evaluation.............................................................200
A.1 Error message displayed when starting up RCNPublicServer in
Win95/98 ...............................................................................................................................202
A.2 Configuration of PATH shell variable necessary for
NativeLibrary.dll for RCNPublicServer in Win95/98.......................................203
A.3 ISServer does not always flush terminated RCNPublicServer ..................................204
A.4 Documentation Errors.........................................................................................................205
A.5 Inconsistent use of Quit, Exit, Leave, Cancel ..................................................211
A.6 “Pick a IS” is grammatically incorrect...............................................................................212
A.7 No version number displayed in RCNPublicServer, rcnClient,
ISServer ...............................................................................................................................213
A.8 User Preference Dialog Displays Invalid Colors ............................................................214
A.9 Preference Dialog Displays Too Many Colors................................................................215
A.10 Preference Dialog Allows Same Color for Two Users in Same Session....................216
A.11 No lock mechanism for simultaneous edits of Team Information.............................217
A.12 Race Condition Joining a Session ......................................................................................218
A.13 Ghost Cursor Hidden By New Applications...................................................................219
A.14 Sticky Mouse Buttons...........................................................................................................220
A.15 Multiple Client Control of Public Machine......................................................................221
A.16 Incorrectly Translated Keys ................................................................................................222
A.17 Sticky Shift, Alt, and Ctrl Keys ....................................................................................223
A.18 Race Condition in rcnClient’s User Interface ...........................................................224
A.19 Race Conditions Joining Sessions, Users, Teams, Publics ............................................225
A.20 Inconsistent use of OK, Okay............................................................................................226
A.21 Flickering Ghost Cursor......................................................................................................227
A.22 Confusing Display of Session Clients ...............................................................................228
A.23 Memory Leaks in Public and Client When Ghosting ....................................................229
A.24 Can’t play Indiana Jones from rcnClient......................................................................230
A.25 Correspondence from RCN Development Team..........................................................231
B Rebecca-J Information....................................................................................................................233
References ...............................................................................................................................................234
vii
LIST OF FIGURES
Figure 1: Rapid prototyping model of the software life cycle [10] .................................................... 3
Figure 2: Time/Space Taxonomy of Groupware [5] .........................................................................10
Figure 3: CollabBillboard socket shadow.............................................................................................22
Figure 4: Sketch of experimental design...............................................................................................23
Figure 5: Selecting a billboard site in the city.......................................................................................24
Figure 6: Control window for assembling billboard. Both users see the same
window, view the entire billboard frame and move pieces.................................................24
Figure 7: Assigned roles "view billboard" window. This user has a zoomed out
view of the billboard frame but cannot move any pieces ...................................................25
Figure 8: Assigned roles "place billboard" window. This user has a zoomed in
view of the billboard frame and can move pieces. ...............................................................26
Figure 9: Z Language schema for CollabBillboard.............................................................................37
Figure 10: GIL: Specification for queueRemotePieceUpdate$n......................................................38
Figure 11: GIL specification for drawRemotePieceUpdate$n.........................................................39
Figure 12: Control flow graph for loop with five possible logic paths...........................................45
Figure 13: Code fragment from CollabBillboard................................................................................46
Figure 14: Usability guidelines from [87]..............................................................................................49
Figure 15: Final Exam C/S Test Multiuser Architecture ..................................................................59
Figure 16: Taxonomy of Evaluation Methodologies [122]...............................................................65
Figure 17: Intersecting Technologies of a CSCW Application ........................................................67
Figure 18: CAMELOT’s Single/Multiuser Stages..............................................................................68
Figure 19: Technology and Social Aspects of CSCW [122] .............................................................85
Figure 20: General architecture diagram for Rebecca........................................................................89
Figure 21: Registration management architecture diagram for Rebecca. .......................................90
Figure 22: High level view of event list model/view/controller architecture ...............................93
Figure 23: Detailed view of event list model/view/controller architecture. .................................94
Figure 24: Component management architecture diagram for Rebecca ........................................95
Figure 25: Playback management architecture diagram for Rebecca..............................................97
Figure 26: Algorithm for event list replay. ...........................................................................................98
Figure 27: Algorithm for native language replay.................................................................................98
Figure 28: State management architecture diagram for Rebecca.....................................................99
Figure 29: Trigger management architecture diagram for Rebecca...............................................101
Figure 30: Connecting to Rebecca-J using IBM’s Visual Age Visual Composition
Editor.........................................................................................................................................105
Figure 31: Connecting to Rebecca-J using inline code ....................................................................106
Figure 32: Recording is played back correctly event though UI components have
moved. .......................................................................................................................................107
Figure 33: UI Components translated to Rebecca’s Component Hierarchy...............................109
Figure 34: Creation and initialization of PropertyChangeComponentInt in
AgentTester ...............................................................................................................................110
Figure 35: AgentTester’s modified setter for monitoring state change to integer
count ..........................................................................................................................................111
viii
Figure 36: Implementation of dispatchEvent() for PropertyChangeEventRecord....................111
Figure 37: Implementation of playbackEvent() for AgentTester...............................................112
Figure 38: Selective recording with Rebecca-J...................................................................................114
Figure 39: Recorder turned on and recording of plus push button press made.........................115
Figure 40: Push button press events copied and pasted back into the event list........................116
Figure 41: Result of replay.....................................................................................................................116
Figure 42: Feedback for synchronization state in Rebecca-J..........................................................118
Figure 43: Implementation of MouseEventRecord’s toJavaString() Method .............................120
Figure 44: Sample output from MouseEventRecord’s toJavaString() Method .......................121
Figure 45: Sample implementation of executeEventRecordList().................................................121
Figure 46: Recording customized with a for loop ............................................................................122
Figure 47: Implementation of setCount() ..........................................................................................124
Figure 48: Implementation of remoteSetCount() .............................................................................124
Figure 49: Record filtration to remove redundant events while recording IPC. ........................125
Figure 50: Original Playback Orchestration Proposal .....................................................................126
Figure 51: An (V+E) algorithm to determine cycles in a graph.................................................128
Figure 52: Resource graph (left) with deadlock cycle detected (right) ..........................................129
Figure 53: Reworked playback orchestration ....................................................................................130
Figure 54: Algorithm to Process Synchronization Events..............................................................131
Figure 55: Algorithm for the removal of a synchronization event from a script. ......................131
Figure 56: Determining synchronization points for SecondWind’s recording. ..........................132
Figure 57: Synchronization Dialog for SecondWind’s Recording.................................................133
Figure 58: Synchronization event inserted just before mouse press on slider bar
in SecondWind’s recording.....................................................................................................134
Figure 59: Timer trigger and virtual user script to support the metronome in
Rebecca-J...................................................................................................................................135
Figure 60: Deadlocked scripts. .............................................................................................................136
Figure 61: Deadlock dialog. ..................................................................................................................136
Figure 62: User interface for triggers in Rebecca-J...........................................................................137
Figure 63: A threshold editor is necessary for a simple event type threshold
model. ........................................................................................................................................140
Figure 64: Rebecca-J’s editor for the propertyChangeInt threshold model. ...............................140
Figure 65: The mouseRegion threshold model editor...............................................................141
Figure 66: Timer browser in Rebecca-J ..............................................................................................142
Figure 67: Configuring a timer trigger for a single virtual user. .....................................................143
Figure 68: Ordering recording players in Rebecca-J ........................................................................144
Figure 69: Adding a customized threshold model to ThresholdList’s initialize()
method.......................................................................................................................................145
Figure 70: Implementation of the compare() method for low level key event
threshold models. .....................................................................................................................146
Figure 71: Implementation of compare() method for keySequence threshold
model. ........................................................................................................................................147
Figure 72: Constructor for mouseRegion threshold model............................................................148
Figure 73: Implementation of event sequencing threshold model in Rebecca-J. .......................149
Figure 74: A shared drawing/chat application ..................................................................................154
Figure 75: An example of trigger chaining.........................................................................................156
ix
Figure 76: Trigger chaining extends shared drawing area test........................................................157
Figure 77: Trigger state chaining example..........................................................................................158
Figure 78: Derivation of unique name from root component.......................................................174
x
LIST OF TABLES
Table 1: SCR Table for CollabBillboard...............................................................................................40
Table 2: Equivalence classes for cos...................................................................................................44
Table 3: Final Exam C/S-Test™ TML Script Commands for Multiuser Script
Synchronization ..........................................................................................................................62
Table 4: Session Control window from SQA Suite™.......................................................................64
Table 5: General Computing Techniques from 1[14] and 2[10]........................................................70
Table 6 General Computing ∩Human Computer Interaction Techniques from
1[125], 2[40] ..................................................................................................................................72
Table 7: Usability Techniques from [40] ..............................................................................................73
Table 8: Distributed Computing Techniques......................................................................................77
Table 9: General Computing ∩Distributed Computing Techniques............................................78
Table 10: Human Computer Interaction ∩Distributed Computing Techniques .......................79
Table 11: Human-Human Interaction Techniques ............................................................................82
Table 12: Human-Human Techniques Organized by CAMELOT Code .....................................84
Table 13: Rebecca’s Remote Objects....................................................................................................93
Table 14: Threshold Models implemented in Rebecca-J.................................................................139
Table 15: Bugs discovered in RCN using CAMELOT and Rebecca-J ........................................177
Table 16: RCN's shared objects classified by coupling and architecture......................................189
Table 17: Results of RCN Ghost Scalability Testing .......................................................................190
xi
ACKNOWLEDGMENTS
My six-year doctoral journey was filled with detours. Some needed to be explored, some
should have been left alone, and some thank goodness, I managed to avoid. I suspect the
twenty other students that entered the program with me in 1994 began a similar journey.
Thirteen made it through the qualification exam. Four passed the candidacy exam. Three are
completing the degree. I wanted to thank the people who helped me make this journey a
success.
Thank you, Mr. Brown, my eighth grade science teacher at East Lyme Junior High School.
Your guidance and confidence in my abilities awakened a source of inner strength and drive
that changed my life forever. You are a wonderful teacher.
Thank you, Mom and Dad for watching over me when I was young and being supportive
while letting me find my own way as an adult. Dad, you taught me the value of hard work and
persistence. Mom, you taught me an artist’s creativity.
Thank you Mike, Tim, Kathleen, and Grandmommie. During the frustrations and doubts that
appeared along the way, your faith and confidence that I was doing the right thing reaffirmed
my own.
Thank you cousins: Jenny, Katie, Lizzie, Byron, Brendan, and Aeron. You were my home
away from home during my tenure here at Rensselear. Your questions over the years about
what grade I was in were fun to answer and a pressing reminder to finish.
Thank you, friends in the Computer Science department: Jeff Neshewait, Patrick Fry, Amir
and Amanda Sehic, Dr. Stephen Blythe, Rick Klein, Gregg Steuben, Louis Ziantl, Terry
Hayden, Pam Paslow, Darren Lim, Quincy Stokes, and Lina Guzman. You made me feel
welcome, gave me great advice, and showed me how to have fun in Troy. The time has passed
too quickly.
xii
Thank you, friends in the Literature, Language, and Communications department: Lynne
Cooke, Anne Navin, Dr. Joe Downing, Dr. Lee Honeycutt and Carolyn Honeycutt. You kept
the other side of my brain from atrophying while I was immersed in geekdom.
Thank you, Bill Oldfield and Paula Paul. You two had a profound impact on my professional
career. You gave me responsibility and challenging projects and your confidence in me
brought a dawning realization that I could succeed at anything I set my mind to.
Thank you, Dr. Steven Howes. You’ve been my best friend for over twenty years. You blazed
the Ph.D path before me, showing me that it was attainable by mere mortals. You advice and
support as a veteran of the process was invaluable.
Thank you, WPI professors Nabil Hachem, Matthew Ward, E. Malcom Parkinson, Michael
Gennert, and Stanley Selkow. I clearly remember the look on Matt’s face one night as he
described the sabbatical to Australia he was about to take. Your encouragement and
enthusiasm when I was deciding whether to pursue a doctorate tipped the scale.
Thank you, Professor Ephraim Glinert. You took me on as an advisee and gave me freedom
to pursue my own curiosity. Your wise counsel kept me from straying down too many dead
ends. Your willingness to fund my research on and off campus, and to grant a brief leave of
absence gave me the flexibility I needed to get the degree completed.
Thank you, Professor Edwin Rogers. Your advice and involvement as a member of my
committee was invaluable. You prodded me to produce a formal testing methodology that has
become an important part of the thesis. Your research group’s application, RCN, was exactly
the kind of collaborative system I needed for an evaluation of the thesis. Finally, our many
conversations about sailing helped kept me sane.
Thank you, Rensselaer computer science senior J.J. Johns. You lead the RCN development
team and helped me a great deal during the evaluation phase of my thesis. I appreciated your
willingness to accommodate my needs while taking a full course load and continuing your
regular RCN duties.
xiii
A special thank you to my wife, Becky Dugan. I’m so grateful for your support this past year.
You’ve taken care of a lot of the details of daily life so I could focus on finishing this
dissertation. You’ve also been a great editor, therapist, and friend.
“The journey is the reward” to quote a Tao saying and I couldn’t agree more. These past six
years have been the most incredible of my adult life. I’ve honed my skills as a computer
scientist, taught classes, worked for an Internet startup company, and conducted serious
academic research. To top it all in a Wilderness First Aid class on campus I met the best thing
that ever happened to me: my wife Becky. Thank you Rensselaer!
1
1 Introduction
Human beings are social animals [1]. Many of the developments that are the hallmarks of
human society can be traced to the need to interact and cooperate. Language allows more
efficient and expressive communication. Money is used to acquire goods and services
from others. Organizations such as the family, university, workplace, government, and law
that preserve, protect, and advance humanity rely on complex interplay between
individuals [2].
Technology has also played an important role in the evolution of humanity. Tools of the
mind - for gathering, processing, and distributing information - have had the greatest
impact in the twentieth century [3]. Among these tools is the computer, arguably the most
powerful tool ever developed. This power comes from the computer’s ability to deal with
information management in a generalized fashion [4].
"A computer-based system that supports groups of people engaged in a common task (or
goal) and provides an interface to a shared environment" is called groupware [5]. Douglas
Englebart published a visionary paper describing a groupware system called NLS in 1968.
NLS contained many of the basic functions that can be found in modern groupware
systems including e-mail, shared annotations, shared screens, shared pointers, and
audio/video conferencing [6]. During the 1970s, e-mail and threaded text conversations
(e.g. conferencing systems and bulletin boards) became commonplace.
The need to interact and socialize combined with the technological progress of the past
several decades has led to the development of a branch of study known as Computer
Supported Cooperative Work (CSCW). Research expertise in the CSCW field covers a
wide range of disciplines including computer science, psychology, anthropology, and
education. Applications that fall under the CSCW umbrella are diverse: electronic mail,
newsgroups, chat, multi-user editors, meeting support, videoconferencing, shared
simulations, and workflow are some examples.
Despite enormous potential, CSCW applications are still immature. Four software
technology components must be successfully integrated in order to create a useful system:
2
General computing provides basic application functionality found in any software
system. Determining that a software program, even a simple one, functions
correctly has been the subject of decades of research.
Human-computer interaction technology supports the interaction between a user and
the application. All of the difficulties inherent in developing the interface for a
single user system apply including: iterative design, reactive programming,
multithreading, undo/redo, and real-time programming.
Distributed systems cover software that supports the execution of the application on
multiple computers. Classic problems of multiprocessing that have to be
confronted include: inter-process communication, process synchronization,
session management, and fault tolerance.
Human-human interaction deals with functionality supporting interaction among
several users. Issues include: coordination, coupling, privacy and user awareness.
Creating, testing, and maintaining a program that uses any one of these software
technologies is difficult. The effort involved in a system that combines all these
technologies is truly daunting.
To attack these difficulties, the CSCW research community has tried to simplify the
creation of groupware through the development of toolkits. These toolkits address four
important areas: run-time architecture, programming abstractions, groupware widgets, and
session management. The run-time architecture aids the programmer with process
management and inter-process communication. Programming abstractions simplify
synchronization of distributed events and data. Groupware widgets provide the
programmer with GUI components for multiuser applications. Session management
allows the programmer to customize how users create, join, leave, and manage a multiuser
application.
1.1 Problem Overview and Motivation
Researchers concede that there is room for improvement of groupware toolkits [7]. For
example, little work has been done to integrate audio and video into CSCW applications
[8]. By examining the software lifecycle, other areas for CSCW application improvement
can be discovered (see Figure 1).
The difficulties encountered in creating CSCW applications also apply to their verification.
For example, security and privacy need to be validated before users can feel confident that
3
their private work is protected. The development process that user interface intensive
CSCW applications go through requires constant reevaluation. Undo/Redo scenarios can
get extremely complicated in multiuser settings, and require thorough verification.
Distributed systems like CSCW software are “notoriously difficult to write, test, and debug"
[9]. . Leading researchers in both the CSCW and testing fields note “CSCW testing tools
are non-existent” [8].
To date CSCW evaluation efforts have been broad based, advocating the examination of
both the social and technological aspects of an application. These broad based approaches
combined with the research community’s preference for social evaluation have created a
lack of specific techniques for the technological evaluation of CSCW software. A
methodology is sorely needed given the complexity of the testing task.
Figure 1: Rapid prototyping model of the software life cycle [10]
In addition to the lack of techniques for testing, there is also the logistical problem of
finding “real” users to exercise the software [11]. A usability test requires users to exercise
the application. Typically, the first user to exercise a compiled and linked program is the
developer. When the developer is satisfied that the application is operating properly, a
Specifications
Verify
Rapid
Prototyping
Verify
Requirements
Design
Verify
Implementation
Test
Integration
Test
Maintenance
Changed
Requirements
Verify
4
second stage begins when real users are brought in for further study. It is relatively easy
for a developer to play the role of a user in a single user application, but in a CSCW setting
this becomes more challenging. “Because we need at least two or more people for each
observation scenario, we spend more time scheduling subjects and setting up equipment to
observe each subject” [12]. It is hard enough to get one user to commit to a block of test
time. It is even more difficult to get two or more users to agree to the same block of time.
Higher costs in terms of time and money are incurred during CSCW testing because of this
scheduling problem and the greater number of users needed for testing.
A common sense approach runs both users’ portions of the program on a single machine.
Input and output are straightforward since both users have the same keyboard, mouse, and
display. It is possible for the developer to see the immediate effect that one user’s action
has on the other because all output goes to the same display. From a cost standpoint, this
method is attractive because it only requires a single machine. There are, however,
significant drawbacks to this approach. Concurrency is severely restricted because only
one user can have the input focus on the machine. Network performance is inaccurately
represented because communication between users never leaves the local machine.
General system performance is also misrepresented. In a heavily graphical application, for
example, the performance when multiple users run on the same machine may be
unacceptable due to intense image manipulation. Screen real estate can also be a problem.
Since many CSCW applications are designed for one user per display, it may be difficult to
view both users’ output simultaneously. Other one-per-machine resources may not be
shared properly. For example, there is only one system cursor per machine. It may be
impossible to test a system cursor remote control function until the application runs on
two machines. Multiuser audio output is also difficult to test on a single machine.
Distributing a two-user application across two machines eliminates most of the single user
problems and more accurately represents how the system will behave. However, a single
developer trying to exercise the application on two machines requires a great deal of
dexterity and agility. Two displays provide an overwhelming amount of screen surface to
observe during simultaneous visual updates. Multiple keyboards and mice allow
concurrent input, but require dexterous skills for anything beyond a simple key-press or
5
button-click. Imagine a single developer trying to type two sentences on two keyboards at
the same time! Sophisticated simultaneous mouse manipulation is also difficult. Audio
also presents a problem. It can be difficult, for example, to isolate which machine is
producing audio output during execution. Headphones offer an option with multiple
testers, but this isn’t possible with a single developer. The difficulty of usability testing a
CSCW application increases when three, four, or more users are added to the system.
We acquired first hand experience with the difficulties of developing and testing a CSCW
application during the creation of CollabBillboard [13]. CollabBillboard is a multiuser
simulation developed to test the theory that explicit user roles can induce greater
collaboration. Although our evaluation of the application supported the hypothesis, we
found the entire process frustrating. The biggest problem was how much we
underestimated the amount of time needed to complete the application. It took almost
three times longer than we expected! A major contributor to the delay was difficulty in
finding subjects to help test the application. For the reasons discussed above, a single user,
the developer, was not sufficient to thoroughly exercise the program. It was often
necessary to comb the halls for volunteers, and as the months went by, they became
increasingly reluctant.
We began an investigation of testing systems to determine if any of them could have
helped during the development of CollabBillboard. The research community has focused
primarily on efforts that automate verification early in the software life cycle. The earlier a
software error can be detected in the life cycle, the less costly it is to fix [14]. Even with
black box and white box testing (see Section 4.2.4), which appear late in the cycle,
automatic testing techniques are used. Work in early life testing has proven impractical for
large complex applications. Late cycle techniques like black and white box testing are
intractable for all but the simplest of programs. The research community has almost
completely ignored the system test stage.
The commercial world, on the other hand, takes a less formal, execution-based approach
to verification. The tester is responsible for manually creating test cases to be executed
against the system, with little guidance from the testing tool. The test cases are executed
against the application during the implementation, integration, and system test phases.
6
Fixed test cases are insufficient for the verification of a CSCW application. There is no
opportunity for the CSCW tester to participate in the test. The tester cannot change the
direction of a test case on the fly. As a passive observer, the tester cannot view the actions
of a user effectively because the automated test executes too quickly. Finally, the tester
lacks fine-grained control over the virtual users participating in the test.
1.2 The Contributions of Our Research
Our research has focused on improvements to execution based testing of CSCW software.
We have developed CAMELOT, a CSCW Application MEthodoLOgy for Testing.
Developers and quality assurance personnel can use CAMELOT to evaluate software
technology that comprises a CSCW application. We devised Rebecca, an architecture for
an execution based test system, motivated by the desire to support live user participation in
a CSCW test. In addition, the architecture makes important contributions to general
execution based testing systems. To determine the efficacy of our work, CAMELOT and
a Java based implementation of Rebecca were used to evaluate a mature CSCW
application: Rensselaer Collaborative Network (RCN). The evaluation uncovered over
twenty bugs in RCN, flaws in Rebecca and the implementation, and provided valuable
feedback for future work.
1.2.1 CSCW Application Methodology for Testing
Existing methodologies take a broad based approach to the evaluation of a CSCW
application. While acknowledging that technology plays a role in a CSCW system, these
methods give few details on how its evaluation should proceed. The CSCW Application
Methodology for Testing (CAMELOT) provides an organized set of specific techniques
that can be used for technological evaluation. The methodology breaks the testing process
into two stages: single user and multi-user. In the single user stage, General Computing
and Human Computer Interaction features are examined. During the multi-user stage,
Distributed Computing and Human-Human Interaction aspects are investigated.
A unique code is associated with each technique. The code provides a classification
scheme for the tests used and problems uncovered during application evaluation. We
believe CAMELOT’s techniques are inclusive of most of the technology tests an evaluator
would want to perform on a CSCW application.
7
1.2.2 Rebecca: An Architecture for Execution Based Testing of CSCW Software
A critical component is missing from multiuser CSCW application development that is
taken for granted in single user applications: support for live user testing. Anytime someone
wants to test a single user application, they can pose as the user and run the application.
As explained above, it is very difficult for a single person to perform a live user test when
multiple users are required. State of the art commercial and research testing systems do not
provide adequate guidance or support for a single person to perform live multiuser
verification.
Our approach to integrating a live user into an execution based testing architecture focuses
on the shortcomings of traditional execution based test systems. Rebecca makes
significant contributions to the general infrastructure of execution based testing systems:
The record/playback process is improved beyond the user interface with
extensible component and event models. Any application activity can be replayed
if the source is defined as a component, and the activity is defined as an event.
A record filtration system is defined that allows the user to filter events by selecting
which components participate in a recording. In past systems, the only filtration
options were manually intensive intermittent recording or editing of the recording.
Unlike traditional testing systems that view testing as a separate task from
development, the architecture seamlessly integrates into existing integrated
development tools such as IBM's Visual Age.
For sophisticated data structures and control flow in a test script, Rebecca
describes a blueprint for exporting recordings in a familiar format: the IDE's native
programming language. This contrasts with traditional test systems which require
the user to learn a proprietary scripting language.
Re-recording of scripts after application changes have been made is reduced using
runtime resolution of components and component-centric events.
Recording script management is simplified with a VCR-like metaphor for creating,
editing and executing tests. This allows the user to create and run a test in
seconds.
Rebecca also breaks new ground in the area of multiuser execution based testing:
The ability to incorporate live and virtual users into a single test session using
distributed triggers. With triggers, virtual users react to events generated by other
8
users (live or virtual). Existing test systems completely prescribe a test session
which precludes meaningful live user participation.
Virtual users can react to four classes of events using triggers: user interface, state
change, timer, and customized. This allows the virtual user to respond to virtually
an application activity, much like a live user.
Threshold models are provided which allow the tester specify the characteristics of
an event or sequence of events that will fire a trigger. A threshold model has a
user interface component, which allows runtime specification of firing conditions.
An extensible object oriented framework for complete customization is also
included.
Improvements to synchronization during multiuser playback including an
orchestration metaphor, simplified synchronization mechanisms, deadlock
detection, and deadlock recovery.
A global recording clipboard, which simplifies the process of sharing some or all of
a recording between virtual users.
Ability to record, playback, and monitor application communication while
maintaining independence from the communication mechanism. Existing test
systems do not provide the ability to monitor application communication. The
few academic systems that do provide this ability are mechanism specific.
A resource conserving architecture. This allows the system to run in tandem with
an IDE, and improves scalability as the number of users participating in a test
increases.
It is expected that Rebecca will impact the development of future execution based testing
systems and collaborative software. Rebecca promotes the integration of testing early in
the software life cycle. This is critical because studies have shown that the earlier a bug is
discovered the less expensive it is to correct. The architecture also provides guidance for
the development of future multiuser testing. This guidance includes independence from
the application's communication infrastructure, improvements to multiuser
synchronization, triggers, and a scalable design. Finally, Rebecca-J, a Java-based
implementation of the test system architecture is available for immediate use for the
development and testing of Java-based collaborative software. In addition to
improvements in multiuser testing, this should immediately benefit the research
community by alleviating the need for live users during a multiuser test.
1.2.3 Evaluation
We believe the evaluation of our methodology and testing architecture was a success.
Unsolicited correspondence from the RCN team (see Section A.25) showed gratitude for
9
the problems uncovered by the CAMELOT and Rebecca approach. Two-dozen bugs
were discovered in this mature CSCW application. Some of the problems were cosmetic.
However, some of them were serious and are being corrected to make RCN a robust
application.
Rebecca was also significantly improved. Flaws in the component management
architecture were uncovered and corrected. Problems with modal dialogs were also fixed.
Finally, several ideas for enhancements to Rebecca were formulated.
1.3 Overview of this Document
The rest of the document is broken up into several chapters. Chapter 1 gives the reader an
understanding of the scope of CSCW and describes some of the major groupware toolkits.
Chapter 2 describes CSCW application CollabBillboard and the lessons we learned from
creating the software. One of the biggest problems we had developing CollabBillboard
was testing the system between versions. Chapter 4 looks at state of the art academic and
commercial contributions to the field of software testing. Chapter 5 describes
CAMELOT, the CSCW Application Methodology for Testing, a set of techniques we have
developed specifically for testing collaborative software. Chapter 6 describes Rebecca, an
architecture we have created for a collaborative software testing system. In Chapter 7,
CAMELOT and Rebecca are evaluated by using them to test the Reconfigurable
Collaboration Network. Chapter 8 concludes the thesis with some thoughts on future
work.
10
2 A Survey of Computer Supported Cooperative Work
Groupware applications can be classified in several ways. One common method of
categorization looks at how an application deals with issues of time and space. When
multiple users are using a groupware application, when and where interaction occurs helps to
define the system’s capabilities. Temporally, users can interact at the same time or at
different times. Spatially, users can interact in the same place or from different places.
Figure 2 illustrates these possibilities.
Figure 2: Time/Space Taxonomy of Groupware [5]
An example of same time/same place groupware is Rensselaer’s Design Conference Room
Collaboration Network. This software was designed for face-to-face design meetings.
Participants each have access to a private workstation and use a floor control policy to
control access to a shared public workstation [15]. Chat programs like Internet Relay Chat
(IRC) are examples of same time/different place groupware. Users communicate with
each other via shared text windows where messages are typed and responses are viewed in
real-time [16]. Other groupware that facilitates communication between users at the same
time regardless of spatial location is known as synchronous groupware. E-mail is an example of
different time/different place groupware. The final category, different time/same place
groupware, has no known applications. A bulletin board where people could leave
messages for each other demonstrates this type of collaboration [5].
synchronous
distributed
interaction
asynchronous
interaction
face-to-face
interaction
Same Time Different Times
Same
Place
Different
Place
asynchronous
distributed
interaction
11
2.1 Groupware Applications
This section presents an overview of the major classes of CSCW applications and some of
the important software systems that have been developed to implement them.
Some Common CSCW Applications: A number of common computer applications fall
under the domain of CSCW. Electronic mail consists of the asynchronous exchange of
information between a sender and one or more recipients [17].
Newsgroups operate in a manner similar to electronic mail. Information is exchanged
asynchronously between a sender and newsgroup covering a specific topic of interest
through activity known as “posting”. Users interested in a particular newsgroup then
download and read postings using a newsreader. Newsgroups are a more public form of
expression than electronic mail, which directs messages to a limited group of recipients
[18].
Chat allows two or more users to communicate synchronously using text. Users can add
messages to the shared text window by typing in a private compose area of the client chat
program and selecting a “send text” option. Within seconds, the text message will appear
in the shared text window of all room occupants. Chat is more conversational than
electronic mail or newsgroups, because of the real-time communication [19].
Videoconferencing is a method of synchronous distance communication between participants
using live audio and video. There is strong interest in this technology because of the time
and money involved in attending face-to-face meetings. Despite obvious benefits,
videoconferencing has not replaced face-to-face meetings because of issues like lack of
support for eye contact, difficulty integrating remote users from multiple sites, and
insufficient network bandwidth [20].
Workflow software is asynchronous groupware that helps improve the process of
performing multi-person tasks in the workplace. Some examples of improvement include
reduced lag time because manual task routing is eliminated; and better feedback about the
state of the tasks that comprise the business process [21].
12
Shared Windows: A shared window system allows synchronous collaboration through a
logical window physically replicated on the screens of participating users. A user provides
input and views output in this window in exactly the same manner as other windows on
the display. However, any action in a shared window is immediately reflected on the
displays of other participating users [22]. Single user applications running inside the shared
window are collaborative with no modifications. This straightforward approach to
collaboration has drawbacks, however. WYSIWIS is the only viewing option, and conflict
resolution is limited to a generalized floor control policy. Social awareness is supported to
a limited degree in some systems through telepointers and a shared transparent layer where
users can make graphical and text annotations [8]. Shared windows are used in areas
including
classroom/meeting support: where users can share the same view of an application
relevant to the discussion;
technical support: where technician can walk a user through a software problem
VConf was one of the earliest shared window systems developed [23]. Rensselaer’s Design
Conference Room Collaboration Network takes the idea of shared windows to an extreme
by sharing windows, display, and an entire workstation between users [15]. Farralon’s
Timbuktuand Travelling Software’s LapLinkare examples of commercial systems.
Multiuser Editing: Multiuser editing can be asynchronous or synchronous. Within these
divisions, further specialization occurs based on the type of information (e.g. text,
graphics). Multiuser asynchronous text editors allow multiple users to edit the same document
over time. At any specific moment, only one user can be editing the document.
Synchronization of the document among users, distributed access, and control are essential
requirements for this type of editing system. Commercial word processors like Microsoft
Wordprovide primitive support through file locking which prevents simultaneous
editing, and file sharing which provides multiuser access and control. Despite limited
capabilities, the basic asynchronous multiuser editor mirrors how collaborative documents
are produced and has been readily adopted by the business community.
13
More advanced asynchronous editors support a variety of collaboration styles. This is
important because it has been observed that collaboration needs change during a
document’s evolution [24]. The PREP asynchronous editor breaks a document up into
layers called columns. The main text, co-author’s notations, and comment
request/responses are examples of columns. A column is composed of chunks that
correspond to a logical unit of information (e.g. paragraph, request/response pair). Each
user receives a copy of the document. Updates to the local copy are received from other
users on a periodic basis. Specialized software helps the user visualize and integrate
remote updates into the local copy. How these updates are sent and received is controlled
by user configurable parameters of interaction. Grain size controls the size (column, chunk,
keystroke) of a document update. Flow determines when an update occurs (automatic,
upon request). Transmission speed controls how fast information must flow from one site to
another via the network. PREP also allows users to manage the task of multiuser editing.
Users are able to negotiate interaction parameters, set document access control, and make
commitments to deadlines [25]. Quilt[26] is an example of another asynchronous editor.
Multiuser synchronous text editors allow multiple users to simultaneously edit the same text
document. Changes made to the document by one user are immediately seen by all other
users. Users may be allowed separate, independent views of the document (What You See
Is Not What I See - WYSINWIS) as in the GroupKit Fish-Eye editor [27], or the views
may be linked (What You See Is What I See - WYSIWIS) as with XEROX PARC’s Cnoter
[28]. Conflict invariably arises during multiuser editing sessions. Rapport [29] uses a
floor control mechanism in which users request permission to modify sections of the
document. GROVE [5] relies on simple voice communication to resolve differences.
Cognoter [9] uses access control to prevent other users from modifying an area that is
already being changed. Public and private access to sections of the document is a desirable
capability. GROVE supports the ability to limit a section’s read/write access to one or
more users. The editing experience is enhanced through social awareness: allowing a user
to know who else is modifying the document, and where the changes are being made.
Groupkit’s Fish-Eye editor uses icons to represent each user in the editing session, and a
graphical representation of the entire document with fish-eye lenses over sections that
14
users are currently editing. Many editors also support telepointers (also known as ghost
cursors, or remote pointers) to indicate the location of a remote user’s pointer.
Multiuser synchronous editing of complex information requires functionality similar to the
synchronous text editors mentioned above. Shared drawing systems are an example of this
kind of editor. Users share a drawing area where text, 2-dimensional graphics, and images
can be manipulated [30]. Any change made to the drawing area by one user is immediately
seen by all other users. Microsoft NetmeetingWhiteboard is a commercial example of
such a system. NetMeeting Whiteboard supports WYSINWIS by dividing the shared
drawing area into sheets, with a set of horizontal and vertical scrollbars for navigation
within a single sheet. Users are allowed to lock sheets to prevent other users from making
changes. There is no support for a private drawing area. Social awareness is limited to
telepointers. One unique feature of the system is the ability to cut and paste any visible
window or window portion onto the drawing surface [31].
Meeting Support: Meeting Support consists of technological and physical environment
additions to a conference room. Interest in this type of technology is widespread because
statistics have shown that workers spend an average of 30-70 percent of their time in
meetings [9]. Technologically, networked computers, whiteboards, shared views, and a
Group Decision Support System (GDSS) are the major components. Most CSCW
meeting rooms allocate one networked computer per attendee, and network to other
devices in the room. The stand-alone whiteboard, an important focus of attention in the
regular conference room, is integrated electronically. Rensselaer’s Design Conference
Room (DCR) [15] includes a Softboard™ whose software records activity as the user
writes with a magic marker on the whiteboard surface. In addition to saving the final
board image, the software can play back the strokes that created the image. XEROX
PARC’s DOLPHIN System [32] and Berkeley’s Colab System [9] take whiteboards to a new
level with liveboards which are essentially large touch sensitive computer displays.
Handwriting recognition, sketching, and gesturing capabilities facilitate interaction with the
device.
The issue of public and private information is an important one during meetings.
Sometimes, users may wish to share information displayed on a private display, while at
15
other times there is a desire for privacy. Colab allows a single window to be shared among
meeting members. This window is usually displayed on the liveboard at all times.
Anything a user wants to share with the rest of the group must be pasted into this shared
window. DOLPHIN uses a sophisticated shared hypermedia document model where
artifacts generated privately can be shared between users and the liveboard. The DCR
allows sharing through a public computer and display. Users can access the public
computer/display through their private computer’s keyboard and mouse.
Many CSCW meeting rooms include special GDSS software to facilitate the meeting
process. The DCR provides a set of flexible, unstructured tools including floor control for
controlling the public display, anonymous chat for brainstorming and private chat for side
conversations. Colab provides two applications: Cognoter and Argnoter. Cognoter is
used for group creation of presentations. Software guides the participants through three
stages: brainstorming, organizing, and evaluation. Argnoter is used for group decisions on
competing proposals. The program brings participants through three different stages:
proposing, arguing, and evaluating. GroupSystems [33] provides applications to support
brainstorming, commenting on a specific topic, and idea organization.
The physical design of the conference room is very important. Colab and DOLPHIN
accommodate six participants around a U-shaped table. The liveboard is placed at the top
of the “U”. GroupSystem accommodates 24 participants with two concentric tiered rows
of seats centered around a large shared display. Each participant has access to a computer
and display, which is slightly recessed to allow greater visual contact with other users. The
DCR uses a hexagonal table that accommodates six. Each participant has a private
computer and public access to a shared computer and display. One unique property of the
DCR is that all display devices are completely recessed within the table. This affords users
total use of the conference room table surface, removes visual obstructions completely,
and helps to make the technology less obtrusive.
Simulations: Simulations involving multiple participants have become commonplace
with the ubiquity of networked computers in diverse application domains including
defense, aeronautics, and entertainment. The U.S. Department of Defense has been
actively developing networked simulators over the past decade. The result of this effort is
16
the Distributed Interactive Simulation (DIS), a set of protocols that allow network
connected simulators to participate in synchronous combat operations using a shared
electronic terrain [34]. Advantages of DIS over single user simulators include group instead
of individual training, support for user participation anywhere on earth, time sensitive
challenges that demand immediate responses from the users, creation of new tasks based
on the actions of the users, and rich interaction possibilities due to the large number
entities (user and computer controlled) simultaneously supported [35].
In the entertainment arena, multiuser games enhance the recreational experience because
they allow cooperation/competition with live users. Presumably, a live user will offer
more interesting challenges than a computer generated opponent. A synchronous
simulated automotive race is much more interesting if the car being challenged belongs to
a friend down the hall (or in the next state!) [36]. Communication between users, if
supported at all, is limited to a shared chat window. Game servers are appearing on the
internet that allow users to join in games with other users anywhere in the world, anytime
of the day or night (e.g. Microsoft’s Internet Gaming Zone, Blizzard’s battle.net, Mplayer,
Iron Wolf) [37]. Sample games include public domain systems like Xpilot and Netrek and
commercial systems like Warcraft, Quake II, and Jedi Knight.
Computer Supported Collaborative Learning (CSCL): CSCL applications occupy an
entire sub-discipline within CSCW. Any application that facilitates both cooperation and
learning falls under the CSCL umbrella. Some important areas of research include distance
learning, teaching rooms, knowledge construction, and shared reality.
Distance Learning is playing an increasingly important role at the college level. A distance
learning student is usually a full-time professional taking classes part-time. Most courses
are viewed as lectures broadcast live (or tape delayed). Interaction with the lecturer and
on-campus class is limited to the telephone, and asynchronous text exchanges (e-mail,
newsgroups, or the web) [38]. A lack of real-time interaction inhibits the kind of exchange
seen in the regular classroom, and in face-to-face collaboration [39]. Desktop
videoconferencing technology may help to solve this problem, however this has its own
challenges. It is difficult for an instructor maintain an awareness of remote students (i.e.
17
gestures, gaze direction, body language) simultaneously at multiple sites. Turn taking is
also a problem [40].
Teaching Rooms are classrooms that incorporate computing technology to facilitate
synchronous, face-to-face cooperative learning. Each student usually has access to a
networked connected computer. The computer display can be recessed to give the student
a line of sight to the lecturer. The lecturer may have the ability to display information on
his computer on a large screen visible to the entire class. The instructor may also have the
ability to project any student’s display onto the large screen [40].
Rensselaer’s Collaborative Classroom (CC) [41] has made a number of improvements to
the basic teaching room. The CC provides seating for teams of two to six students per
table. Embedded in the table is a networked Windows workstation. Students share
control of this workstation using specialized software that runs on their private laptops, or
with shared keyboards and mice provided with the table. Any computer in the room can
view the display of, or take control of any other computer in the room. This allows variety
of interaction styles including instructor demonstration, peer learning, team meetings,
instructor consultation, client consultation and class-wide presentation and critique.
Research has shown that teaching rooms can create experiences that are more interesting
for students than the traditional classroom. The teaching room is not a panacea, and has
had mixed responses from faculty. Some refuse to return to an ordinary classroom.
Others apply newly discovered teaching techniques to the regular classroom. Still others
find changes in teaching styles are too radical, and decide to return to a more traditional
lecture format [40].
Knowledge Construction: Knowledge Construction focuses on collective building of domain
understanding. A newsgroup is a basic form of group knowledge construction. The
Computer Supported Intentional Learning Environment (CSILE) system, from the
Ontario Institute for Studies in Education, is a community database created by students on
networked computers on and off campus [42]. Students can create multimedia notes,
comment on other student’s notes (with automatic notification to the original author), and
organize notes into different informational structures. The Collaboratory Notebook
18
provides students access to a shared multimedia document modeled after a scientific
notebook [43]. A student can create eight kinds of pages: questions, conjectures, evidence
for, evidence against, plans, steps in plans, and commentaries. Hyperlinks provide the
ability to create non-sequential relationships between the pages. Other systems modeled
after the collaborative notebook include CaMILLE for engineering students [44] and
CALE for medical students [44]. KMap is a web-based tool for creating and browsing
concept maps [45]. A concept map is a visual representation of information and forms of
argument. KMap represents pieces of knowledge as text-labeled nodes, with links between
the nodes representing knowledge relationships. When the cursor is over a node, the user
can select from a list of associated multimedia information. KMap can be used to generate
concept maps individually or in a group, then to place them on the web for wider audience
to comment and improvement. Some of the advantages of knowledge construction are
elimination of turn taking problems, peer commentary, progressive results, time for
reflection, independent thought, and cumulative/progressive results [42].
Shared Reality: Shared Reality refers to computer constructed worlds where students can
explore, collaborate, and learn. Examples of shared realities include Multiuser Dungeons
(MUDs), microworlds, and collaborative games. A MUD is a text based shared reality that
consists of rooms, exits, objects, and users. A server hosts the MUD, accepts user
connections, allows users to manipulate and add to the shared reality, and supports
interaction between users. Users communicate synchronously via a chat-like interface.
This same interface also reports the results of interactions with objects and rooms.
Historically, MUDs have been a form of recreational activity; however, recent applications
include MUDs for astrophysicists [46], system administrators [47], and students. For
example, MOOSE Crossing is an educational system where children develop social and
computer skills by programming rooms and objects for a MUD [48]. MUDs are an
effective community for learning because they provide motivation for learning, emotional
support, technical support, and an appreciative audience [49].
SharedARK is a system for creating synchronous, shared microworlds [50]. A SharedARK
microworld is an infinite, shared, two-dimensional “flatland” of which only a small portion
is visible on any one-computer display. Users manipulate objects using a hand shaped
19
pointer. The system can operate in both face-to-face and distance modes. When users
encounter each other in SharedARK, they can set up audio/video links. A basic model of
the physical world is built into the system. Users can experiment and create objects that
have mass, density, and momentum. Several applications have been created including the
Puckland [51] simulator for elastic collisions and ARKCola [52], a simulation of a soft drink
bottling plant. Experiments with SharedARK systems have shown that students are more
engaged and perform deeper evaluations of problem sets than they do when working with
paper and pencil [50].
Other examples of shared reality include MacCandy [53] and TurboTurtle [54]. MacCandy
simulates a candy factory where candies are packed in rolls of ten and rolls are packed in
boxes of ten. The system was designed to help second grade students learn about
estimation, symbology, and addition/subtraction. The microworld is the focus of
classroom-wide discussion when displayed on the instructor’s screen at the front of the
room. TurboTurtle is a system for exploring Newtonian physics, similar to SharedARK.
A distinguishing feature of the system is its sophisticated support for awareness of other
users including user lists, telepointers, and shared widget controls.
2.2 Groupware Toolkits
With so many issues to consider, building a groupware application can be a daunting task.
Researchers have attempted to reduce the development burden by producing groupware
toolkits. Most of this work has been aimed at synchronous groupware. These toolkits
contain generic building blocks that can be used to assemble a CSCW application faster
than conventional single user development tools. Typical groupware toolkits address the
four important areas [8]:
Run-time Architecture – aid the programmer with process management, process
interconnection and inter-process communication
Programming Abstractions – make it easier for the programmer to synchronize
distributed events and data
Groupware Widgets – provide the programmer with a set of generic groupware GUI
tools for synchronous multiuser applications
Session Managers – allow programmer to customize how users create, join, leave, and
manage participation in a CSCW application.
20
At last count, more than thirty groupware toolkits have been developed by the research
community. Toolkits frequently cited as reference systems include Groupkit [55],
Rendezvous [56], and Suite [57]. Groupkit is a Tcl/Tk based toolkit available on Unix,
Windows95, and Macintosh platforms. It uses a replicated architecture, with event
broadcasting when local changes need to be sent to remote users. Remote events are
processed in a manner similar to local events. A large number of groupware widgets are
provided including social awareness, multiuser toolbars and text widgets, telepointers, and
transparent annotation windows. A programmer-configurable session manager is also
furnished.
Rendezvous is an LISP/X-Windows based toolkit available on Unix platforms. It is a
centralized system based on Smalltalk’s Model-View-Controller (MVC) architecture [58].
Much of the remote event handling and synchronization is abstracted into a programmable
constraint system. By specifying constraints between user interface components and the
data model, the constraint solver automatically keeps user views and their data
synchronized. The toolkit is based on an object-oriented version of LISP that provides
over 350 reusable classes. These classes include support for telepointers, floor control, and
multiuser text and graphics. Classes are also included for session management.
Suite is a C-based user interface independent toolkit available on Unix platforms. It is a
centralized system designed around the concept of a multiuser text editor. Applications
consist of editable objects, which are made up of publicly accessible shared variables.
These shared variables are modified through calls issued from interaction variables
associated with a specific local user interface. When an end user interacts with a widget, it
modifies the interaction variable that in turn modifies the active variable. Changes to
shared variables trigger update callbacks for the interaction variables of other users. Enduser
coupling configuration is one unique feature of the system. Users are able to specify
how frequently their user interface updates/is updated by the application’s shared objects.
Suite is user interface independent, so there are no groupware widgets. Session
management is enabled at a high level by giving the end user the ability create and modify
21
user groups within an application session. The programmer can add additional
functionality like access control using Suite primitives.
21
3 A Preliminary Experiment
We gained first hand experience with the difficulties of developing a CSCW application
during the creation of CollabBillboard. CollabBillboard grew out of ideas we had been
developing about assigned roles in a team [13]. Instead of dividing a task into smaller
independent subtasks to be completed in parallel, team members are assigned different but
complementary roles for completing a shared task. Our hypothesis was that explicitly
assigned roles could induce stronger collaboration among team members. To test the
hypothesis we developed this synchronous collaborative simulation.
Although an evaluation of CollabBillboard supported the theory, we found the entire
process frustrating. The biggest problem was how much we underestimated the amount
of time needed to complete the application. It took almost three times longer than we
expected! One of the major contributors to the delay was finding physical users to help
test the application. For reasons discussed in Section 1.1, a single user, the developer, was
not sufficient to thoroughly exercise the program. It was often necessary to comb the halls
for volunteers, and as the months continued, they became increasingly reluctant.
3.1 Architecture
CollabBillboard is a synchronous face-to-face two-player simulation that attempts to
address some shortcomings of previous multiuser simulations through explicitly assigned
roles and group evaluation. Assigned roles require each user to take on a specific role
during the simulation. These roles are complementary, but non-overlapping. Both users
must cooperate within their roles in order to achieve the simulation goal. Group
evaluation, rather than individual based, uses team based performance criteria.
The CollabBillboard application is designed for networked personal computers running
Windows 95 or NT. The development environment, Microsoft Visual C++ (VC++), was
augmented with Microsoft Foundation Classes (MFC) for GUI support, DirectX for high
performance graphics, and Winsock for communication.
22
Applications developed with VC++ and MFC have a structure oriented around the user
interface. Each dialog is associated with a C++ class. Events generated by widgets in the
dialog are converted to messages that invoke class methods. To enable multiuser
capabilities, CollabBillboard includes a shadow socket class with each dialog. The socket
shadow contains methods for communication setup/takedown, sending special events,
and receiving special events. Send event methods report local events and data that are of
interest to remote users. The receive event method converts remote user messages to a
local event and data format. Figure 3 depicts the socket shadow class for the initial dialog
panel. The member functions OnAccept and OnConnect are invoked during
communication setup/takedown. SendOK is invoked by the dialog class method ButtonOK
that is invoked when the user presses the OK button. OnReceive is invoked when a
remote message arrives. For this dialog, OnReceive gets remote ButtonOK events and
invokes same local dialog method.
Class CcollabBillBoardDlgSocket : public CollabBillBoardSocket
{
private:
void OnAccept(int theErrorCode);
void OnConnect(int theErrorCode);
void OnReceive(int theErrorCode);
public:
BOOL InitializeSockets();
BOOL SendOK();
};
Figure 3: CollabBillboard socket shadow
The system requires one machine per user. The complete simulation state is replicated on
each machine. Participants can be situated at different physical locations. However, the
game is designed with activities that require high bandwidth communication between
participants. For this reason, a face-to-face experimental setup was used.
3.2 Experimental Method
A study was conducted by to evaluate the effect CollabBillboard might have on
collaboration between pairs of users. The study used two versions of the program, one
with and another without assigned roles. Time to completion, percent of time spent
conversing, and accurate billboard placement were some performance criteria measured.
Subjects were then given a paper and pencil collaborative exercise. The results of this
23
exercise were compared against a solution key. Finally, the subjects were given a survey to
complete that allowed them to express their subjective feelings about the simulation and
about collaborative experiences during the session.
Figure 4: Sketch of experimental design.
A long desk with monitors at opposite ends was set up in an office. Users sat on different
sides of the desk, each in front of a monitor. The monitors were set up so that each could
be seen only by the user in front of it, and were angled so that both users would sit
between a three foot gap between the monitors on the table; this arrangement afforded
line-of-site viewing for non-verbal communication.
3.3 Task Overview
Research participants worked on one of two versions of the CollabBillboard simulation.
One version of the simulation used assigned roles, while the other (the control) did not.
Participants were grouped into pairs, with each pair using one version of CollabBillboard.
When the simulation was completed, participants worked through a classic paper and
pencil collaborative exercise called Lost At Sea [59]. At the end of the experiment, the pair
was asked to complete a survey about their experiences.
Pairs of participants were scheduled for a one-hour session. When they arrived, they were
introduced to each other, the tasks to be performed were explained, and they were asked
to sign a consent form. A tape recorder was started to record the audio exchange during
the CollabBillBoard portion of the session. Participants started the CollabBillBoard
application on their respective machines. When network communication was established,
24
one of the users pressed the OK button on the initial dialog window, and both users were
presented with a task menu.
Figure 5: Selecting a billboard site in the city.
The session moderator explained that the participants were part of a fictitious advertising
company that wanted to place a billboard in the city of Boston. Two major tasks were
needed to complete the application: select a site in the city to place the billboard; assemble
the scrambled pieces of the billboard on the site’s billboard frame.
Figure 6: Control window for assembling billboard. Both users see
the same window, view the entire billboard frame and move pieces.
The first task, Site Selection, brought up a shared map of the city of Boston, Massachusetts
(see Figure 5). Telepointers were used to indicate remote user focus on the map. As users
moved over possible sites, an informational window appeared describing the site. When a
25
site was selected, it was highlighted. These actions appeared on both participants' maps,
with separate colors indicating a local or remote action. Once participants selected a site,
they proceeded to the second task.
The second task, Billboard Assembly, involved assembling randomly placed pieces of the
billboard in the correct order and properly centering them on a billboard frame. At this
point, the assigned roles and control versions of the program diverged. The control
version brought up a shared billboard frame that users could add billboard pieces to. Each
new piece appeared simultaneously in the same random location on both participants’
screens (see Figure 6). Participants could grab and move any piece of the billboard at any
time. The frame contained a green box representing the local user’s position in the frame.
A red box represented the remote user’s position. To move a billboard piece, a user
placed the green box on a billboard piece, selected the grab button, and then used the
directional arrows. A zoom window was included for fine-grained piece movement.
Figure 7: Assigned roles "view billboard" window. This user has a
zoomed out view of the billboard frame but cannot move any pieces
The assigned roles version of the program split the billboard piece assembly into separate
subtasks: View Placement and Place Billboard. The View Placement task presented the
user with a zoomed out view of the billboard frame. This user could see all billboard
pieces and a green box, which represented the Place Billboard user’s view. The View
Placement user could add new pieces to the frame, and move the other user’s view.
26
However, the View user could not move a billboard piece even if the Place user was
currently grabbing one (see Figure 7).
The Place Billboard task presented the user with a zoomed in section of the billboard
frame. The Place Billboard user could navigate around the billboard frame using the
dialog’s arrow widget. The user could also grab, move, and drop billboard pieces (see
Figure 8).
Figure 8: Assigned roles "place billboard" window. This user has a
zoomed in view of the billboard frame and can move pieces.
Complications arose with assigned roles because neither user could complete the
simulation goal independently. The Place Billboard subtask had a view that represented a
small portion of the billboard frame (approximately 1/4 of a billboard piece). This view
could be very disorienting. The View Placement task had a good view of the frame, but
did not allow the user to move billboard pieces. Consequently, both users depended on
each other to complete the billboard assembly.
Once the Billboard had been assembled in either the control or assigned roles version of
the program, the team received a score based on four factors: choice of billboard site,
properly assembled billboard, properly centered billboard, and time to completion. A brief
discussion about the score with the moderator then ensued. At this point, the tape
recorder was turned off.
27
The second part of the session involved a classic paper and pencil collaborative exercise
called Lost At Sea. Participants were told to read a brief scenario where they imagined
themselves on a sinking ship. They had to rank 15 items in the order that they would be
taken because the ship might sink at any moment. After the task was completed, the
moderator discussed the US Merchant Marine’s ranking of the same items.
The final activity of the session was a survey. The survey covered three areas: subjective
feelings about CollabBillboard, subjective feelings about collaboration during the session,
and personal information. When the survey was completed, the participants were
debriefed by the moderator.
3.4 Evaluation, Results, and Analysis of Team Performance
Team performance was determined using measurements depending on the stage of the
session. For the CollabBillboard stage, five team measurements were used: choice of
billboard site, properly assembled billboard, properly centered billboard, time to
completion, and conversation as a percentage of task completion time.
For the Lost at Sea stage, 17 team measurements were made. The first 15 were absolute
values of the difference between the correct ranking for each item and the team’s ranking
of the item. Next was a cumulative sum of these deltas. Finally, time to complete the
stage was measured.
For the exit survey stage, 31 questions were asked to subjectively assess CollabBillboard
and collaborative experiences during the session. Most of these questions used a rating
scale from one to five, with lower numbers representing a more positive feeling about the
question and higher numbers indicating a negative feeling. A “no opinion” option was
available for each question.
The complete details of the results and analysis of team performance are available [13].
The results and analysis of our study support the hypothesis that assigned roles can
improve collaboration both during the simulation and in subsequent group activities.
Although it took longer for the assigned roles group to complete the simulation, they
produced higher quality results indicating collaboration that is more effective.
28
Conversation, another measure of collaboration, occurred during 85% of the assembly task
for assigned roles and only 44% of the assembly task for the control. On the second
collaborative activity, the assigned roles group completed the work in less time with
superior results. In every instance that the exit survey had statistically valid mean
differences, the responses were more positive about collaboration in the assigned roles
group.
3.5 Lessons Learned from the Development of CollabBillboard
The development of CollabBillboard was a lengthier process than we had anticipated. Our
original schedule called for three months to be spent on application development, but in
actuality, eight months were needed to complete the system. A number of lessons were
learned from reflecting on the experience. The lack of development tools, in particular, a
VC++ groupware toolkit, contributed to the delay. Originally, we intended to build only
the assigned user roles version of CollabBillboard. Building a second, control version was
necessary to evaluate the system
However, the majority of our time was spent developing, testing, and reworking the
human-computer and human-human interfaces for the application. These interfaces
account for a sizeable portion of the elements that make up a CSCW application. Testing
them was a continual problem. Finding subjects to help test the application was also a
challenge. It was often necessary to comb the halls for volunteers, and as the months went
on, they became increasingly reluctant. The next several paragraphs present additional
problems that we uncovered in the process of developing CollabBillboard that we feel may
have been detected earlier and resolved more efficiently with a multiuser-testing
environment.
Usability testing examines the program's human factors issues [14]. General application issues
include Is application appropriate to the user background and experience? Are outputs
meaningful and non-offensive? Are error diagnostics meaningful? Are the interfaces
consistent throughout the application? Are there too many options? Is the system easy to
use? There is no formula for constructing a CSCW application because it is not always
clear how some of the issues discussed in Chapter 1 should be addressed. As with other
29
GUI intensive applications, the correct implementation can require many iterations of a
prototype followed by usability testing. Other issues requiring iterative usability testing
include user interaction coordination, user awareness, undo/redo, locking policy, and
session management
In our implementation of CollabBillboard, we found that tight coupling of telepointers was
visually distracting when users tried to select a site to place the advertising billboard. After
several iterations of the program, a looser coupling was implemented where the local user
was informed only when the remote user made a site selection [13].
We ran into several shared workspace synchronization problems because local user actions
interfered with the processing of a remote user action. For example, in an early prototype
of the system one user could rotate the billboard picture. In test with live users, it was
relatively easy for them to create a scenario where their pictures were rotationally
unsynchronized. One problem that we had a lot of difficulty with later was correctly
reflecting the positions of billboard pieces moved by the remote user. It took about a
week of test trials with live users to find and debug the error. A similar kind of problem
occurred with enforcing boundary conditions on pieces moved by remote users.
Stress testing subjects the program to heavy loads or stresses. A stress test differs from a
load test in that it focuses on data volume over time versus just data volume [14].
Synchronous CSCW applications are particularly susceptible to stress problems because of
interactivity requirements. Events processing on both the network and user machines are
a common cause of interactivity loss. In CollabBillboard, for example, the control version
of the simulation locked the local user out local mouse events when a remote user flooded
the system with billboard piece move events. Several days of investigation uncovered a
flaw in the Windows 95 OS design that gave network and DirectX graphics events priority
over local mouse events. To circumvent this design, coupling was loosened by creating a
temporal buffer that accumulated remote user draw events until a timer expired.
Compatibility/Conversion testing identifies problems between the new software and preexisting
programs and data [14]. Conversion issues revolve around the ability of the new software
to support persistent storage data formats from earlier versions or other programs.
30
Conversion may also require the new software to output data in a format readable by
preexisting software. The distributed nature of CSCW applications makes them
particularly susceptible to compatibility problems when different machines have different
versions of the executable. CollabBillboard suffered from several compatibility problems.
Version 1.0 of CollabBillboard was made publicly available on the web in September 1997.
A second version of the program was made available in April 1998. The event data
generated by these versions are incompatible because of a change from reporting relative
coordinates to absolute coordinates on the billboard frame. Version 2.0 of
CollabBillboard provides two separate applications: user roles and control. Since the
communication protocols for both forms are identical syntactically, it is possible to
connect a client from one application with a server from the other. This combination
results in an unstable environment that causes the application to crash when the users
begin the billboard assembly task.
Recovery testing exercises the software's ability to handle situations during programming,
hardware, and data errors [14]. To test programming errors, code can be injected with
problems (e.g. hard coding an invalid assert). Simulation is a common technique for
testing hardware errors (e.g. returning a network message with an incorrect number of
bytes). Data errors can be purposely created to analyze the system's reaction (e.g. user
types in "-1" as number participants in a CSCW session). In addition to general kinds of
recovery testing, CSCW applications should also test the effects of unpredictable or hostile
remote user actions. Early testing of CollabBillboard discovered a problem when one user
quit the session while the other user remained. The remaining user was able to use the
application for several minutes until the system hung. The problem turned out to be that
the network messaging API buffered messages for the non-existent remote user and when
the buffer overflowed, the system froze.
32
4 Survey of Prior Work in Testing Systems
As discussed in the previous chapter, our preliminary experiment with CollabBillboard
provided us with first hand experience developing a synchronous CSCW application. One
of the greatest difficulties we encountered was testing the software. Because it was a
multiuser synchronous system, we needed several physical users exercising the application
simultaneously. Because it was a GUI application with human-human and humancomputer
interactions, we went through continual iterations to get the interface correct.
Most of the people we asked for testing assistance were willing to help a few times, but we
began to try their patience around the fourth or fifth system build.
This chapter presents a survey of the state of the art in testing. The first goal of the survey
was to uncover the major contributions made by academia and industry to software
testing. The second goal was to understand current testing system shortcomings that
prevent CSCW developers from effectively testing an application. The chapter is
organized around four main sections. Section 4.1 lists the important goals of testing.
Section 4.2 presents the research community's contributions to testing organized by the
software life-cycle process. Section 4.3 discusses academic contributions to GUI-based
testing. Finally, Section 4.4 analyzes three commercial testing systems.
4.1 Goals of Testing
Testing during the software lifecycle is a process by which the behavioral properties of the
software are verified. These properties are correctness, utility, reliability, robustness, and
performance.
A program is behaving correctly if it "satisfies its output specifications independent of its use
of computing resources when operated under permitted conditions" [10]. Correctness is
neither a necessary, nor a sufficient condition for an acceptable program. Correctness is
not necessary because some kinds of errors can be tolerated. For example, in a graphical
editor, a "drag graphical object" command might cause artifacts to appear on the drawing
surface along an object's path. This kind of behavior might be considered a bug, but is
acceptable if the user is provided with some form of drawing surface refresh command
33
that removes the artifacts. Correctness is also not a sufficient condition for an acceptable
program. A program may satisfy its specifications, but the specifications may be incorrect.
The utility of a program is determined by the extent to which it meets user needs. Utility
answers questions about things like ease-of-use and cost effectiveness. Typically, a
program is utility tested in a friendly environment with only valid input. Utility is
extremely important, because if the product does not perform useful functions, then there
is no point in further testing. Work done with Rensselaer's DCR illustrates this
importance. A great deal of effort was expended developing a floor control policy for
shared use of the system's public workstation. The policy was implemented in software as
a FIFO queue. Meeting participants taking control of the public workstation had to make
a request, which was added to the queue behind other requests. When the participant was
at the top of the request queue, s/he was allowed to control the public workstation.
Although the system was straightforward from a programming standpoint, analysis
showed that users tended to ignore the floor control policy, opting instead for a simple
control interrupt capability added later.
Reliability refers to a program's mean time to failure. Ideally, the program and its
supporting infrastructure should never fail, but the cost of verifying this level of reliability
can be prohibitively expensive. One area where high reliability is justified is life-critical
applications such as aviation software. The Federal Aviation Administration refuses to
allow commercial off-the-shelf (COTS) software in any portion of the nation's aviation
system, relying instead on thoroughly tested, but expensive, customized software. COTS
software, like Windows 95, is notoriously unreliable, and while a simple reboot for a
system hang is tolerated by most PC users, it could spell disaster for a busy air traffic
control system [60]. For less critical applications a return on investment analysis can
determine how much testing will ensure a level of reliability that will keep customers
satisfied.
A program is considered robust if it is able to handle different, possibly hostile, operating
conditions, input, and users. The application should tolerate a variety of operating
environments in its supporting infrastructure. This infrastructure includes hardware and
software associated with the network CPU, disk and graphics device. A robust CSCW
34
application, for example, should gracefully handle heavy network loads when trying to
send and receive events between users. Handling invalid input is also important. If a user
types "-1" for the number of participants in a collaborative session, the system should
prompt for a correction. Hostile user actions should also be anticipated. If the user
hosting a CSCW session exits before the rest of the team, the application should ensure
either that the session artifacts are saved, or that the session continues by using a different
host.
Performance is another important criterion that must be verified before the CSCW
application is released. Interactive feedback from a user action must be approximately 16
milliseconds to avoid a feeling of sluggishness [11]. An additional rule of thumb is that
local user performance should always take priority over processing remote user actions.
This means that the developer must be careful that tight coupling does not impact local
activity. In the CollabBillboard application, movement of billboard pieces during the
assembly task was tightly coupled. When live user testing began, it was discovered that
piece updates from the remote user created a feedback cycle that excluded local user
actions until the stream of remote updates ended. Piece movement coupling had to be
loosened to allow local user actions to be processed. The choice of centralized versus
replicated architecture has a big impact on performance. A replicated architecture will
usually have better performance, while the centralized architecture will have less
complicated synchronization and locking mechanisms. One method for verifying the
system will perform acceptably for its chosen architecture is to observe how it behaves
under a scalability test. Network performance is critical to the overall performance of the
CSCW application. The application may consume too much bandwidth when sending
messages between machines. This happens when messages occur too frequently, contain
too much information, or both. Acceptable bandwidth use can be verified by exercising
the application over the network. The application also needs to be tested under various
network conditions including heavy traffic from sources outside the application, increased
traffic from scaling the number of users, and message delay over a wide area network.
35
4.2 Research Testing Systems
This section discusses the contribution that the research community has made to testing
state of the art. It is organized around a modified version of the phases of the software life
cycle: requirements, specifications, design, implementation, integration, and maintenance.
The software life-cycle model describes the process of creating and maintaining a software
application. Competing life-cycle models have been developed over the past several
decades. These models were created to combat the inefficient process of “build and fix”
where developers built some software components, showed the results to the client, and
fixed the software based on client feedback. Most popular models in the literature evolved
from the Waterfall Model [61]. In the waterfall model, software production is broken
down into seven stages: requirements, specifications, design, implementation, integration,
and maintenance. Figure 1 depicts the rapid prototyping life cycle model. The goal of this
model is to quickly turn around versions of the software for client evaluation. Less intercycle
feedback reduces the amount of time it takes to produce a prototype. Rapid
prototyping is particularly useful for user interface development. The client can be
continually involved in the process of creating a friendly, useful, user interface. Feedback
from commercial software development has led to the creation of the incremental model.
The incremental model develops a product as a series of progressive builds, with each
build adding a new set of functions to the application. Each build creates a completely
runnable system with increasingly powerful capabilities [10].
4.2.1 Requirements
The purpose of testing in the requirements phase is to determine if the software team
correctly understands the user's requirements. Building a prototype and discussing the
program with potential users is an effective way of accomplishing this goal [10].
4.2.2 Specification
The purpose of testing in the specification phase is to determine if the software team has
correctly translated the functions required by the user into a software specification. The
most common forms of specification testing are walkthroughs and inspections. A
walkthrough consists of periodic meetings by a small team (led by the author) that reviews
the specifications document. The team size rarely exceeds five people and the meetings
36
last less than 2 hours. During a meeting, the goal is to discover, but not correct, problems.
The author can correct the problems later. Individuals prepare for the meeting by
reviewing the specification and requirements documents [10].
An inspection is a more highly structured process consisting of five formalized steps:
overview, preparation, inspection, rework, and follow-up. The International Institute of
Electrical Engineers (IEEE) has published an international standard for the inspection
process [62]. The overview step is a preliminary meeting where members of the inspection
team are assigned roles and given specific tasks to prepare for the inspection. In the
preparation step team members examine the specification from the perspective of their
assigned roles and prepare checklists for verification during the group inspection. A series
of inspection meetings are then held with the team measuring the specification against the
checklist. Again, problems are only identified, not solved during these sessions. The
rework step corrects problems discovered in the specification. Follow-up ensures that the
rework corrected the problems identified, and didn't introduce any new ones. Although
no formal studies have been done, it is thought that inspections take more time but are
more effective than walkthroughs. IBM's cleanroom verification technique uses inspection
as the main verification tool throughout the software life cycle [63].
Specifications can be written in a variety of formats from informal prose to a formal
algebraic description. The testing research community is interested in formal specifications
because of the potential for early, automated debugging, testing, and analysis [64]. Recall
that the sooner a problem can be found in the development cycle, the less expensive it is
to fix (see Section 4). A formal specification can also be useful in later stages of the
software life cycle such as design phase mathematical proofs of correctness (Section 4.2.3),
and input selection/output analysis for functional testing (Section 4.2.4). There are two
kinds of formal specification: process-based and model-based.
A process-based specification views the program as being comprised of subprograms. A
critical part of the specification is to formally specify the interfaces between subprograms
and abstract data types (ADT). The specification is developed using a top-down process
where successive revisions of the specification result in smaller subprograms with greater
interface and ADT detail. The finest level of detail is a formal algebraic notation.
37
Reusable generic specifications are one advantage of this technique. For example, instead
of describing an integer specific sort routine, a generic sort routine could be specified.
This routine is written at a high enough level that it could sort any data type (e.g. integer,
real, programmer defined). When a specific kind of sorting is needed, another refinement
of the routine is performed with the data type needed [65]. Larch [66] is an example of a
system that supports the process-based specification technique.
A model-based specification is a formal mathematical model of the entire software system.
The specification not only describes the interfaces and data structures of the software
system, but also describes state behavior in a formal way. Z [67] and the Vienna Definition
Model [68] are examples of model-based specifications. Z uses a set/relation notation
where components that make up the software system are represented as schemas. The
following is an example of a Z schema for the CSCW application CollabBillboard:
MoveBillboardPiece
∆Billboard
owner?: OWNER
pieceID?: PIECEID
x?: 
y?: 
owner ∈(Billboard.pieceList(pieceID?)).owner
0 ≤x? ≤XMAX
0 ≤y? ≤YMAX
Billboard.pieceList(pieceID?).x = x?
Billboard.pieceList(pieceID?).y = y?
Figure 9: Z Language schema for CollabBillboard
The MoveBillboardPiece function is responsible for updating the x,y location of a
billboard piece on CollabBillboard's shared workspace. ∆Billboard at the beginning
indicates that the schema will change the system state by altering Billboard. The schema
signature describes the input variables and their data types. For example, x?: indicates
that the input variable x can be any natural number. Schema predicates indicate
Schema name
Indicates schema will
cause a state change.
Schema signature
Schema predicates
38
conditions that must hold for system state and input variables. For example, x must be a
non-negative natural number with a value less than the width of the workspace (XMAX) for
the billboard piece to be displayed properly. Schema predicates can also contain set or
relation operations. The partial predicate pieceList(pieceID?) performs a range
lookup on the set pieceList. This set represents a total mapping of piece IDs (domain)
to actual piece structures (range). The pieceList(pieceID?) predicate returns the piece
whose ID is represented by the input variable pieceID?.
The Test Template Framework [64] uses a Z specification to create cases for
implementation testing. An analysis of the schema signature is done to create an input
space for each variable. A variable's input space is refined into a valid input space through
schema predicate constraints. The valid input space is then grouped into categories using
techniques based on the category partition testing method [69]. The result of this
processing is a Z language specification for a set of generic test cases. Actual test cases are
instantiated by executing the function derived from the specification with data that satisfies
the Z specification for the input variables. Analysis of the test case results is performed by
comparing output against schema signature and predicate constraints. Other specification
based testing work includes Haye's [70] techniques for constructing input/output
constraints from a Z schema specification, and Stanford's Anna [71] system for runtime
checking of Ada programs using specification derived constraints.
In addition to general specification systems like Larch and Z, specialized systems have
been developed for concurrent and real-time programming. Specialized systems providing
verification support include Concurrent Temporal Logic (CTL) [72] an SCR specification
system for event-driven applications, Graphical Interval Logic (GIL) [73], a visual temporal
specification system, and the constrained expression toolkit [74] for real-time programs.
Figure 10: GIL: Specification for queueRemotePieceUpdate$n
GIL allows the temporal properties of a concurrent system to be specified using a
annotated graphical timing diagram. GIL developers claim that the graphical notation of
remoteUpdate$n ^ timerExpired
39
timing diagrams is superior to a temporal logic text specification because visualization
increases the understanding of relationships between the temporal properties of the
system. The semantics underlying GIL allow the diagrams to be converted into
propositional temporal logic, which can then be run through a proof checker. The proof
checker is not automatic, and must be told which diagrams should be included in a
particular proof. Figures Figure 10 and Figure 11 depict a GIL specification for a portion
of the CollabBillboard application.
queueRemotePieceUpdate$n. 0<= n <=(Number of Billboard Pieces - 1)
A remote update event arrives for a piece of the billboard and the event is queued until a
timer expires.
Queuing a remote update keeps remote piece movement from interfering with local user
performance. The remoteUpdate$n boolean is set to TRUE when a remote update event
arrives from another user for Billboard piece n. timerExpired is set to TRUE every 200
milliseconds. The interval depicted by this timing diagram indicates that as long as remote
piece updates are being received and the timer hasn't expired, then the condition
queueRemotePieceUpdate$n will be TRUE.
drawRemotePieceUpdate$n. 0<= n <=(Number of Billboard Pieces - 1)
Figure 11: GIL specification for drawRemotePieceUpdate$n
A billboard piece is redrawn if it has been queued as a remote update and the timer has
expired.
remoteUpdate$n ^ timerExpired
queueForRedrawPiece$n
redraw
remoteUpdate$n
40
The conditions that identify this interval are that the piece has had a remoteUpdate$n
event associated with it since the last time the timer expired. The implication arrow (→)
indicates that if these conditions for the interval are met then the remoteUpdate$n
boolean will be set to FALSE for the billboard piece and the piece will be considered
queued for local redraw until the actual redraw occurs.
Start Mode In Site Remote In
Site
Button Down Remote
Button Down
End Mode
Clear Map F F F F Clear Map
@T - F F Site Info
- @T F F Remote Site
Info
Site Info T F F F Site Info
@F F F F Clear Map
T @T F F Remote Site
Info
T @F F F Site Info
T - @T F Site Selected
Remote Site
Info
F T F F Remote Site
Info
@T - F F Site Info
F @F F F Clear Map
F T F @T Remote Site
Selected
Site Selected - - F F Site Selected
F - @T F Clear Map
- F F @T Clear Map
- T F @T Remote Site
Selected
Remote Site
Selected
- - F F Remote Site
Selected
F - @T F Clear Map
- F F @T Clear Map
T - @T Site Selected
Table 1: SCR Table for CollabBillboard
The Software Cost Reduction (SCR) method is a formal method for specifying the
requirements of real-time systems. The SCR method has been used successfully in a
variety of application domains including aviation, telephony, and nuclear power. System
behavior is modeled as a relationship between two types of variables: monitored variables,
which denote environmental quantities, monitored by the system and controlled variables
that denote environmental quantities the system controls. Conditions, events, and tables
provide details on how monitored variables affect controlled variables. A condition is a
predicate defined on one or more variables in the specification. When any variable
41
changes value, it is called an event. An SCR table specifies a variable's value based on
conditions and events [75].
The following is a mode transition table for the CollabBillboard application's shared map
task. The purpose of this type of SCR table is to show how the system state changes
because of new input conditions. The left-most and right-most columns represent the
current mode and new mode respectively. Clear Map is a mode where the cursor is not on
any billboard site and no site has been selected. Site Info and Remote Site Info are modes
where the cursor is over one of the billboard sites and an information box appears
describing the box. Site Selected and Remote Site Selected are modes where one of the
users has selected a site for billboard placement. In this mode, the selected site is
highlighted with a special yellow (local) or gray (remote) box. The In Site, Remote In
Site, Button Down, and Remote Button Down represent condition variables that are
monitored by the system. An environmental condition can have four possible values: T
(currently TRUE), @T (just turned TRUE), F (currently FALSE), @F (just turned FALSE):
Several verification tools have been developed for SCR specifications. The SCR* system
[75] provides a consistency checker to detect syntax errors, incomplete variable definitions,
or circular variable definitions. The CTL system [72] converts an SCR specification into a
finite state machine and a set of temporal logic propositions. The converted specification
is then nondeterministically executed. The execution proceeds in discrete time units,
which represent single state transitions. Since a transition is activated every time unit, at
least one of the current state's transitions will be enabled at all times. As the machine is
executing, the temporal properties that must hold are checked.
The GIL and CTL systems represent two competing verification techniques for realtime/
concurrent specification systems: theorem proving and state-based. The problem
with theorem proving is that it is difficult to automate. The GIL system, for example,
requires the user to indicate by hand the specifications that will be used to verify a
particular constraint. State-based systems like CTL suffer from an exponential explosion
in the number of states that must be explored to completely verify a system. The
constraint expression toolkit [74], [76] provides tractable automated verification using
integer programming.
42
The constraint expression toolkit is used for bounding the time between events in a
concurrent real-time system. It converts an Ada-like specification into a set of finite
automata, one for each process in the system. The alphabet of each automaton consists of
symbols for computation within the process and for synchronous communication with
other processes. A set of transition variables is assigned to each automaton edge. These
variables count the number of times the edge is traversed during process execution. Start
and halt variables are assigned to each node with an exiting start event edge, and entering
halt event edge. If the process does not contain a start event edge, then all nodes are
labeled with a start variable. The same technique is used for processes without the halt
event edge. Equations are then derived by treating the automatons as a network flow
where the number of times a state is entered equals the number of times it is exited.
Additional equations are added by forcing each automaton to start and halt at exactly one
place. A final set of equations can be added by recognizing that transition variables
representing communication between processes must sum to the same value. Once the
equations have been determined, an integer-programming objective is established, for
example:
tixi
i
where ti is the time it takes to move along the edge labeled with transition variable xi and xi
is the number of times the edge has been traversed. The bounds can be determined by
using integer-programming techniques to solve for the minimum and maximum values of
the objective. Although in general integer programming is NP-complete, there are special
cases that reduce to polynomial time linear programming. The types of equations
generated by the constrained expression toolkit generally reduce to one of these special
cases.
4.2.3 Design
The essential difference between specification and design is that the specification states
what the program is supposed to do, while the design shows how the program will do it.
The purpose of testing in the design phase is to ensure a correct implementation of the
specification. Informal techniques like walkthroughs and inspections are commonly used
43
in design verification (see Section 4.2.2). Formal techniques, such as proofs of correctness,
are also used.
One proof of correctness technique uses mathematical induction on loop invariants. The
idea is to identify set variables and characteristics about those variables that do not change
from loop iteration to iteration. A proof by induction is then performed on the variables
and their characteristics [77]. An alternative proof technique is Hoare's axiomatic method,
which uses deduction on axioms derived from program statements [78], [79]. Formal
proofs of correctness have not found widespread acceptance in the software community as
a verification tool. This is due to a number of factors including the mathematics skill
needed to manipulate predicate calculus and temporal logic, the immense effort needed to
prove even the smallest of designs, and the inability to automate due to the need for
human intervention needed to determine things like loop invariants. Despite these
drawbacks, these formal techniques have been successful in a number of domains,
particularly where the cost of verification is negligible compared to the cost of program
failure (e.g. NASA space missions) [10]. However, even if the cost of correctness proving
could be ignored, it is not a panacea for software verification because "we can never be
sure that the specification is correct" and "we can never be certain that the verification
system is correct" [80].
4.2.4 Implementation
In the implementation phase, the actual code has been written and the process of verifying
the physical program commences. Verification during the implementation phase is also
known as unit, functional, or module level testing. Two kinds of testing are performed
during this phase: black box and white box.
Black box testing ignores the internals of a routine and uses the specification to determine
the expected output given a specific input. Testing is performed by executing the routine
with input and analyzing the output for correctness. This form of testing is attractive
because the tester does not have to be concerned with the internals of the routine, which
allows someone other than the routine author to perform the test. The problem with
black box testing is that in order to thoroughly test a routine, all possible inputs must be
tried. This results in a combinatorial explosion that causes the test of even a simple
44
routine to be computationally infeasible [14]. Consider the following routine from
CollabBillboard that returns the distance between a pair of two-dimensional points:
float distance(int x1; int y1; int x2; int y2;)
In order to thoroughly black box test this routine, the function must be executed and
verified for each possible x and y value. Assuming 32 bit integers, this would result in 2128
test cases, requiring more than 1023 years to complete on an Intel Pentium 166/MMX
machine.
Equivalence partitioning is a method for reducing the number of black box test cases. The
idea is to partition the input space into a set of equivalence classes where any input value in
a class is equivalent to any other input value in the class. Equivalence partitioning
eliminates the need for exhaustive testing because only one representative test needs to be
performed for each equivalence class. For example, suppose a routine that calculates cos,
where is the angle in degrees, is to be tested. By examining the behavior of the cosine
curve between 0 and 360 degrees a number of equivalence classes emerge (see Table 2). If
the equivalence classes are chosen correctly, then a single test with any value from class I
(e.g. 5 degrees) should be sufficient for testing the entire range of values from 0 to 89
degrees. The actual input values tested can be selected by hand, or automatically by
random sampling [81] from each equivalence class.
Equivalence
Class
Cosine Behavior Range
I 1 →0 0 →89
II 0 →-1 90 →179
III -1 →0 180 →269
IV 0 →1 270 →359
V 1 →0 →-1 →0 →-1 Negative multiples of 360 - (0 →359)
VI 1 →0 →-1 →0 →-1 Positive multiples of 360 + (0 →359)
Table 2: Equivalence classes for cos
Despite some heuristics for performing equivalence partitioning [14], it is essentially a
manual process. The process requires a deep understanding of both input parameters and
the purpose of the routine to be tested. There is no way to guarantee that a partitioning
scheme is correct; that each value in an equivalence class will exercise the same code in the
45
same manner in a routine. In the cosexample above if the developer decided to
implement the function using a lookup table, then the equivalence classes in Table 2 would
be insufficient. The technique is not foolproof, but attempts to create a manageable
number of test cases with maximum impact.
Boundary value analysis is an enhancement to equivalence partitioning. The idea is that
test cases that use input values near the boundaries of equivalence classes have greater
impact. Input values are generated from below, on, and above the edges of an equivalence
class. More formally [10]:
For each range (R1,R2) of an equivalence class, five test cases should be created:
(1) < R1 (4) = R2
(2) = R1 (5) > R2
(3) R1 < ∝< R2
In the cosexample above, the boundary conditions for class I would be the following
angles:(-1,0, 1,45,89,90,91).
Figure 12: Control flow graph for loop with five possible
logic paths
46
White box testing uses a routine's internal logic to create test cases. Test case output is
compared against expected output given input values and the specification. The advantage
of white box testing is precise control over the routine logic exercised by each test case.
Unfortunately, testing every possible logic path results in a combinatorial explosion similar
to black box testing. Consider the control flow graph shown in Figure 12 of a loop with 5
possible logic paths per iteration.
To thoroughly test every logic path for 20 iterations of the loop would require 520 + 519 +
… 51 = 1014 test cases. Assuming an Intel Pentium 166/MMX machine, it would take
approximately 21 days to simply execute the test cases. This doesn't count time spent
analyzing the results.
Another problem with path coverage is that it doesn't guarantee that all states will be
exercised in the implemented program. Different input values can cause the program to
behave differently even if the same path is executed. Consider the statements in Figure 13.
Setting theNumber = -1 and theNumber = 0 will cause the same path to be executed, but in
one case the program will print out "MINUS -1" and in the other it will generate a divide by
zero fault.
Figure 13: Code fragment from CollabBillboard
Statement coverage reduces the combinatorial explosion of test cases by ensuring every
statement in the program is executed correctly at least once. One problem with this
approach is that particular test data may give the illusion of statement correctness.
Consider the following code sequence from CollabBillboard:
This code fragment ensures the upper left corner of the billboard frame's movable view
window stays within bounds of the drawing surface. The problem with the statements
if (point.x < 0) point.x = 0; if (point.x > YMAX) point.x = XMAX;
if (point.y < 0) point.y = 0; if (point.y > YMAX) point.y = YMAX;
Test Cases:
1 - point.x = -1; point.y = -1;
2 - point.x = XMAX + 1; point.y = YMAX + 1;
1 - point.x = -1; point.y = -1;
2 - point.x = XMAX + 1; point.y = YMAX + 1;
1
47
above is that the upper bounds for point.x should be XMAX, not YMAX. The error is difficult
to detect because an x-value greater than XMAX will always trigger the upper bounds
conditional because the drawing surface is wider than it is long.
Branch coverage provides statement coverage and additionally ensures every conditional path
is executed at least once. Branch coverage will test the conditionals in Figure 13 for both
TRUE and FALSE conditions. Data from a test case that should trigger a FALSE path
execution for the if (point.x > YMAX)… statement might identify the conditional error.
Although an improvement over statement coverage, branch coverage is still very sensitive
to test case data selection.
Numerous path coverage techniques have been devised which exercise paths through the
code. Combinatorial explosion is avoided by executing paths through the code a non-zero
minimum number of times. A common path coverage technique constructs a control flow
graph to find paths through the code [82]. Path coverage performance can be improved by
discovering the minimum number of paths that have to be traversed to cover all paths [83].
One of the challenges of path coverage is to discover the input values that will cause a
particular path to be executed. Using data flow graphs to create def-use paths for variables
used in the program [84] makes it easier to discover how a particular input value affects
program flow. DFG analysis cannot automatically select input values for test cases, but it
can let the tester know what paths still need to be traversed for a particular variable. The
DELLA PASTA [85] system extends the def-use technique to parallel programs. The core
of the DELLA PASTA system is an algorithm that creates paths for variables defined in
one thread and used in another. The system is very limited in that it only works in a
shared memory architecture and provides no control over the temporal aspects of
execution which can also influence path coverage.
4.2.5 Integration
When the program has been implemented and individual functions have been tested, it is
time to test the program as a whole. Integration testing approaches revolve around how
modules are assembled for verification. Separate integration verifies each module separately,
then modules are combined all at once and the entire program is tested. Top-down integration
integrates and verifies the highest level modules first with stubs for functions in lower level
48
modules. This technique is excellent for identifying major design flaws early in the
software life cycle, but does a poor job of detecting flaws in lower level modules. Bottom-up
integration assembles and verifies the lower level modules first and tests the higher level
modules later. This technique is excellent for identifying problems with lower level
functions, but high-level design flaws are detected late in the life cycle. Sandwich integration
divides modules into low level "utility" functions, and high level "glue-like" logic functions.
Bottom-up integration is performed on the utility modules and top-down integration is
performed on the logic modules [10].
4.2.6 System Testing
When the program has been implemented, its individual functions and combined
functions tested against the specification, there is still verification to perform. System
testing refers to verifying the program against the requirements, not the specification [14].
Facility testing verifies that each objective discussed in the requirements is actually met by
the program. Volume testing subjects the program to heavy volumes of data. Stress testing
subjects the program to heavy loads or stresses. A stress test differs from a load test in
that it focuses on data volume over time versus just data volume. Usability testing examines
the program's human factors issues. Security testing tries to subvert the program's security
mechanisms. Security testing is particularly important in CSCW where issues of privacy
and user roles arise. Performance testing ensures that the program meets requirements for
response times and throughput under various workloads and configurations. Configuration
testing examines how the program operates in a variety of hardware and software
environments. Memory testing is a specific form of configuration testing that verifies the
software's main and secondary storage needs. Compatibility/Conversion testing identifies
problems between the new software and preexisting programs and data. Install testing
exercises the procedures involved with getting the software installed and running.
Reliability testing is performed implicitly throughout the software life cycle (see Section 4.1:
Reliability). Recovery testing exercises the software's ability to handle situations when
programming, hardware, and data errors occur. Serviceability testing investigates
requirements for fixing and maintaining the program. Documentation testing verifies that the
user documentation is correct. Some verification techniques include document inspection,
and incorporating every example into the test case suite. Procedure testing deals with the
49
verification of procedures that users must follow. Acceptance testing is the final test before
the software is formally delivered to the user community.
4.3 Human Computer Interaction Testing
Human Computer Interaction testing research has focused primarily in two areas: testing
architectures and usability testing. Automated testing research has examined the problems
encountered when a testing system is used to automate the evaluation of applications with
graphical user interfaces. Usability testing has attempted to provide techniques and
evaluation techniques for an application’s user interface.
4.3.1 Testing Architectures
Script reusability has been the major focus of academic testing architectures. Because of
the highly iterative nature of GUI application development, the test scripts recorded with
one version of the application quickly become invalid. The bitmap comparison techniques
used in early systems were insufficient because of dependencies on precise location and
content of the GUI. Advocating a programmatic approach, early researchers argued that
test scripts that drive the application by identifying the GUI components
programmatically, rather than graphically, have less sensitivity to specific application state
[86].
Figure 14: Usability guidelines from [87]
The Test Development Environment (TDE) addresses this issue with a visual test
development system that abstracts low-level GUI events into higher-level operations on
Use a simple and natural dialog
Provide an intuitive visual layout
Minimize a user’s memory load
Be consistent
Provide feedback
Provide clearly marked exits
Provide shortcuts
Provide good help
Allow user customization
Minimize the use and effectiveness of modes
Support input device continuity
50
specific GUI components [88]. An organizational tool is provided to group operations
into scripts and store them in a design library. To create a test case, the tester uses a visual
programming environment to select a set of scripts from the library. The visual language
includes provisions for if/then and looping control constructs. Data variance using formbased
constraints is also included to increase script reusability. Low-level application
events are regenerated from the high level operations to exercise the application. When a
new version of the application is developed, the TDE examines the GUI components
using the components it is aware of from the scripts in the design library. Discrepancies
are identified and can be corrected by the tester with the help of mapping wizards included
in the TDE.
Other techniques attack script reusability by generating test cases automatically. To
thoroughly test the application, however, each GUI action has to be tried in combination
with every other GUI action. Like black and white box testing this creates a combinatorial
explosion of test cases. Several approaches have been investigated to reduce this growth.
Pair-wise grouping restricts the length of an interaction chain to two. The creators of this
approach found a significant reduction in test cases without a corresponding drop in
detected bugs [89]. Latin-squares arranges n distinct GUI interactions in an n x n grid
where every interaction occurred exactly once in each row and once in each column [90].
Test case reduction without significant loss of bugs was also found using this approach.
Artificial Intelligence (AI) planning techniques have also been used [91]. One system
analyzes the application’s GUI to derive a set of user actions. The test designer manually
encodes pre and post conditions for each interaction (e.g. to display panel X the user must
press button Y). The designer then defines start and goal states for the application. The
system uses an AI planner to find a path from the start state to the goal state using the
GUI interactions encoded by the designer. Test case reduction is achieved because only
one path is generated for each goal state. Unfortunately, like the techniques in Section
4.2.4, approaches that eliminate test cases can’t guarantee that all problems will be found in
an application.
In addition to script reusability, researchers have investigated visual programming, script
analysis, and multi-modal scripting. A methodology and architecture has been created for
51
testing visual programs like spreadsheets [92]. The system defines cell relation graphs and
constructed compiler-like “definition-use” links between cells that define values and cells
that used definition cells. The testing system highlights dependent cells that have not been
tested. To exercise the cell, the tester changes the value of one or more definition cells.
Highlighting is removed once the code in a dependent cell is executed.
GUITESTER uses script analysis to determine usability problems in an application’s user
interface [93]. Scripts of different users performing the same application task are analyzed.
The analysis extracts common interaction patterns, mean mouse movement distances,
mean interval between user actions, and the proportion of users who were unable to
complete each sub-task. This information is used to identify clarity, safety, simplicity, and
continuity problems. For example, a long mean distance between mouse clicks in a
relatively short interval could mean the user interface suffers from a continuity problem.
Multi-modal scripting integrates additional data into a script recording to improve the
richness script playback. A script can be enhanced with synchronized videotape and voice
captured at the time the user exercised the application. Observer text and voice
annotations can be added later [94]. MITRE’s Multi-modal Logger allows multiple
applications and simultaneous users to be recorded in a single script [95]. The rich
information provided by these recording systems adds important context to the
application during playback analysis.
4.3.2 Usability Testing
Academia and industry have produced numerous guidelines for user interface design and
evaluation [40], [87], [96], [97], [98]. The guidelines range in size from a concise set of one
line statements size as in Figure 14 to a detailed breakdown and description of every aspect
of a graphical user interface.
Most researchers agree on the general principles for a good user interface. Research is very
active, however, in determining if an application violates these principles. Techniques
include empirical evaluation, where users are observed using the application in a usability
lab or in the field. Observers go to great lengths to avoid contact with subjects in order to
preserve realistic application use. Empirical evaluation is an excellent tool real world
52
observation, however, it can be very expensive and time consuming [40]. Another
technique, the walkthrough, uses deliberate attempts to expose usability problems in the
application. Typically, quality assurance personnel or human factors experts perform the
walkthrough, rather than regular users. Walkthroughs provide a cost and time efficient
evaluation, but suffer because they lack a real world setting. Karat provides an excellent
survey of walkthrough techniques including pluralistic walkthroughs, heuristic evaluations,
cognitive walkthroughs, think-aloud evaluations, and scenario-based reviews[87]. More
recently, advocates for participatory evaluation have voiced the opinion that having
evaluators and possibly developers in the same room with real users offers the benefits of
both empirical and walkthrough techniques [99], [100].
4.4 Commercial Test Systems
A survey of testing would be incomplete without a review of modern commercial testing
systems. Unfortunately, there appears to be little contact between academia and the
commercial testing community. Statements from researchers, such as "testing tools for
CSCW applications are non-existent"[8] are simply untrue. At ISSTA '98, the premier
annual academic conference on testing, over a dozen well-known researchers were
questioned about multi-user testing architectures. None of them, including an individual
citing SQA Suite™ in a conference paper, was aware of any multi-user support.
The lack of a rigorous review of commercial testing in the literature necessitated examining
a variety of alternative information sources including:
USENET's comp.sys.testing which provides a regularly updated list of over 200
commercial and public domain testing tools.
Reviews in the Software Testing Online Resources (STORM) web site maintained
by Roland Untch at Middle Tennessee State University [101].
Software review articles from commercial magazines [102], [103], [104], [105].
Several discussions with Dr. Anne Ferraro who performed a review of commercial
testing systems for Microstrategies, Inc. [106].
Test software company web sites.
Several criteria were used to determine a system's desirability. First, the system had to run
on Windows95/NT platforms. This was necessary because software for the Collaborative
53
Classroom, including CollabBillboard, was developed on these platforms. Second, the
system had to support multi-user testing. Determining this capability was challenging
because marketing literature uses words like "stress testing", "load testing", and
"client/server testing" inconsistently. Sometimes this meant that the product was capable
of multi-user testing. Other times this meant that the product could be used to simulate
loads or client behavior in a single user environment. Finally, the company had to provide
an evaluation copy. Four systems were initially selected: Platinum Technology's Final
Exam C/S-Test™ [107], Mercury Interactive's Test Suite™ [108], Rational Software's SQA
Suite™ [109], and Segue's Silk Enterprise Edition™. Unfortunately, negotiations with
Segue broke down before an evaluation copy of their system was obtained.
The review of commercial systems is organized around a reference testing architecture. A
software test environment (STE) can be broken down into six functional categories: test
execution, test development, test failure analysis, test measurement, test management, and
test planning [110].
4.4.1 Test Planning
Test planning provides the tools necessary for managing staff, schedules, and resources
necessary for product testing. Areas covered by this function include features of software
to be tested, detailed test plans, risk assessment, organization training needs, resource
needs, staffing needs, staffing roles, staffing responsibilities, and schedule.
SQA Suite™ and TestSuite™ provide extensive tools for test planning. SQA Suite™
defines the testing process as a sequence of six steps: Test Planning ∝Test Development ∝
Test Results ∝Defect Tracking ∝Summary Reporting and Analysis. SQA Manager is
provided to define and organize test requirements. Test requirements are defined using a
hierarchical folder/document tree. Folders describe level testing objectives with higher
level objectives appearing closer to the root. The leaves are documents, which describe the
detailed low level requirements for a specific test.
TestSuite™ defines the testing process in three steps: Test Planning ∝Test Execution ∝
Bug Tracking. TestSuite™ merges testing planning and development into a single step
following IEEE Standard 829 [111]: Define Goals (requirements) ∝Define Major
54
Capabilities to Test (specification) ∝Define Tests (design) ∝Define Steps for each Test
(implementation) ∝Automate Tests (automation). Testing is viewed as a life cycle that
parallels software development. Wizards are provided which guide the tester through each
step of the planning process.
Final Exam C/S-Test™ does not provide any test planning facilities.
4.4.2 Test Management
Test management deals with the storage and maintenance of test artifacts and their
interrelationships. A sophisticated storage mechanism, such as a database, is needed to
maintain artifact relationships.
SQA Suite™ manages the entire testing process through the SQA Manager [109] program.
SQA Manager allows the tester to perform test planning, archive developed test cases,
archive the results from test execution, and perform analysis on the test case results. Email
and bug tracking support is also provided to tie development, quality assurance, and
management into the process. The artifacts of the test process are stored in either a
Microsoft Access or Sybase relational database. SQA Manager provides a query
mechanism for information in the test repository. Unfortunately, the data model for the
repository is not exposed, so there is no way to link the test system into other development
tools, such as the code library. This would be useful for synchronizing the bug fixes on
the development and test side. A graphing and report writer facility is also included for
reviewing and analyzing software defect information (e.g. age and priority of outstanding
defects, defect ownership, number of defects over time).
TestSuite™ provides similar management through TestDirector [112]. The repository uses
Microsoft Access and exposes some of the data model to external applications.
Specifically, read only views are available for test case results. This allows the tester to run
standard report writing tools against the results. TestDirector provides excellent support
for testing during the iterative GUI development process. When a new build is brought
into the test system, the widgets on each dialog are analyzed. If there are differences
between the widgets in the new build and previous build (e.g. a widget was deleted), and
the archived test cases have dependencies on these differences, then the system will alert
55
the tester and provide a wizard to help modify the test cases. Like SQA Manager, bugtracking
facilities are not integrated with any code library systems. Another shortcoming
of TestDirector is a lack of query tools for data archived in the repository. A graphing and
report writer is also included for defect analysis.
Final Exam C/S-Test™ does not provide any test management facilities.
4.4.3 Test Development
Test development adds the ability to specify test executions. A test suite is developed for
the software under verification. The suite consists of individual test cases. Each test case
includes the input required to run the case, adequacy criteria to determine if the case
passed or failed, and documentation.
Final Exam C/S-Test™ records user actions performed on the application under test
(AUT). Actions are written to a test script, which can then be played back. User actions
are divided into two categories. High level actions involve the manipulation of a GUI
widget (e.g. pushing a button). Low level actions involve device level manipulation (e.g.
mouse click, or keyboard press). The recorder interprets actions at a high level whenever
possible. This gives the test script greater flexibility during execution. A test script that
records an OK button press is more flexible than one that records the absolute screen
coordinates of the mouse click that caused the button press. If the script is run with the
AUT at a new position on the display, the high level action will be replayed, while the low
level one will cause undesired behavior. The following test script action sets the keyboard
and mouse input focus to window specified:
titlename is the text string name of the GUI window. internalId is a special C/STest
™ internal identifier for the window. dbKey is used to lookup information about the
window in a special Windows95/NT repository. dbId identifies the name of the
repository. delay specifies the maximum amount of time the replay system should wait
before deciding that the window cannot be found.
setwindow( titlename, internalId, dbKey, dbId, delay);
56
Once a set of user actions has been recorded, the script can be enhanced with constructs
from the Test Manipulation Language (TML). It is a weakly typed C-like language that
includes conditionals, loops, and four variable types: string, float, int, and list, and includes
subprograms. User exit support is provided so the script writer can call on pre-compiled
subroutines developed in other languages like C and C++.
TestSuite™ provides a similar recording tool and scripting language [108]. In addition to
delaying actions with a timer, the language provides a waitbitmap() function which pauses
test script execution until an geometric area of the AUT matches the specified bitmap.
SQA Suite™ provides a recording tool with an extremely small footprint on the screen.
This is an important benefit over the other two test systems. One of the problems with
recording test cases was that whenever there was a need to interact with the test-recording
tool, the actions necessary to get to the recording tool were also recorded in the test script.
The small footprint provided by SQA Suite™'s tool meant that the program's interface
could be placed in a location next to, but not on top of or underneath the application.
SQA Suite™'s scripting language [113] is a powerful subset of Visual Basic. Support is also
included for any program written in Microsoft's Visual Basic if the user doesn't want to
perform multiuser tests.
4.4.4 Test Execution
Test execution exercises the software and records the results of the execution. The
software exercised may have been be specially instrumented for testing. The artifacts of
test execution include test system and program output, execution traces, and bookkeeping
data (e.g. when test was run, against what build/configuration, with what test case data, by
whom). Systems supporting only test execution were the first kind of STEs developed.
Final Exam C/S-Test™ provides a single system window for test recording, playback, and
analysis. To execute a test case, the tester opens a script file and issues the run command
through the system window. In order to begin the test, the AUT must be in the same state
that it was when the test script was recorded. A text window displays the test script,
highlighting the line currently being executed. A debugger is provided which allows the
tester to single step through the script, set breakpoints, and query the contents of any
script variable. The playback command two speed options: actual and fast. Actual will
57
replay the script actions at the same speed they were recorded. Fast will replay the script
actions with smaller default delays. The results of the test case are saved in a log file for
later analysis. Test scripts can be run in automatic batch mode by creating a script with a
sequence of testExec(fileName) commands (where fileName is the name of a test script
file).
TestSuite™ views test execution as more formal process consisting of test cycles,
automated and manual tests, and test result analysis. Four test cycles are identified: sanity,
normal, advanced, and regression. A sanity test cycle tests the breadth of the application
and consists mostly of tests that should have positive results. Normal and advanced cycles
increase the depth of application testing and contain cases that are more destructive. The
regression cycle verifies that changes in the AUT didn't cause failures other areas of the
application. In addition to a batch mode support for scripts, TestSuite™ supports manual
testing within the system. During a manual test, a dialog box is provided which allows the
tester to indicate pass/fail status of the test and make comments.
TestSuite™'s debugger is comparable to Final Exam C/S-Test™. In addition, it provides
a variable watch list that allows the user view the values of variables and expressions as a
test script is executing. Scripts can be played back in three modes: verify, debug, and
update. The default mode, verify, executes the script and performs implicit and explicit
verification. Debug mode allows the script to be played back with the debugger. Update
sets the reference data used in implicit and explicit verification to be data from the current
run.
TestSuite™ allows the tester to set a number of execution options beyond the script's
playback speed. The min_diff parameter defines the number of pixels that constitute a
threshold match for bitmap verification. delay defines a frequency check for window
stability. A window is sample at the delay specified rate until two consecutive passes
result in the same display. This ensures the window is stable for verification or
synchronization checks.
SQA Suite™ views test execution in two phases: test development, regression testing.
Test development is the process of creating, debugging, and baselining test cases for the
58
AUT. Regression testing executes the developed test cases against the current AUT's
current build. The results of the execution are compared against the case's baseline. Any
discrepancies are reported as potential errors. Although SQA Suite™ supports batch
mode for scripts, it does not integrate manual testing into the process.
The SQA Suite™ script debugger is comparable to TestSuite™'s. Because Visual Basic
allows complex data types, the debugger also includes a data structure browser. SQA
Suite™ only supports verify and debug execution modes. The baseline for a test case
must be collected during recording. Script execution options focus on script playback
speed, and matching window captions. Caption matching is a particular problem if an
application is supported on different versions of the Windows operating system. For
example, Windows 3.1 only supports 8 character filenames with 3 character extensions.
The tester is also able to set test log options before executing a test script. These options
include the level of detail written to the log (all, pass/fail, fail) and whether the results of
the test should be written to the test repository. Finally, error recovery options are
available. The user can specify how the playback should proceed if a script command fails,
test case fails, or the AUT crashes.
4.4.5 Test Analysis
Test analysis examines a test case, both during and after execution to determine pass or
failure. Artifacts from failure analysis include test case pass/failure, and a report for each
failure. Some STEs with failure analysis capability use a test oracle, a subsystem that
automatically analyzes software behavior and output during test execution. All-purpose
test oracles do not currently exist, but several domain specific oracles have been
developed. Poirot [114] analyzes the execution of parallel programs to determine and
isolate performance problems. TAOS/GIL [115] compares a program's temporal
specification against the trace of its implementation execution. TAOS/Reactive [116]
requires the tester to translate specification locations where certain conditions must hold
true to the same location within the implementation.
An oracle is then constructed by creating assertions on these conditions in the
implementation. Final Exam C/S-Test™ TML includes six kinds of verification
statements that the script recorder can select. Bitmap verification allows the user to
59
identify a GUI widget or a geometric subset for comparison. A graphical snapshot is taken
of the area at recording time. When the test script is run, a pixel by pixel comparison is
made between the snapshot and the same area the AUT during playback. GUI object
verification saves the state of one or more GUI widgets. During test script playback, a
comparison is done between a widget's saved state and actual state on the AUT. Text
verification is a special verification tool used for applications that support complex fonts,
such as a WYSIWYG editor. Snapshots of the text area are taken and processed using
Optical Character Recognition techniques to extract the actual text. Comparisons are
made between the text at record and playback time. File verification performs a byte by
byte comparison of a files generated at record and playback time. A user exit is provided
so that the tester can define application specific verification routines. TestSuite™ and
SQA Suite™ provide similar verification tools.
Figure 15: Final Exam C/S Test Multiuser Architecture
In Final Exam C/S-Test™, the results of a test execution are written to a log file. The log
file contains verification pass/fail statements, test script parse and runtime errors, and user
defined messages entered into the log file via the log() script command. A text browser is
provided so the user can review the log. There are two viewing options: all and fail. All
displays all log file output. Fail displays only test script failures. Both TestSuite™ and
SQA Suite™ provide more sophisticated log file analysis tools.
Monitor
Server
Workstation
Server
Workstation
Server
Workstation
60
SQA Suite™ provides special browser called the SQA Test Log Viewer [117]. The Log
Viewer displays an abstraction of the log file that initially lists ten different kinds of log
events, the date and time the event occurred, and a pass/fail status. Examples of events
include start of a test script, call/return from a procedure, general protection fault, and
script command failure. The tester can apply a filter to the event log to view only specific
event types. The tester can get more detail about certain events in the log by selecting the
event. For example, a test case event that has a failure status will display the script
command that actually caused the failure. By double clicking on the test case event, the
user can jump to the actual command in the test script editor. SQA Suite™ also provides
a special comparator application, which allows the tester to compare the results of a test
with the original baseline to determine if the failure recorded, is actually a problem. There
are comparators for images, GUI objects, and text. If a test failure has been determined to
be a program defect, the tester can enter a defect into the SQA Repository. The defect
number will automatically be assigned to the test case results in the log file. TestSuite™
provides a logfile with capabilities similar to SQA Suite™ integrated in the WinRunner
application.
SQA Suite™ includes a graphing package specifically for performance analysis. The
execution times for test scripts and specific start/stop timer script commands are recorded
in the log file. The tester can extract the results from the log file and display them on one
and two-dimensional graphs. Several types of graphs are supported. Elapsed Times -
Summary: graph shows the average elapsed times of repeated executions of a series of test
scripts. Elapsed Times - Chronology: graph shows changes in elapsed time over the series
of test script runs. Elapsed Times: Avg Min Max: graph shows average, min, max values
of repeated executions of a series of test scripts. Performance: graph a series of test script
runs vs. size of data processed. Errors: graph error frequency by test script. Neither
TestSuite™ nor Final Exam C/S-Test™ provides any performance graphing utilities.
4.4.6 Test Measurement
Test measurement includes test coverage measurement, analysis, and instrumentation for
data collection during execution traces. Artifacts include test coverage measures. Section
4.2.4:White box testing discussed test coverage issues. Instrumentation presents a testing
61
challenge because code that has been instrumented behaves differently than the original
code [118]. Standard profiling tools like prof exist for single process programs which
provide call graphs, statement and function counts, and timing statistics. For parallel
programs, instrumented communication libraries, such as the Portable Instrumented
Communication Library (PICL) which trace the send/receive events and record
communication statistics can be used [119]. One problem with massively parallel programs
is that their size and lengthy execution times can result in extremely large execution traces.
Selective instrumentation reduces the amount of data collected by allowing the tester to
select when and what parts of the program will be instrumented. Paradyn, for example,
allows code to be instrumented and de-instrumented on the fly [120].
Apart from the recording test script execution times and providing basic test script
start/stop timer commands none of the test systems have any sophisticated test
measurement and instrumentation capabilities.
4.4.7 Multiuser Testing
Final Exam C/S-Test™ uses two kinds of specialized software to conduct multiuser
testing. A single copy of the monitor program resides on one of the networked
workstations. The monitor provides a session control tools to schedule and view status of
test scripts executing on remote workstations. All workstations participating in a multiuser
test are controlled with a local server program. The server program identifies the
workstation to the monitor as available for testing, and responds to requests from the
monitor (e.g. start executing test script). A test script executing in during a multiuser test is
called a "virtual user". The log files from remote executions are written to a public
directory accessible to all test machines. TestSuite™ and SQA Suite™ use a similar
architecture.
One area that TestSuite™ and SQA Suite™ differ from Final Exam C/S-Test™ is in a
distinction between types of virtual users. In SQA Suite™, a GUI user executes a test
script containing interactions with the application's user interface. Only one GUI user is
allowed per workstation. The main goal of a GUI user is to perform correctness testing.
A virtual user issues http commands against a web server, bypassing the user interface
completely. Because of the reduced processing needed by the test system for text
62
commands, there can be many Virtual users on a single workstation. SQA Suite™
guidelines state that each GUI user requires 20 MB of RAM, while a Virtual user requires
just 1.5 MB. The purpose of Virtual users is to perform load and stress testing.
TestSuite™ GUI and dB Users perform roles similar to SQA Suite™'s GUI and Virtual
users.
Synchronization plays a vital part in coordinating the execution of test scripts on multiple
networked machines. Final Exam C/S-Test™ provides support for both synchronous
and asynchronous messaging for synchronization (see Table 3):
Script Command Description
when ("msgId") enabled {
stmts…
}
Tells TML to look at each incoming message id. If it
matches "msgId" then the statements inside the code
block are executed.
enable "msgId"/disable "msgId" Enables/disables when blocks.
sendMessage()
Sends a message to remote host. Messages contain no
information beyond the message ID. Message is
acknowledged if received.
multiMessage() Sends a message to multiple remote hosts. No
acknowledgement is made if received.
waitMessage() Waits for any message to enter the message queue.
peekMessage() Looks at message on top of message queue without
removing it.
sendMessageToTML() C function allows AUT to send messages to local test
script.
RemoteCallerName() Returns the name of the remote host that caused the test
script to be executed locally.
Run "file" on "hostId" Runs test script file on remote host identified by
host id.
Table 3: Final Exam C/S-Test™ TML Script Commands for Multiuser Script Synchronization
TestSuite™ coordinates virtual users with a synchronous messaging technique called
"rendezvous". Each virtual user declares a rendezvous using the declare_rendezvous("
rzvId") statement. To synchronize across test scripts, the command rendezvous("
rzvId") is issued by all virtual users. Execution will not continue until all virtual
users have executed the rendezvous() command with the same id. SQA Suite™ has a
similar command, SQAVuSyncAndResume(), which provides the some additional capabilities.
Through the monitor, the tester can specify a threshold for the number of virtual users
that must reach the rendezvous point before execution can continue. The tester is also
63
allowed to explicitly force a virtual user to continue. Finally, a timeout option is provided
which allows the virtual user to continue if the rendezvous condition has not been met.
The Final Exam C/S-Test™ session control monitor consists of a status and message
window. The status window reports the status of each workstation participating in the test
session. Connected indicates that the workstation is ready to run a test script. Running
means, a test script is executing on the machine. Getline means that the remote test script is
waiting for the tester to enter some text at a special monitor command prompt. Waiting
indicates the test script is waiting for a test script event (via the waitMessage() command).
Error denotes some kind of error (verification, general protection fault, and script
command) occurred. Stop is displayed when the script has successfully executed.
Disconnected is displayed when the workstation has been dropped from the test session.
The messages window displays any messages transmitted between test scripts via the
sendMessage() or multiMessage() command.
Both TestSuite™ and SQA Suite™ offer a more sophisticated session control interface.
Besides remote workstation status, interfaces provide scheduling and limited
synchronization capabilities. Table 4 is a slightly modified version of the session control
interface for SQA Suite™ [121]. The label field associates a specific workstation and test
script with an identifier. Test station identifies the name of a workstation used in the test
session. Test entry contains a list of test scripts to be run in sequential order on the
workstation specified by test station. The order the scripts appear in the list is the order
they will be run unless overridden by a scheduling method. Status indicates the status of
the workstation: editing, connected, not responding, running, run completed. Editing
indicates that the tester is modifying the entries for the workstation in the session control
window. The other states are self-explanatory. Scheduling method provides the user with
some synchronization control. Valid methods are None, Wait, After