As software projects grow larger and more complex and involve more stakeholders across geographical and organizational boundaries, project managers increasingly rely on open discussion forums to elicit requirements and otherwise communicate with other stakeholders. Unfortunately, open forums generally don’t support all aspects of the requirements-elicitation process; for example, in most forums, stakeholders create their own discussion threads, introducing significant redundancy of ideas and possibly causing them to miss important discussions. Here, we describe an automated forum management (AFM) system we designed to organize discussion threads in open forums, exploring it through a series of case studies and experiments using feature requests mined from open source forums.
Using Web-based forums to gather requirements from stakeholders is common in open source forums and increasingly common across a number of enterprise-level projects. They can grow quite large, with many thousands of stakeholders. Most use discussion threads to help focus the conversation; however, user-defined threads tend to result in redundant ideas, while predefined categories might be overly rigid, possibly leading to coarse-grain topics that fail to facilitate focused discussions.
To better understand the problems and challenges of capturing requirements in open forums, we surveyed several popular open source projects, evaluating how the structure of their forums and organization of their feature requests help stakeholders work collaboratively toward their goals. The survey covered a number of tools and frameworks: a customer relationship management system called SugarCRM11; a UML modeling tool called Poseidon10; an enterprise resource planning tool called Open Bravo10; a groupware tool called ZIMBRA10; a Web tool for administrating the MySQL server called PHP-MyAdmin10; an open .NET architecture for Linux called Mono10; and the large Web-based immersive game world Second Life.9 All exhibited a significantly high percentage of discussion threads consisting of only one or two feature requests. For example, as shown in Figure 1, 59% of Poseidon threads, 70% of SugarCRM threads, 48% of Open Bravo threads, and 42% of Zimbra threads included only one or two requests. The presence of so many small threads suggests either a significant number of distinct discussion topics or that users had created unnecessary new threads. An initial analysis found the second case—unnecessary new threads—held for all forums we surveyed; for example in the SugarCRM forum, stakeholders’ requests related to email lists were found across 20 different clusters, 13 of which included only one or two comments.
We designed an experiment to determine quantitatively if user feature requests and comments (placed in new threads) should instead have been placed in larger preexisting threads. We conducted it using the data from SugarCRM, an open source customer relationship management system supporting campaign management, email marketing, lead management, marketing analysis, forecasting, quote management, and case management. We mined 1,000 feature requests contributed by 523 different stakeholders over a two-year period (20062008) distributed across 309 threads from the open discussion forum. Of the requests, 272 had been placed (by users) into discussion threads with no more than two feature requests, 309 had been placed in groups with nine or more requests, and the rest were placed in intermediate-size groups. We clustered all the feature requests using a consensus Spherical K-Means algorithm (described in the following sections), then estimated the degree to which each feature request belonged to a topic by computing the proximity of the request to the centroid, or center, of its assigned cluster.
We hypothesized that feature requests representing unique topics would have low proximity scores, while those fitting well into a topic would have higher scores. Figure 2 shows the results for feature requests assigned to small groups with only one or two feature requests vs. those assigned to larger groups of nine or more requests. We performed a T-Test that returned a p-value of 0.0005, showing the distributions exhibited a significant difference; nevertheless, there was a high degree of overlap between the two distributions, suggesting that many of the features assigned to individual or small threads could have been placed with other feature requests in larger threads. It was also reasonable to assume that once stakeholders found a relevant thread and added a feature request to it, their choice of words would have been influenced by words in the existing thread, thereby increasing the similarity of feature requests placed in shared threads. We observed that users often responded directly to earlier entries in the thread, possibly accounting for some of the differences in proximity scores between the two groupings. The results from the experiment and our subjective observations generally suggest that in many cases users incorrectly decided to create new threads.
Automated Approach
Many such problems can be alleviated through AFM tools that employ data clustering techniques to create distinct, highly focused discussion threads. However, clustering is hindered by significant background noise in feature requests (such as spelling errors, poor grammar, slang, long-winded comments, nonstandard abbreviations, redundancies, inconsistent use of terms, and off-topic discussions). To be effective, an AFM must deliver high-quality clusters representing a focused theme and be distinct from other clusters in order to minimize the redundancy of ideas across discussions. Clustering algorithms must also execute quickly, so clustering occurs inconspicuously in the background. Finally, as open-discussion forums are characterized by a steady influx of feature requests, clusters must be relatively stable, so requests are not constantly moved from cluster to cluster.
We designed the AFM process to meet these goals. Once an initial set of feature requests is collected, the project manager places the forum under the management of the AFM system, and existing requests are analyzed to identify themes and generate discussion threads. Beyond this point, arriving feature requests are classified into existing threads, unless a new topic is detected, in which case a new thread is created. Here, we describe these processes and the underlying algorithms we developed as the result of extensive experimentation in clustering requirements techniques.6,7
In preparation for clustering, feature requests are preprocessed to remove common words (such as “this” and “because”) not useful for identifying underlying themes. The remaining terms are then stemmed to their grammatical roots. Each feature request x is represented as a vector of terms (tx,1, tx,2,……, tx,w,) that is then normalized through a technique called term frequency, inverse document frequency (tf-idf), where tf represents the original term frequency of term t in the feature request, and idf represents the inverse document frequency; idf is often computed as log2(N/dft), where N represents the number of feature requests in the entire forum, and dft represents the number of feature requests containing t. The similarity between each normalized requests vector a=(a1, a2,……,aw) and each centroid b=(b1, b2, …,bw) is then computed as
where W represents the total number of terms in the entire set of feature requests, and ai is computed as the number of times term ti occurs in feature request a, weighted according to the idf value. Intuitively, this formula returns higher similarity scores between two feature requests if they share a significant number of relatively rare terms. We then determine an appropriate cluster granularity through a technique devised by F. Can and E.A. Ozkarahan2 in 1990 that predicts the ideal number of clusters by considering the degree each feature request differentiates itself from other such requests. The ideal number of clusters K is computed as
where fx,i is the frequency of term ti in artifact x, |x| is the length of the artifact, and Ni is the total occurrence of term ti. This approach has been shown to be effective across multiple data sets2,6 and can be calculated quickly so the ideal number of clusters is recomputed frequently.
Our approach uses a clustering algorithm called Spherical K-Means (SPK)5 that exhibits fast running times and returns relatively high-quality results.6 As described more formally in Figure 3, a set of K centroids, or seeds, is initially selected by the algorithm, with the objective that they are as dissimilar from each other as possible. The distance from each feature request to each centroid is computed, and each feature request is placed in the cluster associated with its closest centroid. Once all feature requests are assigned to clusters, the centroids are repositioned so as to increase their average proximity to all feature requests in the cluster. This is followed by a series of incremental optimizations in which an attempt is made to move a randomly selected feature request to the cluster for which it maximizes the overall cohesion gain. The process continues until no further optimization is possible. This simple post-processing step has been shown to significantly improve the quality of clustering results.6
Clustering results are strongly influenced by initial centroid selection, meaning poor selection can lead to low-quality clusterings. However, this problem can be alleviated through a consensus technique for performing the initial clustering, an approach that generates multiple individual clusterings, then uses a voting mechanism to create a final result.8 Though it has a much longer running time than SPKMeans, consensus clustering has been shown to consistently deliver clusterings that are of higher-than-average quality compared to the standalone SPK clusterings.6 In the AFM system, consensus clustering is used only for the initial clustering in order to create the best possible set of start-up threads.
Following the arrival of each new feature request, the algorithm recomputes the ideal granularity to determine if a new cluster should be added. To add a new cluster in a way that preserves the stability of existing clusters and minimizes clustering time, the AFM approach identifies the least-cohesive cluster, then bisects it using SPK, with K = 2. Feature requests from neighboring clusters are then reevaluated to determine if they exhibit closer proximity to one of the two new centroids than they do to their own currently assigned centroids. If this closer proximity occurs they are reassigned to the relevant cluster. To ensure continued cluster quality, the entire data set is re-clustered periodically following the arrival of a fixed number of new feature requests. Re-clustering is performed through a modified SPKMeans algorithm (we call Stable SPKMeans) designed to minimize the movement of feature requests between clusters through reuse of the current set of centroids as seeds for the new clustering.
Cluster quality is also improved through user feedback specifying whether a pair of feature requests belong together. For example, users not happy with the quality of a cluster can specify that a given feature request does not fit the cluster. They might also provide additional tags to help place the feature request in a more appropriate cluster. These user constraints, along with the tag information, are then incorporated into the SPK algorithm of future clusterings. This reassignment maximizes the quality of the individual clusters and optimizes conformance to user constraints. Our prior work in this area demonstrated significant improvement in cluster quality when constraints are considered.13
Evaluating AFM
We conducted a series of experiments designed to evaluate the AFM’s ability to quickly deliver cohesive, distinct, and stable clusters. They utilized three data sets: the first was the SugarCRM data set discussed earlier; the second included 4,205 feature requests we mined from Second Life, an Internet-based virtual world video game in which stakeholders are represented by interacting avatars; and the third described the features of an online Amazon-like portal designed specifically for students. In spring 2008 we asked 36 master’s-level students enrolled in two different advanced software-engineering classes at DePaul University to consider the needs of a typical student, textbook reseller, textbook author, textbook publisher, instructor, portal administrator, architect, project manager, and developer and create relevant feature requests. The result was 366 feature requests.
To evaluate cluster quality, we constructed an “ideal” clustering for the SugarCRM data by reviewing and modifying the natural discussion threads created by SugarCRM users. Modifications included merging closely related singleton threads, decomposing large megathreads into smaller more cohesive ones, and reassigning misfits to new clusters. The answer set enabled us to compare the quality of the generated clusters using a metric called Normalized Mutual Information (NMI) to measure the extent to which the knowledge of one cluster reduces uncertainty of other clusters.9 On a scale of 0 (no similarity) to 1 (identical clusterings), the SugarCRM clusterings scored 0.57, indicating some degree of similarity between the generated cluster and the answer set. In experiments using the answer set to simulate the gathering of nonconflicting user feedback, NMI scores increased to 0.62 after 1,000 pairwise constraints were randomly selected from the pairs of requests exhibiting borderline (neither very strong nor very weak) proximity scores6 We also created an answer set for the Student Portal data set; initial NMI clustering scores of 0.58 increased to 0.7 following 1,000 pairwise constraints. Figure 4 outlines a cohesive cluster of feature requests for the Student data sets generated using 1,000 systemwide constraints. Note that 1,000 constraints represent only 1.5% of possible constraints for the Student Portal data set and only 0.2% for SugarCRM.
These results demonstrate that many of the themes we identified in our two answer sets were also detected by the unsupervised clustering algorithms. However, because more than one clustering is possible for a given data set, these metrics alone lack sufficient insight to judge whether the cluster quality is better or worse than the user-defined threads. NMI metrics could not be used to evaluate user-defined threads due to the significant disparity between the number of user-defined threads and the number of threads in the answer set.
We thus conducted a case study using the SugarCRM data set to compare the treatment of four topics in the original SugarCRM user-managed forum versus the AFM approach. We selected the topics by examining an unstructured list of feature requests, including distribution lists, appointment scheduling, language support, and document attachments. For each topic, we first identified the set of related feature requests, then compared the way they had been allocated to both user-defined and AFM-generated threads (see Table 1). We did not solicit user feedback for these experiments.
We identified 14 feature requests for the first topic, distribution lists. Human users placed them in a thread called “email management,” including a total of 74 feature requests. The AFM system similarly placed all requests in the thread called “send email to sugar contacts.” In this case, both approaches were relatively successful.
For the second topic, appointment scheduling, we identified 32 feature requests. Human stakeholders placed them across six different threads, with 15 in a thread called “scheduling,” 12 in a thread called “how to improve calendering [sic] and activities in sugar sales,” two in a thread called “calendar,” and one each in threads related to mass updates, printing, and email management. The AFM placed 18 feature requests in a discussion related to daily appointment scheduling, another 10 in a discussion related to meeting scheduling, and the remaining four in two clusters related to email, Web-enabled calendars, and integration with other tools. In this case—appointment scheduling—the AFM performed marginally better than the user-defined threads, as the feature requests were slightly less dispersed.
For the third topic, language support, we identified 11 feature requests. Human stakeholders placed them across seven relatively small threads epitomizing the kinds of problems we found in the open source forums we studied. The AFM created a focused discussion forum on the topic of languages in which it placed nine of the 11 requests. The AFM approach excelled here.
The fourth and final topic, attach documents, represents a case in which a fairly dominant concern was dispersed by human users across 13 different threads, while the AFM placed all related requests in a single highly focused discussion thread.
It was evident that in each of the four cases the AFM approach either matched or improved on the results of the user-created threads.
AFM was designed to minimize the movement of feature requests among clusters in order to preserve stability and quality. In a series of experiments against the SugarCRM, Second Life, and Student data sets, we found that the Stable SPKMeans clustering algorithm had no significant negative effect on the quality of the final clusters. In fact, the NMI scores for the two data sets—SugarCRM and Student for which “target clusterings” were available—showed a slight improvement when we used the stable algorithm.
To evaluate the stability of the modified SPKMeans algorithm, we tracked the movement of feature requests between clusters for re-clustering intervals of 25 feature requests. Figure 5a shows the number of moves per feature request, reported in terms of percentage. Approximately 62%65% of feature requests were never moved, 20%30% of feature requests were moved once or twice, and only about 5%8% were moved more frequently. A significant amount of this movement is accounted for by the arrival of new topics and the subsequent creation of new threads, causing reclassification of some existing feature requests. Figure 5b shows that the percentage of feature requests moved across clusters gradually decreases across subsequent iterations. However, as shown in Figure 5c, the actual number of movements increases in early iterations, then becomes relatively stable. Though not discussed here, we also conducted extensive experiments that found the traditional unmodified SPKMeans algorithm resulted in approximately 1.62.6 times more volatility of feature requests than our modified version.
We analyzed the total execution time of the Stable SPKMeans algorithm for each of the three data sets using MATLAB on a 2.33GHz machine; Table 2 outlines the total time needed to perform the clusterings at various increments. We excluded initial consensus clustering times that, with typical parameters of 50100 requests, could take up to two minutes. As depicted in this example, the SugarCRM data set took a total of 57.83 seconds to complete all 29 clusterings at increments of 50 feature requests. Individual clusterings are notably fast; for example, the complete Second Life data set, consisting of 4,205 requests, was clustered in 75.18 seconds using standard SPKMeans and in only 1.02 seconds using our stable approach.
The Stable SPKMeans algorithm significantly improves the performance of the AFM, mainly because it takes less time to converge on a solution when quality seeds are passed forward from the previous clustering. Smaller increments require more clusterings, so the overall clustering time increases as the increment size decreases. However, our experiments found that increasing the increment size to 25 or even 50 feature requests has negligible effect on the quality and stability of the clusters.
Conclusion
We have identified some of the problems experienced in organizing discussion threads in open forums. The survey we conducted in summer 2008 of several open source forums suggests that expecting users to manually create and manage threads may not be the most effective approach. In contrast, we described an automated technique involving our own AFM for creating stable, high-quality clusters to anchor related discussion groups. Though no automated technique always delivers clusters that are cohesive and distinct from other clusters, our reported experiments and case studies demonstrate the advantages of using data-mining techniques to help manage discussion threads in open discussion forums. Our ongoing work aims to improve techniques for incorporating user feedback into the clustering process so clusters that appear ad hoc to users or contain multiple themes can be improved.
These findings are applicable across a range of applications, including those designed to gather comments from a product’s user base, support activities (such as event planning), and capture requirements in large projects when stakeholders are dispersed geographically. Our ongoing work focuses on the use of forums to support the gathering and prioritizing of requirements where automated forum managers improve the allocation of feature requests to threads and use recommender systems to help include stakeholders in relevant discussion groups.3 They also improve the precision of forum search and enhance browsing capabilities by predicting and displaying stakeholders’ interest in a given discussion thread.
From the user’s perspective, AFM facilitates the process of entering feature requests. Enhanced search features help users decide where to place new feature requests more accurately. Underlying data-mining functions then test the validity of the choice and (when placement is deemed incorrect) recommend moving the feature request to another existing discussion group or sometimes to an entirely new thread.
All techniques described here are being implemented in the prototype AFM tool we are developing to test and evaluate the AFM as an integral component of large-scale, distributed-requirements processes.
Acknowledgments
This work was partially funded by National Science Foundation grant CCR-0447594, including a Research Experiences for Undergraduates summer supplement to support the work of Horatiu Dumitru. We would also like to acknowledge Brenton Bade, Phik Shan Foo, and Adam Czauderna for their work developing the prototype.
Figures
Figure 1. Forums are characterized by numerous small threads, may with only one or two feature requests.
Figure 2. The extent to which feature requests assigned to individual vs. larger threads fit into global topics.
Figure 3. Two-stage spherical K-means.
Figure 4. A partial cluster with security-related requirements generated after gathering 1,000 constraints from the Student data set.
Figure 5. Stability of threads in the AFM clustering process.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment