Popular web sites like Twitter, Blogger, and Epinions let users vent their opinions on just about anything through an ever-increasing amount of short messages, blog posts, and reviews. Automated sentiment-analysis techniques can extract traces of people’s sentiment, or attitude toward certain topics, from such texts.6
Key Insights
- Traditional automated sentiment analysis techniques rely mostly on word frequencies and thus fail to capture crucial nuances in text.
- A better understanding of a text’s sentiment can be obtained by guiding the analysis by the text’s rhetorical structure.
- A deep, fine-grain analysis of the rhetorical structure of smaller units of text can highlight crucial sentiment-carrying text segments and help correctly interpret these segments.
This analysis can yield a competitive advantage for businesses,2 as one-fifth of all tweets10 and one-third of all blog posts16 discuss products or brands. Identifying people’s opinions on these topics,11 and identifying the pros and cons of specific products,12 are therefore promising applications for sentiment-analysis tools.
Many commercial sentiment-analysis systems rely mostly on the occurrence of sentiment-carrying words, although more sophistication can be introduced in several ways. Among open research issues identified by Feldman6 is the role of textual structure. Structural aspects may contain valuable information;19,24 sentiment-carrying words in a conclusion may contribute more to a text’s overall sentiment than sentiment-carrying words in, say, background information.
Existing work typically uses structural aspects of text to distinguish important text segments from less-important segments in terms of their contribution to a text’s overall sentiment, subsequently weighting a segment’s conveyed sentiment in accordance with its importance. A segment’s importance is often related to its POSITION in a text,19 yet recently developed methods make coarse-grain distinctions between segments based on their rhetorical roles5,8,24 by applying Rhetorical Structure Theory (RST).13
While the benefit of exploiting isolated rhetorical roles for sentiment analysis has been demonstrated in existing work, the full rhetorical structure in which these roles are defined has thus far been ignored. However, a text segment with a particular rhetorical role can consist of smaller subordinate segments that are rhetorically related to one another, thus forming a hierarchical rhetorical tree structure. Important segments may contain less-important parts. Since accounting for such structural aspects of a text enables better understanding of the text,15 we hypothesize that guiding sentiment analysis through a deep analysis of a text’s rhetorical structure can yield better understanding of conveyed sentiment with respect to an entity of interest.
Our contribution is threefold. First, as an alternative to existing shallow RST-guided sentiment analysis approaches that typically focus on rhetorical relations in top-level splits of RST trees, we propose to focus on the leaf nodes of RST trees or, alternatively, to account for the full RST trees. Second, we propose a novel RST-based weighting scheme that is more refined than existing weighting schemes.8,24 Third, rhetorical relations across sentences and paragraphs can provide useful context for the rhetorical structure of their subordinate text segments. Whereas existing work mostly guides sentiment analysis through sentence-level analyses of rhetorical structure (if at all), we therefore additionally incorporate paragraph-level and document-level analyses of rhetorical structure into the process. We thus account for rhetorical relations across sentences and paragraphs.
Sentiment Analysis and Rhetorical Relations
Automated sentiment analysis is related to natural language processing, computational linguistics, and text mining. Typical tasks include distinguishing subjective text segments from objective ones and determining the polarity of words, sentences, text segments, and documents.18 We address the binary document-level polarity classification task, dealing with classifying documents as positive or negative.
The state of the art in sentiment analysis has been reviewed extensively.6,18 Existing methods range from machine learning methods that exploit patterns in vector representations of text to lexicon-based methods that account for semantic orientation in individual words by matching the words with a sentiment lexicon, listing words and their associated sentiment. Lexicon-based methods are typically more robust across domains and texts,24 allowing for intuitive incorporation of deep linguistic analyses.
Deep linguistic analysis is a key success factor for sentiment-analysis systems, as it helps deal with “compositionality,”21 or the way the semantic orientation of text is determined by the combined semantic orientations of its constituent phrases. This compositionality can be captured by accounting for the grammatical20 or the discursive4,5,8,22,24 structure of text.
RST13 is a popular discourse-analysis framework, splitting text into rhetorically related segments that may in turn be split, thus yielding a hierarchical rhetorical structure. Each segment is either a nucleus or a satellite. Nuclei constitute a text’s core and are supported by satellites that are considered less important. In total, 23 types of relations between RST elements have been proposed;13 for example, a satellite may form an elaboration on or a contrast with a nucleus.
Consider the positive sentence “While always complaining that he hates this type of movies, John bitterly confessed that he enjoyed this movie,” which contains mostly negative words. RST can split the sentence into a hierarchical rhetorical structure of text segments (see Figure 1). The top-level nucleus contains the core message “John bitterly confessed he enjoyed this movie,” with a satellite providing background information. This background satellite consists of a nucleus (“he hates this type of movies”) and an attributing satellite (“While always complaining that”). Similarly, the core consists of a nucleus (“he enjoyed this movie”) and an attributing satellite (“John bitterly confessed that”). The sentence conveys a positive overall sentiment toward the movie, due to the way the sentiment-carrying words are used in the sentence; the actual sentiment is conveyed by the nucleus “he enjoyed this movie.”
Several methods of using rhetorical relations for sentiment analysis are available, with some, including our own methods, relying more strongly on RST5,8,24 than others.4,22 In order to identify rhetorical relations in text, the most successful methods use the Sentence-level PArsing of DiscoursE, or SPADE, tool,23 which creates an RST tree for each sentence. Another parser is the HIgh-Level Discourse Analyzer, or HILDA,9 which parses discourse structure at the document level by means of a greedy bottom-up tree-building method that uses machine-learning classifiers to iteratively assign RST relation labels to those (compound) segments of a document most likely to be rhetorically related. SPADE and HILDA take as input free text (a priori divided into paragraphs and sentences) and produce LISP-like representations of the text and its rhetorical structure.
This analysis can yield a competitive advantage for businesses, as one-fifth of all tweets and one-third of all blog posts discuss products or brands.
Document-level polarity classification has been guided successfully by analyses of the most relevant text segments, as identified by differentiating between top-level nuclei and satellites in sentence-level RST trees.24 Another, more elaborate, method of utilizing RST in sentiment analysis accounts for the different types of relations between nuclei and satellites,8 yielding improvements over methods that only distinguish nuclei from satellites.5,8
Polarity Classification Guided by Rhetorical Structure
Existing methods do not use RST to its full extent yet typically focus on top-level splits of sentence-level RST trees and thus employ a rather shallow, coarse-grain analysis. However, in RST, rhetorical relations are defined within a hierarchical structure that could be accounted for. Important nuclei may contain less-important satellites that should be treated accordingly. We therefore propose to guide polarity classification through a deep analysis of a text’s hierarchical rhetorical structure rather than its isolated rhetorical relations. We account for rhetorical relations within and across sentences by allowing for not only sentence-level but also paragraph-level and document-level analyses of RST trees.
Fine-grain analysis. Figure 1 outlines the potential of RST-guided polarity classification. Based on the sentiment-carrying words alone, our example sentence is best classified as negative, as in Figure 1a. Accounting for the rhetorical roles of text segments as identified by the RST tree’s top-level split enables a more elaborate but still coarse-grain analysis of the overall sentiment, as in Figure 1b. The top-level nucleus contains as many positive as negative words and may thus be classified as either positive or negative. The negative words in the top-level satellite trigger a negative classification of this segment, which is a potentially irrelevant segment that should be assigned a lower weight in the analysis.
However, such a coarse-grain analysis does not capture the nuances of lower-level splits of an RST tree; for instance, the top-level nucleus in our example consists of two segments, one of which is the sentence’s actual core (“he enjoyed this movie”), whereas the other is less relevant and should therefore be assigned a lower weight in the analysis. Accounting for the rhetorical roles of the leaf nodes of an RST tree rather than the top-level splits can thus enable a more accurate sentiment analysis, as in Figure 1c.
However, an exclusive focus on leaf nodes of RST trees does not account for the text segments’ rhetorical roles being defined within the context of the rhetorical roles of the segments that embed them; for instance, the second leaf node in our example RST tree is a nucleus of a possibly irrelevant background satellite, whereas the fourth leaf node is the nucleus of the nucleus and constitutes the actual core. The full rhetorical structure must be considered in the analysis in order to account for this circumstance, as in Figure 1d.
We propose a lexicon-based sentiment-analysis framework that can perform an analysis of the rhetorical structure of a piece of text at various levels of granularity and that can subsequently use this information for classifying the text’s overall polarity. Our framework (see Figure 2) takes several steps in order to classify the polarity of a document.
Word-level sentiment scoring. Our method first splits a document into paragraphs, sentences, and words. Then, for each sentence, it automatically determines the part-of-speech (POS) and lemma of each word. It then disambiguates the word sense of each word using an unsupervised algorithm that iteratively selects the word sense with the highest semantic similarity to the word’s context.8 It then retrieves the sentiment of each word, associated with its particular combination of POS, lemma, and word sense from a sentiment lexicon like SentiWordNet.1
Rhetorical structure processing. In order to guide the polarity classification of a document d by its rhetorical structure, sentiment scores are computed for each segment si. Our framework supports several methods of computing such scores: a baseline method plus several RST-based methods that can be applied to sentence-level, paragraph-level, and document-level RST trees.
Baseline. As a baseline, we consider text segments to be the sentences sd of document d, with their associated baseline sentiment score being the weighted sum of the score of each word tj and its weight wtj, or
Top-level rhetorical structure. Our framework additionally supports the top-level RST-based method applied in existing work, an approach we refer to as T. The sentiment score of a top-level RST segment si is defined as the sum of the sentiment associated with each word tj in segment si, weighted with a weight wsi associated with the segment’s rhetorical role, or
with Td representing all top-level RST nodes in the RST trees for document d.
Leaf-level rhetorical structure. Another method is our leaf-level RST-based analysis L. The sentiment score of an RST segment si from the leaf nodes Ld of an RST tree for document d is computed as the summed sentiment score of its words, weighted for the segment’s rhetorical role, or
Hierarchical rhetorical structure. The last supported approach is our method of accounting for the full path from an RST tree root to a leaf node, such that the sentiment conveyed by the latter can be weighted while accounting for its rhetorical context. In our hierarchy-based sentiment-scoring method H, we model the sentiment score of a leaf-level RST segment si as a function of the sentiment scores of its words and the weights associated with the rhetorical role of each node rn from the nodes Psi on the path from the root to the leaf, or
where δ represents a diminishing factor and signals the level of node rn in the RST tree, with the level of the root node being 1. For δ > 1, each subsequent level contributes less than its parent to the segment’s RST-based weight, thus preserving the hierarchy of the relations in the path.
Classifying document polarity. The segment-level sentiment scores can be aggregated in order to determine the overall polarity of a document. The baseline, top-level, leaf-level, and hierarchy-based sentiment scores , and for document d are defined as
The resulting document-level sentiment scores can be used to classify document d‘s polarity cd as negative (−1) or positive (1), following
Here, ε represents an offset that corrects a possible bias in the sentiment scores caused by people’s tendency to write negative reviews with rather positive words,24 or
with Φ and N denoting the respective subsets of positive and negative documents in a training set.
Weighting schemes. We consider six different weighting schemes. Two serve as baselines and are applicable to the baseline sentiment scoring approach as defined in Equations (1) and (5). The others apply to our three RST-based approaches as defined in (2) and (6), in (3) and (7), and in (4) and (8), respectively.
Our BASELINE scheme serves as an absolute baseline and assigns each word a weight of 1; structural aspects are not accounted for. A second baseline is a position-based scheme. In this POSITION scheme, word weights are uniformly distributed, ranging from 0 for the first word to 1 for the last word of a text, as an author’s views are more likely to be summarized near the end of a text.19
The first RST-based scheme (I) assigns a weight of 1 to nuclei and a weight of 0 to satellites;24 the second RST-based scheme (II) matches the second set of weights for nuclei and satellites used by Taboada et al.,24 or 1.5 and 0.5, respectively. In both schemes I and II, we set the diminishing factor for the H method to 2, such that each level in a tree is at least as important as all its subsequent levels combined, thus enforcing a strict hierarchy.
Another RST-specific scheme is the extended scheme X in which we differentiate satellite weights by their RST relation type.8 Additionally, we propose a novel extension of X: the full weighting scheme F, in which we not only differentiate satellite weights but also nucleus weights by their RST relation type. For both X and F, the weights and the diminishing factor δ can be optimized.
Evaluation
By means of a set of experiments, we evaluate the variants of our polarity-classification approach guided by structural aspects of text. We focus on a widely used collection of 1,000 positive and 1,000 negative English movie reviews.17
Experimental setup. In the Java implementation of our framework, we detect paragraphs using the <P>
and </P>
tags in the HTML files. For sentence splitting, we rely on the preprocessed reviews.17 The framework uses the Stanford Tokenizer14 for word segmentation. For POS tagging and lemmatization, our framework uses the OpenNLP3 POS tagger and the JWNL API,25 respectively. We link word senses to WordNet,7 thus enabling retrieval of their sentiment scores from SentiWordNet1 by subtracting the associated negativity scores from the positivity scores. This retrieval method yields real numbers ranging from −1 (negative) to 1 (positive). In the final aggregation of word scores, our framework assigns a weight to each word by means of one of our methods (see Table 1), most of which are RST-based.
We evaluate our methods on the accuracy, precision, recall, and F1-score on positive and negative documents, as well as on the overall accuracy and macro-level F1-score. We assess the statistical significance of performance differences through paired, one-sided t-tests, comparing methods against one another in terms of their mean performance measures over all 10 folds, under the null hypothesis that the mean performance of a method is less than or equal to the mean performance of another method.
In order to evaluate our methods, we apply tenfold cross-validation. For each fold, we optimize the offsets, weights, and diminishing factor δ for weighting schemes X and F using particle-swarm optimization,5 with particles searching a solution space, where their coordinates correspond with the weights and offsets (between −2 and 2) and δ (between 1 and 2). The fitness of a particle is its macro-level F1-score on a training set.
Experimental results. Our experimental analysis consists of four steps. First, we compare the performance of our considered methods. We then analyze the optimized weights and diminishing factors. Subsequently, we demonstrate how documents are typically perceived by distinct methods. Last, we discuss several caveats for our findings.
Performance. Table 2 presents our methods’ performance, whereas Figure 3 shows the p-values for the macro-level F1-scores for all combinations of methods, ordered from top to bottom and from left to right in accordance with an initial ranking from worst- to best-performing methods. We sort the methods based on their weighting scheme (BASELINE, POSITION, I, II, X, and F), analysis level (HILDA.D, HILDA.P, HILDA.S, and SPADE.S), and RST analysis method (T, L, and H), respectively. Darker colors indicate lower p-values, with darker rows and columns signaling weak and competitive approaches, respectively. Given a correct ordering, darker colors should be right from the diagonal, toward the upper-right corner of Figure 3.
Three trends can be observed in Table 2 and Figure 3. First, weighting schemes X and F significantly outperform the I, II, BASELINE, and POSITION schemes, with F significantly outperforming X in most cases. Conversely, schemes I, II, BASELINE, and POSITION do not exhibit clearly significant performance differences with respect to one another.
A second trend is that methods guided by document-level RST trees are typically outperformed by comparable methods utilizing paragraph-level RST trees. Sentence-level RST trees yield the best performance. An explanation lies in misclassifications of rhetorical relations being potentially more harmful in larger RST trees; a misclassified relation in one of the top levels of a document-level RST tree can cause a misinterpretation of a large part of the document.
The third trend is that methods applying the hierarchy-based RST analysis method H typically slightly outperform comparable approaches that use the leaf-based analysis L instead, which in turn significantly outperform comparable approaches using the top-level RST analysis method T. The deeper analyses L and H clearly yield a significant advantage over the rather shallow analysis method T.
Some methods stand out in particular. First, HILDA.D.T and HILDA.P.T are relatively weak, especially when using weighting schemes I and II. The top-level RST analysis method T is typically too coarse-grain for larger RST trees, as, for instance, documents are segmented in only two parts when using document-level RST trees. Other weak methods are HILDA.D.H.I and HILDA.P.H.I. The combination of naïve weighting scheme I with deep, hierarchy-based analysis of the rhetorical structure of a document or its paragraphs results in a narrow focus on very specific segments.
Approaches that stand out positively are those applying hierarchy-based RST analysis H to sentence-level RST trees, with weighting schemes X and F, or HILDA.S.H.X, HILDA.S.H.F, SPADE.S.H.X, and SPADE.S.H.F. These approaches perform comparably well because they involve detailed analysis of rhetorical structure; the analysis is performed on the smallest considered units (sentences), the hierarchy of RST trees is accounted for, and the weights are differentiated per type of rhetorical relation. These results confirm our hypothesis that the sentiment conveyed by a text can be captured more adequately by incorporating a deep analysis of the text’s rhetorical structure.
Optimized weights. The optimized weights for distinct types of nuclei and satellites, as defined in RST,13 exhibit several patterns. First, nucleus elements are generally rather important in weighting scheme X, with most weights ranging between 0.5 and 1. The sentiment expressed in elaborating satellites is typically assigned a similar importance in weighting scheme X. Furthermore, contrasting satellites mostly receive weights around or below 0. Background satellites are typically assigned relatively low weights as well.
In weighting scheme F, nuclei and satellites in an attributing relation are typically both assigned weights around 0.5. Conversely, for background and contrasting relations, satellites are more clearly distinct from nuclei. Background satellites are typically assigned less importance than their associated nuclei, with respective weights of 0 and 1. For contrasting relations, nuclei are mostly assigned weights between 0.5 and 1, whereas optimized satellite weights are typically negative.
The optimized values for the diminishing factor δ are typically around 1.5 and 1.25 for weighting schemes X and F, respectively. These values result in the first 15 (for X) or 30 (for F) levels of RST trees being accounted for. Interestingly, with some (document-level) RST trees being more than 100 levels deep in our corpus, the optimized diminishing factors effectively disregard the lower, most-fine-grain parts of RST trees, thus realizing a balance between the level of detail and the potential noise in the analysis.
Processing a document. The observed differences in performance of our polarity-classification methods originate in how these methods perceive a document in the sentiment-analysis process. Figure 4 visualizes how the interpretations of a movie review differ across various methods.
Our software assigns many words in our example positive sentiment, whereas fewer words are identified as carrying negative sentiment. When assigning each part of the review an equal weight (see Figure 4a), this relative abundance of positive words suggests the review is rather positive, whereas it is in fact negative.
Our best-performing RST-based baseline, SPADE.S.T.X, considers only the top-level splits of sentence-level RST trees and yields a rather coarse-grain segmentation of the sentences, leaving little room for more subtle nuance (see Figure 4b). Nevertheless, SPADE.S.T.X successfully highlights the conclusion and the arguments supporting this conclusion. The polarity of some positive words is even inverted, as the reviewer uses them in a rather negative way.
SPADE.S.H.F, our best-performing approach, yields a very detailed analysis in which subtle distinctions are made between small text segments (see Figure 4c). Such nuance helps bring out the parts of the review that are most relevant with respect to its overall sentiment. SPADE.S.H.F ignores most of the irrelevant background information in the first paragraph and highlights the reviewer’s main concerns in the second and third paragraphs. Moreover, sentiment related to the movie’s good aspects is often inverted and mostly ignored by SPADE.S.H.F. The overall recommendation is emphasized in the last paragraph. All in all, it is the incorporation of the detailed analysis of the review’s rhetorical structure into the sentiment-analysis process that makes it possible for our polarity classifier to better understand the review.
Caveats. Our failure analysis has revealed that some misclassifications of authors’ sentiment are due to misinterpreted sarcasm and an occasional misinterpretation of proverbs. Additionally, not all sentiment-carrying words are identified as such by SentiWordNet, as their word sense cannot be successfully disambiguated or the word cannot be found in our sentiment lexicon. Other challenges are related to the information content of documents; for instance, in our corpus, some reviewers tend to mix their opinionated statements with plot details containing irrelevant sentiment-carrying words. Additionally, some reviewers evaluate a movie by comparing it with other movies. Such statements require a distinction between entities and their associated sentiment, as well as real-world knowledge to be incorporated into the analysis.
Even though our RST-based polarity-classification methods cannot cope particularly well with the specific phenomena mentioned earlier, they significantly outperform our non-RST baselines. However, these improvements come at a cost of processing times increasing by almost a factor of 10. The bottleneck here is formed by the RST parsers, rather than by the application of our weighting schemes.
Conclusion
We have demonstrated that sentiment analysis can benefit from deep analysis of a text’s rhetorical structure, enabling the distinction be made between important text segments and less-important ones in terms of their contribution to a text’s overall sentiment. This is a significant step forward with respect to existing work, which is limited to guiding sentiment analysis by shallow analyses of rhetorical relations in (mostly sentence-level) rhetorical structure trees.
Our contribution is threefold. First, our novel polarity-classification methods guided by deep leaf-level or hierarchy-based RST analyses significantly outperform existing approaches, which are guided by shallow RST analyses or by no RST-based analyses at all. Second, the novel RST-based weighting scheme in which we differentiate the weights of nuclei and satellites by their RST relation type significantly outperforms existing schemes. And third, we have compared the performance of polarity-classification approaches guided by sentence-level, paragraph-level, and document-level RST trees, thus demonstrating that RST-based polarity classification works best when focusing on RST trees of smaller units of a text (such as sentences).
In future work, we aim to investigate the applicability of other, possibly more scalable methods for exploiting (discourse) structure of text in sentiment analysis. We also plan to validate our findings on other corpora covering other domains and other types of text.
Acknowledgments
Special thanks go to Bas Heerschop and Frank Goossen for their contributions in the early stages of this work. We are partially supported by the Dutch national research program COMMIT (http://www.commit-nl.nl/about-commit).
Figures
Figure 1. Interpretations of a positive RST-structured sentence consisting of nuclei (marked with vertical lines) and satellites.
Figure 2. A schematic overview of our sentiment-analysis framework; solid arrows indicate information flow, and dashed arrows indicate a used-by relationship.
Figure 3. The p-values for the paired, one-sided t-test assessing the null hypothesis of the mean macro-level F1-scores of the methods in the columns being lower than or equal to the mean macro-level F1-scores of the methods in the rows.
Figure 4. Movie review cv817_3675, as processed by various sentiment-analysis methods.
Tables
Table 1. Characteristics of our considered polarity-classification approaches, or our baselines, our SPADE-based sentence-level RST-guided methods, and our HILDA-based sentence-level, paragraph-level, and document-level RST-guided methods.
Table 2. Performance measures of our considered baselines, our SPADE-based sentence-level RST-guided methods, and our HILDA-based sentence-level, paragraph-level, and document-level RST guided methods, based on tenfold cross-validation on the movie-review dataset. The best performance in each group of methods is in bold for each performance measure.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment