DOI QR코드

DOI QR Code

A Personalized Approach for Recommending Useful Product Reviews Based on Information Gain

  • Received : 2014.11.12
  • Accepted : 2015.04.19
  • Published : 2015.05.31

Abstract

Customer product reviews have become great influencers of purchase decision making. To assist potential customers, online stores provide various ways to sort customer reviews. Different methods have been developed to identify and recommend useful reviews to customers, primarily using feedback provided by customers about the helpfulness of reviews. Most of the methods consider the preferences of all users to determine whether reviews are helpful, and all users receive the same recommendations.

Keywords

1. Introduction

Thanks to the proliferation of e-commerce and the great influence of customer reviews on purchase decisions, many products are now being sold and purchased online [1]. Customers believe that reviews written by others who have already had an experience with the product offer more objective and reliable information than that provided by sellers [2]. As a result, an increase in the average review rating leads to a growth in product sales [3], which, in turn, can strengthen the product’s price competitiveness [4]. However, if there are too many products and reviews, the advantage of e-commerce can be overshadowed by increasing search costs. Reading all of the reviews to find out the advantages and disadvantages of a particular product can be tedious and exhausting [5,6]. To help users find the most useful information about products without much difficulty, e-commerce companies try to provide various ways for customers to write and rate product reviews. Amazon.com asks customers whether a review on a certain product is helpful, and it places the most helpful favorable and the most helpful critical review at the top of the list of product reviews. Some companies also predict the usefulness of a review based on certain attributes including length, author(s), and the words used, publishing only reviews that are likely to be useful [7, 8].

The methods typically used by e-commerce companies begin from the same assumption, namely that all users share the same concept of helpfulness [7, 9, 10, 11, 12, 13, 14]. In contrast, we assume that every user has his or her own concept of helpfulness. To this end, the present study aimed to develop a model that recognizes individual preferences and to test the models used to predict usefulness and make recommendations that consider individual differences. To do this, we extended the information gain approach to consider the votes of all users as well as each individual’s preference. To compare various approaches, the study compared methods that use voting in rating the usefulness of reviews.

For this study, we collected data from 172 people who assessed the usefulness of product reviews through online surveys on a website. Using these data, we identified various types of algorithms and compared the results of personalized product review recommendation methods.

 

2. Related Studies

2.1 Provisions of Product Reviews

E-commerce companies offer platforms for product reviews in order to provide product information to consumers. Because product reviews play a significant role in making purchase decisions, it is important to single out those reviews that provide useful information. Also, due to the large number of product reviews, retailers have provided ways for customers to sort them. Table 1 shows a summary of information about product reviews for shoppting sites among the top 100 sites on the Web as defined by Alexa1.

Table 1.Provisions and Evaluations of Customer Reviews

Most shopping websites provide customer reviews and offer the following information: the average preference of a product, the number of customers who have participated in preference voting, and preference distribution. The reviews are sorted in categories such as “Helpful Reviews,” “Recent Reviews,” and “Preference Score.”Most information on the helpfulness of product reviews is collected through a voting system. The voting system can be divided into two types; the first type asks whether a review is “helpful” or “not helpful,” and the second type asks whether it is “useful.” The helpfulness of a product review is expressed as the total number of users that rated a review as helpful or as the ratio of helpful votes to the total number of votes.

2.2 Recommending Useful Product Reviews

A number of studies have been conducted regarding ways to recommend useful reviews or likeable products to customers using product reviews. Cao et al. (2011)[7] demonstrated that the prediction of a review’s helpfulness is more accurate if usefulness is judged by a combination of the review’s basic information, style, and semantic information than when the evaluation is based on a single factor. The basic information includes whether a review describes advantages and disadvantages and the length of time since it was posted; style information includes the word count and the number of sentences; and the semantic information takes into account the meanings of words used in the review. Kim et al. (2006)[11] offered a method to predict the helpfulness of reviews through support vector machine (SVM) regression using structural, semantic, and morphological information about a review. Liu et al. (2008)[12] suggested a model that can predict the helpfulness of a review by considering the reviewers’ experience and writing styles, and the date when the review was written. Ghose and Ipeirotis (2011)[9] predicted the helpfulness of reviews by considering product attributes, review attributes, reviewer attributes, reviewers’ experiences, reviews’ readability, and reviews’ subjectivity. The predictions of usefulness produced by these methods are based on indicators of reviews’ actual helpfulness, and the results are verified by comparing them with the review rankings.

Some studies have beeen based on the assumption that the prediction of review helpfulness is a matter of distinguishing useful reviews from non-useful reviews. Zhang and Tran (2011)[13] measured the contribution of of particular words to the frequency with which a review was rated as “helpful” or “not helpful.” The helpfulness of a review was predicted, and specific reviews were recommended, based on the sum of the contributions of words contained in the review.

Some studies have assessed the meanings of words and sentences within a review to determine whether the review is positive or negative in tone [15, 16, 17, 18]. Table 2 summarizes the methods of estimating the helpfulness of a review.

Table 2.Methods of estimating helpfulness of a product review

Studies predicting the helpfulness of reviews by employing attributes of the review and of the reviewers as well as values reflecting helpfulness voting tend to be used in systems that recommend the most useful reviews for all users instead of making personalized recommendations.

 

3. Information Gain-based Review Recommendation Algorithm

For a given product P={p1,p2,p3,…,pw}, customer C={c1,c2,c3,…,cm}, and review R={r1,r2,r3,…,rp}, we define the vote V as a matrix of the customers’ votes on product reviews vck,ri, which can be expressed by the following formula:

Here, vck,ri includes helpful (if ck voted ri as helpful), not helpful (if ck voted ri as not helpful), and null (if ck has not voted for ri). Let the set of all the “helpful” votes about review ri be denoted as hi, and the set of all “not helpful” votes about review ri be denoted as . We define the helpfulness value of review ri using the following equation:

The average information gain required to classify a review into a group can be defined using entropy as follows:

Therefore, the average amount of information contributed by a word or term t in a class si can be calculated as follows:

In the area of text mining and text classification, information gain is the amount of information provided by a word or term. The information gain of the term t can be calculated as follows [13]:

In the above equations:

– Pr(si) is the probability of documents’ being included in the category si among all documents;

– Pr(t) is the probability of documents containing t among all documents;

– Pr(si|t) is the probability of documents with t being included in category si among all documents containing t; and

– Pr(si|) is the probability of documents containing t belonging to category si among all documents that do not contain t.

3.1 Information gain—Total method

The above algorithm (Equation 5) was suggested by Zhang and Tran (2011) [13]. In this approach, if the total user usefulness rating is greater than 0.6, it is classified in the helpful review group (s1); otherwise, it is classified in the not helpful review group (s2). In the vote matrix V above, let a “helpful vote” be assigned the value 1, and a “not helpful vote” be assigned the value 0. If the helpfulness value calculated from Equation (2) is greater than 0.6, review rj is classified into the helpful review group.

Based on G(t) from Equation (5), the final Gain(t) can be calculated as follows:

By using the information gain (Gain(t)) of t as calculated above, the predicted helpfulness score of a new review (ri) for all customers can be calculated as:

where M is the total number of stemmed words in review ri, and tj is the jth stemmed word. If (tj) is included in (ri), then f(ri,tj) is 1; if (tj) is not included in (ri), then f(ri,tj) is 0. Among the reviews that were not evaluated by a particular customer, that customer will receive recommendations for reviews with high predicted usefulness scores as calculated using Equation 7. Because this method reflects the opinions of all users, it recommends the same reviews to all users, provided only that they did not yet vote on the reviews.

3.2 Information gain – Personalized method

Based on the concept of using information gain to classify helpful reviews, in this paper, we devised personalized recommendation algorithms. The first personalized algorithm assessed the helpfulness scores of reviews based on each individual’s review preference and the information gain of t.

In contrast to the total method described in Section 3.1, which considered the votes of all users, this personalzed method classified a review into helpful or not helpful review groups based on each individual’s vote. If the value of an element (vck,ri) is 1 in matrix V, it is classified as representing the helpful review group (s1); if the value of (vck,ri) is 0, it is classified as representing the not helpful group (s2). Based on this,the information gain of t for customer ck can be calculated using Equation (5-1) based on the review classification described above.

In the above equation:

– Prck(si) is the probability of reviews’ being included in category si among all reviews rated by ck;

– Prck(t) is the probability of reviews containing t among all reviews rated by ck;

– Prck(si|t) is the probability of reviews with t being included in category si among all reviews rated by ck containing t; and

– Prck(si|) is the probability of reviews containing t belonging to the category si among all reviews rated by ck that do not contain t.

Finally, Gck(t) can be calculated as follows.

By using the information gain (Gainck(t)) of the word t as calculated above, the predicted helpfulness of a new review (ri) for customer ck can be calculated as follows;

Customer ck will get recommendations of reviews with high predicted helpfulness scores. Though the concept of using the information gain of a word is similar to the total method described in Section 3.1, this personalized method uses the concept to estimate helpfulness scores based on each individual’s preferences. By classifying a review’s helpfulness, calculating the information gain of a word and estimating the helpfulness score, this personalized algorithm only takes into account the individual’s preference in selecting reviews.

3.3 Information gain—Weighted personalized method

In Section 3.1, when estimating the helpfulness scores of reviews, we included votes from all users in calculating the information gain of t, and in Section 3.2, information gain was calculated using only a single user’s vote. However, when people decide or evaluate something, they are influenced by others as well as by their own experience and subjective evaluation.

Thus, when calculating the final helpfulness predictions, we considered estimates made using the votes of all users as well as the preferences of the individual. To do this, we devised a weighted personalized method by averaging predicted helpfulness scores from Equation (7) and Equation (7-1) as follows:

By taking the average of the predicted helpfulness scores, we considered the opinions of all users as well as each individual’s subjective evaluations.

3.4 Information gain—Selective personalized method

Using the weighted personalized algorithm in Section 3.3, we can selectively base helpfulness predictions on the opinions of all users or on each individual’s preferences. To this end, we devised Equation (9), which uses predictions selectively based on the values of the prediction scores:

where Score(ri) is the predicted helpfulness score considering the ratings of all users, taken from Equation (7), and Scoreck(ri) is the predicted helpfulness score based on a user’s preference, introduced in Section 3.2.

If the prediction score from the personalized method is greater than that from the total method, then the final prediction score is the score from the personalized method. Otherwise, the predicted scores from the total method are used for the final predictions.

Using this algorithm, we usually followed the opinions of other users because raters generally showed broad agreement in evaluating helpful reviews. However, we selectively used prediction scores from the personalized method if a review was considered more helpful to a certain user.

3.5 Information gain—Personalized method with average threshold

The methods introduced in Section 3.1 and 3.2 employed the vote matrix V in which entries were scored as 1 if helpful or 0 if not helpful. As shown in Table 1, it is common to rate a review by indicating that it is helfpul or not helpful. As discussed in Section 3.1, if the ratio of helpful to not helpful votes for a review is over 0.6, the review is considered helpful. In the personalized algorithm, a review is classified by each user’s vote regardles of whether the review is regarded as helpful.

Another approach is to rate a review using a Likert scale-like star rating system. We will explain our data later in Section 4; briefly, we obtained helpfulness ratings of reviews using a 7-point Likert scale. Because we employed a 0.6 ratio for classifying helpful reviews in Section 3.1, here, we classified the ratings as helpful (or 1) if the Likert rating was 4 or above, and as not helpful otherwise. In Section 4, however, the threshold for classifying helpful reviews differs among users. Thus, when we applied personalized methods, we used each user’s average ratings as the threshold to classify helpful reviews. The personalzed method described in Section 3.2 used a score of 4 or above on the Likert scale to designate a review as helpful. However, the personalized method with an average threshold used a different classification scheme for each user based on that user’s average rating scores. If one user’s average rating score was 4.8, we classified reviews with ratings of rated 5 and above by that user as helpful, and all other reviews were classified as not helpful, even if they had a rating of 4.

The remaining procedures used to estimate helpfulness scores are the same as those in the personalized method described in Section 3.2.

 

4. Data and Experiment

Product reviews used for this study were collected from Amazon.com. Fig. 1 shows an example of a product review posted on the site. At the top of the review, the total number of people who have rated the product and the number of people who considered the review helpful are indicated. The website allows customers to leave a textual review and to give a star rating for the products. In the present study, perfumes and books were selected as “experience goods,” and shoes and cameras were selected as “search goods.” [19, 20] As shown in Table 3, two product items per product group and six reviews per item were selected. The selected reviews were composed of two entries with the greater number of votes, another two with medium-level number of votes, and two entries with the lower number of votes.

Fig. 1.A screen shot of a product review on amazon.com

Table 3.Products and review data used in the experiment

Photos and information regarding the selected products were shown to study participants, and a website was built to allow them to rate the helpfulness of each product’s set of reviews. A total of 172 people participated in the experiment. Each participant was instructed to read six reviews and rate their helpfulness on a scale of 1 to 7, where 1 meant “Not helpful at all” and 7 indicated “Very helpful.” To exclude the influence of the sequence of reviews on the evaluation, the order of reviews was randomly selected.

The average rating given by the participants to the forty-eight reviews was 4.862 (SD = 0.83). The distribution of scores was as follows: a rating of 1, 205 reviews; a rating of 2, 323 reviews; of 3, 695 reviews; of 4, 2,153 reviews; of 5, 1,888 reviews; of 6, 1,790 reviews; and a rating of 7, 202 reviews. The most frequently selected rating was 4 points, and the least frequently selected was 1, followed by 2 and 3. As the ratings increased from 5 to 6 and then to 7, the frequency tended to decrease.

After calculating the frequency of the words included in the 48 reviews, we created a matrix of words and documents. The tm package of R [21] was used to extract the words. Stemming was done using the tm_map function in the tm package after we removed numbers, stopwords, and symbols. Then, we extracted a total of 2,029 words. Table 4 shows the 10 most frequently used words in our dataset.

Table 4.Top 10 most frequently used words

The average frequency of use for all words was 2.0001; the highest number was for words used only one time (1,289 words). To apply the above data to an information gain-based review recommendation algorithm, reviews with ratings of 4 and above were assigned a helpfulness value of 1, and the remaining reviews were assigned value of 0. To recommend reviews based on the information gain found by using the ratings of all users, those reviews that were rated as helpful by more than 60% of all users (104/172 users) were classified in the helpful review group; others were classified in the not helpful review group. This conversion was applied for all of the methods introduced in Section 3, except the personalized method with an average threshold, described in Section 3.5, which used each user’s average rating as the threshold value for classifying reviews as helpful or not.

 

5. Experimental Results

The experiment was conducted using 30%, 50%, and 70% of the entire dataset as training data. Splitting for each case was conducted randomly and was repeated 30 times. Then, the average performance measures were obtained using the 30 randomly split datasets. Additionally, recommendations for reviews were repeated using reviews with the top three, top five, and top seven helpfulness values.

The ratio of helpful reviews included in the recommendation was taken as a measure of the precision of the recommendation, and recall was calculated as the ratio of helpful reviews in the recommendation to the total number of helpful reviews in the test dataset. Given that precision decreasaed as the number of recommended reviews increased, whereas recall showed a tendency to increase, and given its wide use as a performance indicator, an F measure indicator was used as an indicator of performance:

Table 5 shows a summary of the recommendation performance of the methods suggested in Section 3. The performance of a method that randomly recommends reviews for users was used as a baseline for comparison. To compare the performance of the various methods, the F test was conducted for F scores, and a Tukey test was performed as a post hoc test to compare the F scores for the methods.

Table 5.Experiment results

The F test revealed that the mean values for all three measures differed significantly among methods (Table 5). The random method showed the lowest F scores in all situations, as shown in the reuslts of the Tukey test. The F scores of the total method were higher than those of the random method, but lower than those of the personalized methods. It was hard to distinguish among the performances of the personalized methods. Generally, the personalized with average threshold and the personalized method had greater F scores than did the weighted personalized and the selective personalized methods. However, in some cases no signifant differences were found among the personalized methods.

When the size of the training set decreased from 70% to 50% and then to 30%, the F values for the top three, five, and seven helpfulness scores showed a tendency to decrease. Though the larger training sets showed higher performance in terms of recall, the precision values were between 0.6 to 0.7. This can be attributed to the fact that recall values decreased because more helpful reviews were included in the test sets. If we consider only precision, we believe that the influence of the volume of the training set on the information gain methods was minimal.

 

6. Conclusion

The personalized recommendation methods suggested in this study performed better than the total method, which recommends the same reviews to all users. This result was consistent in the cases of the top three, five, and seven helpfulness scores for various sizes of training sets (70%, 50%, and 30% of the total data). Among the personalized methods, the personalized with average threshold and the personalized method had higher F scores than did the weighted personalized and the selective personalized methods. However, their performances did not always differ significantly. Thus, caution should be used when selecting an appropriate method for personalized review recommendations. Data should be tested empirically to find the best algorithm in individual cases.

When we recommended the top three reviews using 70% of the data and the personalized methods, the F score was around 0.3; F scores with 50% and 30% of the data were around 0.2 and less than 0.2 respectively. This was also true for the top five and top seven recommendations. This implies that more training data resulted in better performance.

Though the personalized methods outperformed the total method, the total method has the advantage of being able to recommend reviews for a user even if that reviewer had not voted on a review at all, whereas the personalized methods require that users have voted on other reviews because the methods need to understand the user’s preferences.

An information gain approach, whether total or personalized, can recommend a review that has not been rated yet because the method only investigates the contents of the review. Therefore, it is useful for deciding whether a relatively new review is helpful when that review has not received enough votes for other algorithms to be applied.

In this study, we collected the preferences of users in only four product categories: perfume, books, cameras, and shoes. Thus, caution should be used when applying the suggested methods to other product categories. For personalization, we may use other information such as user preferences for products. Thus, opportunities remain for further studies that combine votes on reviews with other information so as to improve recommendations.

Cited by

  1. 텍스트 마이닝을 활용한 고객 리뷰의 유용성 지수 개선에 관한 연구 vol.14, pp.4, 2015, https://doi.org/10.9716/kits.2015.14.4.159
  2. 중립도 기반 선택적 단어 제거를 통한 유용 리뷰 분류 정확도 향상 방안 vol.22, pp.3, 2016, https://doi.org/10.13088/jiis.2016.22.3.129
  3. 카테고리 중립 단어 활용을 통한 주가 예측 방안: 텍스트 마이닝 활용 vol.23, pp.2, 2015, https://doi.org/10.13088/jiis.2017.23.2.123
  4. A Ghost in the Shell? 고객 리뷰를 통한 스마트 스피커의 인공지능 속성이 평가에 미치는 영향 연구 vol.17, pp.2, 2015, https://doi.org/10.9716/kits.2018.17.2.191