ORFEL: Efficient detection of defamation or illegitimate promotion in online recommendation

doi:10.1016/j.ins.2016.09.006

Information Sciences

Volume 379, 10 February 2017, Pages 274-287

https://doi.org/10.1016/j.ins.2016.09.006 Get rights and content

Abstract

What if a successful company starts to receive a torrent of low-valued (one or two stars) recommendations in its mobile apps from multiple users within a short (say one month) period of time? Is it legitimate evidence that the apps have lost in quality, or an intentional plan (via lockstep behavior) to steal market share through defamation? In the case of a systematic attack to one’s reputation, it might not be possible to manually discern between legitimate and fraudulent interaction within the huge universe of possibilities of user-product recommendation. Previous works have focused on this issue, but none of them took into account the context, modeling, and scale that we consider in this paper. Here, we propose the novel method Online-Recommendation Fraud ExcLuder (ORFEL) to detect defamation and/or illegitimate promotion of online products by using vertex-centric asynchronous parallel processing of bipartite (users-products) graphs. With an innovative algorithm, our results demonstrate both efficacy and efficiency – over 95% of potential attacks were detected, and ORFEL was at least two orders of magnitude faster than the state-of-the-art. Over a novel methodology, our main contributions are: (1) a new algorithmic solution; (2) one scalable approach; and (3) a novel context and modeling of the problem, which now addresses both defamation and illegitimate promotion. Our work deals with relevant issues of the Web 2.0, potentially augmenting the credibility of online recommendation to prevent losses to both customers and vendors.

Introduction

In the Web 2.0, it is up to the users to provide content, like photos, text, recommendations and many other types of user-generated information. The more interaction, e.g., likes, recommendation, comments, etc., a product page (or a user profile) gets, the better are the potential profits that a company (or an individual) may achieve with automatic recommendation, advertisement, and/or priority in automatic search engines. In Google Play, for example, mobile apps heavily depend on high-valued (4 or 5 stars) recommendations to get more important and to expand their pool of customers; on Amazon, users are offered the most recommended products, that is, those that were better rated; and in TripAdvisor, users rely on other’s feedback to pick their next travels. The same holds for defamation, which is the act of lowering the rank of a product by creating artificial, low-valued recommendations. Sadly, fraudulent interaction has come up in the Web 2.0 – fraudulent likes, recommendations, and evaluations define artificial interests that may illegitimately induce the importance of online competitors.

Attackers create illegitimate interaction by means of fake users, malware credential stealing, Web robots, and/or social engineering. The identification of such behavior has great importance to companies, not only because of the potential losses due to fraud, but also because their customers tend to consider the reliability of a given website as an indicator of trustfulness and quality. According to Facebook [11], fraudulent interaction is harmful to all users and to the Internet as a whole, so it is important that users have a true engagement around brands and content.

However, catching up with such attacks is a challenging task, especially when there are millions of users and millions of products being evaluated in a system that deals with billions of interactions per day. In such attacks, multiple fake users interact with multiple products at random moments [1] in a way that their behavior is camouflaged among millions of legitimate interactions per second. The core of the problem is: how to track the temporal evolution of fraudulent user-product activity since the number of possible interactions is factorial?

We want to identify the so-called lockstep behavior, i.e., groups of users acting together, generally interacting with the same products at around the same time. As an example, imagine that an attacker creates a set of fake users to artificially promote his e-commerce website; then, he would like to comment and/or recommend his own Web pages, posts, or advertisements to gain publicity that, fairly, should come from real customers. Here, an attacker may refer to employees related to a given company, professionals (spammers) hired for this specific kind of job, Web robots, or even anonymous users. The weak point in all these possibilities is that the attacker must substantially interact with the attacked system within limited time windows; also, the attacker must optimize his efforts by using each fake user account to interact with multiple products. This behavior agrees with the lockstep definition. Note that this pattern is well-defined in online recommendation and in many other domains, such as academic co-citation, social network interaction, and search-engine optimization. Provided that this is not a new problem, we use in this paper the definition of lockstep behavior given by Beutel et al. [4]. See the upcoming Section 3.3 for details.

The task of identifying locksteps is commonly modeled as a graph problem – nodes are either users or products; weighted edges represent recommendations – in which we want to detect near-bipartite cores considering a given time constraint. The bipartite cores correspond to groups of users that interacted with groups of products within limited time intervals. One lockstep may be defamation, when the interactions are negative (low-valued) recommendations; or illegitimate promotion, when the recommendations are positive. Therefore, the problem generalizes to finding near-bipartite cores with edges whose weights correspond to the rank of the recommendations. Note that we want to tackle the problem without any previous knowledge about suspicious users, products, nor the moments when frauds occurred in the past.

This work extends the state-of-the-art solutions for the problem of lockstep identification. Our main contributions are threefold:

1.
Novel algorithmic paradigm: We introduce the first vertex-centric algorithm able to spot lockstep behavior in Web-scale graphs using asynchronous parallel processing; vertex-centric processing is a promising paradigm that still lacks algorithms specifically tailored to its modus operandi;
2.
Scalability and accuracy: We tackle the problem for billion-scale graphs in one single commodity machine, achieving efficiency that is comparable to that achieved by state-of-the-art works on large clusters of computers, whilst obtaining the same efficacy;
3.
Generality of scope: We tackle the problem for real weighted graphs ranging from social networks to e-commerce recommendation, expanding the state-of-the-art of lockstep semantics to discriminate defamation and illegitimate promotion.

This paper follows a traditional organization. Section 2 presents background concepts, while Section 3 reviews the related works. Our proposal is described in Section 4. In Section 5, we report experimental results, including real data analyses. Finally, Section 6 concludes the paper and presents ideas for future work.

Section snippets

Vertex-centric graph processing

We use in this paper the well-known concept of vertex-centric processing [23]. Given a graph $G = (V, E)$ with vertices labeled from 1 to |V|, we associate a value to each vertice and to each edge – for a given edge $e = (u, v),$ u is the source and v is the target. With values associated to vertices and edges, vertex-centric processing corresponds to the graph scan approach depicted in Algorithm 1. The values are determined according to the computation that is desired, e.g., Pagerank or belief

Clustering

The identification of lockstep behavior refers to the problem of partitioning both the rows and the columns of a matrix – known in the literature as co-clustering or bi-clustering. Some authors have worked on similar variations of the bi-clustering problem. For example, Papalexakis and Sidiropoulos used PARAFAC decomposition over the ENRON e-mail corpus [29]; Dhillon et al. used information theory over word-document matrices [9], and Banerjee et al. used Bregman divergence for predicting

The generalized lockstep problem

This section shows how to enhance the potential semantics of the lockstep-detection problem by taking into account the weights of edges (e.g., recommendations’ scores).

Given the formulation presented in Section 3.3, we propose new semantics to the problem by considering weights of edges to define the concepts of defamation – Eq. (6) – and illegitimate promotion – Eq. (7). These weights correspond, for instance, to the numeric evaluation (score) given by a user to a product in a recommendation

Experimental setting

We studied two real-world graphs: Amazon.FineFoods and Amazon.Movies. They are publicly available at the Snap project [20] web page at http://snap.stanford.edu/. Both datasets comprise user-product recommendation data from the Amazon website; the first one refers to the section of fine food products and the second one contains reviews of movies. For each review, we have the corresponding timestamp and one numeric evaluation (score) ranging from 1 to 5. Synthetic graphs were also studied so to

Conclusions

We conclude that, although the problem of detecting fake online interaction over time is NP-hard, it is possible to timely detect most of the malicious activities even with low-cost computer machinery. To do so, we designed a vertex-centric-based graph algorithm using the asynchronous parallel processing paradigm. The general problem is modeled as a bipartite weighted and timestamped graph from which we want to detect temporal near-bipartite cores. This model suits to problems of systematic

Acknowledgments

We thank Prof. Christos Faloutsos and Alex Beutel, from Carnegie Mellon University, for their valuable collaboration. This work was funded by Conselho Nacional de Desenvolvimento Cientifico e Tecnologico (CNPq - 444985/2014-0), Fundacao de Amparo a Pesquisa do Estado de Sao Paulo (FAPESP - 2013/10026-7, 2016/02557-0 and 2014/21483-2), and Coordenacao de Aperfeicoamento de Pessoal de Nivel Superior (Capes).

References (36)

A. Cruz-Roa et al.
Visual pattern mining in histology image collections using bag of features
Artif.Intel.Med.
(2011)
R. Peeters
The maximum edge biclique problem is np-complete
Discrete Appl. Math.
(2003)
L. Akoglu et al.
RTM: laws and a recursive generator for weighted time-evolving graphs
Data Mining, 2008. Eighth IEEE International Conference on
(2008)
A. Anagnostopoulos et al.
Approximation algorithms for co-clustering
Proceedings of the Twenty-Seventh ACM Symposium on Principles of Database Systems
(2008)
A. Banerjee et al.
A generalized maximum entropy approach to Bregman co-clustering and matrix approximation
J. Mach. Learn. Res.
(2007)
A. Beutel et al.
CopyCatch: stopping group attacks by spotting lockstep behavior in social networks
Proceedings of the 22nd International Conference on World Wide Web
(2013)
ChengY.
Mean shift, mode seeking, and clustering
Pattern Anal. Mach. Intel. IEEE Trans.
(1995)
P.-A. Chirita et al.
Preventing shilling attacks in online recommender systems
Proceedings of the 7th Annual ACM International Workshop on Web Information and Data Management
(2005)
K. Crammer et al.
A needle in a haystack: local one-class optimization
Proceedings of the Twenty-First International Conference on Machine Learning
(2004)
I.S. Dhillon et al.
Information-theoretic co-clustering
Proceedings of the Ninth ACM International Conference on Knowledge Discovery and Data Mining
(2003)

J.R. Douceur

The sybil attack

Peer-to-Peer Systems

(2002)

Facebook, Improvements to our site integrity systems, 2012, (http://facebook.com/10151005934870766). October,...

J.E. Gonzalez et al.

PowerGraph: distributed graph-parallel computation on natural graphs

Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation

(2012)

A. Goyal et al.

Feature subspace selection for efficient video retrieval

International Conference on Multimedia Modeling

(2010)

G. Gupta et al.

Robust one-class clustering using hybrid global and local search

Proceedings of the 22nd International Conference on Machine Learning

(2005)

A.A. Hagberg et al.

Exploring network structure, dynamics, and function using NetworkX

Proceedings of the 7th Python in Science Conference, Pasadena, CA USA

(2008)

W.-S. Han et al.

TurboGraph: a fast parallel graph engine handling billion-scale graphs in a single PC

Proceedings of the 19th ACM International Conference on Knowledge Discovery and Data Mining

(2013)

U. Kang et al.

PEGASUS: apeta-scale graph mining system implementation and observations

2009 Ninth IEEE International Conference on Data Mining

(2009)

Cited by (10)

A dynamic classification unit for online segmentation of big data via small data buffers
2020, Decision Support Systems
Citation Excerpt :
The use of data buffers can result in a significant reduction of statistical power [28]. The use of data buffers in information technology is beginning to gain traction, e.g., in forecasting problems [29], spatial clustering [30,31], online recommendations [32], and other applications. The principle of data buffers perfectly suits the need for real-time dynamic segmentation without loss of too much important information.
In many segmentation processes, we assign new cases according to a model that was built on the basis of past cases. As long as the new cases are “similar enough” to the past cases, segmentation proceeds normally. However, when a new case is substantially different from the known cases, a reexamination of the previously created segments is required. The reexamination may result in the creation of new segments or in the updating of the existing ones. In this paper, we assume that in big and dynamic data environments it is not possible to reexamine all past data and, therefore, we suggest using small groups of selected cases, stored in small data buffers, as an alternative to the collection of all past data. We present an incremental dynamic classifier that supports real-time unsupervised segmentation in big and dynamic data environments. In order to reduce the computational effort of unsupervised clustering in such environments, the suggested model performs calculations only on the relevant data buffers that store the relevant representative cases. In addition, the suggested model can serve as a dynamic classification unit (DCU) that can act as an autonomous agent, as well as collaborate with other DCUs. The evaluation is presented by comparing three approaches: static, dynamic, and incremental dynamic.
Understanding Customer Behaviour: A Comprehensive Survey of Segmentation and Classification Techniques in the Age of Big Data
2023, International Journal of Intelligent Systems and Applications in Engineering
A Survey on Big IoT Data Indexing: Potential Solutions, Recent Advancements, and Open Issues
2022, Future Internet
An Unsupervised Approach for Detecting Group Shilling Attacks in Recommender Systems Based on Topological Potential and Group Behaviour Features
2021, Security and Communication Networks
Social Media Users Send Promotional Links to Strangers: Legitimate Promotion or Security Vulnerability?
2020, IEEE Access
Spam analysis of big reviews dataset using Fuzzy Ranking Evaluation Algorithm and Hadoop
2019, International Journal of Machine Learning and Cybernetics

View all citing articles on Scopus

View full text

ORFEL: Efficient detection of defamation or illegitimate promotion in online recommendation

Abstract

Introduction

Section snippets

Vertex-centric graph processing

Clustering

The generalized lockstep problem

Experimental setting

Conclusions

Acknowledgments

Artif.Intel.Med.

Discrete Appl. Math.

RTM: laws and a recursive generator for weighted time-evolving graphs

Data Mining, 2008. Eighth IEEE International Conference on

Approximation algorithms for co-clustering

Proceedings of the Twenty-Seventh ACM Symposium on Principles of Database Systems

A generalized maximum entropy approach to Bregman co-clustering and matrix approximation

J. Mach. Learn. Res.

CopyCatch: stopping group attacks by spotting lockstep behavior in social networks

Proceedings of the 22nd International Conference on World Wide Web

Mean shift, mode seeking, and clustering

Pattern Anal. Mach. Intel. IEEE Trans.

Preventing shilling attacks in online recommender systems

Proceedings of the 7th Annual ACM International Workshop on Web Information and Data Management

A needle in a haystack: local one-class optimization

Proceedings of the Twenty-First International Conference on Machine Learning

Information-theoretic co-clustering

Proceedings of the Ninth ACM International Conference on Knowledge Discovery and Data Mining