Fourth International Conference on Computer Science and Information Technology (CoSIT 2017) | Principal Component Analysis | Cluster Analysis

Please download to get full document.

View again

of 7
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

Documents

Published:

Views: 4 | Pages: 7

Extension: PDF | Download: 0

Share
Related documents
Description
A monitoring system is proposed to detect violent content in Arabic social media. This is a new and challenging task due to the presence of various Arabic dialects in the social media and the non-violent context where violent words might be used. We proposed to use a probabilistic nonlinear dimensionality reduction technique called sparse Gaussian process latent variable model (SGPLVM) followed by k-means to separate violent from non-violent content. This framework does not require any labelled corpora for training. We show that violent and non-violent Arabic tweets are not separable using k-means in the original high dimensional space, however better results are achieved by clustering in low dimensional latent space of SGPLVM.
Transcript
    Dhinaharan Nagamalai et al. (Eds) : CoSIT, SIGL, AIAPP, CYBI, CRIS, SEC, DMA - 2017 pp. 01– 07, 2017. © CS & IT-CSCP 2017 DOI : 10.5121/csit.2017.70401 U NSUPERVISED   D ETECTION   OF    V  IOLENT   C ONTENT   IN    A  RABIC   S OCIAL   M EDIA Kareem E Abdelfatah 1,3 , Gabriel Terejanu 1 , Ayman A Alhelbawy 2,3  1 Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, USA 2 Computer Science and Electrical Engineering Department, University of Essex, Essex, United Kingdom 3 Computers and Information Faculty, Fayoum University, Fayoum, Egypt  A  BSTRACT    A monitoring system is proposed to detect violent content in Arabic social media. This is a new and challenging task due to the presence of various Arabic dialects in the social media and the non-violent context where violent words might be used. We proposed to use a probabilistic non-linear dimensionality reduction technique called sparse Gaussian process latent variable model (SGPLVM) followed by k-means to separate violent from non-violent content. This framework does not require any labelled corpora for training. We show that violent and non-violent Arabic tweets are not separable using k-means in the srcinal high dimensional space, however better results are achieved by clustering in low dimensional latent space of SGPLVM.  K   EYWORDS   Violence, Social Media, Arabic, SGPLVM, Dimensionality Reduction, Unsupervised learning 1.   I NTRODUCTION   According to the Arab Social Media Report, there were 6 million Twitter users in the Arab world in March 2014, posting on average around 17 million tweets per day [1]. Twitter provides pro- found information as people share with others what they like and do not like, their beliefs, their political opinions, and what they observe. Due to dramatic problems plaguing much of the Arab world, a significant amount of content on social media is about violence and abuse. Detecting offensive and violent content in social media is a very active research area, especially in the last few years. This type of research is valuable to various organizations such as Human Rights Organizations (HRO). In some crisis countries like Iraq or Syria, it may be dangerous and not safe for HROs to obtain reports and monitor the human rights situations through the usual process. Therefore, mining social media might be a solution to the problem of detecting and identifying human rights abuses safely. However, according to our knowledge there is very little work for detecting violent content in Arabic social media. This is a serious gap, as there is a real need for such kind of research in Arabic social media. Arabic language in social media is one of the most challenges languages to be study and analyzed. Arabic is the official language in around 22 countries with more than 350 million people around the world [2]. All of these countries are Diglossia societies where both the standard form of the language, Modern Standard Arabic (MSA), and the regional dialects (DA) are used [3]. MSA is used in official settings while DA is the native tongue of Arabic speakers. DA does  2 Computer Science & Information Technology (CS & IT) not have a standard orthography and it is divided into several groups among these countries [4]. Nowadays, these dialects are extensively utilized in social media text, in spite of their srcinal absence from a written form [3]. Detecting violence content in Arabic social media is not a trivial task. Not only because the different Arabic dialects that we have mentioned above, but also because of violent Arabic words are not always representative of violent context. For example, the word “Killing” has both a violent meaning but it may also be used in a non-violent context as in the following tweet examples [5].             ''The memory and the pain twins, you cannot kill the pain without crushing the memory''            ''You may kill the flowers but cannot prevent the arrival of spring'' On other hand, the same word can be used in a violent context, like the following example [5]:            ''The killing of five people shot dead by gunmen and arrested six suspects'' In this work, we tackle this problem using a recently released dataset that contains 16234 manually annotated Arabic tweets [5]. It contains different violent context like killing, raping, kidnapping, terrorism, invasion, explosion, or execution, etc. According to our knowledge this is the first study conducted on this dataset. We use an unsupervised technique to binary cluster this dataset to violent and non-violent content. First, the Sparse Gaussian Process Latent Variable Model (SG- PLVM) [6] is used as an unsupervised probabilistic non-linear Dimensionality Reduction (DR) model. Then we apply k-means on the features extracted in the previous step. Using recent released Arabic dataset [5], our experiments show that violent and non-violent Arabic tweets are not separable using k-means in the srcinal high dimensional space, however better results are achieved using low dimensional projections provided by the SGPLVM. 2.   P REVIOUS   W ORK There is much research work in detecting violent content on web [7, 8]. Computer vision techniques have been proposed to detect violence in videos [9–11]. On the other hand, text mining techniques have been used to detect violence in English social media; but little work targets this problem in Arabic social media. A probabilistic violence detection model (VDM) is proposed in Ref. [12] to extract violence related topics from social media data. The authors propose a weakly supervised technique and they used OpenCalais with Wikipedia documents, and Wikipedia and YAGO categories to build a training corpus. The dataset was built to detect violence categories such as Crimes, Accidents, War Conflict, etc. Non-violence related categories are anything other than violence, like Education and Sports. We tested OpenCalais, but unfortunately it does not support Arabic text. Also, the number of documents under violence categories in Arabic Wikipedia is very small. Lexical Syntactical Feature (LSF) [13] has been introduced to detect offensive language in social media. The proposed system uses the user profile to get some information about the user’s  Computer Science & Information Technology (CS & IT) 3 English writing style. A set of lexical features like Bag of Words and N-grams, and hand-authoring syntactic rules are used to identify name-calling harassments. In additions, a users potentiality to send out offensive content in social media has been predicted using some features like style, structure, and context-specific features. This proposed method uses Naive Bayes and SVM techniques to train a classifier. 3.   C LUSTERING   IN   A   L OWER   S PACE It is very common in NLP to have a really high dimensional feature vectors. Using unsupervised techniques for clustering patterns is good and cheap choice. k-means algorithm is one of the good candidates for unsupervised learning techniques. But, k-means can give better results when it is applied on low dimensional features [14] Therefore, it is common to project a high dimensional data set onto a lower dimensional subspace using unsupervised DR techniques such as Principle Components Analysis (PCA) [15] to improve learning. It is widely used approach to project data onto a lower dimensional subspace using PCA then use k-means to cluster the data in the lower dimensions space [15]. Because unsupervised clustering algorithms such as k-means operate mainly on distances, it is vital to use a DR technique that is able to preserve the distance metric between the data points in the low dimensional subspace. PCA is the most widely used linear DR for obtaining a lower dimensional representation of a data set. PCA may maintain the dissimilarity [14] which can help the K-means to achieve better separation for clustering. We meant by preserve the dissimilarity is the ability to preserve the points that are far apart in data space to be far apart in the latent space. However, due to linearity, PCA may not capture the structure of the data through a low dimensional embedding [16]. Gaussian process latent variable model (GPLVM) [17] is a flexible non-linear approach to probabilistic modelling data in high dimensional spaces. It can be used as DR method which maps between the observed data points  D N  Y   × ℜ∈  and latent unobserved data points q N   X   × ℜ∈ . One of its advantages it can preserve the dissimilarity and smoothness between the data in high and low dimension spaces. Smoothness means that if two points in the latent space are close (far) to each other then they will be mapped to two points that are relatively close (far) to each other in the data space. The GPLVM as a probabilistic approach models the relationship between latent variables and the observed data through non-linear parametrized function ( ) i:, + y  ε   wX,f = :i,i:, where 1i:, × ℜ∈  N   y represents one dimension of the observed data and  D w  × ℜ∈ 1:i, is one row of the parameters  Dq W   × ℜ∈  which it has a prior Gaussian distribution over each of its row with zero mean and unit variance ( )  I |w N w ii 0,~ and noise ( )  I  σ   N  2i:, 0,~ ε   . GPLVM assumes that there is independency across the data dimensionality. Thus, the likelihood for all dimensions can be written as a product of the likelihood of the  D  observed dimensions. ( )  ( )  I | y  Ν  = X |Y  p  D 2i:,1=i σ +K0, ∏  Inferencing the latent projects can be achieved through maximizing the marginal log-likelihood of the data,    4 Computer Science & Information Technology (CS & IT) ( ) | |  ( ) C+YYTr 21log2log 1 T  K K  D= X |Y  p  − −−  Here, C is a constant and  N  N  K   × ℜ∈ is a kernel matrix that is calculated from the training data. There are different kernel functions available that can be used. In our experiments we used the radial basis function (RBF), ( ) ( ) ( )       −−− 2rbf  2exp θ =x, γ  x x x x  xk   jiT  ji ji  where θ  rbf   , γ  are the parameters or the kernel. However, a major drawback with the standard GPLVM approach is its computational time. To infer the latent variable  X  , GPLVM uses a gradient based iterative optimization of the log likelihood which requires ( ) 3  N O  complexity due to the inverse of K [6]. Therefore, the Sparse- GPLVM (SGPLVM) [6] comes to solve this issue by reducing the complexity to O(u 2  N)  where u is the number of points retained in the sparse representation. Therefore, using Sparse-GPLVM before K-means can guarantee to preserve the dissimilarity between the data points in the latent space which leads to coherent patterns that can be detected easily via clustering. 4.   D ATASET A manually annotated dataset of 16,234 tweets are used for training and testing [5]. Every tweet had been classified by at least five different annotators. As every tweet is classified by different users, it may be assigned different classes. So, a final aggregate class is assigned based on a class confidence score as it is described in the srcinal publication [5]. In our experiments we have kept only the tweets have a confidence score more than 0.7. Table 1: Dataset Details Class Training Testing Total %  Violence 5673 2759 9332 57.5 Non-Violence 4790 2112 6902 42.5 Total 11363 4871 16234 The srcinal dataset is classified into seven violence classes: crime, violence, human rights abuse, political opinion, crisis, accidents, and conflict. There is an additional class “other”, which contains non-violence tweets where some violence words had been mentioned. Because we are interested in detecting the violence acts in Arabic social media regardless the type of violence, all violence classes are mapped to one class “violence”, while the “other” class is mapped to “non-Violence” class. Around 70% of the dataset is used for training and 30% is used for testing as shown in Table 1. 5.   E XPERIMENTS   S ETUP The Arabic has a complex morphological structure especially for Dialectal Arabic [18]. Until now, there are no available standard resources for Arabic text mining and morphological analysis [18]. However for our study, we use MADIMARA [18] analysis tool because it has most of
Recommended
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks