Discovering semantic and sentiment correlations using huge corpus of short informal Arabic language text

Tracking #: 1294-2506

Authors: 
Salha Al-Osaimi
Muhammad Badruddin Khan

Responsible editor: 
Harith Alani

Submission type: 
Full Paper
Abstract: 
Semantic and Sentiment analysis have received a great deal of attention over the last few years due to the important role they play in many different fields, including marketing, education, and politics. Social media has given tremendous opportunities for researchers to collect huge amount of data as input for their semantic and sentiment analysis. Using twitter API, we collected around 4.5 million Arabic tweets and used them to propose a novel automatic unsupervised approach to capture patterns of words and sentences of similar contextual semantics and sentiment in informal Arabic language at word and sentence levels. We used Language Modeling (LM) model which is statistical model that can estimate the distribution of natural language in effective way. The results of experiments of proposed model showed better performance than classic bigram and latent sematic analysis (LSA) model in most of cases at word level. In order to handle the big data, we used different text processing techniques followed by removal of the unique words based on their relevance to problem.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 04/Mar/2016
Suggestion:
Reject
Review Comment:

The paper describes how to use Language Modeling to estimate the sentiment and semantic correlation of informal Arabic language text. It is not clear what the novel approach and the contribution of the paper are.
Methodology seems a bit loose; further details should have been given. When generating the vocabulary, I am a bit skeptical that removing words appearing less than 400 times in the corpus can be performed without losing information. A link to the generated dataset should have been given (taking care of any copyright issues).
It is not clear how sentiment tweets from the collected ones have been detected for analysis.
There is not a precision-recall analysis or nothing that proves the accuracy of the proposed method. A comparison analysis should have been provided as well. It is not clear how the correlation between sentiment and semantics has been evaluated.
Also, I do not see how semantics has been employed in such a paper.
The authors claimed to resolve the big data problem using different text processing techniques. I wonder if that is really a big data problem. Usually in presence of big data problem there are specific techniques (such as Map Reduce) but the authors seem to not have used any of them.

Minor errors:
In the related work section, work of researchers are indicated using names of authors but without the actual reference. Some references are not really pertinent to the underlying section and the paper.
Figure 3 should be a table and not a screenshot of a spreadsheet.
The English should be widely revised and improved. Several grammar errors and typos are present.
There are also formatting errors: some columns of several pages of the paper seem wrong.

Review #2
By Taha Tobaili submitted on 14/Mar/2016
Suggestion:
Reject
Review Comment:

1- English is very poor.
2- Paper suggests that there is sentiment analysis however authors only focus on semantic correlations.

Abstract:
3- Why you need to handle big data?
4- What about sentence level?

Introduction
4- opinion mining is different from sentiment analysis
5- "expensive job"?
6- What is huge in terms of corpus?
7- Why is this work important?

Related Work:
8- Different approaches need to be explained and compared to your work.
9- Related work need not to be "encouraging" rather highlights how your work is better than other approaches.

Methodology:
10- Date is valuable target
11- repeated characters and symbols are valuable- express sentiment.
12- How and which tool did you use for normalization
13- tokenization: what about stop words?
14- IF-IDF deletes highest freq. words not lowest!
15- work in document representation is not clear!
16- Cluserting... bad flow
17- "hardware limitation", "out of memory"!

Results
18- How did you find sentiment words in informal language
19- There are no informal words
20- what query words?
21- what is the contribution?
22- what about other k values?
23- a very basic semantic correlation for Arabic in KSA.
24- tables are too vague
25- most topics are about football, what about the war thats taking place?
26- which 5 sentences??

Discussion:
27- no evaluation of results

Conclusion:
28- contribution?

Review #3
Anonymous submitted on 17/Mar/2016
Suggestion:
Reject
Review Comment:

The work in this paper proposes an unsupervised approach for Arabic sentiment analysis at the word- and sentence-level. The proposed approach relies on the co-occurrence of word bigrams in tweets in order to extract the contextual semantics and sentiment similarity between these words. Evolution is conducted on a set of 4.5 million Arabic tweets. Qualitative analysis is done on a very few samples of the words and sentences in the dataset to compare the performance of the proposed approach against two context-based sentiment detection baselines.

Strengths: An unsupervised approach for sentiment analysis of informal Arabic texts.

Weaknesses

Contribution and Novelty:
The use of word co-occurrences for sentiment analysis has been extensively studied in previous works (some cited in the Section 2 in the paper). The proposed approach does not introduce any major addition to current state-of-the-art, which makes the contribution and the novelty of this work very limited.

Methodology:
- Although the proposed approach is simple and straight forward, there are several places where the certain parameters/aspects of the approach look ad-hoc. For example, its unclear why the authors choose to remove words that have a co-occurrences frequency of less than 400? More importantly, removing these words shrink the vocabulary size by 84%, as described in Section 3.4. In my opinion, such random and unconstrained reduction method would results in removing many contextual and/or opinionated words from the vocabulary which in turn would lead to affect the sentiment analysis performance.
- Section 3.7: What do the components in the vector representation of words represent? e.g., what do 1,1 and 5 in the vector of the word “love” refer to?
- I understand that you tried several numbers of clusters for K-means, but it’s unclear how you decide that 200 is the best choice!

- Evaluation:
No proper evaluation of the proposed approach is conducted in this paper. The current evaluation is done by manually comparing the output of the proposed approach against the baselines. To this end, very few samples of the output were analysed, which does not give a clear idea about the performance of the proposed approach.

- Scalability:
Several reduction methods are applied on data/matrices in this paper in order reduce the computational cost of the proposed approach. Regardless of whether such kind of reduction is sound or not, this suggest that the scalability of your approach is very limited.

Presentation
The paper is not very well structured and contains many typos. Also, equations are presented in the paper as images. I suggest the authors to follow the standard latex syntax for that in order to make the presentation clearer.

Overall, although this paper address an important problem in the sentiment analysis area, the proposed approach lacks novelty. Also, the evaluation conducted in this work is very limited.