Masterarbeit, 2019
109 Seiten, Note: 89
Abbreviations and Acronyms
Personal Motivation
Scientific Motivation
Acknowledgments
Chapter I
1.1 Introduction
1.2 Problem Statement
1.3 Purpose of the Study
1.4 Research Questions
1.5 Research Hypotheses
1.6 Significance of the Study
1.7 Limitations of the Study
1.8 Challenges of the study
1.9 Research contributions
1.10 Key Terms
1.10.1 Sentiment Analysis
1.10.2 Natural Language Processing (NLP)
1.10.3 Arabizi NLP
1.10.4 Classifier
1.10.5 Big Data
1.10.6 Machine Learning Classifier
1.10.7 Lexicon-based Classifier
1.10.8 Customer Review
1.11 Research Outline
Chapter II
2.1 Literature Review
2.2 Natural Language Processing (NLP)
2.3 Big Data and Sentiment Analysis (SA)
2.4 Approaches to SA
2.4.1 Lexicon-Based Approach
2.4.2 Machine Learning Approach
2.4.3 Hybrid Approach
2.5 Arabizi and the Lebanese Dialect
2.6 Sentiment Analysis and Lebanese Arabizi
Chapter III
3.1 Research Methodology
3.2 Research Design
3.3 Research Sample
3.3.1 The Challenges of Analyzing Arabizi Texts
3.4 Data Preprocessing and Filtering
3.4.1 Removal of reviews with “neutral” sentiment
3.4.2 Ratings' Encodings
3.4.3 Data splitting for training and testing
3.4.4 Data Cleaning
3.5 Reviews Representation
3.5.1 Selected Features
3.6 Research Tools
3.6.1 Machine Learning Classifier
3.6.2 Lexicon-based Classifier
Chapter IV
3.7 Research Procedure
Chapter IV
4.1 Experiment Preparation
4.2 Data Preprocessing
4.3 Feature Extraction
4.4 Building Classifiers
4.4.1 Machine Learning
4.4.2 Lexicon-based
4.5 Results and Evaluation
Chapter V
5.1 Research Result
5.2 Machine Learning
5.2.1 First phase (Default settings)
5.2.2 Second phase (hyperparameters tuning settings)
5.2.3 Experiment Summary
5.3 Lexicon-based
5.3.1 Experiment Summary
5.4 Discussion
Chapter VI
6.1 Conclusion
6.2 Future Work
References
Appendix
Appendix A
Appendix B
Endnote
Figure 1 - Turning Test in which person C chats through text only with another person B and machine A, adopted from Bansal (2018, p.3)
Figure 2 - ML techniques, adopted from Wang, Chaovalitwongse, and Babuska (2012, p.2)
Table 1 - Correspondence Difference Between Arabic and Arabizi Encoding Characters. Adopted from Cotterell et al. (2014, p.2) based on Yaghan (2008)
Table 2 - Six Similar Arabic and Arabizi Characters in Relation to Five Regional Dialects, adopted from Tobaili (2015, p.3)
Table 3 - Arabizi Code Categories, adopted from Abed AL-Aziz, et al. (2011)
Table 4 - Generated Arabizi SoundEX Code for Different Orthographic Forms, adopted from Abed AL-Aziz, et al. (2011)
Table 5 - Arabizi letters mapping to corresponding Arabic letters, adopted from Duwairi et al. (2016, pp. 129)
Figure 3 - The Range of Corpus Collection Marked Inside the Dark Line
Table 6 - A sample of the Lebanese Arabizi Corpus
Figure 4 - Distribution of rating count
Figure 5 - Distribution of Review Length Across The Arabizi Corpus
Figure 6 - Distribution of Review in Respect to Service Sector Providers: Private & Public
Figure 7 - The Top 30-page names that contain the highest review count
Figure 8 - Exaggeration in Arabizi texts
Table 7 - Code Mixing and Switching Phenomenon in Lebanese Arabizi Corpus
Table 8 - Examples of Question Arabizi Sentences
Table 9 - Opposite Representation of Negative Lexical Markers
Table 10 - Variations in Linguistics components of Arabizi words
Table 11 - Arabizi Texts Concatenation Phenomenon
Table 12 - Superlative and Comparative of Arabizi Adjectives
Table 13 - Arabizi Terms Variations in Writing
Figure 9 - Sentiment Classes Distributions in Arabizi Corpus
Figure 10 -Example of Text Review a7la 3alam/The best people Conversion to Lower Case.
Table 14 -BoW: Terms Frequency of three sample reviews
Table 15 -Inverse Document Frequency of Three Sample Reviews
Table 16 - TF*IDF of Three Sample Reviews
Figure 11 - Logistic (sigmoid) Function specifying the feature x of input xi and the target y sentiment class, adopted from Cramer (2002, p.3)
Table 17 - An Example illustrating Score (xi), Representing Value Coefficients of a Given Input (xi)
Figure 12 -An Example of OM using LR, adopted from (Guestrin & Fox, 2016a)
Figure 13 -Classifier Confident in the Task of OM, adopted from (Guestrin & Fox, 2016b)
Table 18 - Example: Coefficient Maximization
Figure 14 - Data Representation in LR on OM Task
Figure 15 - Example: Coefficient values
Figure 16 - Example: Computation of Derivate Contribution to Coefficient wl
Figure 17 - SLCSAS Architecture
Figure 18 - Research Procedure to Approach SA
Table 19 - TF*IDF and BoW word level n-gram features
Table 20 -BoW Feature Representation Matrix
Table 21 - TF*IDF Feature Representation Matrix
Table 22 - Main Default LR settings
Table 23 - Learnt Coefficients of both LR Models with BoW and TF*IDF features
Table 24 -Pipeline Hyperparameters tuning architecture
Table 25 -Pipeline Hyperparameters tuning Best Model Architecture
Table 26 -Dictionary Categories For Terms in Postive & Negative Classes
Figure 19 -A Simple Example of Grammar Rule in the Delivery Category
Figure 20 - Complex Example of Grammar Rule in the Price Category of the Negative Class...64 Figure 21 - Semantic Map of both positive and negative classes with corresponding category..65 Table 27 - Computer Specification
Table 28 -Confusion Matrix in Binary Classification Task
Table 29 - Performance of BoW LR Model with Default Settings
Table 30 - Performance of TF*IDF LR Model with Default Settings
Figure 22 - The Receiver Operating Characteristics curves of BoW and TF*IDF LR Models with Default Settings
Table 31 -Confusion Matrix of BoW LR Model with Default Settings
Table 32 -Confusion Matrix of TF*IDF LR Model with Default Settings
Table 33 - Performance of BoW LR Model with Hypermeters Tuning Settings
Table 34 - Performance of TF*IDF LR Model with Hypermeters Tuning Settings
Figure 23 - The Receiver Operating Characteristics curves of BoW and TF*IDF LR Models with Hypermeters Tuning Settings
Table 35 -Confusion Matrix of BoW LR Model with Hypermeters Tuning Settings
Table 36 -Confusion Matrix of TF*IDF LR Model with Hypermeters Tuning Settings
Figure 24 - Summary: The Receiver Operating Characteristics curves of First- and Second-ML Phases of LR
Table 37 - Semantic map Categories found in test data
Table 38 - Performance of SLCSAS Classifier on 274 text reviews in total
Figure 25 - SLCSAS FP Example
Figure 26 - SLCSAS FN Eample
Figure 27 - SLCSAS Tp Eample
Figure 28 - SLCSAS TN Eample
Table 39 - Confusion Matrix of the Results of SLCSAS
Table 40 - Performance Comparison Between ML and Lexicon-based classifiers
Because of the huge amount of data that users generate on the web, it is crucial to build a sentiment analysis tool for the Arabizi language system to recognize feelings associated to these data. We proposed two kinds of approaches based on supervised machine learning (ML) using logistic regression (LR) and the other based on lexicon-based using Science of Language and Communication Semantic Analysis System (SLCSAS) tool. Both approaches have been conducted and tested on a corpus of 2635 public and public services' reviews, including hotels, restaurants, shops, governmental institutions (municipalities, universities, and offices, etc.) and other categories, collected from Facebook, Google and Zomato platforms. The total reviews of private service sector are 2501, which overrepresent the sample than the rest 134 reviews of public sector.
At first, data of text reviews have been preprocessed and filtered by 1) removing user's information, 2) transforming texts to lower case, 3) splitting data into 80% training and 20% testing sets, 4) removing reviews with neutral class, and encoding reviews with 0s (negative) and 1s (positive) classes. Then, data feature is considered through BoW and TF*IDF enhanced with word level n-grams dictionary mainly unigrams, bigrams, and trigrams. In SLCSAS, dictionary, grammar rules, and semantic map have been constructed for further implementation. On the other hand, ML models built through the consideration of two phases. The first phase considers the construction of two LR models with default settings set by scikit-learn library. The second phase considers using a pipeline to facilitate the hyperparameter tuning of two other LR classifiers. Finally, the results of the five built classifiers are evaluated in terms of precision, recall, f1-score, confusion matrix, and receive operating characteristics curve.
At last, findings show that both LR models trained through BoW and TF*IDF features with default settings remarked similar results, while the hyperparameter tuning of the LR model trained through TF*IDF has surpassed the one with BoW. Therefore, the best nominated classifier is the hyperparameter tuned TF*IDF LR with word level unigram. In addition, SLCSAS classifier has achieved a competitive result in comparison to ML models but with lower coverage on the test data. The ML models so exceed in performance against lexicon-based classifier.
Keywords : Sentiment Analysis, Natural Language Processing (NLP), Arabizi NLP, Classifier, Big Data, Machine Learning Classifier, Lexicon-based Classifier, Customer Review.
In this section, abbreviated and acronym terms, which used in this paper, are clarified and presented as follows:
A pplication P rogramming I nterface (API)
A rabic N ame E ntity R ecognition (ANER)
A rea U nder the C urve (AUC)
A rtificial I ntelligence (AI)
A utomatic L anguage P rocessing (ALP)
A utomatic L anguage P rocessing A dvisory C ommittee (ALPAC)
B ag o f W ords (BoW)
C omputational L inguistics (CL)
C omputer- m ediated C ommunication (CMC)
D eep L earning (DL)
I nformation E xtraction (IE)
I nformation R etrieval (IR)
L ogistic R egression (LR)
M achine L earning (ML)
M odern S tandard A rabic (MSA)
N atural L anguage P rocessing (NLP)
O pinion M ining (OM)
P art o f S peech T agging (POST)
R eceive O perating C haracteristics (ROC)
Reg ular Ex pressions (REGEX)
S cience of L anguage and C ommunication S emantic A nalysis S ystem (SLCSAS)
S emantic O rientation (SO)
S entiment A nalysis (SA)
T erm F requency a nd I nverse D ocument F requency (TF*IDF)
T raitement A utomatique du L angue (TAL)
T ransformational Rules (T-rules)
W orld W ide W eb (WWW)
I have been attracted to Natural Language Processing (NLP) expertise area since I was in the third year of English language studies in linguistics branch at the Lebanese University. I was between hundreds of students learning to represent sentences through dependency trees on white boards without the use of technology in structural linguistics course, given in structural linguistics course. Since that, I was fascinated about the automatic language processing (ALP)/traitement automatique du langue (TAL) and started extensively to search for available opportunities in professional research.
Even with the barriers of programming back in the mind of intelligence applications, I became much burning with enthusiastic knowledge. Every day, I learn more and more in the fast- technological world. Much practically, Sentiment Analysis (SA) has caught my attention to further my research on. Because users' opinions in the virtual could take much time and effort to analyze by human hands. Machine doing the analysis with a glimpse of an eye is greatly mind-driven and time-saving. I put my mind on SA because I believe this task enfolds the core of artificial intelligence (AI); and it is highly important in sentiment predication for future happiness and statistician of a nation based onavailable datain the web.In other words, governments could know and ensure what whole nations would feel and reckon in the nearest and furthest future through deep SA.
Accordingly, I am motivated to analyze texts in theoretical and applicable studies on SA. As a contribution of me with my research team Dr. Moustafa Al-Hajj and Dr. Amani Sabra, we have researched on SA and NLP applications in the Arabic language (Al Omari and Al-Hajj, 2019; Al Omari et al., 2019; Al Omari, Al-Hajj, and Sabra, 2019). Most importantly, we have reviewed lexicon-based, ML, and deep learning (DL) classifiers in wide variety of NLP tasks and applications on SA, sentence categorization, part-of-speech-tagging, language identification, name entity recognition, authorship attribution, word sense disambiguation, and text classification (Al Omari and Al-Hajj, 2019). We have indicated recent challenges in Arabic language processing with further solutions: 1) solid training on NLP approaches of lexicon-based, ML, and DL, 2) accessibility to research materials, 3) increasing the fund to research development.
Furthermore, we proposed LR model trained with term and inverse document frequency (TF*IDF) for the sentiment classification of customer reviews in Arabic language on OCLAR (Opinion Corpus for Lebanese Arabic Reviews) dataset (Al Omari et al., 2019). Results of the experiment have remarked the unbalance representation of OCLAR dataset; therefore, the classifier's confident was low in sentiment predication. In total, the f-measure is 0.15% on negative class and 0.94% on positive class. In following research, we structured DL architecture of both convolutional neural network (CNNs) and long-term short memory (LSTM) in cope with state-of- the-art literature (Al Omari, Al-Hajj, and Sabra, 2019). The architecture has achieved the state-of- the-art performance on three benchmark datasets (Sub-AHS, ASTD, and OCLAR) out of two (Main-AHS and Ar-Twitter). The model has achieved the following results in accuracy measure on the five datasets: 0.881 on Main-AHS, 0.968 on Sub-AHS, 0.842 on Ar-Twitter, 0.7918 on ASTD, 0.903 on OCLAR.
The research study is practical in the Arabizi language system because there are only few studies dedicated to that system. This research investigates the language components of Arabizi through manualand analysesof services reviews.
First, this thesis works on extracting keywords from the text reviews in way to ease the process of sentiment classification by categorizing significant words into classes. In other words, the words that indicate the classification class of the text under study. The hand-made keywords are the core of lexicon-based analyzer. The SLCSAS classifier is being used to experiment all the maps between keywords in variety of classes. This study remarks significant classification approach and methodology that is well constructed to classify reviews in Arabizi language system. Also, it would be greatly resourced in the general evaluation of services in Lebanon.
Secondly, the research compared the lexical approach to ML model, which has not been addressed in the literature before. The research remarks the first experiment of LR in classifying Arabizi texts. In addition, the scientific significance of this research also relies on the hyperparameter tuning of LR model in goal to reach optimal performance. Therefore, this research leaves many questions and further problems to deal with in upcoming research studies.
I wish to say my sincere regards and thanks to Dr. Moustafa Al-Hajj and Dr. Amani Sabra who have been a great research team and supervising me during the master's thesis on SA for Arabic and Arabizi language systems in the Lebanese Context. It is worth to mention the great help, encouragement, guidance, and support I received from both all the time.
Moreover, I wish to express my thanks and gratitude to Coursera's learning platform because it gives me the opportunity to explore the knowledge beyond the physical boundaries. Also, I would like to thank my brothers and my mother for the constant support and encouragement I have received throughout my journey.
Nowadays, the huge flow of unstructured (unlabeled) data of about forty thousand exabytes that speculated to reach in the early 2020 (Gantz and Reinsel, 2012), with the presence of the World Wide Web (WWW), has attracted a large number of data mining researchers for the aim of extracting vivid knowledge and other useful information for making sense of what the people feel and reckon in the virtual space (Waters, 2010). For such big data analysis, Sentiment Analysis (SA) or opinion mining (OM) is a major concern for opinion analytic and extraction from sequences of texts in forms of reviews, discussions, and blogs (Pang and Lee, 2008). SA is one of multidisciplinary research field that includes NLP, Computational Linguistics (CL), Information Retrieval (IR) or Extraction (IE), ML, DL, and Artificial Intelligence (AI) (Feldman, 2013). Concerning emotion understanding and identification in the depth of Computer-mediated Communication (CMC), SA is most practical and useful to carry on because it fills the gap between machine's understanding and human natural language by giving it the ability to identify and grasp sentimental information through written expressions associated within the big data by classifying and processing language and utterance into one of SA predefined classes, for example, positive, neutral, or negative one (Duwairi et al., 2016).
One of the most popular and used social media application in the Arab world is Facebooki1. It shows a continuous increase in its users, reaching to about 116 and a half million in the Middle Eastern countries, and specifically 360 thousand in Lebanon solely at the beginning of 2018 (“Middle East Internet Statistics”, 2018). Accordingly, users generate continues flow of data in every day's basis that are characterized as growing mountains fueled with opinions: reviews, ratings, recommendations, and other useful information (Wright, 2009), especially on public and private services including food, education, hotel, resort, product, shop, and restaurant, etc. (Agarwal et al., 2015). However, various-shaped challenges associated within the folds of the generated big data while attempting to automatically process such in NLP tasks for the sake of knowledge-making and further decision-making in terms of data size, language dialect, and the With the extensive amount of data available online, the style of language writing could be differentiated from one user to another, based on the used language writing style and the educational background. According to the survey of the Arab Social Media Report (2017), 30% of people in the Middle East review public and private services by writing Arabic characters, and 26% of users use Latin characters, while other 15% use a combination of both Arabic and English texts whenever they would like to express their thoughts. Hence, the mixing of more than one language system and Latin characters paves the way for popularizing new ways of writings on the web in CMC settings (Aboelezz, 2012). One of these styles of writings is Arabizi, which is the target language system for automatic treatment and processing for SA task in this research system because of its high use and popularity in Lebanon.
Arabizi is “a slang term (slang vernacular, popular informal speech) describing a system of writing Arabic using English characters. This term comes from two words “arabi” (Arabic) and “Engliszi” (English). The actual word would be “3rabizi” if represented in its own system,” (Yaghan, 2008, p. 39). To confirm the use and the existence of Arabizi language system, the results of American University survey in Cairo (2011), on Arabizi language system use by 70 Arabic users in Facebook, notes that more than 82% of people confirmed using Arabizi, 40% used it most of their times, other 20% use it always, and 17% respondents, who does not used it, said it was out of respect for the Quran and as part of effort to keep away their Arabic identities from the Westernized influence of Arabizi. Besides, Yasmine Dabbous argued, in article Saving Arabic (“Saving Arabic” ,2014) declared on the UN Arabic Language Day, Arabizi in the past used as to facilitate communication between young users, although nowadays it is being used in notes-taking at the universities in Lebanon, which recognizes a step ahead in its phasal development and a threat to the native language Arabic. Accordingly, Arabizi remarks a case of new-developed language system that could transfer any message as same as Arabic non-Latin system but at the boundaries of CMC settings.
This research thesis focuses on developing as well as deploying efficient and proficient OM model for automatically processing Arabizi language system in the context of both public and private customer service providers in Lebanon country. Service providers include restaurants, hotels, shopping centers, governmental institutions, etc... Arabizi corpus of 2635 text reviews, which is essential for the building of the OM model, was gathered through crawling pages of service providers in Facebook, Googleii2, and Zomatoiii3 websites over a period of time from April 4, 2018 to October 30, 2018. The research methodology is experimental, quantitative, and descriptive in nature following both supervised ML and lexicon-based approaches, which were used to build the sentiment classifier, composing additional features that would be beneficial for the training and testing of both. On one hand, ML Features include both Bag of Words(BoW)and Term Frequency and Inverse Document Frequency (TF*IDF) paired with word level n-grams in range of unigrams, bigrams and trigrams. ML approach presented by LR, which fits for binary classification problems as in our case, trained on 80% of data and tested on the 20% in two separate specifications with word level n-grams (unigrams, bigramsand trigrams), the first trained on BoW and the another trained on TF*IDF in default settings set by scikit-learn libraryiv4, while the other phase considers training two LR models with having their hyperparameters tuned in attempt to reach the optimal performance. The language analysis is automated in ML through data feature representation of text reviews in BoW and TF*IDF. On the other hand, the rule-based classifier includes semantic map categories of delivery, customer service, price, recommendation & suggestion, administration, service, product, market, ambiance, and overall. Besides, dictionary and grammar parser are harvested from 25% of the training set by deep manual language analysis for the purpose of structuring the SLCSAS classifier. In experiment, it is tested on the same testing set that made in LR model.
Because of the unavailability of SA tools for automatically processing Arabizi language system, building a one is of highly importance. For this, Arabizi language system would be in a place of recognition in the SA field with the increase of internet users, who currently use it and would use in the future. In addition, it would help companies, institutions, small businesses in model.LogisticRegression.html extracting sentiments of positives and negatives much more efficiently in text reviews written in Arabizi; therefore, they would reflect on enhancing the qualitiesof theirprovided services.
The main aimof this research thesis is to give credit to the Arabizi language users' feelings and thoughts in Lebanon territory by extracting sentimental knowledgeout of expressed sequences oftextsin positive or negative impressions. In addition, it is necessary to highlight the challenges that underpin this language system for the public and researchers most particularly to further their research studies on. Moreover, it is crucial to distinguish Arabizi, particularly in the Lebanese context; therefore, it would be a startup point for other researches to build on. Furthermore, this research experiments the machine capabilities ontasks for sentiment predication and classification in the Lebanese Arabizi. And, this thesis is purposeful to build a dataset that contains reliable Arabizi reviews,which could be used for further researches. Researchers could be working on the expansion of this corpus, too. In general, it is important to classify the outstanding number of Arabizi sentences, which could be of great help for media offices, government centers, research facilities, and start-ups businesses in knowledge-making and future current-based predication tasks.
The research questions of the study are:
1. What would be the best approach that fits the SA task?
2. Which data preprocessing techniques are suitable and available for natural language texts?
3. Which data features most fit the trained ML models?
4. Which LR model would be the best for SA task?
To answer the following research questions, experiments are necessary to conduct using different ML modelsas well as the lexicon-based classifier aligned with diverse data preprocessing stepsand features extraction.
The hypotheses for the study are:
1. ML models could classify Lebanese Arabizi texts in binary classes: positive or negative.
2. Dataset could construct from social media sites on pages of public and private services' providers.
3. Private sector services' providers exceed public (governmental) ones.
4. All experiments would show a very slight difference regarding the achieved results.
5. ML implementation exceeds rule-basedclassifier's performance.
To whether validate the hypothesis or not, research experiment and data analytic are crucial to carry on in the further sections of this thesis.
The significance of this research is to experiment the effectiveness of applying lexiconbased classifier as well as ML algorithms in the benefits of SA for the Lebanese Arabizi, so to suggest improvements for further research studies. Many steps should be followed to reach the desirable importance from this research, as follows:
I. Investigating most effective and used data preprocessing and features methods for sentiment classification.
II. Deciding which ML techniques and rule-based ones to use for achieving the optimal result.
III. Using available computational resources for the building of SA tools.
IV. Analyzing the performance of applied rule-based classifier as well as ML algorithm with respect to features' selections.
V. Offering suggestions for further research improvement based on the performance ofeach undertaken ML model and lexicon-based one.
VI. Deploying the best classifier of the undertaken experiments as a real-time service application for Lebanese Arabizi sentiment classification.
Most importantly, the study's importance relies in the richness of information that provided for the public. As a researcher, I do find the subject interesting for further studies from multiple places and positions for the purpose of building an outstanding automatic classifier for Lebanese Arabizi; therefore, researchers may reach to a proper classifier that could handle the sentiment extraction from any given Lebanese Arabizi text. Moreover, companies, shopping markets, and institutions aiming at improving their visibility services might consider integrating and incorporating SA model into their systems.
The research study has faced many obstacles in its folds that restricts it to the area where it has been researched on. First, because the dataset is strictly texts written in Arabizi script of the Lebanese dialect, classification results would lose its generality and validation if it used to predicate other text reviews written in a different dialect. For instance, the research tackles the problematic of creating a tool for Lebanese Arabizi sentiment classification, but it could not achievegood results in predicating sentiment of Arabizi, which is written in Egyptian dialect. Most importantly, the size of the dataset that has been used in this study is small in comparison to other high-credential studies. This goes back to the unavailability of any annotated dataset in Arabizi. And due to the shortness of time and the heavy pressure during the master's study, the collected corpus has a small number of sequences of text reviews.
Many challenges we have been faced with during the research process in the field of SA for Lebanese Arabizi text reviews in both public and private services' providers. First challenge, we have been looking for reviews written in Arabic script but while surveying the social media websites mainly Facebook, we have found the number of reviews written in Arabizi script is dominating alongside the English language. Therefore, we have shifted the language choice to Arabizi because of its highlighted representation and importance for further SA task. Secondly, reviews were collected through manual extraction by accessing page by page looking for the ones written in Arabizi script. Otherwise, the use of automatic extractor is much encouraged but did not find a one. In addition, the Zomato application programming interface (API) regularizes the access of reviews per page only to the top 10 reviews, which are quite frustrating. Google map and Facebook APIs are not formally described in how to crawl for reviews in selected areas and services providers' pages. Therefore, we have left to the manual picking choice, which took a large amount of time of about 15588 minutes (= 259.8 hours) to collect 2635 text reviews. Last but not least, analyzing 25% of the reviews in the training set for the favor of building up the lexiconbased classifier was the hardest thing opposed to what was expected, tacking about 2 weeks of continues works. The evaluation of the results took a similar time period to make the comparison happens between it and the ML LR classifier. Finally, the small corpus size of 2635 reviews with the high variety of writings that exist within them made the biggest challenge for training proficient classifiers to reach for convincing and optimal performance.
This research thesis on OM for Arabizi language system contributes to the research literature in wide scale. First of all, the research contributes with the collected corpus in Arabizi language, which embarks a start for bright future works. The corpus is also separated to distinguished training and testing sets that would be a best candidate to test other future-based experiments on. Through the focus on problems, challenges, and suggestions that have been encountered and offered in this master's project, better performance on SA as well as on other tasks such as part-of-speech-tagging is possible.Besides,an automatic processing ML toolv5 would be deployed in real-time service for Arabizi language treatment on SA. The manual of how to use the tool is well illustrated in Appendix A. According, it would be a beneficial tool for users as well as researchers to give their thoughts on for further improvements. In addition, for our best knowledge it is the first experiment ofLR model on OM for Arabizi script in the Lebanese dialect and for Arabizi in general. Finally, this research project conducts a first-kind experiment of hyperparameter tuning on SA for Arabizi reviews. All of the coding programs and the data that have been used in this research thesis are available onlinevi6. The illustration of the coding program as well as the dataset is well presented in Appendix B.
1.10.1 Sentiment Analysis: SA is referred to as subjectivity analysis, opinion mining, or appraisal extraction, with relationship to computer-assisted emotion recognition and expression (Pang and Lee, 2008). SA studies subjective elements, which are “linguistic expressions of private states in context,” (Wiebe et al., 2004). These sentiment elements may come from single words, phrases, or sentences, or even whole documents, which regarded as a single sentiment unit (Turney and Littman, 2003; Agrawal et al., 2003).
1.10.2 Natural Language Processing (NLP): NLP is sometimes referred to CL that is the engineering and scientific discipline that guides the understanding of the humans' languages from computational perspective for the building of machine applications that could communicate and dialogue with humans through AI and thinking (Schubert, 2019).
1.10.3 Arabizi NLP: Arabizi is the informal use of Arabic language in form of Latin characters, represented in written form only (Yaghan, 2008). It is typed most often typed on mobile phones and computer keyboard marking a significance spread via social media website (Basis Technology, 2012). And, NLP analyzes of the Arabizi language system of wide variations in grammar, dialect, context, and spellingto make meaningandunderstandingbased on the assigned task, for example, SA, in which the machine would be apply to spot feelings associated with Arabizi natural language texts (Duwairi et al., 2016).
1.10.4 Classifier: Classifier is “to classically organize things by classes, by categories, to classify according to a classification,” (Larousse, n.d.). Explicably, classifier is “A mapping from unlabeled instances to (discrete) classes. Classifiers have a form (e.g., decision tree) plus an interpretation procedure (including how to handle unknowns, etc.). Some classifiers also provide probability estimates (scores), which can be thresholded to yield a discrete class decision thereby considering a utility function,” (Kohavi and Provost, 1998). In SA task, classification is based on binary or multi-class predication.
1.10.5 Big Data: “Big data is a term applied to data sets whose size or type is beyond the ability of traditional relational databases to capture, manage, and process the data with low-latency. And it has one or more of the following characteristics - high volume, high velocity, or high variety. Big data comes from sensors, devices, video/audio, networks, log files, transactional applications, web, and social media - much of it generated in real time and in a very large scale,” (“Big Data Analytics”, n.d.).
1.10.6 Machine Learning Classifier: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E,” (Mitchell, 1997, p. 2).
1.10.7 Lexicon-based Classifier: it is an AI application for manipulating and storing data for knowledge extraction based on human-crafted rule sets that indulge human's analysis and inference.
1.10.8 Customer Review: “A consumer's opinion and/or experience of a product, service or business. Reviews can be found on specialist websites and on the websites of many retailers, retail platforms, booking agents, and trusted trader schemes (schemes helping consumers to select a trader),” (Valant, 2015). More specifically, A review is a “critical evaluation of a text, event, object, or phenomenon. Reviews can consider books, articles, entire genres or fields of literature, architecture, art, fashion, restaurants, policies, exhibitions, performances, and many other forms [...]. While they vary in tone, subject, and style, they share some common features: First, a review gives the reader a concise summary of the content [...]. Second, and more importantly, a review offers a critical assessment of the content [.]. Finally, a review often suggests whether or not the audience would appreciate it,” (“Book Reviews”, n.d.).
The following sectionsof the thesis are organized as follows:
Section II describes the main keywords in literature review following a forced order on NLP, big data, SA and its applicable approaches of rule-based, ML, and hybrid ones; Arabizi language system and its relation to the Lebanese culture. Finally, it presents the most relevant works of SA for the Arabic language and Arabizi in particular.
Section III highlights the Arabizi corpus that used for sentiment classification and the challenges that we have found while analyzing text reviews. Also,data preprocessing is explained besides the selected features. Moreover, a theoretical part of used classifiers presented in much detail.
Section IV details the used corpus and the process of data filtering and preprocessing techniques thathave been conducted on the dataset; then the feature extraction; therefore, feeding the features into the ML model to create the LR classifier. Secondly, an information is given about the used lexicon-based classifier, which depends on handcrafted grammar, dictionary and semantic map.
Section V presents the experiments that conducted to evaluate the approach taken in this research. Section VI concludes the ideas of the whole research and give insight about future works.
Sentiment Analysis for Lebanese Arabizi Customers' Reviews
This research section highlights the main research keywords by emphasis on the literature background for each in the latest research field advances. The research keywords are discussed in order through drawing relationship between them, including NLP, big data, SA, approaches to SA, Arabizi language system, and the Lebanese language dialect.
NLP focuses on the interactions and communications between machine and human's natural languages by deep process and analysis of natural language data. This field goes way back to the 1950s in which Turing published his famous article titled Computing Machinery and Intelligence, which is also known for turning test that is to have a user C chatting in text only through acomputer keyboard and screen with other two separate isolated partners, one is a machine Aand the other is a human B,as it is presented in Figure (1). Therefore, for the test to be succeeded, the user must be unable to recognize whether he/she is talking to a human or a machine (Saygin, Cicekli, and Akman, 2000).This test challenges the AI breakthrough in robotics, cloud computing systems, and all advances to create a role model of humanized robots that have similar intelligences as humans do. Accordingly, AI does come to alive whenever a single machine A could compete against a human C, as Turning claimed in his research:
I believe that in about fifty years' time it will be possible, to programme computers, with a storage capacity of about 109, to make them play the imitation game so well that an average interrogator will not have more than 70 per cent chance of making the right identification after five minutes of questioning. (Turing, 1950, p.8)
Recently, a program chatbot software, as a talkative 13-year-old Ukrainian lad, successfully passed the imitation game in a competition in the 60thanniversary of Turning's death with a pass mark of 33 that it was a real boy (Williams, 2014).
Figure 1 - Turning Test in which person C chats through text only with another person B and machine A, adopted from Bansal (2018, p.3)
Abbildung in dieser Leseprobe nicht enthalten
Later in 1950s, Georgetown experiments on machine translation from Russian to English in which he translated sixty sentences (Hutchins, 2004). It was predicated that within the upcoming five years, machine translation would be at its peak, but the ALPAC (Automatic Language Processing Advisory Committee) in 1966 came up to the spot marking a falling attempt of research on the field (Hutchins and Hays, 2015). Not until 1980s, the field of NLP was an abandonment area of research because of the used approaches based on simple languages' analyses and transformations (Noam, 1965). Language transformational rules (T-rules) are the rules that are applied to change the syntactic structure (deep structure) of sequence of symbols into a new whole syntactic structure that is called a surface structure. These rules are in four forms: deletion, insertion, permutation and substitution.
In 1980s, a transformation period from systems based on hand-written rules to statistical algorithms for NLP introduced such as decision-tree (D-Tree) algorithm and the hidden Markov model (HMM). These models were notably successful in NLP research field as they achieve proficiency results on a large corpus. However, at that time, the major limitation for statistical algorithms were the need for more and more annotated data for computing to the highest performance (Johnson, 2009).
Recent advances in NLP research field mark four main areas of learning techniques: unsupervised, semi-supervised, supervised, and reinforcement learning algorithms that will be elaborated in 2.4.2 Machine Learning Approach below. In the 2010s, DL has changed the course of NLP research in the promising state-of-the-art results that could achieve in most of the tasks, including SA (Heikal, Torki, and El-Makky, 2018).
Crucially, big data mark a big deal for organizations, institutions, and companies in every tiny aspect, ranging from customer information to employees hiring, to product advertising, etc. It shapes the base blocks to build a successful and prosperousbusiness, otherwise businesscompany without big data would be disarmed from extracting any valuable knowledge, performing a strategic decision, as well as knowing market forces of supplyand demand (Elegendy and Elragal, 2014). Therefore, organizations must organize big data into a storage for later retrieval. Such methods of storing structured data include database, data mart, and data warehouse by the means of using tools to extract data from external sources, to transform the data to fit the operational needs, and finallytoload it into database or data warehouse. Thus, the data made available for data mining and analytical queries (Bakshi, 2012).
On the other hand, data mining works on finding correlations and patterns similarities between the stored big data “databases”. It also known as “knowledge discovery in data” (Bastien, 2018). SA is one of the flourishing data mining applications. It focuses on understanding emotions from subjective texts through patterns correlation. It spots opinions and attitudes to a certain topic, subject, and service for the purpose of extracting sentiment knowledge. SA takes NLP as its core for the building of dictionaries that contain sentiment indicators to draw relationship and connection among words in data (Mouthami, Devi, and Bhaskaran, 2014).
SA has multiple analytic levels: document level (Farra et al., 2010), sentence level (Farra et al., 2010; Shoukry and Rafea, 2012a), aspect level (or feature level) (AL-Smadi et al., 2018), word level and character level (Abdualla et al.,2013).
In document level, sentiment is being predicated for the whole document based on the subjectivity expressed within. This method is beneficial to give an overview of subjectivity in each document, but it fails to give a specified knowledge of subjectivity concerning a specific category as food experience, for example. In this level, sentiment is binary classification problem of positive or negative. Also, it is regarded usually as regression task in which sentiment is 5-stars classification. However, sentiment classification digs deeper into the document when it deals with each sentence entry as specific case in which sentiment is being predicated for each subjectivity or objectivity in a sentence in multiple classes, including positive, negative, or neutral sentiments.
Next, sentiment classification based on word level and character level approach ranks each word in a sentence with a score of positive, negative or neutral class, which is later summed up to obtain the overall sentiment polarity of that sentence. In other words, SA on word level, it deals with each word in an entry as subject for sentimental classification. Overall, the sum of polarities of all words in an entry gives the final sentiment predication.
On the other hand, sentiment with aspect or feature level analyzes a given text by looking into a specific category or feature that text contains. In this approach, sentiment predication identified based on already indicated features. For example, a feature could be the “food experience” in a restaurant. Thus, a review of “Easily the best food in Beirut” would remark a positive sentiment because of “best” indictor for the food experience. This level remarks a good result, but the classifier is only confident whenever it met already-known features, and therefore if the classifier encountered a new text that has unrecognized feature, it would lose its confident. Accordingly, the sentiment classifier would look only on the already trained features unpaying attention to other untrained features within a review. Moreover, this level has two types (Pang, Lee, and Vaithyanathan, 2002):
First, explicit aspect is easily to be identified in a sentence in a form of a phrase. For example, “The food quality is great”, here “food quality” is the extracted aspect, which is clearly obvious for annotator. On the other hand, implicit aspect is hard to identify, which requires much reasoning of a sentence. For instance, “The game is not working probably!”, here “not working probably” is not the desired aspect but it is the unfunctionally.
For the SA task, Various approaches employed including lexicon-based, ML and DL, and lastly hybrid approaches. On one hand, lexicon-based approach makes use of dictionary in the process of either term's inclusive or exclusive from the input text, while ML and DL makes uses of large dataset for model's training and testing, which then could be used for sentiment predication of a given text. In addition, the hybrid approach takes the output of lexicon-based approach as input for ML, creating a mixture model. The DL, on the other hand, makes a significance hybrid architecture of employing many ML as well as DL models for the purpose of Deep NLP.
Many applications nowadays traced to Semantic Orientation (SO), which measures the expressed subjectivity in a document d by either one magnetic side of positive on one side or negative on the other side. This SO approach is considered an unsupervised technique to build the sentiment lexicon by assigning each term t a corresponding class inferred from its semantic intensity in given d (Shoukry and Rafea, 2012b). Thus, terms' classes computed together to formulate the sentiment of that given t (Morsy and Rafea, 2012). This approach sometimes called rule-based approach, which referrers to a handwritten linguistic rule using Regular Expressions (REGEX). However, such system is time consuming; and it requires a long time to adjust (Petasis et al., 2001).
There are two ways to construct sentiment gazetteer or dictionary. First, a one could extract terms manually from a sentence, and annotate them with their corresponding SO. Therefore, dictionary is built automatically by using the terms expansion method where a few seed keywords used to attract other related terms to formulate it. This idea based on the assumption that some terms associate with others that occur with, having a close SO or sentiment polarity (+1 for positive, -1 for negative, or 0 for neutral). According to El-Beltagy and Ai (2013), they provided a clear example illustrating this methodology. It starts by looking on the conjugation patterns of occurred terms in a sentence. For instance, the phrase “^> j fj^” / “polite and respectable”, would be regarded as two terms sharing the “same” polarity to certain extent, giving a fact that they are connected grammatically by “j” / “and”. As for each expansion, the program user filters the undesired terms manually from the dictionary. These steps are repeated several times until the dictionary reaches its words' counts goal.
The term ML was first coined by Arthur Samuel in 1959. ML uses statistic as its core so that can learn and construct knowledge from input data to predicate an output (Kohavi and Provost, 1998). ML algorithms undertake data-driven decision and predication on input data (Bishop, 2006). ML employed in a wide range of computing NLP tasks and applications include SA, information extraction (IE), email filtering, summarization, text clustering, question-answering chatbot, and many others (Khurana et al., 2017). In implementation, ML approach takes advantage of the wide range of algorithms and techniques for learning decisions from data for later decisionmaking. Examples of ML algorithms include decision-tree, LR, and support vector machine, etcetera (Al Omari et al., 2019).
ML algorithms are implemented using four main techniques, which are supervised, unsupervised, semi- supervised, and reinforcement learning, as they elaborated below.
Figure 2 - ML techniques, adopted from Wang, Chaovalitwongse, and Babuska (2012, p.2)
Abbildung in dieser Leseprobe nicht enthalten
“Techniques used to learn the relationship between independent attributes and a designated dependent attribute (the label),” (Kohavi and Provost, 1998). In this technique, ML algorithms learn decisions and predications from the input (x) and their desired output (y) as to formulate a rule that maps inputs to outputs, as follows:
Abbildung in dieser Leseprobe nicht enthalten
As in later encounter of new input, the algorithm would use what has been learned to predicate its class. This technique is a resemblance of a teacher supervising the learning process of a student, where the algorithm predicates the classes on the training data while it is being corrected by the former teacher (Brownlee, 2016).
“Learning techniques that group instances without a pre-specified dependent attribute,” (Kohavi and Provost, 1998). In this technique, ML algorithms learn the pattern similarities among input (x) because there is no desirable output (y) is available. This technique is a resemblance of independent learning where there are neither a teacher nor supervisor to correct the student's work (model's predications). Such applications used to discover the relationships that group data together (Brownlee, 2016).
This technique integrates supervised and unsupervised ML Techniques for the tasks where data has only some of unlabeled data (y). It is a practical technique, which saves time and money, makes best predications on unlabeled data as it is useful to use all trained data as an invest on the predication of new untrained data (Brownlee, 2016).
To imitate how human beings learn from their committed errors and mistakes, reinforcement learning allows the machine to learn from committed errors in the predication process to receive a reward in the next time similar encounters (Wang, Chaovalitwongse, and Babuska, 2012). This technique could empower the machine to be an independent learner as if it is a human learning from his/her past experiences.
DL, as a branch of ML, has overcome the baseline ML algorithms with sophisticated architectures and algorithms. A basic architecture of DL consists of an input layer, an output layer, and hidden layer in-between. However, the basic architecture is not referred to as enough deep to be a DL architecture. A deep architecture may consist ofmore than 3 hidden layers with hundreds of processing units (neurons). DL architectures include, for example, recurrent neural networks
(RNNs), long short-term memory (LSTM), and convolutional neural networks (CNNs), etcetera (LeCun, Bengio, and Hinton, 2015).
[...]
1 https://www.facebook.com complexity of linguistics form and nature (phonology, morphology, syntax, semantic and pragmatic) (Elgendy and Elragal, 2014).
2 https://www.google.com/maps/@,34.0427069,35.6546808,8.7z
3 https://www.zomato.com/lebanon
4 https://scikit-learn.org/stable/modules/generated/sklearn.linear
5 https://sa-arabizi-lb.herokuapD.com
6 https://github.com/marwanalomari/Sentiment-Analysis-for-Lebanese-Arabizi-Customers-Reviews
Der GRIN Verlag hat sich seit 1998 auf die Veröffentlichung akademischer eBooks und Bücher spezialisiert. Der GRIN Verlag steht damit als erstes Unternehmen für User Generated Quality Content. Die Verlagsseiten GRIN.com, Hausarbeiten.de und Diplomarbeiten24 bieten für Hochschullehrer, Absolventen und Studenten die ideale Plattform, wissenschaftliche Texte wie Hausarbeiten, Referate, Bachelorarbeiten, Masterarbeiten, Diplomarbeiten, Dissertationen und wissenschaftliche Aufsätze einem breiten Publikum zu präsentieren.
Kostenfreie Veröffentlichung: Hausarbeit, Bachelorarbeit, Diplomarbeit, Dissertation, Masterarbeit, Interpretation oder Referat jetzt veröffentlichen!
Kommentare