Sentiment Analysis Bank of Khartoum Customers’ comments Using a Decision Tree Classifier.

Hind Hamza Fadul Modwey*1, Dr. Eltyeb Elsamani Abd Elgabar Elsamani1

: IT department , faculty of computer science and information technology , Al-Neelain University

* Email : Hind7488@gmail.com

HNSJ, 2022, 3(10); https://doi.org/10.53796/hnsj31028

Download

Published at 01/10/2022 Accepted at 22/09/2022

Abstract

Today the banking sector has become one of the most technologically advanced sectors in the world.

Banks have become able to use technology to speed up daily operations, in addition to creating new tools that allow their customers to make autonomous financial decisions faster, easier and safer [1]. Automated sentiment analysis technique also known as “opinion mining is primarily to analyze the opinions of its customers. It refers to determining whether the opinion expressed in a document or sentence is positive, negative or neutral [2]. We note that scientific research on sentiment analysis in language Arabic is very limited at the present time.While there are many applications of sentiment analysis in the English language, the Arabic language is still making slow pace in this field [3], we find Arabs of different nationalities, in their daily lives they use Arabic dialects according to their different Arab nationalities, especially on the media Social communication [4] although Arabic is increasing as one of the most used languages ​​on the Internet, few studies have focused on analyzing Arabic sentiment so far[5]. Many of the challenges facing sentiment analysis in the Arabic language, such as grammar, the presence of diacritics, the presence of a group of Arabic dialects, and the presence of different forms of words [6],

Colloquialism is a popular dialect with which the common people communicate, and it is loose and is not subject to rules, and does not accept control [7] ,This paper deals with the problem of categorizing Bank of Khartoum Sudanese customers’ comments on the Google Play Store using machine learning methods, where the Desion Tree DT classifier was used to classify comments based on their polarity, whether they are positive, negative or neutral. Work was evaluated on four different scales. The results showed that the use of DT with lemmatization libraries improves the accuracy of sentiment classification; DT achieved a high accuracy of 91%.

INTRODUCTION:

Sentiment analysis is an NLP technique that is implemented on a text to determine whether the author’s intentions toward a particular topic, product, etc. are positive, negative, or neutral[8].

It is also called opinion mining, and it is a field of study that analyzes people’s opinions, feelings, and emotions toward entities such as products, services, institutions, individuals, issues, events, topics, and their attributes[9].

Also referred to as opinion mining, it is a natural language processing (NLP) approach that sets the emotional tone behind a body of text. It is a common way for organizations to define and categorize opinions about a product, service or idea. It involves the use of data mining, machine learning, and artificial intelligence to extract text for emotions[10] , The Arabic language used in social media is usually a mixture of Modern Standard Arabic and one or more Arabic dialects.[11] , The Arabic language is one of the main languages ​​spoken in the world, which has maintained its existence for thousands of years, despite all the challenges faced by the Arab nation of division and disintegration, it has emerged from that new language. dialects and each region has retained its own dialect. However, despite this difference, the Arabs preserved their mother tongue by reading the Qur’an, relying on it in formulating school curricula, preparing sermons and articles, writing poetry and prose, …., etc., though, in their daily lives colloquial dialects are traded [12] , We find Sudanese express their opinion about banking services on various platforms in their colloquial dialect, and it is noticeable that their texts are mostly broken and inaccurate, as they contain many abbreviations and spelling errors that lack Arabic grammar. This makes the task of analyzing these opinions and extracting emotional information from them constitutes a great challenge for banks.

RELATED WORK:

2018, This study analyzed mobile banking application sentiments using the Naïve Bayes classifier and using the scale from 1 to 5 known as (five stars) to rate sentiments.Use the confusion matrix to evaluate the rating, as the data was collected from the Google play app store, and the results show that the analysis rate using the Naïve Bayes Classifer was 89.41%, and the scale results from 1 to 5 indicate that out of 1701 reviews 278 positive ratings and 1432 reviews were obtained. Negative evaluation [13].

Michael Adu Kwarteng , and others 2020 , They analyzed tweet sentiments for UniCredit’s social media posts in Europe to find out opinions about its online services. The data from only 953 English tweets were collected using the twitter API. The results of sentiment analysis showed that of the 953 English tweets used for the study, 499 tweets were rated as positive, 37 were negative, and the remaining 417 were categorized as neutral. The results indicate that The increase in positive feelings indicates the degree of customer satisfaction with the bank’s services [14].

The Proposed Method:

data label

As a first step, we collected around 1,000 Bank of Khartoum Sudan customer comments on the Google Play Store and categorized them manually by language experts into Positive, Negative, and Neutral.

235 of them were negative, 739 positive, and 26 neutral.

Text pre-processing:

After preparing the data, we load it into the Jupyter program in order to apply a number of processing techniques to it (Natural Language Processing (NLP) and use a number of Natural Language Processing libraries (NLTK) using the Python language.

Data cleaning

In the first step of processing, the researcher removed spelling inconsistencies, such as the unification of Arabic letters as well as from signs, English letters and numbers, for example (@, &). The researcher also noted that the data contains some repeated sentences or repeated letters by the commentators in order to confirm their feelings, as it was deleted and placed in the natural context, for example ” بطييييييييئ” changed to ” بطيئ” and

” التطبيق زفتتتتتتتت!!!” Changed to

“التطبيق زفت”. And “تطبيق بنكك ممتاز ممتاز للغاية” Changed to “تطبيق بنكك ممتاز لغاية” Useless data, such as URLs, addresses, names.

Remove Arabic stop words

There are a lot of words that are used frequently and thus lose their value in meaning, such as the pronouns ‘she’ and conjunctions ‘and’, they are usually a form of linguistic noise and removing them helps in learning to focus on the rest of the sentence. Therefore, the researcher removed some of the phrases and words that we do not need by filtering the Sudanese suspended words, for example, the stop words in

(“ (” صباح الخير خدماتكم تعبانة ردو علينا على الاقل , (“بنكك لو ظبطو الشبكة تاني ما فاضي ليهو لي تطبيق بس انتو اجتهدو)…. It is “لو”, “ما”, “بس”, “لي”.

Text Tokenization:

Divide the text into words, for example (تطبيق جميل وساهل وسريع اوصي به) after the token process: “تطبيق” , “جميل” , “وساهل” , “وسريع”, “اوصي” , “به” .

Normalization:

The text used in the data collected presents many challenges when compared to the formally structured text as it turns out from Nabil and others [15]. It contains unorganized language, many spelling errors, colloquial words, colloquial expressions, acronyms, and idiomatic expressions, and most of them are contradictions. Moreover, some words contain more than one model, which is an issue that highlights the need for normalization, i.e. standardization of Arabic letters applied by the researcher to all data.

TABLE I EXAMPLE OF PRE-PROCESSING A COMMENT:

comment After Preprocessing Preprocessing Step
@ fawryتطبيق فوري سيئ سيئ وزبالة وتكلفة التحويل غااااااالية! كماااان! The original comment
تطبيق فوري سيئ سيئ وزبالة وتكلفة التحويل غااااااالية! كماااان! Data cleaning
تطبيق فوري سيئ سيئ وزبالة وتكلفة التحويل غااااااالية كماااان Normalization
تطبيق فوري سيئ وزبالة وتكلفة التحويل غالية كمان Removing duplicated word or characters
تطبيق ، فوري ، سيئ ، وزبالة ، وتكلفة ، التحويل ،غالية ، كمان Tokenization
تطبيق ، فوري ، سيئ ، وزبالة ، وتكلفة ، التحويل ،غالية Stop remove word
تطبيق فوري سيئ وزبالة وتكلفة التحويل غالي Stemming

The researcher also applied different variations to other pre-processing steps in order to study their effects on the accuracy of classification results, its implementation showed clear results in classification accuracy, these steps can be one of the following:

Lemmatization:

The purpose of this step is to return speech [16] to its original, which helps the computer to understand because it reduces the number of vocabulary that it has to learn, moreover, the difference in the shape of words may make it difficult for the computer to recognize words that have similar meanings. There are two ways to return speech to its natural context. The researcher used the method of Allamat libraries. In this method, words are returned to the closest common word in meaning even if they are different in infinitive, for example,

(Camel > camel, camel) Sudanese sometimes give the word additional letters such as (and, may, where), all this was dealt with by the lexicon of polarization that was prepared manually by the researcher. The dictionary contains the source of negative and positive words that express feelings [17].

TABLE 2: LEMMATIZATION OF WORDS:

the words lemmatization
طاشي,بطش,يطش,مطشش, وطاشي, طاشه, طش
بطء , بطيئة, بطيء, بطيئ
رافض,يرفض, ويرفض, برفض, رفض
اسواء, سيئة,سيئا, سئي, سيئه, سوء, وسيئ سيئ
يعلق,بعلق,معلق,بيعلق,موعلق,ومعلق, معلقة علق

Note: Lemmatization It is the origin of each word as found in the lexicon of polarity.

Automated Learning Algorithms as a Tool:

The data is divided into the training and test dataset (the data is divided into two parts, 80% of the data for training and 20% for testing). The training data set used for classification based on the Decision Tree(DT) classifiers and the data were classified based on their polarity into positive, negative and neutral categories,

While the test data set is used to predict the polarity of interactions . Decision tree is a type of predictive modeling for machine learning, which is an effective machine learning method. It is sometimes called a classification and regression tree (classification and regression trees), which is used to classify or build a prediction model [18].

Building a Matrix of Numbers:

In this step, an array will be created which contains a set of rows “data information”, which represent the fundamental comments file, and the “header information” columns represent all words/terms in all comments files after implementing all processing operations. The last columns are Class or Label, which are positive, negative, or neutral according to the file that was manually categorized in advance. The value of each row intersection with a column in the array is determined by typing the categorized word. The researcher compared the comments file with the polarity glossary, in order to filter the comments from words that have no meaning or feeling and only keep the words that have feelings. The comment file was programmatically loaded and each word in the comment file is compared with each word in the positive and negative glossary file, if the word is found in the glossary file we encode it to (1, -1, 2) according to its classification, otherwise, it will not be added.

TABLE 3: MATRIX OF NUMBERS:

Class خالص سيئ شديد دسيس واقعة شبكة زباله زفت فوري تطبيق Documents/term
Negative 0 0 0 0 واقعة/

Negative

0 زبالة/

Negative

زفت/

Negative

0 0 تطبيق فوري زفت زباله شبكة واقعة
Positive 0 0 0 دسيس/

positive

0 0 0 0 0 0 دسيس شديد
Negative 0 سيئ/

Negative

0 0 0 0 0 0 0 0 تطبيق سيئ خالص

EXPERIMENTATION

NLP : with Python: In this study, the researcher uses a natural language processing kit with Python. Natural language processing tools with Python NLTK is one of the leading platforms for working with human language data and Python, the NLTK module is used to process natural languages. NLTK is an abbreviation stand for Natural Language Toolkit [19].Classification Techniques: The researcher used one classification methods: Decision Tree (DT).

Data collection: The data was collected from Google Play Store, which is one of the most popular application stores. The researcher focused on collecting comments written in the Sudanese dialect, where (1000) comments were collected that express the opinions of Bank of Khartoum customers in the service of bankak application.

RESULT:

Four different measures – Precision, Recall, Accuracy and F-Measure for DT classifier – were used to assess the validity of the rating of test comments as positive or negative or Neutral, the results of the experiments described in Tables:

TABLE 4: DT CLASSIFIER:

Classifier Precision Recall Accuracy F-Score
DT 76 93 91 84

From table 4 above we notice that DT Classifier achieved good result For Accuracy which equal to 91% .

CONCLUSIONS:

The results of classifying the opinions of Bank of Khartoum customers written in the Sudanese dialect contained in this paper showed that the DT classifier gave a high accuracy of 91%. This indicates that the initial processing steps that we propose in this paper have significantly improved the accuracy of emotion classification. Besides, our approach to sentiment analysis in which we extend the feature region with features extracted from the polarity dictionary improves sentiment rating results. For future work, we plan to extend our model, create graphical I/O interfaces and test them on other Sudanese banks data, as well as expand the size of the Sudanese banks lexicon to improve the accuracy of sentiment classification of customer comments in Sudanese dialect. The work was evaluated using four different scales as shown in Table (4), and the classification accuracy was confirmed.

This study is considered of great value to Sudanese banks, as it enables them to know the extent of their customers’ satisfaction with the services provided, which contributes to achieving competitive vigilance and enabling them to take sound decisions and thus improve the level of services and the overall performance of banks in an optimal manner and push them to improve and continuously develop and prevent early service failure.

References:

[1]. Muhannad Al-Jamais, 9/26/2021, The Impact of Artificial Intelligence on Banking Services, Xina Ai . website, https://www.xina.tech/blog-posts/ai-impact-on-banking-services

[2].GeeksforGeeks, Twitter Sentiment Analysis using Python,22.jul.2021,

https://www.geeksforgeeks.org/twitter-sentiment-analysis-using-python/

[3]. Shamrasy website,Sentence-Level Arabic Sentiment Analysis,

https://shamra-academia.com/show/5ff8323bbb5f3

[4]. Zhao, J., K. Liu, and L. Xu, Sentiment analysis: mining opinions, sentiments, and emotions. 2016, MIT Press.

[5].sciencedirect,2020,A review of sentiment analysis research in Arabic language,

https://www.sciencedirect.com/science/article/abs/pii/S0167739X19311537

[6].Arab Social Media Report, 2020-07-29, Wayback Machine.

http://www.socialmediatoday.com/social-networks/kadie-regan/2015-08-10/10-amazing-social-media-growth-stats-2015.

[7].Is colloquial suitable for teaching?,2020,

https://ar.islamway.net/article/81638/%D9%87%D9%84-%D8%AA%D8%B5%D9%84%D8%AD-%D8%A7%D9%84%D8%B9%D8%A7%D9%85%D9%8A%D8%A9-%D9%84%D9%84%D8%AA%D8%AF%D8%B1%D9%8A%D8%B3

[8]. Zhao, J., K. Liu, and L. Xu, Sentiment analysis: mining opinions, sentiments, and emotions. 2016, MIT Press.

[9]. Narayanan, R., B. Liu, and A. Choudhary. Sentiment analysis of conditional sentences. in Proceedings of the 2009 conference on empirical methods in natural language processing. 2009.

[10]. 21. Feldman, R.J.C.o.t.A., Techniques and applications for sentiment analysis. 2013. 56(4): p. 82-89.

[11]. Alwakid, G., T. Osman, and T.J.P.C.S. Hughes-Roberts, Challenges in sentiment analysis for Arabic social networks. 2017. 117: p. 89-100.

[12]. Liu, B.J.S.l.o.h.l.t., Sentiment analysis and opinion mining. 2012. 5(1): p. 1-167.

[13]. Science Publishing Corporation,2018, Mobile banking app sentiment analysis using the Naïve Bayes classifier,

https://www.sciencepubco.com/index.php/ijet/article/view/22998/11443

[14]. Raphael kwaku botchway and others, A review of social media posts from unicredit bank in Europe

Sentiment analysis approach,faculty tomas bata university in zlin .

https://www.researchgate.net/publication/337580801_A_review_of_social_media_posts_from_UniCredit_bank_in_Europe_a_sentiment_analysis_approach

[15]. Nabil, M., Aly, M. and Atiya, A. (2015). Astd: Arabic Sentiment Tweets Dataset. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, September 2015. Association for Computational Linguistics

[16]. https://www.tutorialspoint.com/python_data_science/python_stemming_and_lemmatization.htm

[17]. Ghadah Alwakid, Taha Osman and Thomas Hughes-Roberts, Challenges in Sentiment Analysis for Arabic Social Networks, 2017, 5-6 November 2017, Dubai, United Arab Emirates

[18].smart Arabic, Iyad Abu Darwish,2020

https://translate.google.com/?sl=ar&tl=en&text=%D8%A7%D9%8A%D8%A7%D8%AF%20%D8%A7%D8%A8%D9%88%20%D8%AF%D8%B1%D9%88%D9%8A%D8%B4&op=translate

[19]. Perkins, J., 2010. Python text processing with NLTK 2.0 cookbook. Packt Publishing Ltd.