TY - GEN
T1 - Word Level Language Identification of Code Mixing Text in Social Media using NLP
AU - Shanmugalingam, Kasthuri
AU - Sumathipala, Sagara
AU - Premachandra, Chinthaka
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/12
Y1 - 2018/12
N2 - Understanding social media contents has been a primary research topic since the dawn of social networking. Especially, contextual understanding of the noisy text, which is characterized by a high percentage of spelling mistakes with creative spelling, phonetic typing, wordplay, abbreviations, and Meta tags. Thus, the data processing demands a more complex system than traditional natural language processors. Also people easily mixing two or more languages together to express their thoughts in social media context. So automatic language identification at word level become as necessary part for analyzing the noisy content in social media. It would help with the automated analysis of content generated on social media. This study uses Tamil-English code-mixed data from popular social media posts and comments and provided word level language tags using Natural Language Processing (NLP) and modern Machine Learning (ML) technologies. The methodology used for this system is a novel approach implemented as machine learning classifier based on features such as Tamil Unicode characters in Roman scripts, dictionaries, double consonant, and term frequency. Different machine learning classifiers such as Naive Bayes, Logistic Regression, Support Vector Machines (SVM), Decision Trees and Random Forest used in training and testing. Among that the highest accuracy of 89.46% was obtained in SVM classifier.
AB - Understanding social media contents has been a primary research topic since the dawn of social networking. Especially, contextual understanding of the noisy text, which is characterized by a high percentage of spelling mistakes with creative spelling, phonetic typing, wordplay, abbreviations, and Meta tags. Thus, the data processing demands a more complex system than traditional natural language processors. Also people easily mixing two or more languages together to express their thoughts in social media context. So automatic language identification at word level become as necessary part for analyzing the noisy content in social media. It would help with the automated analysis of content generated on social media. This study uses Tamil-English code-mixed data from popular social media posts and comments and provided word level language tags using Natural Language Processing (NLP) and modern Machine Learning (ML) technologies. The methodology used for this system is a novel approach implemented as machine learning classifier based on features such as Tamil Unicode characters in Roman scripts, dictionaries, double consonant, and term frequency. Different machine learning classifiers such as Naive Bayes, Logistic Regression, Support Vector Machines (SVM), Decision Trees and Random Forest used in training and testing. Among that the highest accuracy of 89.46% was obtained in SVM classifier.
KW - Code-mixing
KW - NLP
KW - language identification
KW - machine learning
UR - http://www.scopus.com/inward/record.url?scp=85068474646&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85068474646&partnerID=8YFLogxK
U2 - 10.1109/ICITR.2018.8736127
DO - 10.1109/ICITR.2018.8736127
M3 - Conference contribution
AN - SCOPUS:85068474646
T3 - 2018 3rd International Conference on Information Technology Research, ICITR 2018
BT - 2018 3rd International Conference on Information Technology Research, ICITR 2018
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 3rd International Conference on Information Technology Research, ICITR 2018
Y2 - 5 December 2018 through 7 December 2018
ER -