The Investigation on the Effect of Feature Vector Dimension for Spam Email Detection with a New Framework


ERGİN S., IŞIK Ş.

9th Iberian Conference on Information Systems and Technologies (CISTI), Barcelona, İspanya, 18 - 21 Haziran 2014 identifier identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Cilt numarası:
  • Doi Numarası: 10.1109/cisti.2014.6877092
  • Basıldığı Şehir: Barcelona
  • Basıldığı Ülke: İspanya
  • Eskişehir Osmangazi Üniversitesi Adresli: Evet

Özet

In this study, the effect of dimension for a feature vector on the classification of Turkish e-mails as spam or legitimate is investigated. Although hundreds of experimental studies are achieved especially for English, which is a non-agglutinative language, the number of efforts for Turkish, which is one of the most popular agglutinative languages in the world, is counted something on the fingers of one hand. Therefore, a solution is sought for Turkish spam e-mail problem taking the special characteristics of Turkish e-mails into consideration. The developed spam filtering framework has four components named as morphological decomposition, feature selection, training, and test phases. A fixed-prefix stemming approach is used to extract the features of an e-mail and then the Mutual Information (MI) method is carried out as the feature selection method. The Decision Tree (DT) and Artificial Neural Network (ANN) classifiers are employed and the recognition accuracies obtained from these methods are considerably satisfactory. The highest accuracy rates are 91.08% for ANN and 87.67% for DT methods when the dimensions of feature vectors are selected as 150x5) and (75x5), respectively.