On feature extraction for spam e-mail detection


Gunal S., ERGİN S., Gulmezoglu M. B., Gerek O. N.

MULTIMEDIA CONTENT REPRESENTATION, CLASSIFICATION AND SECURITY, vol.4105, pp.635-642, 2006 (SCI-Expanded) identifier

  • Publication Type: Article / Article
  • Volume: 4105
  • Publication Date: 2006
  • Journal Name: MULTIMEDIA CONTENT REPRESENTATION, CLASSIFICATION AND SECURITY
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, EMBASE, MathSciNet, Philosopher's Index, zbMATH
  • Page Numbers: pp.635-642
  • Eskisehir Osmangazi University Affiliated: Yes

Abstract

Electronic mail is an important communication method for most computer users. Spam e-mails however consume bandwidth resource, fill-up server storage and are also a waste of time to tackle. The general way to label an e-mail as spam or non-spam is to set up a finite set of discriminative features and use a classifier for the detection. In most cases, the selection of such features is empirically verified. In this paper, two different methods are proposed to select the most discriminative features among a set of reasonably arbitrary features for spam e-mail detection. The selection methods are developed using the Common Vector Approach (CVA) which is actually a subspace-based pattern classifier. Experimental results indicate that the proposed feature selection methods give considerable reduction on the number of features without affecting recognition rates.