scikit-learn : Spam comment filter using SVM

bogotobogo.com site search:

Spam filter

Well, though the title of this chapter is "Spam filter...", it may not be about the spam filter you're expecting if it is filtering emails using SVM. However, in this chapter, I'll show you a sort of spam filter sample if we agree on the definition of the 'spam': an unwanted text!. We usually call it spam comment.

We're not going to put lots of efforts on refining the detection scheme, rather we'll be focused on svm classification so that we can learn basic usage of svm.

So, in this chapter, I'll make my own data set with features and labels. Then, I'll train SVM, and test another set of inputs. We do not have that many labels in this example, just two: good or bad.

I have samples of critique for a wab page content. The majority of audiences are teen-agers, and the site gives away points to whoever posts a comment. So, sometimes we have garbage like this:

I WuVS HIM XD HE SHALLL TAKE OVER THIS WORLD LIKE A BOSSY BOSS

Or:

It's so amaz, like, omg, it' bootiful. da cat ish a purdy collor and ish like o amaz. da desigguh ish purdy and da spech bibble ish purdy,

Even worse:

DOGE DOGE DOGE DOGE DOGE DOGE DOGE DOGE DOGE DOGE DOGE DOGE DOGE DOGE DOGE MUCH SEXY DOGE DOGE DOGE DOGE DOGE DOGE

So, we need to have a system that can detect the bad critiques, and remove them from the page automatically.

The code A

This is my initial code, and used a set of only five features.

Here is the steps:

Setting the feature set: this is the most important step. We should figure out what features we need to filter out the garbage critiques. In this code, we have only 5 features.
Read the two data files: good.txt and bad.txt.
Make it a list of strings (one string per comment). The pipe('|') is used as a delimeter.
Filling out the feature vector for each critique.
We construct list of feature vector (feature + label)
We need to do random shuffle the vector for even distribution of train/test sets.

Make one group as a train set and the other set as a test set:

XY_train= [[542, 34, 0.06273062730627306, 104, 0, 1], ....]
XY_test= [[758, 49, 0.06464379947229551, 133, 0, 1], ....]

Then, convert the two sets to numpy arrays.

X_train, Y_train = make_np_array_XY(XY_train)
X_test, Y_test = make_np_array_XY(XY_test)

We also need to break-up the set into two arrays: X and Y: X_train and Y_train look like this:

X_train =  [[  5.42000000e+02   3.40000000e+01   6.27306273e-02   1.04000000e+02
    0.00000000e+00]....]
Y_train =  [ 1.  1.  1.  1.  1.  1.  1.  0.  0.  1.  0.  0.  0.  1.  1.  1.  1.  1.
  1.  0.  1.  1.  1.  1.  0.  0.  0.  1.  0.  1.  1.  1.  1.  1.  1.  0.
  1.  1.  1.]

Then, we train SVM:

# train set
C = 1.0  # SVM regularization parameter
svc = svm.SVC(kernel='linear', C=C).fit(X_train, Y_train)

Now that we have a trained svc object (sklearn.svm.classes.SVC), we're able to predict Y_predict from the test set X_test:
```
Y_predict = svc.predict(X_test)
```
The next step is compare the elements of the Y_predict list with those of the Y_test.

With this very small feature vector, we hit 75% mark. Also we can use metric that scikit provides: sklearn.metrics.f1_score (aka balanced F-score or F-measure):

test_sizeest_size = len(Y_test)
score = 0
for i in range(test_size):
    if Y_predict[i] == Y_test[i]:
        score += 1

print 'Got %s out of %s' %(score, test_size)
  
f1 = f1_score(Y_test, Y_predict, average='macro')  
print 'f1 macro = %.2f' %(f1)
f1 = f1_score(Y_test, Y_predict, average='micro')  
print 'f1 micro = %.2f' %(f1)
f1 = f1_score(Y_test, Y_predict, average='weighted')  
        print 'f1 weighted = %.2f' %(f1)

To predict better, we need more features and well as a bigger training set of data. It'll be in the next section. Once again, this code is not in a good shape. It'll be further refined in the next section that's coming soon.

from __future__ import division
import collections
import random 
import numpy as np
from sklearn import svm
from sklearn.metrics import f1_score

def number_of_chars(s):
    return len(s)

def unique_chars(s):
    s2 = ''.join(set(s))
    return len(s2)

def weighted_unique_chars(s):
    return unique_chars(s)/number_of_chars(s)

def words_count(s):
    return collections.Counter(s)

def words_counter_object(s):
    cnt = collections.Counter()

    words = s.split()	
    for w in words:
        cnt[w] += 1

    return cnt

def total_words(cnt):
    sum = 0
    for k in dict(cnt).keys():
        sum += int(cnt[k])

    return sum

def most_common(cnt, n):
    for k,v in cnt.most_common(n):
        #print "most common  k = %s : v = %s" %(k,v)
        pass

def is_repeated(cnt):
    for k,v in cnt.most_common(1):
        freq = v/total_words(cnt)
        # print 'freq=',freq
        if freq > 0.5:
            return 1
    return 0

def make_feature_vector(critique, labels):
    " construct feature vector" 
    feature_vector = []
    for i in range(len(critique)):
        s = critique[i]
        feature = []
        counter_obj = words_counter_object(s)

        feature.append(number_of_chars(s))
        feature.append(unique_chars(s))
        feature.append(weighted_unique_chars(s))
        feature.append(total_words(counter_obj))
        feature.append(is_repeated(counter_obj))

        feature.append(labels[i])
        feature_vector.append(feature)

    return feature_vector

def read_data():
    ''' read and make a list of critiques'''
    f = open('bad.txt', 'r')
    bad = f.read().split('|')
    f.close()

    f = open('good.txt', 'r')
    good = f.read().split('|')
    f.close()

    return bad+good, [0]*len(bad) + [1]*len(good)

def make_np_array_XY(xy):
    print "make_np_array_XY()"
    a = np.array(xy)
    x = a[:,0:-1]
    y = a[:,-1]
    return x,y

if __name__ == '__main__':

    critiques, labels = read_data()
    features_and_labels= make_feature_vector(critiques, labels)
    number_of_features = len(features_and_labels[0]) - 1
    random.shuffle(features_and_labels)

    # make  train / test sets from the shuffled list
    cut = int(len(features_and_labels)/2)
    XY_train = features_and_labels[:cut]
    XY_test = features_and_labels[cut:]

    X_train, Y_train = make_np_array_XY(XY_train)

    X_test, Y_test = make_np_array_XY(XY_test)

	# train set
    C = 1.0  # SVM regularization parameter
    svc = svm.SVC(kernel='linear', C=C).fit(X_train, Y_train)
    print 'type(svc)=', type(svc)
    print 'svc=',svc

    print 'Y_test:\n', Y_test

    Y_predict = svc.predict(X_test)
    print 'Y_predict:\n', Y_predict

    # score
    test_size = len(Y_test)
    score = 0
    for i in range(test_size):
        if Y_predict[i] == Y_test[i]:
            score += 1

    print 'Got %s out of %s' %(score, test_size)
  
    # f1 score
    f1 = f1_score(Y_test, Y_predict, average='macro')  
    print 'f1 macro = %.2f' %(f1)
    f1 = f1_score(Y_test, Y_predict, average='micro')  
    print 'f1 micro = %.2f' %(f1)
    f1 = f1_score(Y_test, Y_predict, average='weighted')  
    print 'f1 weighted = %.2f' %(f1)

Output:

...
svc= SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
Y_test:
[ 0.  0.  0.  1.  1.  1.  1.  1.  0.  0.  1.  0.  0.  0.  1.  1.  1.  1.
  1.  0.  1.  1.  0.  1.  1.  1.  0.  0.  1.  1.  1.  1.  1.  0.  1.  0.
  1.  1.  0.]
Y_predict:
[ 1.  1.  0.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.  0.  1.  1.  1.  1.
  1.  0.  1.  1.  0.  1.  1.  1.  0.  1.  1.  0.  1.  1.  1.  0.  1.  1.
  1.  0.  1.]
Got 29 out of 39
f1 macro = 0.81
f1 micro = 0.81
f1 weighted = 0.81

The code B - with more data set

The code below (sv5.py) is almost identical to the Code A (svtest.py) used in the previous section. The difference is that we're using linear_model.SGDClassifier() for the classifier which is much faster. Also, this time, we're using a bigger data set (goodCritiques.txt and badCritiques.txt).

# train set
C = 1.0  # SVM regularization parameter
#svc = svm.SVC(kernel='linear', C=C).fit(X_train, Y_train)
print "linear_model.SGDClassifier()..."
from sklearn import linear_model
svc = linear_model.SGDClassifier().fit(X_train, Y_train)

The ratio for the numbers of data in the two sets (train vs test) has been changed from 1:1 to 9:1. It usually 4:1 or 9:1.

The code looks like this:

from __future__ import division
import collections
import random 
import numpy as np
from sklearn import svm
from sklearn.metrics import f1_score

def number_of_chars(s):
	return len(s)

def unique_chars(s):
	s2 = ''.join(set(s))
	return len(s2)

def weighted_unique_chars(s):
	return unique_chars(s)/number_of_chars(s)

def words_count(s):
	return collections.Counter(s)

def words_counter_object(s):
    cnt = collections.Counter()
    words = s.split()	
    for w in words:
        cnt[w] += 1
    return cnt

def total_words(cnt):
    sum = 0
    for k in dict(cnt).keys():
        sum += int(cnt[k])
    return sum

def most_common(cnt, n):
    for k,v in cnt.most_common(n):
        #print "most common  k = %s : v = %s" %(k,v)
        pass

def is_repeated(cnt):
    for k,v in cnt.most_common(1):
        freq = v/total_words(cnt)
        # print 'freq=',freq
        if freq > 0.5:
		    return 1
    return 0

def make_feature_vector(critique, labels):
    " construct feature vector" 
    feature_vector = []
    for i in range(len(critique)):
        s = critique[i]
        feature = []
        counter_obj = words_counter_object(s)

        feature.append(number_of_chars(s))
        feature.append(unique_chars(s))
        feature.append(weighted_unique_chars(s))
        feature.append(total_words(counter_obj))
        feature.append(is_repeated(counter_obj))

        feature.append(labels[i])
        feature_vector.append(feature)

    return feature_vector

def read_data():
	''' reads data files and returns lists of comments and labels'''
	f = open('badCritiques.txt', 'r')
	#f = open('bad.txt', 'r')
	bad = f.read().split('|')
	f.close()

	f = open('goodCritiques.txt', 'r')
	#f = open('good.txt', 'r')
	good = f.read().split('|')
	f.close()

	return bad+good, [0]*len(bad) + [1]*len(good)

def make_np_array_XY(xy):
	""" takes XY (feature + lable) lists, then makes np array for X, Y """
	a = np.array(xy)
	x = a[:,0:-1]
	y = a[:,-1]
	return x,y

def get_f1_score(Y_test, Y_predict):
	test_size = len(Y_test)
	score = 0
	for i in range(test_size):
		if Y_predict[i] == Y_test[i]:
			score += 1
	print 'Got %s out of %s' %(score, test_size) 
	print 'f1 macro = %.2f' %(f1_score(Y_test, Y_predict, average='macro'))
	print 'f1 micro = %.2f' %(f1_score(Y_test, Y_predict, average='micro'))
	print 'f1 weighted = %.2f' %(f1_score(Y_test, Y_predict, average='weighted')) 

if __name__ == '__main__':

	critiques, labels = read_data()

	#feature_vector = []

	features_and_labels= make_feature_vector(critiques, labels)		
	number_of_features = len(features_and_labels[0]) - 1

	# shuffle to mix good and bad ones
	random.shuffle(features_and_labels)

	# make  train / test sets from the shuffled list
	cut = int(len(features_and_labels)*0.9)
	XY_train = features_and_labels[:cut]
	XY_test = features_and_labels[cut:]
	
	X_train, Y_train = make_np_array_XY(XY_train)
	X_test, Y_test = make_np_array_XY(XY_test)
	print 'len(X_test) = %s len(Y_test) = %s' %(len(X_test), len(Y_test))

	# train set
	C = 1.0  # SVM regularization parameter
	#svc = svm.SVC(kernel='linear', C=C).fit(X_train, Y_train)
	print "linear_model.SGDClassifier()..."
	from sklearn import linear_model
	svc = linear_model.SGDClassifier().fit(X_train, Y_train)

	print "svc.predict()..."
	Y_predict = svc.predict(X_test)

	print 'Y_predict:\n', Y_predict
	print 'Y_test:   \n', Y_test

 	# get f1 score
	get_f1_score(Y_test, Y_predict)

Output:

len(X_test) = 115 len(Y_test) = 115
linear_model.SGDClassifier()...
svc.predict()...
Y_predict:
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.]
Y_test:   
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.  1.  1.  1.  1.
  1.  0.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.  1.  1.  1.  0.  1.  1.  1.
  1.  1.  0.  1.  1.  1.  1.  0.  1.  0.  1.  1.  1.  1.  1.  1.  1.  0.
  1.  1.  1.  1.  0.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.  0.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.  0.  1.  1.  0.  0.  1.
  1.  1.  1.  1.  1.  1.  1.]
Got 100 out of 115
f1 macro = 0.93
f1 micro = 0.93
f1 weighted = 0.93

It appears to be working fine and we got better f1 scores. However, if we carefully look into the Y_predict values, they are all 1s. If we run again, we may get 0. What does this mean?

SVM is not learning primarily due to the features. So, we need more features.

The code C - with more features

As discussed in the previous section, we need more features. I added two new features to the feature vector:

def repeated_count_top_3(cnt):
    "returns total freq of the most common 3 words"
    freq = 0
    for k,v in cnt.most_common(3):
        freq += v
    return freq/3

def longest(s):
    "returns the length of longest word"
    mylist = s.split()
    if len(mylist) == 0:
       return 0
    return len(max(mylist, key=len))

The first one returns total frequency of the most common 3 words, and the second one returns the length of the longest words in a comment.

We also need to modify the following function two add the above two to the feature vector:

def make_feature_vector(critique, labels):
    ...
    feature.append(repeated_count_top_3(counter_obj))
    feature.append(longest(s))
    ...

With the new code (sv6.py) , we get the following output:

len(X_test) = 229 len(Y_test) = 229
linear_model.SGDClassifier()...
svc.predict()...
Y_predict:
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  1.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.
  0.  0.  0.  0.  1.  0.  0.  0.  1.  1.  0.  0.  0.  0.  0.  0.  1.  0.
  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.
  0.  0.  0.  1.  0.  0.  0.  0.  0.  1.  1.  0.  0.  0.  0.  0.  0.  0.
  1.  0.  0.  1.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  1.  0.
  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  1.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  1.
  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  1.
  0.  0.  0.  0.  0.  1.  1.  0.  0.  1.  0.  0.  0.  1.  1.  0.  1.  1.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.]
Y_test:   
[ 1.  1.  1.  0.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  0.  1.  1.  1.  1.  1.  1.  0.  1.  1.  1.  0.  1.  0.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  0.  1.  1.  0.  0.  1.  1.  1.  1.  1.
  1.  1.  0.  1.  1.  1.  1.  1.  1.  1.  1.  0.  1.  1.  1.  1.  1.  0.
  1.  0.  1.  1.  1.  1.  1.  0.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.  1.
  1.  1.  1.  1.  1.  1.  1.  0.  1.  0.  0.  1.  1.  0.  1.  1.  1.  1.
  1.  1.  1.  1.  0.  1.  0.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.  1.
  1.  0.  1.  1.  1.  1.  1.  1.  0.  1.  1.  0.  1.  1.  1.  1.  1.  1.
  1.  0.  1.  1.  0.  0.  1.  1.  0.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  0.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  0.  1.  0.  0.  1.  1.
  1.  1.  0.  1.  1.  0.  1.  1.  1.  1.  1.  1.  1.]
Got 67 out of 229
f1 macro = 0.29
f1 micro = 0.29
f1 weighted = 0.29

Note that the Y_predict arrays has mixed values of 1s and 0s now!