scikit-learn : Spam comment filter using SVM
Well, though the title of this chapter is "Spam filter...", it may not be about the spam filter you're expecting if it is filtering emails using SVM. However, in this chapter, I'll show you a sort of spam filter sample if we agree on the definition of the 'spam': an unwanted text!. We usually call it spam comment.
We're not going to put lots of efforts on refining the detection scheme, rather we'll be focused on svm classification so that we can learn basic usage of svm.
So, in this chapter, I'll make my own data set with features and labels. Then, I'll train SVM, and test another set of inputs. We do not have that many labels in this example, just two: good or bad.
I have samples of critique for a wab page content. The majority of audiences are teen-agers, and the site gives away points to whoever posts a comment. So, sometimes we have garbage like this:
I WuVS HIM XD HE SHALLL TAKE OVER THIS WORLD LIKE A BOSSY BOSS
Or:
It's so amaz, like, omg, it' bootiful. da cat ish a purdy collor and ish like o amaz. da desigguh ish purdy and da spech bibble ish purdy,
Even worse:
DOGE DOGE DOGE DOGE DOGE DOGE DOGE DOGE DOGE DOGE DOGE DOGE DOGE DOGE DOGE MUCH SEXY DOGE DOGE DOGE DOGE DOGE DOGE
So, we need to have a system that can detect the bad critiques, and remove them from the page automatically.
This is my initial code, and used a set of only five features.
Here is the steps:
- Setting the feature set: this is the most important step. We should figure out what features we need to filter out the garbage critiques. In this code, we have only 5 features.
- Read the two data files: good.txt and bad.txt.
- Make it a list of strings (one string per comment). The pipe('|') is used as a delimeter.
- Filling out the feature vector for each critique.
- We construct list of feature vector (feature + label)
- We need to do random shuffle the vector for even distribution of train/test sets.
- Make one group as a train set and the other set as a test set:
XY_train= [[542, 34, 0.06273062730627306, 104, 0, 1], ....] XY_test= [[758, 49, 0.06464379947229551, 133, 0, 1], ....]
- Then, convert the two sets to numpy arrays.
X_train, Y_train = make_np_array_XY(XY_train) X_test, Y_test = make_np_array_XY(XY_test)
- We also need to break-up the set into two arrays: X and Y:
X_train and Y_train look like this:
X_train = [[ 5.42000000e+02 3.40000000e+01 6.27306273e-02 1.04000000e+02 0.00000000e+00]....] Y_train = [ 1. 1. 1. 1. 1. 1. 1. 0. 0. 1. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 0. 0. 1. 0. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1.]
- Then, we train SVM:
# train set C = 1.0 # SVM regularization parameter svc = svm.SVC(kernel='linear', C=C).fit(X_train, Y_train)
- Now that we have a trained svc object (sklearn.svm.classes.SVC), we're able to predict Y_predict from the test set X_test:
Y_predict = svc.predict(X_test)
- The next step is compare the elements of the Y_predict list with those of the Y_test.
- With this very small feature vector, we hit 75% mark. Also we can use metric that scikit provides: sklearn.metrics.f1_score (aka balanced F-score or F-measure):
test_sizeest_size = len(Y_test) score = 0 for i in range(test_size): if Y_predict[i] == Y_test[i]: score += 1 print 'Got %s out of %s' %(score, test_size) f1 = f1_score(Y_test, Y_predict, average='macro') print 'f1 macro = %.2f' %(f1) f1 = f1_score(Y_test, Y_predict, average='micro') print 'f1 micro = %.2f' %(f1) f1 = f1_score(Y_test, Y_predict, average='weighted') print 'f1 weighted = %.2f' %(f1)
- To predict better, we need more features and well as a bigger training set of data. It'll be in the next section. Once again, this code is not in a good shape. It'll be further refined in the next section that's coming soon.
from __future__ import division import collections import random import numpy as np from sklearn import svm from sklearn.metrics import f1_score def number_of_chars(s): return len(s) def unique_chars(s): s2 = ''.join(set(s)) return len(s2) def weighted_unique_chars(s): return unique_chars(s)/number_of_chars(s) def words_count(s): return collections.Counter(s) def words_counter_object(s): cnt = collections.Counter() words = s.split() for w in words: cnt[w] += 1 return cnt def total_words(cnt): sum = 0 for k in dict(cnt).keys(): sum += int(cnt[k]) return sum def most_common(cnt, n): for k,v in cnt.most_common(n): #print "most common k = %s : v = %s" %(k,v) pass def is_repeated(cnt): for k,v in cnt.most_common(1): freq = v/total_words(cnt) # print 'freq=',freq if freq > 0.5: return 1 return 0 def make_feature_vector(critique, labels): " construct feature vector" feature_vector = [] for i in range(len(critique)): s = critique[i] feature = [] counter_obj = words_counter_object(s) feature.append(number_of_chars(s)) feature.append(unique_chars(s)) feature.append(weighted_unique_chars(s)) feature.append(total_words(counter_obj)) feature.append(is_repeated(counter_obj)) feature.append(labels[i]) feature_vector.append(feature) return feature_vector def read_data(): ''' read and make a list of critiques''' f = open('bad.txt', 'r') bad = f.read().split('|') f.close() f = open('good.txt', 'r') good = f.read().split('|') f.close() return bad+good, [0]*len(bad) + [1]*len(good) def make_np_array_XY(xy): print "make_np_array_XY()" a = np.array(xy) x = a[:,0:-1] y = a[:,-1] return x,y if __name__ == '__main__': critiques, labels = read_data() features_and_labels= make_feature_vector(critiques, labels) number_of_features = len(features_and_labels[0]) - 1 random.shuffle(features_and_labels) # make train / test sets from the shuffled list cut = int(len(features_and_labels)/2) XY_train = features_and_labels[:cut] XY_test = features_and_labels[cut:] X_train, Y_train = make_np_array_XY(XY_train) X_test, Y_test = make_np_array_XY(XY_test) # train set C = 1.0 # SVM regularization parameter svc = svm.SVC(kernel='linear', C=C).fit(X_train, Y_train) print 'type(svc)=', type(svc) print 'svc=',svc print 'Y_test:\n', Y_test Y_predict = svc.predict(X_test) print 'Y_predict:\n', Y_predict # score test_size = len(Y_test) score = 0 for i in range(test_size): if Y_predict[i] == Y_test[i]: score += 1 print 'Got %s out of %s' %(score, test_size) # f1 score f1 = f1_score(Y_test, Y_predict, average='macro') print 'f1 macro = %.2f' %(f1) f1 = f1_score(Y_test, Y_predict, average='micro') print 'f1 micro = %.2f' %(f1) f1 = f1_score(Y_test, Y_predict, average='weighted') print 'f1 weighted = %.2f' %(f1)
Output:
... svc= SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0, kernel='linear', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False) Y_test: [ 0. 0. 0. 1. 1. 1. 1. 1. 0. 0. 1. 0. 0. 0. 1. 1. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 0. 0. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 0.] Y_predict: [ 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 1. 1. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 0. 1. 1. 1. 0. 1.] Got 29 out of 39 f1 macro = 0.81 f1 micro = 0.81 f1 weighted = 0.81
The code below (sv5.py) is almost identical to the Code A (svtest.py) used in the previous section. The difference is that we're using linear_model.SGDClassifier() for the classifier which is much faster. Also, this time, we're using a bigger data set (goodCritiques.txt and badCritiques.txt).
# train set C = 1.0 # SVM regularization parameter #svc = svm.SVC(kernel='linear', C=C).fit(X_train, Y_train) print "linear_model.SGDClassifier()..." from sklearn import linear_model svc = linear_model.SGDClassifier().fit(X_train, Y_train)
The ratio for the numbers of data in the two sets (train vs test) has been changed from 1:1 to 9:1. It usually 4:1 or 9:1.
The code looks like this:
from __future__ import division import collections import random import numpy as np from sklearn import svm from sklearn.metrics import f1_score def number_of_chars(s): return len(s) def unique_chars(s): s2 = ''.join(set(s)) return len(s2) def weighted_unique_chars(s): return unique_chars(s)/number_of_chars(s) def words_count(s): return collections.Counter(s) def words_counter_object(s): cnt = collections.Counter() words = s.split() for w in words: cnt[w] += 1 return cnt def total_words(cnt): sum = 0 for k in dict(cnt).keys(): sum += int(cnt[k]) return sum def most_common(cnt, n): for k,v in cnt.most_common(n): #print "most common k = %s : v = %s" %(k,v) pass def is_repeated(cnt): for k,v in cnt.most_common(1): freq = v/total_words(cnt) # print 'freq=',freq if freq > 0.5: return 1 return 0 def make_feature_vector(critique, labels): " construct feature vector" feature_vector = [] for i in range(len(critique)): s = critique[i] feature = [] counter_obj = words_counter_object(s) feature.append(number_of_chars(s)) feature.append(unique_chars(s)) feature.append(weighted_unique_chars(s)) feature.append(total_words(counter_obj)) feature.append(is_repeated(counter_obj)) feature.append(labels[i]) feature_vector.append(feature) return feature_vector def read_data(): ''' reads data files and returns lists of comments and labels''' f = open('badCritiques.txt', 'r') #f = open('bad.txt', 'r') bad = f.read().split('|') f.close() f = open('goodCritiques.txt', 'r') #f = open('good.txt', 'r') good = f.read().split('|') f.close() return bad+good, [0]*len(bad) + [1]*len(good) def make_np_array_XY(xy): """ takes XY (feature + lable) lists, then makes np array for X, Y """ a = np.array(xy) x = a[:,0:-1] y = a[:,-1] return x,y def get_f1_score(Y_test, Y_predict): test_size = len(Y_test) score = 0 for i in range(test_size): if Y_predict[i] == Y_test[i]: score += 1 print 'Got %s out of %s' %(score, test_size) print 'f1 macro = %.2f' %(f1_score(Y_test, Y_predict, average='macro')) print 'f1 micro = %.2f' %(f1_score(Y_test, Y_predict, average='micro')) print 'f1 weighted = %.2f' %(f1_score(Y_test, Y_predict, average='weighted')) if __name__ == '__main__': critiques, labels = read_data() #feature_vector = [] features_and_labels= make_feature_vector(critiques, labels) number_of_features = len(features_and_labels[0]) - 1 # shuffle to mix good and bad ones random.shuffle(features_and_labels) # make train / test sets from the shuffled list cut = int(len(features_and_labels)*0.9) XY_train = features_and_labels[:cut] XY_test = features_and_labels[cut:] X_train, Y_train = make_np_array_XY(XY_train) X_test, Y_test = make_np_array_XY(XY_test) print 'len(X_test) = %s len(Y_test) = %s' %(len(X_test), len(Y_test)) # train set C = 1.0 # SVM regularization parameter #svc = svm.SVC(kernel='linear', C=C).fit(X_train, Y_train) print "linear_model.SGDClassifier()..." from sklearn import linear_model svc = linear_model.SGDClassifier().fit(X_train, Y_train) print "svc.predict()..." Y_predict = svc.predict(X_test) print 'Y_predict:\n', Y_predict print 'Y_test: \n', Y_test # get f1 score get_f1_score(Y_test, Y_predict)
Output:
len(X_test) = 115 len(Y_test) = 115 linear_model.SGDClassifier()... svc.predict()... Y_predict: [ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] Y_test: [ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 0. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1.] Got 100 out of 115 f1 macro = 0.93 f1 micro = 0.93 f1 weighted = 0.93
It appears to be working fine and we got better f1 scores. However, if we carefully look into the Y_predict values, they are all 1s. If we run again, we may get 0. What does this mean?
SVM is not learning primarily due to the features. So, we need more features.
As discussed in the previous section, we need more features. I added two new features to the feature vector:
def repeated_count_top_3(cnt): "returns total freq of the most common 3 words" freq = 0 for k,v in cnt.most_common(3): freq += v return freq/3 def longest(s): "returns the length of longest word" mylist = s.split() if len(mylist) == 0: return 0 return len(max(mylist, key=len))
The first one returns total frequency of the most common 3 words, and the second one returns the length of the longest words in a comment.
We also need to modify the following function two add the above two to the feature vector:
def make_feature_vector(critique, labels): ... feature.append(repeated_count_top_3(counter_obj)) feature.append(longest(s)) ...
With the new code (sv6.py) , we get the following output:
len(X_test) = 229 len(Y_test) = 229 linear_model.SGDClassifier()... svc.predict()... Y_predict: [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] Y_test: [ 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 0. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 0. 1. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1.] Got 67 out of 229 f1 macro = 0.29 f1 micro = 0.29 f1 weighted = 0.29
Note that the Y_predict arrays has mixed values of 1s and 0s now!
Machine Learning with scikit-learn
scikit-learn installation
scikit-learn : Features and feature extraction - iris dataset
scikit-learn : Machine Learning Quick Preview
scikit-learn : Data Preprocessing I - Missing / Categorical data
scikit-learn : Data Preprocessing II - Partitioning a dataset / Feature scaling / Feature Selection / Regularization
scikit-learn : Data Preprocessing III - Dimensionality reduction vis Sequential feature selection / Assessing feature importance via random forests
Data Compression via Dimensionality Reduction I - Principal component analysis (PCA)
scikit-learn : Data Compression via Dimensionality Reduction II - Linear Discriminant Analysis (LDA)
scikit-learn : Data Compression via Dimensionality Reduction III - Nonlinear mappings via kernel principal component (KPCA) analysis
scikit-learn : Logistic Regression, Overfitting & regularization
scikit-learn : Supervised Learning & Unsupervised Learning - e.g. Unsupervised PCA dimensionality reduction with iris dataset
scikit-learn : Unsupervised_Learning - KMeans clustering with iris dataset
scikit-learn : Linearly Separable Data - Linear Model & (Gaussian) radial basis function kernel (RBF kernel)
scikit-learn : Decision Tree Learning I - Entropy, Gini, and Information Gain
scikit-learn : Decision Tree Learning II - Constructing the Decision Tree
scikit-learn : Random Decision Forests Classification
scikit-learn : Support Vector Machines (SVM)
scikit-learn : Support Vector Machines (SVM) II
Flask with Embedded Machine Learning I : Serializing with pickle and DB setup
Flask with Embedded Machine Learning II : Basic Flask App
Flask with Embedded Machine Learning III : Embedding Classifier
Flask with Embedded Machine Learning IV : Deploy
Flask with Embedded Machine Learning V : Updating the classifier
scikit-learn : Sample of a spam comment filter using SVM - classifying a good one or a bad one
Machine learning algorithms and concepts
Batch gradient descent algorithmSingle Layer Neural Network - Perceptron model on the Iris dataset using Heaviside step activation function
Batch gradient descent versus stochastic gradient descent
Single Layer Neural Network - Adaptive Linear Neuron using linear (identity) activation function with batch gradient descent method
Single Layer Neural Network : Adaptive Linear Neuron using linear (identity) activation function with stochastic gradient descent (SGD)
Logistic Regression
VC (Vapnik-Chervonenkis) Dimension and Shatter
Bias-variance tradeoff
Maximum Likelihood Estimation (MLE)
Neural Networks with backpropagation for XOR using one hidden layer
minHash
tf-idf weight
Natural Language Processing (NLP): Sentiment Analysis I (IMDb & bag-of-words)
Natural Language Processing (NLP): Sentiment Analysis II (tokenization, stemming, and stop words)
Natural Language Processing (NLP): Sentiment Analysis III (training & cross validation)
Natural Language Processing (NLP): Sentiment Analysis IV (out-of-core)
Locality-Sensitive Hashing (LSH) using Cosine Distance (Cosine Similarity)
Artificial Neural Networks (ANN)
[Note] Sources are available at Github - Jupyter notebook files1. Introduction
2. Forward Propagation
3. Gradient Descent
4. Backpropagation of Errors
5. Checking gradient
6. Training via BFGS
7. Overfitting & Regularization
8. Deep Learning I : Image Recognition (Image uploading)
9. Deep Learning II : Image Recognition (Image classification)
10 - Deep Learning III : Deep Learning III : Theano, TensorFlow, and Keras
Ph.D. / Golden Gate Ave, San Francisco / Seoul National Univ / Carnegie Mellon / UC Berkeley / DevOps / Deep Learning / Visualization