Machine learning and Steem #3: Account classification - accuracy improvement up to 95%

jacekw.dev (58)in #utopian-io • 6 years ago (edited)

Repository

https://github.com/keras-team/keras

What Will I Learn?

Collect data from beem and SteemSQL
Build neural network model for multiclass classification problem
Build decision tree for multiclass classification problem
Visualize decision tree

Requirements

python
basic concepts of data analysis / machine learning

Tools:

python 3
- pandas
- matplotlib
- seaborn
- jupyter notebook
- scikit-learn
- keras + tensorflow (as backend to keras wrapper)

It looks like a lot of libraries, but it's a standard python toolset for data analysis / machine learning.

Difficulty

Intermediate

Tutorial Contents

Problem description
Collecting data
Building neural network
Building decision tree
Visualization of the decision tree

Problem description

The purpose of this tutorial is to improve the accuracy of the model used to classify Steem accounts (content creator vs scammer vs comment scammer vs bid-bot). In the previous part of this tutorial, we achieved an accuracy of 83%. And this is what the confusion matrix looks like:

We will try to get improvement through:

enlarging the training set
increasing the number of iterations
using a decision tree instead of a neural network

Collecting data

The data will be downloaded in a similar way to the previous part of the tutorial, but we want to collect as much as possible. Last time, each class had 100 elements, but this is not enough.

Scammers will be collected from the comments of the users guard and arcange, who warns about this type of accounts. For example:

We will use the following queries:

SELECT DISTINCT SUBSTRING(body, CHARINDEX('@', body) + 1, CHARINDEX(' leads', body) - CHARINDEX('@', body) - 1) as account
FROM Comments (NOLOCK)
WHERE depth > 0 AND
      author = 'guard' AND CONTAINS(body, 'phishing')

 SELECT DISTINCT SUBSTRING(body, CHARINDEX('@', body) + 1, CHARINDEX(' is a', body) - CHARINDEX('@', body) - 1) as account
 FROM Comments (NOLOCK)
 WHERE depth > 0 AND
       author = 'arcange' AND 
       CONTAINS(body, 'CONFIRMED AND SCAM') AND 
       body LIKE '%The message you received from%'"""

In this way, we've obtained 750 unique accounts. Therefore, we will try to get the same number of elements for other classes.

Content creators will be collected from the SteemSQL database with the following script:

SELECT TOP 750 author
FROM Comments (NOLOCK)
WHERE depth = 0 AND
      category in ('utopian-io', 'dtube', 'dlive', 'steemhunt', 'polish')
ORDER BY NEWID()

To get a list of spammers, first find the more frequent short comments.

SELECT body, COUNT(*) as cnt
FROM Comments (NOLOCK)
WHERE depth = 1 AND LEN(body) < 15 AND created BETWEEN GETUTCDATE() - 60 AND GETUTCDATE()
GROUP BY body
ORDER BY cnt DESC

This gives us the following list:

spam_phrases = [
    'nice', 'nice post', 'good', 'beautiful', 'good post',
    'thanks', 'upvoted', 'very nice', 'great', 'nice blog',
    'thank you', 'wow', 'amazing', 'nice one', 'awesome',
    'great post', 'lol', 'like', 'cool', 'hi',
    'nice, upvoted', 'good job', 'nice article', 'nice pic', 'nice photo',
    'welcome', 'hello', 'good article', 'nice picture', 'nice info',
    'promote me', 'fantastic', 'super', 'nice work', 'nice video',
    'good project', 'wonderful', 'nice bro', 'lovely', 'nice shot'
]

Now we can collect the list of spammers with the following query:

query = """\
SELECT TOP 750 author
FROM Comments (NOLOCK)
WHERE depth = 1 AND
      created BETWEEN GETUTCDATE() - 60 AND GETUTCDATE() AND
      body in """ + to_sql_list(spam_phrases) + """
GROUP BY author
ORDER BY COUNT(*) DESC"""

The list of bid-bots has been collected manually from https://steembottracker.com/. But here's the problem, we only have 100 records. There are two options:

leave 100 records and thus have unbalanced classes
use some method to add records

We will use the simplest method - we copy data to get 750 records. This is not an ideal situation, but rather a better than unbalanced class.

The full script that retrieves the features of accounts can be found here.

As a reminder, the features analyzed are:

['followers', 'followings', 'follow ratio', 'muters',
'reputation', 'effective sp', 'own sp', 'sp ratio', 'curation_rewards',
'posting_rewards', 'witnesses_voted_for', 'posts', 'average_post_len', 'comments',
'average_comment_len', 'comments_with_link_ratio', 'posts_to_comments_ratio']

I will not focus on visualization here, because it was shown in previous parts and I do not want to repeat myself. The sample chart looks as follows and all of them can be seen here.

Building neural network

Let's start with the simplest neural network possible.

model = Sequential()
model.add(Dense(17, input_dim=17, activation='relu'))
model.add(Dense(4, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='nadam',
              metrics=['accuracy'])

model.fit(X_train, y_train, epochs=50, batch_size=1, verbose=0)
score = model.evaluate(X_test, y_test,verbose=0)
y_pred = model.predict_classes(X_test)

print('accuracy: %.3f' % score[1])
cm = confusion_matrix(np.argmax(y_test, axis=1), y_pred)
plot_confusion_matrix(cm)

accuracy: 0.847

A short glossary:

Name	Description
model.add	adds new layer
Dense	fully connected layer
relu	ReLU activation function
sigmoid	sigmoid activation function
categorical_crossentropy	loss used for multiclass classification problem
nadam	Adaptive Moment Estimation optimizer, basically RMSProp with Nesterov momentum
accuracy	percentage of correctly classified inputs

Let's add more layers and neurons and increase number of iterations.

model = Sequential()
model.add(Dense(85, input_dim=17, activation='relu'))
model.add(Dense(40, activation='relu'))
model.add(Dense(20, activation='relu'))
model.add(Dense(4, activation='softmax'))

accuracy: 0.851

We see some improvement over the model from the previous part of the tutorial, but it is not yet too significant.

Building decision tree

Let's try to use a different model - a decision tree. It is a decision support tool that uses a tree-like graph of decisions and their consequences.

X_cols = columns
y_cols = ['class']
X = df[X_cols]
y = df[y_cols]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
input_dim = len(X_cols)

model = tree.DecisionTreeClassifier(max_depth=8)

model.fit(X_train, y_train)
score = model.fit(X_test, y_test)
y_pred = model.predict(X_test)

print('accuracy: %.3f' % accuracy_score(y_pred, y_test))
cm = confusion_matrix(y_test, y_pred)

accuracy: 0.951

The accuracy of the model has increased significantly up to 95%!

Visualization of the decision tree

We will use a graphviz library for visualization.

import graphviz
from sklearn.tree import export_graphviz

dot_data = tree.export_graphviz(
    model,
    out_file=None,
    feature_names=X_cols,
    class_names=class_names,
    filled=True,
    rounded=True,
    special_characters=True)
graph = graphviz.Source(dot_data)

graph = graphviz.Source(dot_data)
graph.format = 'png'
graph.render('dtree_render', view=True)

As we can see, the decision tree is very large (to see the details just open the image in the new tab).

Let's look at a fragment of this tree. The color of nodes is related to the class. The value gini (Gini impurity) measures how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset.

Curriculum

Conclusions

the bigger the dataset, the better we can train our model
it's worth trying different algorithms instead of sticking to one
the decision tree turned out to be a better choice, both in terms of efficiency and execution time

Proof of Work Done

Collecting data
Building classifiers

#tutorials #programming #machinelearning #python

6 years ago in #utopian-io by jacekw.dev (58)

$54.98

Sort:

Trending

[-]

portugalcoin (72) 6 years ago

Thank you for your contribution.

Very good tutorial, thanks for your effort for this tutorial.

Your contribution has been evaluated according to Utopian policies and guidelines, as well as a predefined set of questions pertaining to the category.

To view those questions and the relevant answers related to your post, click here.

Need help? Write a ticket on https://support.utopian.io/.
Chat with us on Discord.
[utopian-moderator]

$0.02

2 votes

[-]

utopian-io (71) 6 years ago

Hey @jacekw.dev
Thanks for contributing on Utopian.
We’re already looking forward to your next contribution!

Want to chat? Join us on Discord https://discord.gg/h52nFrV.

Vote for Utopian Witness!

$0.00

1 vote

[-]

steemitboard (66) 6 years ago

Congratulations @jacekw.dev! You have completed the following achievement on Steemit and have been rewarded with new badge(s) :

Award for the number of upvotes received

_{Click on the badge to view your Board of Honor.}
_{If you no longer want to receive notifications, reply to this comment with the word STOP}

To support your work, I also upvoted your post!

Do you like SteemitBoard's project? Then Vote for its witness and get one more award!

$0.00

[-]

paulag (73) 6 years ago

this is really awesome. Well done. I am thinking I would love to work with you on something. You have fantastic skills and your posts are a breath of fresh air.

Hope you don't mind me asking, but did you sign up to steemit via utopian?

$0.00

[-]

jacekw.dev (58) 6 years ago

Thanks. I came to steemit about a year ago, but I only wrote in the #polish community using the @jacekw account. Recently I have some ideas for using Machine Learning (eg account classification), so I decided to start creating contributions for #utopian-io.

I am thinking I would love to work with you on something.

With pleasure!

$0.00

[-]

paulag (73) 6 years ago

wow just looking at some of your posts on your other account too. I had never come across this account before, this is such a pity. but I am glad I have discovered you now. Steem on

$0.00

STEEM 0.26

TRX 0.11

JST 0.032

BTC 63754.85

ETH 3055.95

USDT 1.00

SBD 3.85

Machine learning and Steem #3: Account classification - accuracy improvement up to 95%

Repository

What Will I Learn?

Requirements

Difficulty

Tutorial Contents

Problem description

Collecting data

Building neural network

Building decision tree

Visualization of the decision tree

Curriculum

Conclusions

Proof of Work Done

Coin Marketplace