Machine learning and Steem #5: Account classification - understanding and using Ensemble Methods + tweaking performance

in #utopian-io6 years ago (edited)

Repository

https://github.com/scikit-learn/scikit-learn

What Will I Learn?

  • Collect data from beem and SteemSQL
  • Understand the idea behind Ensemble Methods
  • Use Ensemble Methods
    • Averaging Methods: Bagging, RandomForest
    • Boosting Methods: AdaBoost, GradientTreeBoosting
  • Compare Ensemble Methods performance
  • Tweak performance

Requirements

  • python
  • intermediate knowledge of data science / machine learning

Tools:

It may look like a lot of libraries, but it's a standard python toolset for data analysis / machine learning.

Difficulty

  • Advanced

Tutorial Contents

  • Problem description
  • Collecting data
  • 3D visualization
  • Idea behind Ensemble Methods
  • Using Ensemble Methods
  • Comparing performance
  • Tweaking performance
  • What's next?

Problem description

The purpose of this tutorial is to build the steem account classifier, that will be able to distinguish between:

  • content creator
  • scammer
  • comment spammer
  • bid-bot

It is possible that the number of classes will change in the final version. The classifier, which was created in the last part of this tutorial, achieved an accuracy around 96%. In this section, Ensemble Methods will be used to further improve this result. It is also important that the execution time is acceptable.

The final goal is to build an API and web interface that will allow any Steem user to use this classifier.

Collecting data

The data was downloaded in a similar way to the previous part of the tutorial. Currently, each class has 822 records. The script that retrieves all the data can be found here.

The data looks as follows:

                 name  followers  followings  ...    comments_with_link_ratio  posts_to_comments_ratio  class
0          fuadmaulana        198          24  ...                    0.000000                 0.000000      0
1               ltfbyz         58          31  ...                    0.000000                 0.000000      0
2             papanero        460         191  ...                    0.000000                 0.624113      0
3           pialejoana        356          10  ...                    0.704641                 0.046414      0
4           adarshagni        700          63  ...                    0.000000                 3.000000      0
5           pialejoana        356          10  ...                    0.704641                 0.046414      0
6      uniqueknowledge        226          81  ...                    0.000000                 0.000000      0
7       jesusdiazmaica        190           7  ...                    0.000000                 0.000000      0
8        playfulfoodie       3820         537  ...                    0.022099                 0.254144      0
9             wakjal12        152          38  ...                    0.000000                 0.000000      0
10           iamfusion        201          60  ...                    0.000000                 0.000000      0
11           naruitchi        570         483  ...                    0.000000                 0.000000      0
12     mugiwaranohaqqi        395          83  ...                    0.000000                65.000000      0
13            click3rs       3814        5003  ...                    0.000000                10.125000      0
14          harwalis05        369         139  ...                    0.000000                 3.400000      0
15           taufit333        438         527  ...                    0.000000                 0.000000      0
16               unnun        354         121  ...                    0.000000                 2.000000      0
17             schubes        492         111  ...                    0.083333                 0.916667      0
18        ahmetsaritas        287          51  ...                    0.000000                 0.000000      0
19    grecki-bazar-ewy        697         181  ...                    0.142336                 0.054745      0
20        arthurbishop        116          43  ...                    0.000000                 0.000000      0
21            jozef230        982         504  ...                    0.291045                 0.212687      0
22             redjepi       1391          34  ...                    0.083333                 4.000000      0
23         puneetgamer         74          24  ...                    0.000000                 0.000000      0
24              godwyn        446         311  ...                    0.000000                 0.000000      0
25          lexikon082       1561         238  ...                    0.113095                 0.351190      0
26        good-joinned        871         194  ...                    0.000000                 0.000000      0
27          jumpmaster        805         194  ...                    0.000000                 0.671642      0
28       cheapproperty        283           8  ...                    0.000000                 0.333333      0
29            webgrrrl        368          25  ...                    0.000000                 3.000000      0
...                ...        ...         ...  ...                         ...                      ...    ...

3D visualization of features

To understand the data we are dealing with, it is worth to visualize it. We will use the mplot3d library and the code below.

# create 3D scatter plot
# x, y, z - names of features
def scatter_plot_3d(d, x, y, z):
    fig = plt.figure(figsize=(7, 7))
    # create 3D axes
    ax = Axes3D(fig)
    # plot points, cmap is colormap
    ax.scatter(d[x], d[y], d[z], c=d['class'], cmap=plt.cm.Accent, edgecolor='k', s=80)
    # set labels of each axis
    ax.set_xlabel(x)
    ax.set_ylabel(y)
    ax.set_zlabel(z)
    # add legend to plot
    # color of each legend label coresponds to color of given class
    ax.legend([label(color(150, 210, 150)), label(color(247, 192, 135)),
               label(color(235, 8, 144)), label(color(102, 102, 102))],
              ['content-creator', 'scammer', 'comment-spammer', 'bid-bot'], numpoints = 1)

If we use raw data, the visualization may not be very readable, due to a number of values, which significantly differ from the whole dataset.

followings + followers + muters:

For this reason, it is worth to clean up the data that we will use for visualization. Just remove some of the outliers.

for column in columns:
    # calculate quantile (we want to remove 4% of the outliers)
    quantile = df[column].quantile(0.96)
    # filter column
    df1 = df1[df1[column] < quantile]

The charts are now much more readable.

followings + followers + muters:



If we run the code as Jupyter Notebook, the charts are interactive.

Idea behind Ensemble Methods

Ensemble Methods can be divided into two groups:

  • Averaging Methods
  • Boosting Methods

The idea behind the Averaging Methods is to build several classifiers independently, and then use the weighted average as the final result. This usually leads to a better result, due to the reduced variance.

In the case of Boosting Methods it is different - we build classifiers sequentially, in order to minimize the bias of the combined estimator.

  • Averaging Methods
    • Bagging methods build several independent instances of a classifier on random subsets of the training dataset and then calculates final prediction from individual predictions. In this way, variance of the base classifier can be reduced.
    • Random forests consists of trees built from a sample of the the training set. What's more, the split is the best split from a random subset of the features, not from all features.
    • Extreme Random Trees are similiar to Random forests, but when performing a split, instead of looking for the most discriminating thresholds, the thresholds are chosen randomly for each candidate feature and the best of these randomly generated thresholds is selected.
  • Boosting Methods
    • AdaBoost aggregates sequence of weak estimators on modified versions of the training set. The final result is based on the vote of individual estimators.
    • GradientTreeBoosting generalizes boosting to arbitrary differentiable loss functions

Using Ensemble Methods

The following code will be used to test the classifiers.

X_cols = columns
y_cols = ['class']
X = df[X_cols]
y = df[y_cols]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# convert from pandas dataframe to plain array
y_train = np.array(y_train).ravel()
y_test = np.array(y_test).ravel()

# used classifiers: Decision tree for reference + ensemble methods
classifiers = [
    ('DecisionTreeClassifier', DecisionTreeClassifier(max_depth=8, random_state=0)), 
    ('BaggingClassifier', BaggingClassifier(DecisionTreeClassifier(max_depth=8), max_samples=0.5, max_features=0.5)),
    ('RandomForestClassifier', RandomForestClassifier(max_depth = 10, n_estimators=10)),
    ('ExtraTreesClassifier', ExtraTreesClassifier(max_depth = 8, n_estimators=10, min_samples_split=2, random_state=0)),
    ('AdaBoost', AdaBoostClassifier(n_estimators=100)),
    ('GradientTreeBoosting', GradientBoostingClassifier(n_estimators=60, learning_rate=1.0, max_depth=1, random_state=0))
]

# iterate over classifiers
for clf_name, clf in classifiers:
    start = time.time()
    clf.fit(X_train, y_train)
    score = clf.fit(X_test, y_test)
    y_pred = clf.predict(X_test)
    end = time.time()
    print('%25s - accuracy: %.3f, execution time: %.3f ms' % 
          (clf_name, accuracy_score(y_pred, y_test), end - start))
    cm = confusion_matrix(y_test, y_pred)
    plot_confusion_matrix(cm)

Compare Ensemble Methods performance

ClassifierAccTimeConfusion matrix
DecisionTree0.9450.026
Bagging0.9410.069
RandomForest0.9860.067
ExtraTrees0.8840.029
AdaBoost0.7670.617
GradientBoosting0.9740.464

RandomForestClassifier and GradientBoostingClassifier achieved significantly better results than a single decision tree. AdaBoostClassifier turned out to be the worst classifier here, achieving accuracy far worse than all others. However, it's worth remembering that using ensemble methods often means increasing the execution time.

RandomForestClassifer consists of single trees, so we can visualize one of them with following code.

import graphviz
from sklearn.tree import export_graphviz

# convert decision tree metadata to graph
graph = graphviz.Source(export_graphviz(
    classifiers[2][1].estimators_[0], # select first decision tree from RandomForestClassifier
    out_file=None,
    feature_names=X_cols,
    class_names=class_names,
    filled=True,
    rounded=True))

# set format of output file
graph.format = 'png'
# save graph to file
graph.render('RandomForestClassifier')

Tweaking performance

RandomForestClassifier has achieved the best accuracy, which is why we will focus now on this classifier and try to improve this accuracy even more.

We will change the values ​​of several parameters and check how it affected the result:

  • criterion - function which measures quality of split: gini or entropy
  • min_samples_split- the minimum number of samples which is required to split a node
  • bootstrap - whether or not to use bootstrap samples
  • n_estimators - number of trees in the forest
  • max_depth - the maximum depth of the tree

The list of classifiers looks as follows.

classifiers = [
    ('RandomForestClassifier1', RandomForestClassifier(max_depth = 8, n_estimators=10, random_state=0)),
    ('RandomForestClassifier2', RandomForestClassifier(max_depth = 8, n_estimators=10, random_state=0, criterion='entropy')),
    ('RandomForestClassifier3', RandomForestClassifier(max_depth = 8, n_estimators=10, random_state=0, min_samples_split=2)),
    ('RandomForestClassifier4', RandomForestClassifier(max_depth = 8, n_estimators=10, random_state=0, bootstrap=True)),
    ('RandomForestClassifier5', RandomForestClassifier(max_depth = 8, n_estimators=100, random_state=0)),
    ('RandomForestClassifier6', RandomForestClassifier(max_depth = 10, n_estimators=100, random_state=0))
]

And here are the results.

   RandomForestClassifier1 - accuracy: 0.971, execution time: 0.065 ms
   RandomForestClassifier2 - accuracy: 0.976, execution time: 0.102 ms
   RandomForestClassifier3 - accuracy: 0.971, execution time: 0.061 ms
   RandomForestClassifier4 - accuracy: 0.971, execution time: 0.065 ms
   RandomForestClassifier5 - accuracy: 0.976, execution time: 0.603 ms
   RandomForestClassifier6 - accuracy: 0.992, execution time: 0.643 ms

Most parameters did not have a larger impact on the result. It was different in the case of max_depth, efficiency increased up to 99.2%. A further increase of this parameter keeps the result at a similar level. However, it must be remembered that too high depth may not be very beneficial because it increases the execution time and can lead to overfitting of the classifier.

Below is the confusion matrix for the best result.




What's next?

  • building API for classifier
  • building web interface for classifier
  • enlarging the dataset

Curriculum

  1. Machine learning (Keras) and Steem #1: User vs bid-bot binary classification
  2. Machine learning and Steem #2: Multiclass classification (content creator vs scammer vs comment spammer vs bid-bot)
  3. Machine learning and Steem #3: Account classification - accuracy improvement up to 95%
  4. Machine learning and Steem #4: Account classification - comparison of available classifiers + 3D visualization of features

Conclusions

  • it is worth to clean data before visualization to remove outliers
  • 3D visualization allows you to show more information than 2D visualization, but on the other hand, the data can be less readable
  • classifiers can be combined with each other in different ways
  • combining classifiers can reduce bias or variance, so that the final accuracy can be better
  • combining classifiers also means increasing the execution time, so you often have to choose between accuracy and execution time
  • performance can be improved by changing parameters

Proof of Work Done

Collecting data
Classification

Sort:  

Thank you for your contribution.
I loved the work you did here, and i remember earlier contributions of yours. You've been away for some time, don't .. we love your work :)
Overall excellent work, visualizations and different methodology descriptions are quite useful. Iterating a bit further on some concepts for the purpose of a tutorial might have been a bit more helpful to your readers.
Can't wait to see your final product and API. It seems Steem will now have a multitude of account classifications systems, which is definitely positive!

Your contribution has been evaluated according to Utopian policies and guidelines, as well as a predefined set of questions pertaining to the category.

To view those questions and the relevant answers related to your post, click here.


Need help? Write a ticket on https://support.utopian.io/.
Chat with us on Discord.
[utopian-moderator]

Thank you for your review, @mcfarhat! Keep up the good work!

Hello! Your post has been resteemed and upvoted by @ilovecoding because we love coding! Keep up good work! Consider upvoting this comment to support the @ilovecoding and increase your future rewards! ^_^ Steem On!

Reply !stop to disable the comment. Thanks!

Hi @jacekw.dev!

Your post was upvoted by @steem-ua, new Steem dApp, using UserAuthority for algorithmic post curation!
Your post is eligible for our upvote, thanks to our collaboration with @utopian-io!
Feel free to join our @steem-ua Discord server

Congratulations @jacekw.dev! You have completed the following achievement on the Steem blockchain and have been rewarded with new badge(s) :

You made more than 800 upvotes. Your next target is to reach 900 upvotes.

Click here to view your Board of Honor
If you no longer want to receive notifications, reply to this comment with the word STOP

Do not miss the last post from @steemitboard:

Trick or Treat - Publish your scariest halloween story and win a new badge
SteemitBoard notifications improved

Support SteemitBoard's project! Vote for its witness and get one more award!

Hey, @jacekw.dev!

Thanks for contributing on Utopian.
Congratulations! Your contribution was Staff Picked to receive a maximum vote for the tutorials category on Utopian for being of significant value to the project and the open source community.

We’re already looking forward to your next contribution!

Get higher incentives and support Utopian.io!
Simply set @utopian.pay as a 5% (or higher) payout beneficiary on your contribution post (via SteemPlus or Steeditor).

Want to chat? Join us on Discord https://discord.gg/h52nFrV.

Vote for Utopian Witness!

Ja cie krece, nie wiedzialem ze masz drugie konto :) solidne tutoriale, szanuje.

Congratulations @jacekw.dev! You have completed the following achievement on the Steem blockchain and have been rewarded with new badge(s) :

You made more than 900 upvotes. Your next target is to reach 1000 upvotes.

Click here to view your Board of Honor
If you no longer want to receive notifications, reply to this comment with the word STOP

Do not miss the last post from @steemitboard:

The new Steemfest³ Award is ready!
Be ready for the next contest!

Support SteemitBoard's project! Vote for its witness and get one more award!

Congratulations @jacekw.dev! You have completed the following achievement on the Steem blockchain and have been rewarded with new badge(s) :

You made more than 1000 upvotes. Your next target is to reach 1250 upvotes.

Click here to view your Board of Honor
If you no longer want to receive notifications, reply to this comment with the word STOP

Do not miss the last post from @steemitboard:

SteemFest3 and SteemitBoard - Meet the Steemians Contest

Support SteemitBoard's project! Vote for its witness and get one more award!

Coin Marketplace

STEEM 0.28
TRX 0.13
JST 0.032
BTC 60906.91
ETH 2920.56
USDT 1.00
SBD 3.69