Sincerity Project Update
Over the last week I've been integrating some of the data from the SteemPlus crowdsourcing process to expand the amount of training data I can use to teach the machine learning algorithm how to distinguish spammers and bots from human content creators.
I'm fairly new to machine learning, and when somebody (sorry I can't remember who) mentioned random forest classifiers a couple of weeks ago, I decided to investigate them more. Following that, I have also now changed part of the classifier from using a nearest neighbours algorithm to a random forest classifier.
In terms of how well the software predicts the training data (cross-validation), these factors have resulted in a slight improvement in accuracy. The 'false positives' have also been reduced, meaning that fewer non-spammers should be wrongly labelled as spammers in SteemPlus and other services using the Sincerity spam API. Obviously people have different ideas of what a spammer is though, so this tool can at best reflect some kind of community average.
I am also now collecting extra data which will be ready to incorporate at the end of the month, which seem likely to further improve accuracy.
By that time, I would also like to have collected even more crowdsourced data for adding to the training set. I am very happy with the software, but the training data could still be better.
The trouble with allowing people to anonymously report spammers is that some people seem to use it as a way to try and remove content that they just don't like or understand. For example, many non-english language accounts were labelled incorrectly as spammers, as were some popular youtubers who have recently joined the platform, perhaps with divisive content.
I am thinking about better ways to collect crowdsourced data, feel free to let me know if you can help, or have any ideas. If you are in a position to delegate some SP for me to upvote comments with useful training data, that could be very useful.
Current Training Data
Whilst I can't reveal full details about the classification algorithm for a couple of reasons, the training data I used is shown here (and will be kept up to date when retraining happens). I don't have time to check all these accounts myself, but if you do, and find any inaccuracies, that is very helpful information for improving the classifier. One incorrectly classified account here could affect the software classification of many accounts!
Pending changes (changes I plan before the next software training)
Thanks to the validation by @fraenk, I've updated the changes list.
To be removed from all lists:
To relabel as bots:
New APIs Methods Coming Soon
Project Sincerity isn't just about spam and bot classification though. More of the large amount of data being collected will soon be available in the form of new APIs which relate to characteristics of voting, commenting, etc.