@calamus056 recently made a report, part 3 in his series on on self-voting user list since HF19. It was basically a data dump without any much interpretation of the data gathered. It is of course coming from his context against self voting, which is evident from other posts and the comments, but the data itself was presented as is. You can see it here.
I thought it would be interesting to try and interpret it and graph it. Statistics asks questions of data and attempts to make sense of it. See this from a stats research guide on the subject:
In regular conversation, both words [data and statistics] are often used interchangeably. In the world of libraries, academia and research there is an important distinction between data and statistics. Data is the raw information from which statistics are created. Put in the reverse, statistics provide an interpretation and summary of data.
[...] A statistic will answer “how much” or “how many”. [...]
These data and interpretation are for comments only, not root posts.
@calamus056 has asked me to re-relay his disclaimer:
DISCLAIMER: The information in this article shouldn't be perceived as 100% accurate. When you spot significant errors, please leave a comment. Also keep in mind that the full list below is a raw data dump. In no way is it implied that all cases are considered problematic. It's for you to decide what you think about it and what to do with the information. The reason for names being included is that this is public information and others will release (and some already have released) the information independently.
But also note that I did not run this interpretation by him and it is not endorsed by him unless he chooses to do so in the comments 🙂
Foreword on data
The data is from users who's self vote ratio on comments is 50% or above.
Is there a consistent relationship between SBD rewarded from voting on others' comments compared with SBD rewarded from voting on own comment?
The first graph here shows the SBD on own comments plotted against SBD on others' comments for users in the data set, as a scatter plot.
See that most points are clustered in the bottom left, though there does seem to be some linear relationship. This is very common in real world data which is often distributed logarithmically. When making an interpretation we would always adjust for this, but don't forget that we are adjusting.
Here data on both axis are adjusted for their logarithmic distribution and we can see a clear linear relationship which is denoted by a barely visible line (sorry about that, it's the best the software I'm using could provide).
This is called linear regression. Note the best formula we can fit to the data is
y = 0.557x - 7.5226, using this we can see the relationship for
x SBD on own comments and get a prediction of
y SBD on others comments. You might notice from the graph that the line does not go all the way though and this approximate equation has a lower limit of about x = $13.506
E.g. $50 on own comments would predict about $20.33 rewards for others. In this case the self vote reward ratio is 71.09%
What do we get from this? Well it should be no surprise that of the users who vote reward themselves more than half of what they reward others that the predictor for this group only would give us an equation predicting greater rewards for others. There is some variation but only on the downward side of the line, which has to be true by definition of our data set, i.e. 50% self vote rewards or above.
This is a point in not being mislead to a false conclusion.
What might be more interesting than something we already know is that there is a something of a pattern in rewards for self and others among these users. The majority seem to clustered around the main line so we can say that (among this user set) generally as self rewards increase so do rewards for others.
Another graph on this: ratio plot of SBD rewarded from voting on others' comments compared with SBD rewarded from voting on own comment
This shows us the distribution of the 216 users that vote more than or equal to 50% for themselves (in terms of SBD rewards).
We can see that close to 100% self voting is relatively rare (there is only one at 100%, and the next is at 98.34%) and that is falls away quickly among the population.
The median self vote percentage here is 63.85%, so half of these users have self vote rewards between 50% and 63.85%
This is consistent with our scatter plot above.
Is there a consistent relationship between number of comments and self vote reward ratio?
This is a text book example of no trend whatsoever, the data points are very well distributed. They bunch a bit more toward the 50% side of self vote reward ratio, but we expect that because we know most users in the group skew towards 50% rather than 100%.
What we don't see se any relationship between the number of comments made and self vote reward ratio. The suggests that we cannot say, for example, that if a user posts a lot of comment they would probably self vote a lot, or that if the user self votes a lot they must spam a lot of comments.
Is there a consistent relationship between number of comments and SBD self vote reward?
From a first glance there seems to be vague relationship between number of comments and SBD from self vote, right? We might be tempted to conclude that those who get more SBD from self voting are leaving more comments and so spammers!
But something comes to mind - wouldn't more comments usually tend towards more rewards? 🤔
So let's look at the the number of comments compared with the overall SBD rewards (including both self rewards and those for others).
EDIT: The legend name was wrong. I have updated. The data is still the same.
It's almost the exact same graph!
So this is a kind of a red herring, and not useful to us. It just shows us what is intuitively true - those who comment more generally will get more rewards, both for themselves and for others. In other words, this is not interesting at all.
Summary of all aspects
This chart is not really that useful but kind of asks the same question as above. I just wanted to do it mainly for the visual effect.
This is the same as the first graph but with added circle size proportional to number of comments. We can see that the more SBD, the number of comments generally increases (as we saw above), but also that it is not strictly so. The very top earners are not making as many comments as some a little lower than them.
We have seen some relevant stats based on the data @calamus056 published recently.
Though the data is limited, and @calamus056 himself cautions on the accuracy, the over all impression is that the number of users self voting comments more than 50% of rewards since HF 19 are comparatively small (only 216 users are on the list), and those that self vote their comments more than half generally still vote a lot on others. This was claimed anecdotally by some, and the data seems to reflect it.
I would like to see this same data but from the same number of days before HF 19. Then we could compare to see if there has been a significant rise. We should also get this data from the same number of days after this snapshot so we can see if the situation is stable or getting "worse" (more self voting).
Currently all I can say for sure is that it is less than I thought it was in the general population. This does not mean that I don't disagree with self voting still, or that I don't think it's open for abuse, or that the very top self voters are abusing the system.
Caution on conclusion
This is based on what has already happened and may not be a predictor of things to come. I think it is still very important to safeguard the platform against abuse. Even if abuse is not happening at quite the level we thought it might have we should still look at the example of the bad apples and wonder if this is an acceptable level of abuse (given of course that you agree that "too much" self voting is abuse - you may not).
Effect on #project-smackdown
I have not had a lot of time to reflect on these findings, but I will. My initial reaction, paired with recent conversations in the #steem-coop, is that a more targeted approach to dealing with comment abuse is required. I am moving to the position that top self voters by ratio are really the ones abusing the self vote loophole at the moment, instead of those with just high rewards.
^^^ Please let me know what you think of this.
I defend (and have defended) the publishing of information like this. At a basic level it is free speech, but in any case there is a need to understand issues and statistics can help with that, as long as the data is solid and the questions are appropriate for that.