OK, I see some of you are impatient to move forward. So, let's move forward to the next phase.
I have now written a
summary of the findings of the first phase. It incorporates most of the ideas that have been presented in this topic so far (related to the 1st phase). You may want to have a brief look at that list now. There is a chance that we may drop some of those metrics later on, if those are deemed impractical or ineffective for some reason, but until then, let's keep them on our list.
The next step is to figure out how to convert those listed metrics into numbers.
Some hints and guidelines for this phase:
- Please note that we're still discussing individual metrics at this stage -- the formula for combining all of them is scheduled for the next phase
- For simplicity's sake, we should use the same scale for all the individual metrics. I propose a scale of 0-100.
- You don't need to consider how important this particular metric is. We'll assign the weight (=importance) later on.
- If the hit fulfills the best criteria for some particular metric, it should be assigned a score of 100 for that particular metric
- Please try to use the full scale from 0 to 100 (or 1 to 100 if it's easier for you that way)
- Negative metrics ("this is a meeting hit") can also be scaled from 0 (or 1) to 100 (we'll take care of the sign in the next phase)
- "Simple mathematics" is preferred, but if you can justify using "higher mathematics", feel free. Some metrics can be scored linearly, but for some metrics it might be useful to use logarithmic scale, for example.
- Methods that rely only on the information of that single hit are preferred. Yes, we can do full statistical analysis of how frequent such hits are, but if we can come up with a close enough approximation through other means, it'd be a bit easier to implement.
- The brainstorming topic has lots of material related to this phase, feel free to use the ideas presented there
- When thinking about the scores, please also pay attention to how the proposal would work with triples and up. Generally, we can assign scores for each leg of the hit separately and combine them in the next phase.
- If you can improve someone else's proposal, go ahead and provide your counterproposal. There are probably quite a few methods for calculating the numbers. Only saying that "this is bad" is not productive, try to come up with your own proposal instead.
- For the hit frequencies, I'd suggest that we use the hit grouping method. I think it solves more problems than it creates. The capping methods described in the brainstorming topic would also work, but the grouping would also solve a few somewhat unrelated user interface issues. In summary this'll mean that if someone finds a bundle of notes from some other user, that counts as a "single occurrence" for the frequency calculations. The individual notes will still be hits with all the same score.
- Relatedly, I'd also like to put more weight on the long-term hit frequencies than short-term frequencies. If a tracker moves to some other tracker's territory, there's a chance that they'll initially get some higher rated hits. The hit grouping makes this less of a concern for anyone. However, if the users continue having daily hits, the interestingness of those next hits will slowly decrease.
Here's an example (this can be improved if you can come up with something better):
Hit with an unusual denomination
Scoring: 0 for the most common hit denomination (fivers), 100 for the least common hit denomination (200e), others linearly between those two.
Here's a summary of all hits and their denominations on EBT, including moderated hits:
Code: Select all
mysql> select denomination, count(distinct serial) from hits group by 1;
+--------------+------------------------+
| denomination | count(distinct serial) |
+--------------+------------------------+
| 5 | 559156 |
| 10 | 159650 |
| 20 | 141445 |
| 50 | 38525 |
| 100 | 4219 |
| 200 | 846 |
| 500 | 1040 |
+--------------+------------------------+
(total: 904881)
Applying some basic linear scoring, we'd end up with these scores:
Code: Select all
mysql> select denomination, count(distinct serial), 100-(( count(distinct serial) - 846) / (559156-846))*100 from hits group by 1;
+--------------+------------------------+----------------------------------------------------------+
| denomination | count(distinct serial) | 100-(( count(distinct serial) - 846) / (559156-846))*100 |
+--------------+------------------------+----------------------------------------------------------+
| 5 | 559156 | 0.0000 |
| 10 | 159650 | 71.5563 |
| 20 | 141445 | 74.8170 |
| 50 | 38525 | 93.2512 |
| 100 | 4219 | 99.3959 |
| 200 | 846 | 100.0000 |
| 500 | 1040 | 99.9653 |
+--------------+------------------------+----------------------------------------------------------+
An alternative would be to use the logarithmic scale:
Code: Select all
mysql> select denomination, count(distinct serial), 100-( log(count(distinct serial) - 846) / log(559156-846))*100 from hits where denomination not in ('5','200') group by 1;
+--------------+------------------------+----------------------------------------------------------------+
| denomination | count(distinct serial) | 100-( log(count(distinct serial) - 846) / log(559156-846))*100 |
+--------------+------------------------+----------------------------------------------------------------+
| 10 | 159650 | 9.501058064713135 |
| 20 | 141445 | 10.421196432291310 |
| 50 | 38525 | 20.372392921419873 |
| 100 | 4219 | 38.609834187384350 |
| 500 | 1040 | 60.190511050902900 |
+--------------+------------------------+----------------------------------------------------------------+
In this scenario 5e notes would (also) get zero and 200e notes 100 points.
We'd also need to consider how common the denominations are in each country. For example, the least common denomination for hits in Austria is 500e instead of 200e, and the hit ratio of other denominations could be different as well.
If you can come up with better proposals, go ahead. We'd need to come up with similar methods for
all the metrics and I could use your math skills for that. I can provide extra statistical information as needed. The field is yours. Discuss
As usual, if you have something that you'd like to say that isn't related to this current phase of assigning scores for individual metrics, please use the
brainstorming topic. Thanks.
(note: I'll likely be a bit absent for the next few days due to other commitments, don't expect immediate and/or long replies for your proposals from me in the next few days)