Excel formula to distribute proportionally (not provided) attribution
We all know by now (not provided) is a pain in the ass. Not only because part of the picture has gone and you have to work harder to find the missed pieces of the puzzle, also because while putting SEO results in charts and figures you have to deal with that or your reports and dashboards are going to be harder to understand for your clients.
Traditionally we segment organic in two types, Branded and non-Branded traffic for obvious reasons but we have now a new type around messing it up a bit, (not provided) organic traffic. Thanks Google.
The next chart is an example of that mess. This is revenue from organic traffic segmented by the three types since the nightmare started.
In many accounts I've checked, visits from (not provided) go from 9% to 13% although the proportion could be higher in certain cases or segments of your traffic. Two explicit tweets from yesterday on that note.
"Keyword (not provided)" is 35% of my new visits. <shakes fist at Google>— Marc Bitanga (@marcbitanga) March 8, 2012
How to represent (not provided)?
Basically we have two options:
Leave it as it comes and represent it as a third bucket for visits, conversion, revenue or any other metric you might want to include
- Disadvantages: rebuild tons of segments in your analytics tool and having to explain why more frequently that you would like
- Advantage: the nature of the (not provided) remains uncertain but the rest of the numbers for Branded and non-Branded are accurate
Break it down proportionally and integrate those numbers within the Branded and non-Branded ones
- Disadvantages: the higher the proportion of (not provided) compared to total organic traffic, the higher the inaccuracy for all segments knowing in advance that splitting (not provided) proportionally is an inaccurate solution per se
- Advantage: K.I.S.S. principle
I am inclined to use the last solution but a second opinion is always welcome so I tweeted.
@anilopez I've been breaking them down proportionally as best as I can measure.— Matthew Brown (@MatthewJBrown) January 26, 2012
Excel formula to distribute (not provided)
With the confidence inherited from Matthew's and some of my workmates (@jackson_lo and @adrianvender) words, next step was simple, put excel to work and create a formula* solving the issue. Here you have it:
Once the formula is applied to initial numbers of the chart above this the resulting new one. Total organic traffic is the same, as expected, but we get rid of anoying (not provided) integrating it within the two typical organic traffic types.
I guess we can live with the disadvantages of this method right? At least me.
*This was my first attempt to write something more complicated than 'a+b=c' in the shape of a formula since I left high school so, math and physics nerds in the house tonight, don't laugh so hard at me, ok?
Update. My apreciation to the comments from john Andrews you see below (#3 & #4) triggered a conversation in Twitter (worst place to discuss whatever but anyway) interesting enough, in my opinion, to reporduce it here.
I really appreciate your comment @johnandrews "trust value based on the relative size of the np set" is a great idea.
It's nothing more than error or variance... standard for any model/stat. NEVER accept anything without variance numbers.
There def is standard in stats/data. Any measure has a variance. Any combination of measures has a variance.
"robustness" is a measure of ability to handle real world data, as opposed to test data or ideal data. Sensitivity is also stnd.
Can always review variance, sensitivity, and robustness with any proposed derivation/algo. Great way to "see" utility.
That's why import to state assumptions... that limits the application to real data; allows us to evaluate utility.
From there you can plot diagram of when it's useful, when it's not, and when it might be useful. Great stuff.
I had been adjusting for that based on what the split between branded and non-branded looked like prior to NP.
What sucks is new sites that come in with a high NP right off the bat. Lots of guessing and modeling at that point.
Prob need to model the NP user. She's logged into Google. Assume no reason to be diff w/r to B/NB, or some reason.
Key is the assumptions made... that NP mirrors rest, or not, and why.
Otherwise drop the NP from analysis, and then normalize results to recover lost volume. Same outcome as Ani method.
Yeah, run the numbers based on a few different splits and compare the difference.