Peltier Tech Blog

Excel Chart Add-Ins | Training | Charts and Tutorials | Peltier Tech Blog

 

Main menu:

Peltier Tech Chart Utilities for Excel Peltier Tech Panel Chart Utility Peltier Tech Waterfall Chart Utility Peltier Tech Cluster-Stack Chart Utility Peltier Tech Box and Whisker Chart Utility Peltier Tech Marimekko Chart Utility Peltier Tech Dot Plot Utility Peltier Tech Cascade Chart Utility

 
Excel Dashboards
 

 
Amazon Books
 

Subscribe

Site search

Subscribe

Site search


Recent Posts

Popular Posts

Privacy and License

Privacy Policy

Creative Commons License
Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

Web Browser Stats: Problems With Data Gaps

 
by Jon Peltier
Monday, April 5th, 2010
Peltier Technical Services, Inc., Copyright © 2012.
Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

Last week in Quick Analysis of Web Browser Stats I provided a line chart (“timeline”) to augment Ed Bott’s discussion of Google Chrome’s inroads in the browser market (from Chrome takes a bite out of IE and Firefox). Ed occasionally revisits the relative market share of the different browsers, but it’s not a regular monthly feature. In his article, he compared three time points, September 2008, September 2009, and March 2010. I suffer from a unique symptom of OCD called Data Loss Aversion, so I noticed right away that there were eleven missing months between the first two timepoints, and another five in the gap between the last two.

I don’t have access to Ed’s web site statistics, but I can use my own data to illustrate the issues with gaps in a timeline. This chart shows browser share on my site for the same months that Ed reported. Ed and I have different demographics among our followers, so IE makes up a larger share of my visitors. But I’ve seen the same steady rise in users of Chrome.

Web Browser Usage at Three Arbitrary Times

Because Safari and Opera represent such a small and relatively constant percentage of browser usage, I’ll leave them out of the rest of this analysis.

Interpolated Data

As soon as I noticed the gaps in the data, I wondered what the browser shares looked like during the intervening months. The first inclination is to assume that the data follows the lines between the known points. This is not a good assumption, since the lines themselves provide no information about how the data behaves between known points.

Web Browser Usage at Three Arbitrary Months Interpolated to All Months

Note that in a line chart, we put distinct markers to try to indicate where we have actual measurements, and we use straight segments to join points to help the reader trace measured points for the same series. Using curved or smoothed lines between actual points would imply incorrectly that the intervening data followed particular trajectories.

Actual Data

The actual data is shown here for all months between September 2008 and March 2010.

Web Browser Usage at All Months

Here I’ve plotted the three widely spaced points as solid markers and the intervening months as unfilled markers. The assumption that the data follows the drawn lines isn’t accurate, though it’s not too far off.

Web Browser Usage at Three Arbitrary Months and All Months

If we didn’t have the intervening data points, it is a matter of faith to believe that the actual behavior followed the connecting lines.

What If? (Random Data)

What if the data was random, and our three “known” points came from sampling from normally distributed data sets? To illustrate this, I used the same means as my real data, and slightly elevated standard deviations, and generated normally distributed random points based on these statistics.

The randomly generated points no longer appear to be following the connecting lines.

Web Browser Usage at Three Arbitrary Times as Part of Random Sample

In fact we can construct run charts for our three main browsers, as I’ve done non-rigorously below. The three dotted lines around each set of points denote the mean, the mean plus three standard deviations, and the mean minus three standard deviations. (See Introducing Control Charts (Run Charts) and Statistical Process Control.)

Web Browser Usage as Set of Run Charts

All three data sets appear to represent processes controlled by normal random fluctuation. There is no systematic variation in any of the data. No points in the three data sets fall outside the ±3 SD control limits, or two out of three beyond ±2 SD (not drawn, but trust me). In no case do we find more than six points in a row all above or all below the mean (the rule usually states that nine points indicates out of control). We never see more than half a dozen points in an alternating up and down pattern (rule is usually 15 points), or more than half a dozen points all increasing or all decreasing (usual rule is 7 points).

If we plot the actual data as run charts, we can conclude that  there is a systematic variation in the Chrome data. Too many consecutive values for Chrome are increasing, for example, which indicates nonrandom variation.

Web Browser Usage as Set of Run Charts

While the full data set does confirm that the trends seen by the first three time points are real, without studying the intervening points, we cannot tell the difference between systematic trends and randomly sampled points.

Related Posts:

Learn how to create Excel dashboards.

Comments


Comment from Jeff Weir
Time: Monday, April 5, 2010, 1:19 am

I switched to using Excel as my browser in June ’09…how come I don’t show up above? Perhaps a #DIV/0! error is masking my spike?


Comment from Jon Peltier
Time: Monday, April 5, 2010, 6:57 am

Jeff -

Excel doesn’t appear in the list, but there are dozens of items that only have fifty or fewer visits per month. If Excel internally uses one of Microsoft’s libraries to access the web pages, perhaps your visits are lumped in with IE. Assuming you’re serious and just not a little late for April Fools Day.


Comment from DaleW
Time: Monday, April 5, 2010, 1:18 pm

Jon,

For your browser data, linear regression is another way to show that Chrome is significantly gaining market share, while IE is significantly losing market share, based on a t-test of the regression slope.

For run charts, what does it mean to say — and what would cause — Firefox to be too closely grouped around its mean value? Typically, that sort of SPC violation is detected relative to a previously larger variation, by the rule of 15 or more straight points falling within 1 standard deviation of the mean, but here you have only 19 points to estimate a standard deviation, and your limits are perhaps too wide? Staring at your magnified run chart in Firefox, it appears to me that Firefox also had bigger ups and downs than we’d expect from purely random variation, but that would take us back to the difference between proper Shewhart SPC charts and quick & dirty ±3 sd run charts.


Comment from Jon Peltier
Time: Monday, April 5, 2010, 3:02 pm

Dale -

I considered LR, but decided that this post was about SPC. T-tests rule out a non-zero slope for Firefox, but indicates that the Chrome and IE slopes are in fact non-zero.

I forgot when I looked at the Firefox run chart that I had used “slightly elevated standard deviations” to derive the “random” data. I used these values to compute the control lines, but plotted the original data, which had smaller standard deviations. I’ve redrawn the chart with the proper control limits.


Comment from Calvin Graham
Time: Tuesday, April 6, 2010, 3:46 am

I suppose the decider would be to see the browser market shares plotted against each other to spot the real trend. Interesting stuff though.

I think it would also be interesting to see the comparitive data from a non-techie site. I normally use Safari at home (where I have a Mac) but I look over this site at work where we all have IE. The browser useage argument on the net for different types of site is probably skewed by this factor quite a bit I suspect


Comment from Jon Peltier
Time: Tuesday, April 6, 2010, 7:19 am

Calvin -

Good idea. Here’s a scatter plot matrix of the three main browsers.

Scatter Plot Matrix Comparing Main Browsers

Here I’ve added a line of best fit to each series along with R² values.

Scatter Plot Matrix Comparing Main Browsers

Internet Explorer and Chrome actually show a pretty good negative correlation. IE and Firefox also have a negative correlation, but it’s weak. If we ignored a month or two on either end of the data the correlation would be stronger (R² = -0.6703).


Comment from DaleW
Time: Tuesday, April 6, 2010, 7:47 am

Jon,

I can’t wait for you to build your matrix plot utility for Excel!

Since you intended this post to be about SPC, this would be a great opportunity to teach how much more powerful a true Shewhart Individuals control chart can be when looking for a trend, compared to the quick & dirty ±3 stdev() run chart that you’ve drawn here.

I’ll wager your Chrome data will bust outside its ±3 sigma limits on both ends.


Comment from Jon Peltier
Time: Wednesday, April 7, 2010, 8:01 am

Dale -

Ask and ye shall receive: SPC Approach to Browser Stats.

Write a comment

I welcome comments from my readers. If you have an opinion on this post, if you have a question or if there is anything to add, I want to hear from you. Whether you agree or disagree, please join the discussion.

If you want to include an image in your comment, post it on your own site or on one of the many free image sharing sites, and include a link in your comment. I'll download your image and insert the necessary html to display the image inline.

Read the PTS Blog Comment Policy.





Subscribe without commenting

Peltier Tech Chart Utilities for Excel Peltier Tech Waterfall Chart Utility Peltier Tech Box and Whisker Chart Utility Peltier Tech Cluster-Stack Chart Utility Peltier Tech Panel Chart Utility Peltier Tech Marimekko Chart Utility Peltier Tech Dot Plot Utility Peltier Tech Cascade Chart Utility

Create Excel dashboards quickly with Plug-N-Play reports.