Over on Beyond the Box Score, Justin Bopp has treated us to a graphical analysis of attendance at major league baseball stadiums. Beyond the Box Score is a blog about baseball that follows the SABR (Society for American Baseball Research) tradition, which involves intensive statistical analysis. About the time physicists refine their Theory of Everything, the Sabermetricians will have derived a Statistic of Everything that will replace batting average, RBI, ERA, and fielding percentage. I like to keep an eye on baseball, so I’m glad I’ve found Justin’s informative and entertaining blog.
In this analysis, Justin has related attendance to the home team’s won-lost record. The season isn’t over yet, so the data covers games through the past weekend. Justin covers the National and American Leagues in two separate posts as part of his Graph of the Day series:
Justin’s analyses use some awkward charts and he introduces an awkward statistic.
Beyond the Box Score Charts
Here is Justin’s analysis of the American League’s Eastern Division, where my Sox play:
There are a total of six charts, since each league has three divisions. I’ve reduced them here:
The charts are dark, and the colors and gradients bold, so the charts weigh heavy on the eyes. There are only twelve data points per chart, six on each axis, and part of the analysis involves comparing the relative heights of the bars on primary and secondary axes. I’ve discussed the problems of comparing series on primary and secondary axes in Secondary Axes in Charts. Any conclusions you may reach are affected by the relative arbitrary scales on the primary and secondary axes.
It’s not easy to make comparisons among these charts, and you miss the opportunity to see how attendance and number of wins are correlated. In fact, you don’t even see attendance in these charts: Justin has introduced a statistic called Attendance Per Win, which he is comparing to the team’s wins.
Chart Busters Charts
Rather than plot the two variables of interest on primary and secondary columns, Chart Busters have made an XY chart (or as statisticians and Microsoft call it, a scatter chart). All three divisions of both leagues fit onto the chart, which shows wins on the X axis and attendance along the Y axis. The teams are denoted by the data labels, and we only have one case of overlapping, the red squares for the Minnesota Twins and the Chicago White Sox, in the middle of the chart.
There is an obvious positive correlation between wins and attendance: makes sense, because teams love to cheer for a winner.
Justin’s use of the derived Attendance Per Win statistic initially makes some sense. But when Chart Busters replace the Y axis of the XY chart with Attendance Per Win, we lose the clear correlation. The National League looks purely random, while the American League has at best a slight positive trend.
Justin’s intent was to show which teams gain more attendance than expected for the number of wins. Without inventing a new statistic, Chart Busters show this by simply drawing the lines of best fit. There are separate lines for each league, since there’s some difference in average attendance between leagues.
Linear regression supports our eyeballs: there is a positive correlation for each league. R² for the American League is 74%, and for the National League is 39%. Not a bad correlation, since it ignores such factors as weather, stadium size, and scheduling variations. The points Justin makes are visible in this chart: fans in Florida, Oakland, Texas, and Tampa do not support their teams, as shown by the distance of these points below the fitted lines. In fact, Florida has abysmal fan support.
The Mets, in contrast, have the strongest fan support. This in conjunction with their sub-mediocre win total shows that my previous statement, that fans love to see a winner, is not strictly true. The Brewers, Cubs, Phillies, and Dodgers also have stronger than expected fan support, but then, the Dodgers and Phillies are leading their respective divisions.
Let’s see how Justin’s Attendance Per Win stands up to best fits.
The American League shows a positive correlation, not very steep, and the National League has only a very slight positive slope. The R² values are 23% for the AL and 2% for the NL. We see the same teams with strong and weak fan support as in the previous chart, without clouding the analysis with an unnecessary derived statistic.