Any student of statistics, in fact, any middle school student, has constructed a box plot. A simple box plot (or box-and-whisker chart), like the one above, needs five parameters for each category. These are the minimum and maximum values for that category, the median value (middle value), and the first and third quartiles. In fact, the median can also be called the second quartile. The box plot above is not a meaningless sample of the chart type: it shows the variation in quartiles determined by several different methods. These will be described in excruciating detail in this tutorial.
A useful, if vague, definition of quartile is “one of three values that approximately divide a sorted data set into four parts of equal size”. This division is easy and exact, if the number of values in the set is evenly divisible by four. But in the majorityof cases, it is less certain.
Many techniques have been put forth for determining quartiles, and mostly they resolve into the handful of methods shown above, which are used by software packages. The techniques give similar, though not exactly the same, results. In this document I will describe these definitions of quartiles in hopes of shedding some light on this topic, which is more widely used than understood.
I am not a statistician, but I’ve had to understand quartiles for my Peltier Tech Charts for Excel. Many of my users wonder about how quartiles are calculated, so I’ve decided to document my understanding. If you have further questions, or if you find any mistakes, please let me know in the comments.
Median
The median is the central value in a sorted data set. If the values are listed from left to right in order of increasing value, there are as many values to the left of the median as to the right.
Determining the median is easy. If there is an odd number of values, the median is the value in the middle. For example, in this set of nine values, the median is the fifth value (in this case, 5), with four values below it and four above.
If there is an even number of values, the median does not correspond to a value in the data set. Instead the median is the average of the largest value in the lower half and the smallest value in the higher half. In this set of eight values, the median separates the bottom four from the top four, so we define it as the average of the fourth and fifth values, in this case, 4.5.
For a small number of simple data sets, the definition of quartiles is as easy, but usually it’s more involved. Even when it’s easy, the statistical treatments make it seem harder than it is.
Hinge Techniques for Determining Quartiles
This topic is covered in the companion page Hinges.
Interpolation Methods of Determining Quartiles
This topic is covered in the companion page Quartiles.
Comparison of Values from All Hinge and Quartile Methods
This topic is covered in the companion page Comparison.
Quartiles in the Peltier Tech Chart Utility
This topic is covered in the companion page Quartiles in the Peltier Tech Chart Utility.
References
I found innumerable sources for this information about quartiles. Most were either very basic, or not useful at all. The following three are the most useful links I found.
Quartiles in Elementary Statistics
Eric Langford, California State University, Chico
Journal of Statistics Education Volume 14, Number 3 (2006).
This paper had an extensive and highly mathematical discussion of the methods described here, and several others.
Quartiles: How to calculate them?
David Journet, iTSS Wallingford
This short paper provided a summary of the SAS, Minitab, and Excel methods, supporting the information in the first reference.
Calculating Quartiles: Why Computer-Generated Results Don’t Always Agree
Delmar E. Searles, Asbury University
This article was the only place I’d ever seen a number line used to explain the difference between the N-1 and N+1 approaches to percentile definitions. I found this description almost intuitive, and decided to adopt it for all of my descriptions here. We are, after all, visual creatures, and most of us are predominantly visual learners.
DaleW says
Jon,
Even perfectly drawn box plots often leave me wondering whether the raw data is trying to say that we have a significantly skewed population, or that we might still have a symmetric underlying population (even a nice bell curve) somewhat obscured by random sampling variation. That’s typically more of an issue with small samples, and boxplots don’t visually reveal their sample size (unless they show lots of outliers).
Sometimes the 5-number summary of a box plot isn’t as useful as the 2 parameter summary which defines the simplest bell curve for the same data. I wonder if a future generation of your box plot — more than just supporting the wonderful diversity of quartile definitions — might tackle the more fundamental question of whether a given sample distribution is skewed enough that we NEED to use a representation such as a box plot to grasp the distribution, rather than trying a much simpler symmetric or normal population model. This is a surprisingly hard general problem (unknown distribution, unknown median) to tackle in Excel — but not so hard if we just check whether the skew is too large in magnitude to make it likely that the data would come from a normal distribution. (Search for “Measuring Skewness: A Forgotten Statistic” if you ever find time to ponder how a boxplot could justify its own existence — or I can send you a prototype spreadsheet.)