Statistics

Trendlines and Chart Types in Excel

Friday, October 21, 2022 by Jon Peltier Leave a Comment

Friday, October 21, 2022 by Jon Peltier
Peltier Technical Services, Inc., Copyright © 2023, All rights reserved.

tldr; Stick to XY Scatter charts if you need trendlines for your data. Line charts may misrepresent the relationships in your data.

Trendlines and Chart Types

A user had problems with my Trendline Calculator for Multiple Series and sent me his workbook. It turns out, he was using the program on a Line chart, and I recalled that Line charts can have problems when calculating trendlines.

Here is some simple data, plotted in a Line chart. A trendline has been calculated, and the formula and R² are shown in the chart. Wow, that’s a very nice straight line fit, with all points exactly on the line.

Trendline on a Line Chart: Bad Statistics Warning

But wait! Look at the X axis: the labels go from 1 to 2, then to 5 and 6, then to 9 and 10, and the spacing between labels is equal! That happens because unless Excel recognizes the X values as dates, it treats them as non-numeric text labels, and spaces them evenly as categories across the chart. Excel also uses the category number, not the inherent numerical value of the category label, when calculating the trendline formula. So Excel uses 1 through 6 as X values, which match the Y values perfectly, resulting in a perfect fit to the formula y = x.

The solution, of course, is to use an XY Scatter chart instead. X and Y values are all treated as numeric, and these numbers are used as is when calculating the trendline formula. The horizontal spacing of the points matches the X values, and we see that while the fit is rather close, the points are not all exactly aligned with the trendline.

Trendline on an XY Scatter Chart: Good Statistics

Another solution would be to format the X axis as a Date axis. This forces Excel to treat the numbers numerically, even though they are not dates. Now the horizontal spacing of the points matches their true values, and the trendline matches the one calculated for the YX chart above.

Trendline on a Line Chart with a Date Axis: Good Statistics, but Why?

But using an XY Scatter chart type is better than using a line chart in most cases, even with a Date axis (unless you are working with dates). The X axis does not begin at zero, for one thing. Other chart formatting is designed for line charts (you can change the formatting, but why bother?).

Why Do People Use Line Charts Instead of XY Scatter Charts?

When inserting a chart, a user encounters a set of icons similar to those below. The Line chart icon shows data points connected by lines, while the XY Scatter icon shows points without connecting lines. Through inexperience or haste, I think a lot of people insert Line charts because they want lines connecting their points.

Portion of Chart Type Selection User Interface

But this is not the difference between Line and XY charts. In both cases, you can format your data with or without markers, with or without connecting lines. The difference is in how the two chart types treat X values. As we saw above, Line charts will treat numerical X values as non-numeric labels, potentially spoiling your whole day.

More About Trendlines and Regression

Posted: Friday, October 21st, 2022 under Statistics.
Tags: Line Charts, Regression, Statistics, XY Charts.
Comments: none

Dynamic Array Histogram

Monday, March 8, 2021 by Jon Peltier 4 Comments

Monday, March 8, 2021 by Jon Peltier
Peltier Technical Services, Inc., Copyright © 2023, All rights reserved.

My friend Thom and I had a discussion about tracking weight and using histograms to show the shape of the weight distribution. He told me he disliked the native Excel Histograms, and I agreed. I’ve written a lot of articles about histograms (see the list at the end of this article), and my commercial software, Peltier Tech Charts for Excel, provides a couple types of histogram which are much more flexible than Excel’s.

I was going to show Thom how to build his own histograms, with a frequency table and all, and I thought, you know, all those new functions and features will make this easier. So I’m going to build a histogram using Dynamic Arrays and show you how easy it can be.

The Data

I’m using my recorded weights for 2020, which has a column for date and one for weight. The dates are in A2:A288 and the weights are in B2:B288. Easy peasy.

I like to make a little summary table when I do an analysis like this. Below I’ve included the number of data points (Count), the Mean and Standard Deviation, and the Minimum and Maximum values. The formulas are =COUNT(B2:B288), =AVERAGE(B2:B288), =STDEV(B2:B288), =MIN(B2:B288), and =MAX(B2:B288).

Generate Chart Data

Let’s make a list of weights. I’ll use =UNIQUE(B2:B88) to produce a list of weights in column B. But let’s also sort the list, using =SORT(UNIQUE(B2:B288)). The dynamic array formula starts in cell D6, and spills down as far as it needs to, in this case to cell D19. The spill range is indicated by the blue shadowed border of D6:D19. This spilling into appropriate-size ranges makes Dynamic Arrays flexible and powerful.

Given the weights we need counts for, we can use a simple COUNTIF formula. In cell E6 I have =COUNTIF(B2:B288,D6#). The # symbol after D6 in the formula means Excel will use the entire Dynamic Array defined in D6, and spill the results starting in cell E6, however long it may be.

And we see the Dynamic Array result in E6:E19

We can calculate the points we need for a Normal Curve using the NORM.DIST function. The function in cell F6, which spills into F6:F19, is =NORM.DIST(D6#,E3,F3,FALSE), using the mean and standard deviation calculated in cells E3 and F3.

Actually, I can fix the curve data. The results above are in fractions while the counts are in whole numbers. But multiplying the fractions by the total number of input values will put the curve and the count on the same scale. So I’ll change the formula in F6 to =D3*NORM.DIST(D6#,E3,F3,FALSE).

Build the Chart

Select the range D6:F19 and insert a clustered column chart.

Right click on one of the visible data points, and choose Change Series Chart Type from the pop-up menu. Select the Curve series, and change the chart type to Line in the dropdown.

It’s a good start, but still a bit rugged.

Format the Curve series line to use a Smoothed Line, and change the Gap Width of the columns to 50.

Finally I deleted the legend.

The bars and the normal curve are not perfectly aligned, because there’s a longer tail at the higher weights, but that’s not a problem.

So it’s a pretty good chart. Or is it…??

The First Correction

Did you notice that there was no value 174 between 173 and 175? The problem with using UNIQUE is that it only gives you what values are in the range, not every value you might expect.

I’ll fix this by using SEQUENCE(rows, columns, start, increment) rather than UNIQUE(range). Cell D6 has the formula =SEQUENCE(H3+1-G3,1,G3,1). The number of rows is the max plus one minus the min, H3+1-G3; the number of columns is 1, the starting value is the minimum, G3, and the increment is 1. This Dynamic Array formula requires an extra row (for the previously missing 174) but the Count and Curve Dynamic Arrays keep up easily.

The chart now shows all categories, with a zero-height bar at 174.

That’s even better. But is it good enough?

Non-Integer Inputs

Thom’s data is different from mine: I record weights as whole number pounds, but Thom records tenths of pounds. This requires a few changes.

If I regress to the first example and use =SORT(UNIQUE()), I get a list of every unique value in the data range. Obviously this isn’t what I want; I really just want whole numbers in the chart’s X values.

If I use the SEQUENCE approach with =SEQUENCE(H3+1-G3,1,G3,1) in cell D6, I still encounter a problem because the minimum isn’t a whole number.

I need to adjust my calculated Min and Max, using =FLOOR(MIN(B2:B288),1) and =CEILING(MAX(B2:B288),1) to give me whole numbers in cells G3 and H3. The function in D6, =SEQUENCE(H3+1-G3,1,G3,1), now provides what I need

But now the COUNTIF() function in cell E6 falls flat. As written, the function looks for exact matches with the results from the D6# Dynamic Array, counting only 23 of the 287 weights.

I need a smarter counting function in cell E6, so I will use =COUNTIFS(B2:B288,">="&D6#,B2:B288,"<"&D6#+1), which counts values between D6$ and D6#+1. This counts all 287 of these values.

The same NORM.DIST function as before gives me the normal curve coordinates. Cell F6 has the formula =NORM.DIST(D6#+0.45,E3,F3,FALSE).

The reason for the offset of 0.45 in the formula for the curve is that I’m not counting whole numbers, I’m counting values between one whole number and the next. The bar for 161 reflects values between 161.0 and 161.9 (assuming a resolution of 0.1), which average 161.45.

The chart is nearly identical to the one with whole number weights above.

Make the Chart Symmetrical

All the charts so far suffer from a certain asymmetry. Because it’s easier for my weight to float upward a bit for a few days than downward, there is a longer upward tail on the distribution, and the bulge of the chart is off center.

I can fix that with a further modification to the Min and Max in my summary table. To make the chart symmetrical, I need the same amount of space above and below the mean.

The distance from the mean to the max is MAX(B2:B288)-E3, while the distance from the mean to the min is E3-MIN(B2:B288). The symmetric space above and below the mean is given by MAX(MAX(B2:B288)-E3,E3-MIN(B2:B288), so the new min and max values in G3 and H3 ARE:

=FLOOR(E3-MAX(MAX(B2:B288)-E3,E3-MIN(B2:B288)),1) =CEILING(E3+MAX(MAX(B2:B288)-E3,E3-MIN(B2:B288)),1)

When these are used as the bounds, I get the following Dynamic Arrays:

When I create a chart as above, the bulge of values is centered in the chart. This satisfies my internal aesthetic. Because of the longer tail at the top end, the curve and bars are not perfectly aligned, but I won’t dispute the data.

LET Me Take it Further

If the Dynamic Arrays are becoming easy for you, we can take it further, using the new LET function. With LET, I can define inputs and intermediate calculations, and use them in downstream calculations. I’m continuing with whole number data from here on, but these principles could be applied to either case.

My formula in cell D6 is shown below. I define the input range of weights, rng, and the calculated minimum and maximum values, datamin and datamax. I compute delta, the larger of the spans between the max or the min and the mean. Based on delta I compute my new minimum and maximum values, newmin and newmax. I determine my list of weights, then do my COUNTIF and NORM.DIST as in the individual Dynamic Arrays in D6:F6. By using CHOOSE({1,2,3},... I can output all of these from a single formula. It’s mind-boggling at first, but also exciting.

=LET(rng,B2:B288,
     datamin,MIN(rng),
     datamax,MAX(rng),
     avg,AVERAGE(rng),
     delta,MAX(datamax-avg,avg-datamin),
     newmin,FLOOR(avg-delta,1),
     newmax,CEILING(avg+delta,1),
     weights,SEQUENCE(newmax+1-newmin,1,newmin,1),
     CHOOSE({1,2,3},
            weights,
            COUNTIF(rng,weights),
            COUNT(rng)*NORM.DIST(weights,avg,STDEV(rng),FALSE)
            )
     )

With this large formula in cell D6, here is my new Dynamic Array formula. Just like the previous one, but it takes one formula, not three.

Since the calculations are identical, the resulting chart is identical.

Is your mind blown yet? If not, read on.

LAMBDA Anyone?

Dynamic Array formulas came first, and they awed and amazed us all. Then came the LET function, which allowed us to input arguments and perform intermediate calculations leading to our desired results. But LAMBDA takes Excel an order of magnitude further, allowing us to define a formula, then use it as a custom function wherever we need it.

Using LAMBDA I’m going to define a function HistoNormData, which will allow me to input a range, such as my weights, and spit out a data range that I can use in a histogram.

The bulk of my LAMBDA formula will be the LET formula from the previous section. I input the data range into LAMBDA, pass it into the LET, and output the chart data range. The LAMBDA formula looks like this:

=LAMBDA(rng,
        LET(datamin,MIN(rng),
            datamax,MAX(rng),
            avg,AVERAGE(rng),
            delta,MAX(datamax-avg,avg-datamin),
            newmin,FLOOR(avg-delta,1),
            newmax,CEILING(avg+delta,1),
            weights,SEQUENCE(newmax+1-newmin,1,newmin,1),
            CHOOSE({1,2,3},
                   weights,
                   COUNTIF(rng,weights),
                   COUNT(rng)*NORM.DIST(weights,avg,STDEV(rng),FALSE)
                   
            )
     )

I can’t use it like this. But I can enter it into a formula and append values for the LAMBDA arguments. My only argument is rng, and I want to use B2:B288, so I enter the formula in cell D6, and append the range address in parentheses at the end of the formula:

=LAMBDA(rng,
        LET(datamin,MIN(rng),
            datamax,MAX(rng),
            avg,AVERAGE(rng),
            delta,MAX(datamax-avg,avg-datamin),
            newmin,FLOOR(avg-delta,1),
            newmax,CEILING(avg+delta,1),
            weights,SEQUENCE(newmax+1-newmin,1,newmin,1),
            CHOOSE({1,2,3},
                   weights,
                   COUNTIF(rng,weights),
                   COUNT(rng)*NORM.DIST(weights,avg,STDEV(rng),FALSE)
                   
            )
     )(B2:B288)

This approach helps to debug the LAMBDA formula.

The result of the LAMBDA formula is identical to that of the LET formula in the previous section, and the resulting chart is also identical. Since the LAMBDA works out in this test, I’m ready to convert it into a custom function. This is done using Excel’s Defined Name infrastructure.

On the Formulas tab of the ribbon, click Define Name. When the New Name dialog pops up, enter the name of the custom function in the Name textbox, and enter the formula (not including the arguments in parentheses at the end) in the Refers To textbox. It’s easiest to just copy and paste the formula that you worked on in the Excel formula bar above: you can make the Name dialog larger, but you can’t make the Refers To box more than one row of text tall. Microsoft assures us they are working on a better formula editing experience, and we can’t wait.

Click OK and the custom function is created.

The function is used like any others in Excel. Cell D6 contains the formula

=HistoNormData(B2:B288)

and my three columns of values are output in the sheet, spilling to fill as many rows as are needed.

The range and chart are identical to what we’ve already seen above, but I can easily use my HistoNormData function to compute similar output ranges for other data in the same workbook. For example, in the worksheet shown below, I have a much larger range of data. I entered this formula in cell D6

=HistoNormData(B2:B783)

And I get a corresponding chart of the larger data set, without any additional work.

Names are defined for a given workbook, so you would have to define your custom function in any workbook where it is needed. But it’s easier than you think: you use the custom LAMBDA function on a worksheet, then copy that worksheet to another workbook, and the custom function is also copied to the new workbook.

Make the Chart Dynamic

I’ve written a follow-up article, Dynamic Charts Using Dynamic Arrays, that shows how to make this histogram dynamic, so that changes to the size of the Dynamic Array’s spill range are reflected in the chart.

A year or so after I posted these articles, Microsoft released an enhancement to Excel that made Dynamic-Array-driven charts themselves dynamic. If all of the data in the chart comes from a single Dynamic Array formula, the chart’s source data will change size to match the Dynamic Array’s spill range. This means we can select the original Dynamic Array and insert our chart, ignore the need to create Names for the X and Y values, and the chart will dynamically change its source data range as the Dynamic Array changes.

More About Dynamic Arrays, LET, and LAMBDA

Posted: Monday, March 8th, 2021 under Dynamic Arrays.
Tags: Dynamic Arrays, Histograms, LAMBDA Function, LET Function, Office 365, Statistics.
Comments: 4

Watching my Weight with SPC (Statistical Process Control)

Tuesday, April 28, 2020 by Jon Peltier 8 Comments

Tuesday, April 28, 2020 by Jon Peltier
Peltier Technical Services, Inc., Copyright © 2023, All rights reserved.

I’ve been working on a Statistical Process Control project for a client, building a workbook to automate construction of control charts. Years ago I wrote a tutorial called Introducing Control Charts (Run Charts). Many processes, in manufacturing, in business, or in nature, show fluctuations in their outputs. We can use Statistical Process Control (SPC) techniques to monitor these processes and ensure the fluctuations stay within expected limits.

I was looking for data to proof out the tool I was building, and I thought I could use my weight as a decent data set. My wife bought a new digital scale in 2006, and I’ve been weighing myself almost every day since then. And being an Excel jock, I put my measurements into a spreadsheet.

In the chart below, you can see how I fluctuated around 200 lb for over a decade. Then 20 months ago my wife and I joined Weight Watchers, and over the course of 6 or 8 months I lost 40 lb.

I thought looking at the past few months would be a good way to illustrate the use of SPC to track a process. This exercise will construct a series of control charts of this data.

Learning about Statistical Process Control

I first learned about Statistical Process Control as a practitioner and as a trainer, while employed as a scientist/engineer for a large manufacturing corporation. One of the resources we had was a deceptively small book called Understanding Variation: The Key to Managing Chaos by Donald J. Wheeler.

Understanding Variation: The Key to Managing Chaos
by Donald J. Wheeler

There are many other information sources about SPC and control charts. The National Institute of Standards and Technology (NIST) has an online Engineering Statistics Handbook, which has a chapter on Univariate and Multivariate Control Charts. Wikipedia has brief articles with many references covering SPC and Control Charts. And Google shows about 1.2 billion results for SPC and 0.5 billion results for Control Charts.

Getting Started

Prepare the Data

The first step is to identify the data and get it into a form where it can be analyzed. I decided to track from 1-Sept-2019 to 1-Feb-2020. Below is the top of my data worksheet, with a few calculations. The data is in three columns of an Excel Table named Table_1. The first two columns are date and weight, manually entered. The third column is Moving Range (MR), which we will use as a measure of variability in the data. The formula in cell C2 and filled down the Table column is

=IFERROR(ABS([@Weight]-OFFSET([@Weight],-1,0)),NA())

Essentially it determines the absolute value of my change in weight from one day to the next. Any error in the calculation (such as trying to subtract the column header) returns NA(), or the #N/A error.

Weight data and preliminary calculations

I’ve calculated some values in a range beside the table, and I’ll explain them as I go along. The little table below the calculations show the formulas I’ve used. I’ve also named these cells as indicated, to make it easier to use the cells in formulas.

Chart the Data

The next step is to plot the data. I’ve made two charts, one of my weight, the other of the calculated moving range. We look first for any obvious issues in the data, such as the spike late in September. If you look at the data above, apparently I gained 18 lb one day, and lost it the next. A more likely explanation is that I transposed digits in 168 and instead entered 186 in the worksheet. I’ll deal with this data issue soon, but for now I’ll continue with the SPC construction.

I added the calculated items as columns in my Table to make it easier to chart them. Having named the cells, I could use simple formulas in the Table: =Mean in cell D2, =LCL in cell E2, etc.

Data table with calculated items — *click on image to enlarge*

Among my calculations are averages of the weight data (Mean) and of the moving range data (MR Bar). Let’s add these as green horizontal lines to the weight and MR charts for reference.

Compute Limits

So far, so good. Now let’s add a measure of “allowable” or “acceptable” variation. If the process is following statistical rules and its variability follows a normal distribution, we would use multiples of sigma, the standard deviation, to identify limits. According to the definition of a normal distribution, 68.3% of values fall within ±1 standard deviation of the mean, 95.5% fall within ±2 sigma, and 99.7% fall within ±3 sigma of the mean. By convention, 3 sigma is commonly used to identify acceptable variations.

We could measure the sample’s standard deviation (SD) directly, multiply it by 3, and use this to determine our limits. But using moving range is more robust, since outliers and non-normal distributions have a greater effect on sigma than on moving range.

The average moving range, or MR Bar, is used to calculate control limits. Less commonly, the median of the moving range is used to compute these limits.

First we determine MR UCL, which is the Upper Control Limit on the moving range, by multiplying the average moving range by 3.268. This is plotted to the moving range chart as a horizontal orange line (bottom chart below). We would expect 99.7% of our MR values to fall below this limit.

In the same way, we calculate the UCL and LCL (Upper and Lower Control Limits) of our individual data. We multiply MR Bar by 2.67, and add it to or subtract it from the mean to get our limits. These are plotted on our chart of individual values as horizontal orange lines (top chart below). Again, we expect 99.7% of our individuals to fall between these two lines.

IMR Chart = Combined Individuals and Moving Range Charts

These charts of measurements along with means and limits are called Control Charts. The chart of individual values is called an I Chart (no, not “eye chart”), and the moving range chart is the MR Chart. Together they are referred to as an IMR (sometimes ImR) Chart.

Our ±3 SD limits are shown in the dashed red lines below (they are calculated as LCL 2 and UCL 2). They fall pretty far outside the MR-based control limits. All points fall well within the SD-based limits, except for the one obvious outlier.

Standard Deviation and Moving Range based control limits

In fact, because the outlier causes two excessive moving range values, the MR-based limits are also too wide, and would lead us to accept points that would otherwise be out of control.

Clean Up Special Cause Variations

Special and Common Cause Variation

The spike in my weight in September is a “special cause” variation, because it is a one-off problem. Since it is obviously not a valid measurement, we can attribute it to a recording error, and ignore it. We want to remove this value from our moving range calculations, since it resulted in limits which were too wide.

The other variation we see in the timeline is “common cause” variation. It comes from variations in inputs, like exercise, meals, and other factors, which are themselves subject to normal variation.

Clean Up the Data

In my adjusted table below (Table_2), I’ve added two columns. Wt 2 simply repeats the data in Weight, using the Table formula =[@Weight]. I can replace any special cause deviation with =NA() or #N/A in this column. MR 2 uses the same formula as MR, based on the Wt 2 column:

=IFERROR(ABS([@[Wt 2]]-OFFSET([@[Wt 2]],-1,0)),NA())

Where there was one bad weight and two bad moving ranges, we now have #N/A values in the table, which we can ignore in the chart and in our other calculations.

Plot the New Data

When we plot our individual and moving range values, the chart scales now show much narrower ranges, and there are no longer any obvious outliers: there is one high individual value and corresponding moving range in January, a few low weights in November, and a few high weights in December.

Let’s add our means and control limits, and see what we have. The MR chart shows the outlying value in late January, and four more moving range values that are just at the limit. In the individuals chart, the low values are within the limits (“in control”) while the high values we eyeballed before are above the UCL (“out of control”).

When values are out of control, we have to examine the process, to ensure that nothing is wrong with our process, and that nothing has changed. I can actually explain some of the variations. On Thanksgiving, I ran a “Turkey Trot” with my daughter, so for a couple weeks I was running more than my usual 3 miles a day: thus the few low values in November. And of course, the few values of 172 coincide with the Christmas and New Year’s holidays.

Standard Deviation vs Moving Range

Below I’ve plotted the SD-based limits along with the MR-based limits. The limits are much closer to each other and closer to the mean than when the outlier was included in the calculations.

Here I’ve plotted these control limits as calculated with and without the outlier. The outlier had a substantial effect on the limits, especially on the SD limits.

Comparison of moving range based control limits and standard deviation based limits

When the variation fits a normal distribution, the two sets of limits are close together, with the MR-based limits wider sometimes and the SD-based limits wider other times. The larger the data set, the closer they will be.

For the rest of this analysis, I’ll ignore sigma and stick to MR-based calculations.

Highlighting Outliers

Enhanced Data

We can enhance our IMR Chart by highlighting points which are out of control. I’ve added two columns to my table to support this. Wt X has this formula

=IF(OR([@[Wt 2]]<=LCL,[@[Wt 2]]>=UCL),[@[Wt 2]],NA())

which shows the value from Wt 2 if it falls outside the control limits, and #N/A otherwise. MR X has this formula

=IF([@[MR 2]]>=MR_UCL,[@[MR 2]],NA())

which again shows the value from MR 2 if it falls above the control limit, otherwise #N/A.

Highlighting the Chart

I’ve added these columns to my IMR Chart as red/orange markers.

Additional Control Chart Rules

There are other features of control charts that indicate a process which is out of control. These are conditions which are not expected to be found in about 99.7% of cases. Here are a handful of common out-of-control rules; the first one is the one I highlighted above.

One point beyond 3-sigma control limits
2 of 3 points outside 2-sigma on same side of mean
4 of 5 points outside 1-sigma on same side of mean
8 consecutive points outside 1-sigma on both sides of mean
15 consecutive points inside 1-sigma on both sides of mean
9 consecutive points on same side of mean
6 consecutive points moving in same direction
14 consecutive points alternating up and down

Advanced SPC software highlights any of these situations, in addition to the 3-sigma violations.

Extending the Data

To show how to manage a growing data set, I added ten more weeks of my weight tracking.

Frozen Control Limits

Typically, when a process is determined to be steady, the limits are calculated and frozen, then these are extended forward. This is illustrated below: the frozen limits were calculated from September through February, indicated with solid lines, and extended into April, shown with dashed lines.

Where I had a few values above the UCL in December and January, I now had several below the LCL and only a few above the mean in February and beyond.

This is evidence of a process shift. Several of the additional rules mentioned at the end of the last section would have been triggered. Checking my exercise records gives us an explanation. For much of the period from September through January, I was running 3 miles a day, four or five days a week. The weather in February was rather mild, so I increased my mileage to about 3.5 miles a day, six days a week.

Moving (Variable) Limits

The control charts below show control limits calculated over the entire range. The process change is still noticeable, but it’s not as clear as with the frozen and extended limits above.

Another problem with continually recalculating limits is that the limits move over time. Points which were in control at one time may be pushed out of control by later measurements. A December point at 170 which was in control when the limits were frozen is now out of control under the newly computed limits.

Staged Analysis

We can overcome this concern by staging our analysis, that is, computing different limits for different subsets of our data. In my latest Table below, I’ve added a column named Stage, which contains 1 for the first stage and 2 for the second; these can be entered manually or with a formula, which for example increments the stage number on a given date. The control limits are computed separately for different stages.

The IMR Chart below shows a staged analysis. Stage 1 looks familiar; the UCL for both MR and Individuals are slightly lower because the large MR late in January coincided with the process change. The violations in stage 1 are the same as before; the few outliers in stage 2 would have been well within the stage 1 limits, but are actually above the stage 2 UCL.

It’s common practice not to compute a separate average moving range for all stages, especially if the stages have small numbers of points, but instead use an overall MR Bar. The chart below uses this combined measure of variation. Stage 1’s control limits are now a bit tighter, so the low weights measured during the Turkey Trot training in November are now outliers. Conversely, Stage 2’s control limits are slightly wider, so there are no outliers in Stage 2.

Statistical Process Control Articles in this Blog

Posted: Tuesday, April 28th, 2020 under SPC.
Tags: Control Charts, Run Charts, Statistical Process Control, Statistics.
Comments: 8

Trendline Calculator for Multiple Series

Tuesday, February 12, 2019 by Jon Peltier 33 Comments

Tuesday, February 12, 2019 by Jon Peltier
Peltier Technical Services, Inc., Copyright © 2023, All rights reserved.

A couple months back I wrote Add One Trendline for Multiple Series which shows how to add a trendline to a chart, and have the trendline calculated for multiple series in the chart. In fact, that tutorial was based on my answer to a question on Quora, How can I have multiple scatter plots and one trendline for all of them combined in Excel? Some Quora questions can be kind of lame, but this was a good one, especially if I’m getting a second blog post out of it.

Feedback on that tutorial was positive, but it seems that people would like the process to be faster and simpler. Fair enough.

So I decided to write a small add-in that automates the process.

The Manual Process

If you recall, the original problem was that we had three series of data in the chart, and we can easily get a trendline for any or all individual series, but we want a trendline that covers all points in all three series. You can download a workbook with my dummy data and charts here: MultiScatterTrendlineData.xlsx.

Note: this approach will not work in Line charts; in general you should not use Line charts if you need trendlines for your data.

Here is the original chart from the earlier tutorial:

And here is the chart with a trendline for each individual series:

Multiple XY Series and Trendlines in One Chart

We created a new series in the chart that included all points from the first three series (the yellow markers cover the blue, orange, and green ones):

Multiple XY Series, Including the Combined Series, in One Chart

This was the tedious step, adding all the data to a new series, and this is the part that my add-in will speed through.

Then we hid the new series by formatting it without markers, and added a trendline:

Multiple XY Series with One Combined Trendline in One Chart

The VBA Code

The three original series in the chart have these formulas:

=SERIES(Sheet1!$C$2,Sheet1!$B$3:$B$11,Sheet1!$C$3:$C$11,1)
=SERIES(Sheet1!$E$2,Sheet1!$D$3:$D$11,Sheet1!$E$3:$E$11,2)
=SERIES(Sheet1!$G$2,Sheet1!$F$3:$F$11,Sheet1!$G$3:$G$11,3)

Remember, a series formula has four arguments (a bubble chart series has a fifth argument, but we’ll ignore bubble charts here):

=SERIES(Series Name, X Values, Y Values, Plot Order)

We’ll give our added series a new name, “Combined”, and it will automatically be 4th in the plot order. In between we will combine the X values and Y values of the original three series. Our constructed series formula looks like:

=SERIES("Combined",
    (Sheet1!$B$3:$B$11,Sheet1!$D$3:$D$11,Sheet1!$F$3:$F$11),
    (Sheet1!$C$3:$C$11,Sheet1!$E$3:$E$11,Sheet1!$G$3:$G$11),
    4)

The multiple X value ranges are enclosed in parentheses, as are the multiple Y value ranges.

What our code will do is count the series in the chart, read each series formula in turn, split out its arguments, and concatenate the separate X and Y values into combined X and Y values. The code will then add the new series, apply the arguments of the series formula, hide the markers, and add a trendline.

Here is the simple procedure:

Sub ComputeMultipleTrendline()
  If Not ActiveChart Is Nothing Then
    With ActiveChart
      Dim ixSeries As Long
      For ixSeries = 1 To .SeriesCollection.Count
        Dim SeriesFormula As String
        SeriesFormula = ActiveChart.SeriesCollection(ixSeries).Formula
        SeriesFormula = Mid$(SeriesFormula, InStr(SeriesFormula, "(") + 1)
        SeriesFormula = Left$(SeriesFormula, Len(SeriesFormula) - 1)

        Dim SeriesArgs As Variant
        SeriesArgs = Split(SeriesFormula, ",")

        Dim XAddress As String, YAddress As String
        XAddress = XAddress & SeriesArgs(LBound(SeriesArgs) + 1) & ","
        YAddress = YAddress & SeriesArgs(LBound(SeriesArgs) + 2) & ","
      Next

      XAddress = "=(" & Left$(XAddress, Len(XAddress) - 1) & ")"
      YAddress = "=(" & Left$(YAddress, Len(YAddress) - 1) & ")"

      With ActiveChart.SeriesCollection.NewSeries
        .Name = "Combined"
        .XValues = XAddress
        .Values = YAddress
        .Format.Line.Visible = False
        .MarkerStyle = xlMarkerStyleNone
        With .Trendlines.Add.Format.Line
          .DashStyle = msoLineSolid
          .ForeColor.ObjectThemeColor = msoThemeColorText1
          .ForeColor.Brightness = 0
        End With
      End With
    End With
  End If
End Sub

If you want to run this code, open the VB Editor (easiest way: use the Alt+F11 shortcut), find your workbook in the Project Explorer, and insert a fresh module (Insert menu > Module, or simply Alt+N+M).

If the new module doesn’t say Option Explicit at the top, type it yourself, then go to the Tools menu > Options, and on the Editor tab of the dialog, check the box labeled Require Variable Declaration, and you may as well uncheck the box for Auto Syntax Check. I discuss why in a decade-old tutorial, VB Editor Settings.

Skip a line after Option Explicit in your brand new code module, then copy the code from above, and paste it into the module.

Before you run the code, select a chart. Then press Alt+F8 to open the Macros dialog. Select ComputeMultipleTrendline and click Run. In the blink of an eye, the new series is added, though it’s not visible, and the trendline appears. I used a solid black line, rather than the default dotted line Excel uses, because I think a solid line makes it easier to see.

It doesn’t matter if all series use the same or different X values; the code doesn’t even compare the X values of the different series, it just puts them all into the series formula.

The Multi Scatter Trendline Calculator

I used the code above as the basis for my add-in. I added a custom ribbon tab named Multi Trendline with a custom button labeled Multi Scatter Trendline to invoke the code. I also designed a dialog so that you can select which series in the chart to include in your analysis (and which to exclude).

Preparing to Install the Add-In

You can download the add-in from this link: MultiScatterTrendlineCalculator.xlam. The add-in is packaged in a zip file. Unzip the file, and store the add-in in the User Add-in Library, which is

C:\Users\USERNAME\AppData\Roaming\Microsoft\AddIns\

where USERNAME is your Windows login. You can get there quickly by pressing Win+R (Win = Windows key), typing %appdata% in the Run box, and clicking OK, which opens the Roaming directory, and drilling down to Microsoft and then AddIns.

You can actually store the add-in in almost any convenient folder, but when you use the Add-In Library, it’s easy to find the add-in from within Excel when you install it.

Windows protects your computer from malicious software that came from a different computer than yours, but it also protects your computer from useful software that came from my computer, so you need to unblock the add-in. Right click on the add-in file in Windows Explorer, and choose Properties. At the bottom of the General tab of the Properties dialog, there may be a notice that the file may be blocked, and there is a checkbox to unblock the file.

Check the box, and click OK.

Installing the Add-In

If you have the Developer tab showing on Excel’s ribbon, go there and click on Excel Add-Ins (or if it’s an older version of Excel that has no Excel Add-Ins button, click on Add-Ins) to open the Add-Ins dialog.

Otherwise, click on the File tab > Options > Add-Ins. Click the Go button near the bottom of the list to open the Add-Ins dialog.

Or you can use the old Excel 2003 shortcut, Alt+T+I to open the Add-Ins dialog.

If you stored the add-in in the User Library, it will appear in the Add-Ins dialog as MultiScatter Trendline Calculator. Otherwise you will have to click Browse, then navigate to find the add-in.

Check the box in front of this entry, then click OK, and the add-in is installed, available whenever you run Excel.

If you don’t want the add-in installed all the time, you can simply start it when you need it, using File > Open in Excel, double clicking in Windows Explorer, or dragging it from Windows Explorer and dropping it on Excel.

Using the Add-In

Select a chart, then click the button on the custom ribbon tab.

Multi Scatter Trendline Button on the Ribbon

The dialog pops up. Select which series you want to include, and click OK.

Multiple Series Trendline Calculator Dialog

The series is added invisibly, and a trendline is added using Excel’s default settings. You can format this just like any other trendline, to change the fitting model used, to show the trendline formula on the chart, or to change the trendline’s formatting.

Multiple Series Trendline Added to Chart

About the Add-In

I have left the add-in unprotected in case you want to see how it all works. There is XML code that handles the custom ribbon tab and button. There is code in a second module to handle clicks from the ribbon button. A UserForm (i.e., a dialog) has been added to get input from the user. The main procedure is more detailed than shown in this article, to accommodate this dialog, and to compile data selectively.

I enjoy doing this kind of project. Even with the ribbon components and the dialog, it only takes a few hours. If you need something like this done, send me your requirements and I’ll generate a quote.

I liked this little utility so much that I’ve added it to Jon’s Toolbox, which is a nifty set of tools for quickly manipulating charts and data for such purposes as teaching and preparation for publishing.

Add One Trendline for Multiple Series

Wednesday, December 5, 2018 by Jon Peltier 23 Comments

A member of the online forum Quora asked, How can I have multiple scatter plots and one trendline for all of them combined in Excel? I interpreted this to mean, “I have multiple scatter series in my chart, how do I get a trendline for the combined data in the chart?”

I made up some dummy data, and generated this XY chart. Note: this approach will not work in Line charts; in general you should not use Line charts if you need trendlines for your data.

You can download a workbook with my dummy data and charts here: MultiScatterTrendlineData.xlsx.

Of course you can add a trendline to each of the series, and you get this cluttered mess. It has multiple trendlines, not a multiple-trendline.

Note: In a recent version of Excel (I don’t recall if it was 2013 or 2007), trendlines changed from black to the color of the points, which was good for visibility, and they also became a dotted line, which was bad for visibility. I always change to a solid line.

Well, this is not what the person wanted to know. But it’s no big deal to get what they wanted.

First, let’s just check out the chart. The three series have these formulas:

=SERIES(Sheet1!$C$2,Sheet1!$B$3:$B$11,Sheet1!$C$3:$C$11,1) =SERIES(Sheet1!$E$2,Sheet1!$D$3:$D$11,Sheet1!$E$3:$E$11,2) =SERIES(Sheet1!$G$2,Sheet1!$F$3:$F$11,Sheet1!$G$3:$G$11,3)

I’ve color coded the formula arguments to show series names in red, series X values in purple, and series Y values in blue (the colors of the series highlights in the worksheet). The numbers at the end are the plot order of each series.

What we need to do is add a series to the chart that uses all of these X values and all of these Y values. There are at least two ways to get this series.

Select Data

Right click on the chart and click on Select Data from the pop up menu.

The Select Data Source dialog appears.

Click the Add button, and the Edit Series dialog appears.

Click in the Series Name box, and add a descriptive label. “Combined” works.

Click in the Series X Values box, then with the mouse select the first range of X values. Type a comma, and select the second range of X values. Type another comma, and select the third range. Note: instead of typing a comma, you could hold the Ctrl key while selecting the second and third ranges, but I find that I make fewer errors if I type the comma and then don’t have to worry about how many buttons I’m holding with how many fingers.

Click in the Series Y Values box, delete the “={1}” that is there, then with the mouse select the first range of Y values. Type a comma, and select the second range of Y values. Type another comma, and select the third range.

The populated Edit Series dialog looks like this:

The Series X Values and Series Y Values boxes contain

=Sheet1!$B$3:$B$11,Sheet1!$D$3:$D$11,Sheet1!$F$3:$F$11

and

=Sheet1!$C$3:$C$11,Sheet1!$E$3:$E$11,Sheet1!$G$3:$G$11

Excel will put parentheses around these comma-separated addresses, and double quotes around the Series Name.

Click OK twice, and a new series will appear, whose markers obscure those of the existing three series.

The formula for the added series is:

=SERIES("Combined",(Sheet1!$B$3:$B$11,Sheet1!$D$3:$D$11,Sheet1!$F$3:$F$11), (Sheet1!$C$3:$C$11,Sheet1!$E$3:$E$11,Sheet1!$G$3:$G$11),4)

We can see that it includes the X values and the Y values for the three original series.

Add Series Formula Directly

You don’t need to use the Select Data Source dialog to add data to a chart. If you select the chart area (just the outermost rectangle containing the chart), you can click in the Formula bar, and enter your formula. Note that you have to enter double quotes around the series name and parentheses around the multiple areas of the X and Y values. You can type the addresses of the individual names, which is inconvenient since you have to remember to include the sheet name and exclamation point; it’s easier to select the ranges with the mouse. You can hold Ctrl while selecting multiple areas, but I find it easier to type a comma between range addresses. Don’t forget to end the formula with a comma, the Plot Order 4, and the closing parenthesis.

When you press Enter, the chart has a new series that hides the old series, just like above.

Add Trendline

Select this new series, click on the plus “skittle” next to the chart, and check Trendline.

Now format the “Combined” series to hide it (no lines or markers), and format the trendline to enhance visibility and to display the trendline formula and R² value.

Update: Trendline Calculator for Multiple Series

For anyone who wants to apply this technique but finds it tedious to add a series with all the X and Y ranges, I’ve built a small utility that does the heavy lifting for you.

It’s a simple Excel add-in that lets you select a chart, choose which series in the chart to include (and exclude), and it builds a hidden series with a visible trendline combining all series that you’ve selected.

In the article I show some simple code that compiles the data into a new series. The add-in is free and unlocked, so you can look into its inner workings: the code, the dialog (UserForm), and the custom ribbon elements.

See Trendline Calculator for Multiple Series for information.

Trendline and Regression Articles on this Blog

Posted: Wednesday, December 5th, 2018 under Data Techniques.
Tags: Statistics, Trendlines.
Comments: 23

Microsoft MVP Logo