Statistics for Agile Teams: Understanding Variation

Posted on March 26th, 2012 in Lean by siddharta || 3 Comments

A question came up in a recent discussion about why agile teams need to understand basic statistics.

Here is why…

[Note: the following post talks about Scrum teams and velocity. The equivalent for Kanban teams is lead time, so replace velocity with lead time everywhere if you are doing Kanban]

Take an example of a scrum team that committed to 40 points, delivered 20, then had a fiery debate during the retrospective about what went wrong and how to prevent it next time.

The next sprint they make some changes, deliver 45 points, get excited about the impact of their changes and decide that they should aim for 50 points now.

The sprint after they deliver only 30 points. Disaster!! Its doom and gloom at the retrospective about what went wrong this time. The Scrum Master steps in and says that its a major regression, this cant continue. That’s two sprints where they missed commitments by 20 points and stakeholders aren’t impressed. He recommends that they should seriously consider working weekends to hit commitments made to the stakeholders…..

How many times have we seen this pattern? Without realising it, this team is doing the absolutely worst thing possible. It’s called tampering. What is tampering? Read on…

Tampering

Lets change the context.

Assume that you are driving to work everyday. It takes you 30 minutes to get there. Now, that does not mean it takes you exactly 30 minutes every day. Some days it might be 25 minutes, other days it might take 35 minutes, but on average you know its somewhere around 30.

In fact, if you were to plot your times every day for forty days, it might look something like this (the Y-axis represents how many minutes early or late you arrive at work each day)

Most people would not raise an eyebrow at this graph. It is quite natural that the driving time varies each day. The cause for the variation could be anything — difference in traffic levels, how many red lights you hit etc.

But now you decide to get a little clever. You decide to look at yesterday’s time and use that to change your behaviour the following day.

The first day you reach office in 25 minutes. So the next day you think, “yesterday it took just 25 minutes, so I can leave 5 minutes late today”. Unfortunately, today it takes 35 minutes. Add your 5 minute lateness in starting out, and you reach office 10 minutes late. The day after you are cautious. You reached 10 minutes late yesterday, so today you leave 10 minutes early. This time it takes the usual 30 minutes, and you reach office 10 minutes early (because you started 10 minutes early).

In fact, if you followed this strategy, then with the same sequence of driving times above, this is what your result would be

With this clever new strategy, the graph swings wildly. Some days you are as much as 30 minutes early. Other days 30 minutes late.

Lets recap…

If you just blindly leave home 30 minutes before time everyday, you’ll reach between 10 minutes early and 10 minutes late each day. This is a stable system.

If you look at yesterday’s time and try to compensate the next day, the variation increases. After a few days of following this strategy, you’ll be arriving way too early or late. This system is out of control.

This is called tampering – Meddling with the system when you should leave it alone.

Or as Dr.Deming said, “Dont just do something, stand there“.

Understanding Variation

Everything has variation. Some things have less variation, others have more, but everything has variation. The variation inherent in the system is called Common Cause variation. In the driving example, the traffic pattern changes slightly, your luck with the signals change.. all this causes a variation in the time you take to reach office.

Another type of variation is called Special Cause variation. Special cause variation has an assignable cause that you can point to that caused the variation. For example, if there is an accident, or a diversion on the road — the delay due to that is special cause variation.

For special cause variation, you can point to a particular point and ask “who or what caused this?”. You can do a root cause analysis to find and eliminate the problem.

An agile team also has variation in how they work. Some days you work well, other days things don’t go so great. This is just normal day to day stuff which happens, and it causes variation. One sprint they do 40 points, the next sprint they do 20 points. The sprint after, they might do more than 40 points. Thats pretty normal, and its usually cause by common cause variation. Sometime you have special cause variation, eg: If a sprint spans the Christmas + New Year week, then the loss in velocity is a special cause.

Tampering revisited

The problem arises when we confuse common cause variation for special cause, and try to find an assignable cause in a common cause variation and attempt to “fix it”.

When the velocity drops from 40 points last sprint to 20 points this sprint, then teams typically look at figuring out why that happened. The problem is, nothing might have happened. It could have just been the usual routine variation in the system. In that case, any “fixes” that are applied may end up tampering the system. The next sprint, the velocity changes again due to routine variation (plus any side effects from the “fix”). Teams then scramble to figure out what happened this time.

The result is a system that goes out of control.

The only two ways to hit a constant velocity

Everyone says that agile teams should have a constant velocity. There are only two ways to do this:

  1. Be an insanely amazing team that has absolutely no variation at all
  2. Game the numbers [Slow down when you are ahead, and work overtime when you are behind]

There are probably a handful of teams (I might even go on a limb and say there are none) that can produce perfectly consistently without any variation whatsoever.

That means if you’re velocity is constant, you’re gaming it. Worse, it probably means you are working overtime to get there. And once you hit the number, you’re going to be working overtime most of the other sprints in order to keep your velocity there.

The difference between tampering and process improvement

Some of you might ask, “what about process improvement? Isn’t that tampering?”. No, its not. Tampering is looking at one data point which does not have any special variation, drawing conclusions from it, and attempting to fix the system based on it. Process improvement looks at a much longer period of time to draw conclusions from.

Yesterday’s weather is a bad way to plan and as we saw in the graph above, it can lead to severe problems.

Some teams use the average of the last five sprints, which is a better approach. You did 40 points this sprint? Great, now plan for 30 points next sprint. You did 20 points? Thats okay, plan for 30 points again.

Also, don’t mistake variation for improvement. So you did 30 points the last sprint, and did 40 points this sprint? Hold the celebrations. How do you know that this is improvement and not just usual variation?

Answer: You don’t. Two points don’t make a trend. Only recalibrate your forecast after delivering more than planned for 4-5 sprints in a row.

Even better, don’t use target sprint commitment. Let the team do however much they can, and deliver whatever was completed.

 

Doing Distributed Agile?

Share and collaborate with distributed teams with our electronic agile board tools. Get all the benefits of electronic tools without sacrificing the benefits of physical boards. Supports Scrum taskboards, Kanban boards and user story maps. Check it out!

3 Responses to “Statistics for Agile Teams: Understanding Variation”

  1. Leena Says:

    Hi Sidhartha,

    Interesting post. So what you are saying is as the variations are normal, so its ok to have velocity variations across iteration. Couple of questions though:

    1. Are you saying that we should completely take the variations are normal and don’t need to react to the same? Don’t we need to think what can be done to achieve stability which is called as “Sustainable pace”?

    2. Am little confused about “Process Improvement” you are referring to. Are you talking about it in general or about the improvements in process to achieve stability in velocity?

    Thanks,
    Leena

  2. siddharta Says:

    Hi Leena,

    Thanks for your questions.

    Some variation is normal, and there is certain variation that you need to react to.

    For example, if it generally takes you about 30 minutes to reach work, and today it takes you 40 minutes, then that’s probably normal variation. But if it takes you 90 minutes today then something odd has happened & you need to find out what it is. Usually it means there is a special cause of variation that day — an accident, road diversion, a flat tire etc.

    There are ways to differentiate the two, which I’ll cover later in the series.

    You bring about a good question on sustainable pace. Again, lets look at the driving example. Look at the first graph above. In normal circumstances, it takes 20-40 minutes (avg 30 min) to arrive at work. That is the natural, sustainable pace.

    Now suppose I say that you should take *exactly* 30 minutes to get to work, not a minute more or less. Do you think that’s sustainable? Probably not.

    Same way, hitting the exact same velocity number every sprint, is not sustainable pace.

    I know a lot of people talk about having a sustainable pace as hitting the exact same velocity number every sprint. In my opinion, that is very wrong and harmful.

    Re: Process Improvement. Suppose you start off without doing TDD, then after a few sprints you adopt TDD. This causes your velocity to rise. Now, is this tampering, or is it genuine process improvement? This is an important question. I’ll blog more about this later in the series.

  3. Leena Says:

    Thanks a lot Sidhartha for the clarifications. Looking forward to your later posts.

Leave a Reply