                                            ## Posts Tagged ‘data analysis’

### A sigma here, a sigma there…

Wednesday, May 9th, 2012

Whenever we come across a new result one of the first things we ask is “How many sigma is it?!” It’s a strange question, and one that deserves a good answer. What is a sigma? How do sigmas get (mis)used? How many sigmas is enough?

The name “sigma” refers to the symbol for the standard deviation, σ. When someone says “It’s a one sigma result!” what they really mean is “If you drew a graph and measured a curve that was one standard deviation away from the underling model then this result would sit on that curve.” Or to use a simple analogy, the height distribution for male adults in the USA is 178cm with a standard deviation of 8cm. If a man measured 170cm tall he would be a one sigma deviation from the norm and we could say that he’s a one sigma effect. As you can probably guess, saying something is a one sigma effect is not very impressive. We need to know a bit more about sigmas before we can say anything meaningful.

The term sigma is usually used for the Gaussian (or normal) distribution, and the normal distribution looks like this:

The area under the curve tells us the population in that region. We can color in the region that is more than one sigma away from the mean on the high side like this:

This accounts for about one sixth of the total, so the probability of getting a one sigma fluctuation up is about 16%. If we include the downward fluctuations (on the low side of the peak) as well then this becomes about 33%.

If we color in a few more sigmas, we can see that the probability of getting two, three, four and five sigma effect above the underlying distribution is 2%, 0.1%, 0.003%, and 0.00003%, respectively. To say that we have a five sigma result is much more than five times as impressive as a one sigma result! The normal distribution with each sigma band shown in a different color. Within one sigma is green, two sigma is yellow, three sigma is... well can you see past the second sigma?

When confronted with a result that is (for example) three sigma above what we expect we have to accept one of two conclusions:

1. the distribution shows a fluctuation that has a one in 500 chance of happening
2. there is some effect that is not accounted for in the model (eg a new particle exists, perhaps a massive scalar boson!)

Unfortunately it’s not as simple as that, since we have to ask ourselves “What is the probability of getting a one sigma effect somewhere in the distribution?” rather than “What is the probability of getting a one sigma effect for a single data point?”. Let’s say we have a spectrum with 100 data points. The probability that every single one of those data points will be within the one sigma band (upward and downward fluctuations) is 68% to the power 100, or $$2\times 10^{-17}$$, a tiny number! In fact, we should be expecting one sigma effects in every plot we see! By comparison, the probability that every point falls within the three sigma band is 76%, and for five sigma it’s so close to 100% it’s not even worth writing out.

A typical distribution with a one sigma band drawn on it looks like the plot below. There are plenty of one and two sigma deviations. So whenever you hear someone says “It’s an X sigma effect!” ask them how many data points there are. Ask them what the probability of seeing an X sigma effect is. Three sigma is unlikely for 100 data points. Five sigma is pretty much unheard of for that many data points!

So far we’ve only looked at statistical effects, and found the probability of getting an X sigma deviation due to fluctuations. Let’s consider what happens with systematic uncertainties. Suppose we have a spectrum that looks like this:

It seems like we have a two-to-three sigma effect at the fourth data point. But if we look more closely we can see that the fifth data point looks a little low. We can draw three conclusions here:

1. the distribution shows a fluctuation that has a one in 50 chance of happening (when we take all the data points into account)
2. there is some effect that is not accounted for in the model
3. the model is correct, but something is causing events from one data point to “migrate” to another data point

In many cases the third conclusion will be correct. There are all kinds of non-trivial effects which can change the shape of the data points, push events around from one data point to another and create false peaks where really, there is nothing to discover. In fact I generated the distribution randomly and then manually moved 20 events from the 5th data point to the 4th data point. The correct distribution looks like this:

So when we throw around sigmas in conversation we should also ask people what the shape of the data points looks like. If there is a suspicious downward fluctuation in the vicinity of an upward fluctuation be careful! Similarly, if someone points to an upward fluctuation while ignoring a similarly sized downward fluctuation, be careful! Fluctuations happen all the time, because of statistical effects and systematic effects. Take X sigma with a pinch of salt. Ask for more details and look at the whole spectrum available. Ask for a probability that the effect is due to the underlying model.

Most of the time it’s a matter of “A sigma here, a sigma there, it all balances out in the end.” It’s only when the sigma continue to pile up as we add more data that we should start to take things seriously. Right now I’d say we’re at the point where a potential Higgs discovery could go either way. There’s a good chance that there is a Higgs at 125GeV, but there’s also a reasonable chance that it’s just a fluctuation. We’ve seen so many bumps and false alarms over the years that another one would not be a big surprise. Keep watching those sigmas! The magic number is five.

### Analyzing New Data: Never the Same Twice

Monday, February 20th, 2012

Physicists did a lot of planning for data analysis before the LHC ever ran, and we’ve put together a huge number of analyses since it started. We’ve already looked for most of the things we’ll ever look for. Of course, many of the things we’ve been looked for haven’t shown up yet; in fact, in many cases including the Higgs, we didn’t expect them to show up yet! We’ll have to repeat the analysis on more data. But that’s got to be easier than it was to collect and analyze the data the first time, right? Well, not necessarily. We always hope it will be easier the second or third time around, but the truth is that updating an analysis is a lot more complicated than just putting more numbers into a spreadsheet.

For starters, every time we add new data, it was collected under different conditions. For example, going from 2011 to 2012, the LHC beam energy will be increasing. The number of collisions per crossing will be larger too, and that means the triggers we use to collect our data are changing too. All our calculations of what the pileup on top of each interesting collision looks like will change. Some of our detectors might work better as we fix glitches, or they might work worse as they are damaged in the course of running. All these details affect the calculations for the analysis and the optimal way to put the data together.

But even if we were running on completely stable conditions, there are other reasons an analysis has to be updated as you collect more data. When you have more events to look at, you might be interested in limiting the events you look at to those you understand best. (In other words, if an analysis was previously limited by statistical uncertainties, as those shrink, you want to get rid of your largest systematic uncertainties.) To get all the power out of the new data you’ve got, you might have to study new classes of events, or get a better understanding of questions where your understanding was “good enough.”

So analyzing LHC data is really an iterative process. Collecting more data is always presenting new challenges and new opportunities that require understanding things better than before. No analysis is ever the same twice.