This is an old revision of the document!


The Future is Open

Jesse Thaler (MIT)

In November 2014, the CMS experiment at the Large Hadron Collider (LHC) released the first batch of research-grade data from the 2010 proton-proton collision run. This was an unprecedented move in the field of particle physics, since up until this point, access to hadron collider data was restricted to members of the experimental collaborations.

When I heard about the CMS open data project, I immediately downloaded the CERN Virtual Machine to see what kind of data had been made available. As a theoretical particle physicist, I can slam together particles and study their debris… on my chalkboard, or through pen-and-paper calculations, or using software simulations. For the first time, I had access to real collision data from a cutting-edge experiment, as well as an opportunity to demonstrate the scientific value of public data access.

It was not easy. In fact, it was one of the most challenging research projects in my career. But roughly three years later, my research group proudly published two journal articles using CMS open data in 2017, one in Physical Review Letters and one in Physical Review D. And from our experience, I can say confidently that the future of particle physics is open.

Putting Theory into Practice

In particle physics, there has long been a division between theorists like myself and experimentalists who work directly with collision data. There are good reasons for this divide, since the expertise needed to perform theoretical calculations is rather different from the expertise needed to build and deploy particle detectors. That said, there is substantial overlap between theory and experiment in the area of data analysis, which requires an understanding of statistics, and data interpretation, which requires an understanding of the underlying physical principles at play.

One of the main reasons for restricting data access is that collider data is extremely complicated to interpret properly. As an example, the center-of-mass collision energy of the LHC in 2010 was 7 TeV, and by conservation of energy, one should never find more than 7 TeV of total energy in the debris of a single collision event. In the CMS open data, however, we found an event with over 10 TeV of total energy. Was this dramatic evidence for a subtle violation of the laws of nature? Or just a detector glitch? Not surprisingly, this event did not pass the recommended data quality cuts from CMS, which demonstrates the importance of having a detailed knowledge of particle detectors before claiming evidence for new physics.

Therefore, progress in particle physics typically proceeds via a vigorous dialogue between the theoretical and experimental communities. An experimental advance can inspire a new theoretical method, which launches a new experimental measurement, which motivates a new theoretical calculation, and so on. While there are some theoretical physicists who have officially joined an experimental team, either in a short term advisory role or as a longterm collaboration member, that is relatively rare. Thus, the best way for me to influence how LHC data is analyzed is to write and publish a paper, and I'm proud that a number of my theoretical ideas have found applications at the LHC.

With the release of the CMS open data, though, I was presented with the opportunity to perform exploratory physics studies directly on data. My friend (and CMS open data consultant) Sal Rappoccio is fond of saying that “data makes you smarter”. This aphorism applies both to detector effects, where “smarter” means processing the data with improved precision and robustness, and to physics effects, where “smarter” means extracting new kinds of information from the collision debris. So while I didn't know exactly what I wanted to do with the CMS open data when I first downloaded the CERN Virtual Machine, I knew that, no matter what, I was going to learn something.

Gathering a Team

The first thing I learned was somewhat demoralizing, since within the first few weeks, I realized that I did not have the coding proficiency nor the stamina to wrestle with the CMS open data by myself. While I regularly use particle physics simulation tools like Pythia and Delphes, the CMS software framework required a much higher level of sophistication and care than I was used to.

Luckily, an MIT postdoctoral fellow Wei Xue (now at CERN) had extensive experience using public data from the Fermi Large Area Telescope, and he started processing the 2 Terabytes of data in the Jet Primary Dataset (more about that later). Around the same time, an ambitious MIT sophomore Aashish Tripathee (now a graduate student at University of Michigan) joined the project with no prior experience in particle physics but ample enthusiasm and a solid background in programming.

So what were we actually going to do with the data? My first idea was to try out a somewhat obscure LHC analysis technique my collaborators and I had developed in 2013, since it had never been tested directly on LHC data. (It may eventually be incorporated into a hardware upgrade of the ALTAS detector, or it may remain in obscurity.) Wei was even able to make a plot (slide 29) for me to show in March 2015 as part of a long-range planning study for the next collider after the LHC. There is a big difference, though, between making a plot and really understanding the physics at play, and despite performing a precision calculation of this technique, it was not clear we could do a robust analysis.

In early 2015, though, I had the pleasure of collaborating with two MIT postdoctoral fellows, Andrew Larkoski (now at Reed College) and Simone Marzani (now at University of Genova), to develop a novel method to analyze jets at the LHC. While new, this method had a “timeless” quality to it, exhibiting remarkable theoretical robustness that we hoped would carry over to the experimental regime.

The Substructure of Jets

Jets are collimated sprays of particles that arise whenever quarks and gluon are produced in high-energy collisions. Almost every collision at the LHC involves jets in some way, either as part of the signal of interest or as an important component of the background. In the 2010 CMS open data, the Jet Primary Dataset contains collision events exhibiting a wide range of different jet configurations, from the most ubiquitous case of back-to-back jet pairs, to more exotic cases with just a single jet (which might be a signal of dark matter) or with a high multiplicity of energetic jets (which might arise from black hole production).

While the presence of jets has been known since 1975 (and arguably even earlier than that), there has been remarkable progress in the past decade in understanding the substructure of jets. A typical jet has around 10-20 particles, and the pattern of those particles encodes information about whether the jet comes from a quark or from a gluon or from a more exotic object. The field of jet substructure continues to be an active area of development in collider physics, as summarized in this review article.

A fascination feature of jets is that their substructure has self-similar features, and this “fractal” behavior is a consequence of the

are a ubiquitous phenomenon at the LHC

Therefore, if we were going to use the CMS open data properly, we would have to determine as best we could,

therefore presented a challenge

would be good to cover:

•	what inspired you to poke around the data in the first place,
•	what challenges you faced in doing so,
•	what you managed to do that you didn’t expect to, and vice versa,
•	what outside help you got from CMS,
•	what your thoughts are on open data in general.

Alternatively, since I know that some of this is already covered in your p

complete with all of the complications of trying to interpret and analyze a massive data set.

which means that most of the time, . This was an opportunity to work with a real data

I was working with archival data from the ALEPH experiment with the help of an ALEPHMarcello Maggi.

Since I am not a member of

I am  

I do not work at the Large Hadron Collider (LHC), at least not directly. I am a theoretical particle physicist, which means that I slam together particles and study their debris… in my mind.

Though my research is largely based on pen-and-paper calculations, I know that ultimately, physics is an experimental science. Indeed, almost everything we know about the structure of the universe has originated from centuries of keen observations and detailed measurements. Theoretical insights have of course played a crucial role in solidifying the deep principles of fundamental physics. But without experimental data, these principles would be mere speculations, and physical laws would be as flimsy as the paper they are written on.

A surprisingly effective way to learn about the universe is to smash together particles at ever-increasing energies. The most energetic collisions ever achieved in a controlled laboratory setting occur at the Large Hadron Collider (LHC) at CERN. By studying the resulting collision debris, we can uncover evidence for new particles and forces of nature, such as the discovery of the Higgs boson at the LHC in 2012.

There are four main LHC detectors—ATLAS, CMS, ALICE, and LHCb—and one has to be a member of an official detector collaborations to gain access to the wealth of collision data being provided by the LHC. I am not a member

am not a member of any of the LHC detector collaborations. Rather, I am a theoretical particle physicist, which means that I slam together particles and study their debris… in my mind. Physics is an experimental science, and But theorists like myself play an important role in deciphering and organizing experimental finding, both by speculating about how the universe might be

I am theoretical particle physicist, which means that

In my field, there who study real collision debris.

playground/cmstest.1511141065.txt.gz · Last modified: 2017/11/20 01:24 by jthaler