This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
| playground:cmstest [2017/11/20 18:53] jthaler | playground:cmstest [2017/11/28 14:02] (current) jthaler | ||
|---|---|---|---|
| Line 18: | Line 18: | ||
| Because of these complications, progress in particle physics typically proceeds via a vigorous dialogue between the theoretical and experimental communities.  An experimental advance can inspire a new theoretical method, which launches a new experimental measurement, which motivates a new theoretical calculation, and so on. While there are some theoretical physicists who have officially joined an experimental team, either in a short term advisory role or as a longterm collaboration member, that is relatively rare. Thus, the best way for me to influence how LHC data is analyzed is to write and publish a paper, and I'm proud that a number of my theoretical ideas have found applications at the LHC. | Because of these complications, progress in particle physics typically proceeds via a vigorous dialogue between the theoretical and experimental communities.  An experimental advance can inspire a new theoretical method, which launches a new experimental measurement, which motivates a new theoretical calculation, and so on. While there are some theoretical physicists who have officially joined an experimental team, either in a short term advisory role or as a longterm collaboration member, that is relatively rare. Thus, the best way for me to influence how LHC data is analyzed is to write and publish a paper, and I'm proud that a number of my theoretical ideas have found applications at the LHC. | ||
| - | With the release of the CMS open data, though, I was presented with the opportunity to perform exploratory physics studies directly on data. My friend (and CMS open data consultant)  [[https://arts-sciences.buffalo.edu/physics/faculty/salvatore-rappoccio.html|Sal Rappoccio]] is fond of saying that "data makes you smarter".  This aphorism applies both to detector effects, where "smarter" means processing the data with improved precision and robustness, and to physics effects, where "smarter" means extracting new kinds of information from the collision debris.  So while I didn't know exactly what I wanted to do with the CMS open data when I first downloaded the CERN Virtual Machine, I knew that, no matter what, I was going to learn something. | + | With the release of the CMS open data, though, I was presented with the opportunity to perform exploratory physics studies directly on data. My friend (and CMS open data consultant) [[https://arts-sciences.buffalo.edu/physics/faculty/salvatore-rappoccio.html|Sal Rappoccio]] always reminds us of the apocryphal saying: "data makes you smarter".  This aphorism applies both to detector effects, where "smarter" means processing the data with improved precision and robustness, and to physics effects, where "smarter" means extracting new kinds of information from the collision debris.  So while I didn't know exactly what I wanted to do with the CMS open data when I first downloaded the CERN Virtual Machine, I knew that, no matter what, I was going to learn something. | 
| Line 27: | Line 27: | ||
| Luckily, an MIT postdoctoral fellow [[https://th-dep.web.cern.ch/roster/xue-wei|Wei Xue]] (now at CERN) had extensive experience using public data from the [[https://fermi.gsfc.nasa.gov/ssc/data/|Fermi Large Area Telescope]], and he started processing the 2 Terabytes of data in the [[http://opendata.cern.ch/record/5|Jet Primary Dataset]] (more about that later).  Around the same time, an ambitious MIT sophomore [[https://lsa.umich.edu/physics/people/graduate-students/aashisht.html|Aashish Tripathee]] (now a graduate student at University of Michigan) joined the project with no prior experience in particle physics but ample enthusiasm and a solid background in programming. | Luckily, an MIT postdoctoral fellow [[https://th-dep.web.cern.ch/roster/xue-wei|Wei Xue]] (now at CERN) had extensive experience using public data from the [[https://fermi.gsfc.nasa.gov/ssc/data/|Fermi Large Area Telescope]], and he started processing the 2 Terabytes of data in the [[http://opendata.cern.ch/record/5|Jet Primary Dataset]] (more about that later).  Around the same time, an ambitious MIT sophomore [[https://lsa.umich.edu/physics/people/graduate-students/aashisht.html|Aashish Tripathee]] (now a graduate student at University of Michigan) joined the project with no prior experience in particle physics but ample enthusiasm and a solid background in programming. | ||
| - | So what were we actually going to do with the data? My first idea was to try out a [[https://arxiv.org/abs/1310.7584|somewhat obscure]] LHC analysis technique my collaborators and I had developed in 2013, since it had never been tested directly on LHC data. (It may eventually be incorporated into a [[https://hep.uchicago.edu/atlas/trigger/|hardware upgrade]] of the ALTAS detector, or it may remain in obscurity.)  Wei was even able to make a plot [[https://indico.cern.ch/event/340703/contributions/802184/attachments/668768/919259/jthaler_FCC_nobuilds.pdf|(slide 29)]] for me to show in March 2015 as part of a long-range planning study for the next collider after the LHC. There is a big difference, though, between making a plot and really understanding the physics at play, and despite performing a [[https://arxiv.org/abs/1501.01965|precision calculation]] of this technique, it was not clear we could do a robust analysis. | + | So what were we actually going to do with the data? My first idea was to try out a [[https://arxiv.org/abs/1310.7584|somewhat obscure]] LHC analysis technique my collaborators and I had developed in 2013, since it had never been tested directly on LHC data. (It may eventually be incorporated into a [[https://hep.uchicago.edu/atlas/trigger/|hardware upgrade]] of the ALTAS detector, or it may remain in obscurity.)  Wei was even able to make a plot [[https://indico.cern.ch/event/340703/contributions/802184/attachments/668768/919259/jthaler_FCC_nobuilds.pdf|(slide 29)]] for me to show in March 2015 as part of a long-range planning study for the next collider after the LHC. There is a big difference, though, between making a plot and really understanding the physics at play, and despite performing a [[https://arxiv.org/abs/1501.01965|precision calculation]] of this technique, it was not clear whether we could do a robust analysis. | 
| In early 2015, though, I had the pleasure of collaborating with two MIT postdoctoral fellows, [[http://people.reed.edu/~larkoski/|Andrew Larkoski]] (now at Reed College) and [[https://www.difi.unige.it/it/dipartimento/persone/marzani-simone|Simone Marzani]] (now at University of Genova), to develop a [[https://arxiv.org/abs/1502.01719|novel method]] to analyze jets at the LHC. While new, this method had a timeless quality to it, exhibiting remarkable theoretical robustness that we hoped would carry over into the experimental regime. | In early 2015, though, I had the pleasure of collaborating with two MIT postdoctoral fellows, [[http://people.reed.edu/~larkoski/|Andrew Larkoski]] (now at Reed College) and [[https://www.difi.unige.it/it/dipartimento/persone/marzani-simone|Simone Marzani]] (now at University of Genova), to develop a [[https://arxiv.org/abs/1502.01719|novel method]] to analyze jets at the LHC. While new, this method had a timeless quality to it, exhibiting remarkable theoretical robustness that we hoped would carry over into the experimental regime. | ||
| Line 34: | Line 34: | ||
| ==== The Substructure of Jets ==== | ==== The Substructure of Jets ==== | ||
| - | Jets are collimated sprays of particles that arise whenever quarks and gluon are produced in high-energy collisions.  Almost every collision at the LHC involves jets in some way, either as part of the signal of interest or as an important component of the background noise.  In the 2010 CMS open data, the [[http://opendata.cern.ch/record/5|Jet Primary Dataset]] contains collision events exhibiting a wide range of different jet configurations, from the most ubiquitous case with [[https://arxiv.org/abs/1705.02628|back-to-back jet pairs]], to the exotic case with just a single jet (which might be a [[https://arxiv.org/abs/1703.01651|signal of dark matter]]), to the explosive case with a high multiplicity of energetic jets (which might arise from [[https://arxiv.org/abs/1705.01403|black hole production]]). | + | Jets are collimated sprays of particles that arise whenever quarks and gluon are produced in high-energy collisions.  Almost every collision at the LHC involves jets in some way, either as part of the signal of interest or as an important component of the background noise.  In the 2010 CMS open data, the [[http://opendata.cern.ch/record/5|Jet Primary Dataset]] contains collision events exhibiting a wide range of different jet configurations, from the most ubiquitous case with [[https://arxiv.org/abs/1705.02628|back-to-back jet pairs]], to the more exotic case with just a single jet (which might be a [[https://arxiv.org/abs/1703.01651|signal of dark matter]]), to the explosive case with a high multiplicity of energetic jets (which might arise from [[https://arxiv.org/abs/1705.01403|black hole production]]). | 
| - | While the formation of jets in high-energy collisions has been known [[https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.35.1609|since 1975]] (and arguably even earlier than that), there has been remarkable progress in the past decade in understanding the [[https://arxiv.org/abs/1709.04464|substructure of jets]].  A typical jet is composed of around 10-30 individual particles, and the pattern of those particles encodes subtle information about whether the jet comes from a quark, or from a gluon, or from a more exotic object.  Jet substructure continues to be an active area of development in collider physics, with many new advances made every year. | + | While the formation of jets in high-energy collisions has been known [[https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.35.1609|since 1975]] (and arguably even earlier than that), there has been remarkable progress in the past decade in understanding the [[https://arxiv.org/abs/1709.04464|substructure of jets]].  A typical jet is composed of around 10-30 individual particles, and the pattern of those particles encodes subtle information about whether the jet comes from a quark, or from a gluon, or from a more exotic object.  Jet substructure continues to be an active area of development in collider physics, with many new [[https://arxiv.org/abs/1012.5412|advances]] [[https://arxiv.org/abs/1201.0008|made]] [[https://arxiv.org/abs/1311.2708|every]] [[https://arxiv.org/abs/1504.00679|year]]. | 
| A fascinating feature of jets is that they exhibit fractal-like behavior.  As one zooms in on a jet and examines its substructure, one finds that the substructure itself has sub-substructure, which has sub-sub-substructure, and so on. This recursive self-similar behavior is captured by the [[http://www.sciencedirect.com/science/article/pii/0550321377903844?via%3Dihub|"QCD splitting functions"]], which describes how a quark or gluon fragments into more quarks and gluons.  (QCD refers to quantum chromodynamics, which is the theory that describes the interactions of quarks and gluons.)  While the QCD splitting functions are well-known and have been indirectly tested through a multitude of collider measurements, they had never before been tested directly. | A fascinating feature of jets is that they exhibit fractal-like behavior.  As one zooms in on a jet and examines its substructure, one finds that the substructure itself has sub-substructure, which has sub-sub-substructure, and so on. This recursive self-similar behavior is captured by the [[http://www.sciencedirect.com/science/article/pii/0550321377903844?via%3Dihub|"QCD splitting functions"]], which describes how a quark or gluon fragments into more quarks and gluons.  (QCD refers to quantum chromodynamics, which is the theory that describes the interactions of quarks and gluons.)  While the QCD splitting functions are well-known and have been indirectly tested through a multitude of collider measurements, they had never before been tested directly. | ||
| Line 60: | Line 60: | ||
| ==== Learning from the Community ==== | ==== Learning from the Community ==== | ||
| - | While our two publications only list 5 authors (Aashish, Wei, Andrew, Simone, and myself), our acknowledgements recognize around 40 experimentalists who generously offering their time, advice, and, in some cases, code. Without help from [[https://arts-sciences.buffalo.edu/physics/faculty/salvatore-rappoccio.html|Sal Rappoccio]], we would have struggled to figure out how the extract the proper jet correction factors.  Without help from the CMS open data team, including [[https://tuhat.helsinki.fi/portal/en/person/kmlassil|Kati Lassila-Perini]] and [[http://www.desy.de/~geiser/|Achim Geiser]], we would have never figured out how to determine the "integrated luminosity", which tells you how much total data CMS had collected.  Whenever I gave talks about our CMS open data effort, experimentalists in the audience would kindly point out some of our "rookie mistakes" (often made by starting experimental PhD students).  We also benefitted from having a 2015 summer student [[https://alexis-romero.weebly.com/|Alexis Romero]] (now a graduate student at University of California, Irvine) test whether the CMS open data results agreed with those obtained from simulated LHC samples. | + | While our two publications only list 5 authors (Aashish, Wei, Andrew, Simone, and myself), our acknowledgements recognize around 40 experimentalists who generously offering their time, advice, and, in some cases, code. Without help from [[https://arts-sciences.buffalo.edu/physics/faculty/salvatore-rappoccio.html|Sal Rappoccio]], we would have struggled to figure out how to extract and apply the proper jet correction factors.  Without help from the CMS open data team, including [[https://tuhat.helsinki.fi/portal/en/person/kmlassil|Kati Lassila-Perini]] and [[http://www.desy.de/~geiser/|Achim Geiser]], we would have never figured out how to determine the "integrated luminosity", which tells you how much total data CMS had collected.  Whenever I gave talks about our CMS open data effort, experimentalists in the audience would kindly point out some of our "rookie mistakes" (often made by starting experimental PhD students).  We also benefitted from having a 2015 summer student [[https://alexis-romero.weebly.com/|Alexis Romero]] (now a graduate student at University of California, Irvine) test whether the CMS open data results agreed with those obtained from simulated LHC samples. | 
| Most of the feedback we got from the experimental particle physics community was [[https://twitter.com/KyleCranmer/statuses/913112593715335168|very positive]].  Though there was considerable initial skepticism that a team of 5 theoretical physicists could perform a publishable analysis based on open collider data, much of that skepticism dissipated once it became clear that our analysis was based largely on the same workflow used by CMS. Our analysis is by no means perfect, since there are places where we simply didn't have the information (or the expertise) to address a known shortcoming.  But I am proud that we applied a high degree of scrutiny to our own work, even though the final plots in our September 2017 publication are essentially the same as the ones I showed back in August 2015. | Most of the feedback we got from the experimental particle physics community was [[https://twitter.com/KyleCranmer/statuses/913112593715335168|very positive]].  Though there was considerable initial skepticism that a team of 5 theoretical physicists could perform a publishable analysis based on open collider data, much of that skepticism dissipated once it became clear that our analysis was based largely on the same workflow used by CMS. Our analysis is by no means perfect, since there are places where we simply didn't have the information (or the expertise) to address a known shortcoming.  But I am proud that we applied a high degree of scrutiny to our own work, even though the final plots in our September 2017 publication are essentially the same as the ones I showed back in August 2015. | ||
| Line 68: | Line 68: | ||
| In my view, though, the scientific benefits of making data public outweigh the scientific costs.  With the CMS open data, there is a 4-5 year time lag between when the data is collected and when it is made public.  That time lag helps ensure that open data complements, rather than competes, with the needs of the CMS collaboration.  Moreover, open data is a stepping stone towards full archival access, such that even when the LHC is eventually decommissioned, the data will be preserved for future use. By making the data public, there is a chance to perform a back-to-the-future analysis like ours, where 2010 data, released in 2014, is analyzed using a 2015 technique, for publication in 2017. | In my view, though, the scientific benefits of making data public outweigh the scientific costs.  With the CMS open data, there is a 4-5 year time lag between when the data is collected and when it is made public.  That time lag helps ensure that open data complements, rather than competes, with the needs of the CMS collaboration.  Moreover, open data is a stepping stone towards full archival access, such that even when the LHC is eventually decommissioned, the data will be preserved for future use. By making the data public, there is a chance to perform a back-to-the-future analysis like ours, where 2010 data, released in 2014, is analyzed using a 2015 technique, for publication in 2017. | ||
| - | Interestingly, as we were pursuing our open data analysis, there was a [[https://arxiv.org/abs/1708.09429|official CMS analysis]] on a similar topic.  Our analysis was based on proton-proton collisions from 2010, while the CMS analysis was based mostly on lead-lead collisions from 2015. Our analysis was an exploratory study of jet substructure, while the CMS analysis was far more ambitious, using jet substructure to probe the properties of a hot dense state of matter called the quark/gluon plasma.  One could cynically say that our analysis was stealing thunder from CMS, but I see these two studies as being synergistic, since we made different analysis choices that led to complimentary physics insights.  In this way, open data can enrich the dialogue between the theoretical and experimental communities. | + | Interestingly, as we were pursuing our open data analysis, there was a [[https://arxiv.org/abs/1708.09429|official CMS analysis]] on a similar topic.  Our analysis was based on proton-proton collisions from 2010, while the CMS analysis was based mostly on lead-lead collisions from 2015. Our analysis was an exploratory study of jet substructure, while the CMS analysis was far more ambitious, using jet substructure to probe the properties of a hot dense state of matter called the quark/gluon plasma.  One could cynically say that our analysis was stealing thunder from CMS, but I see these two studies as being synergistic, since we made different analysis choices that led to complementary physics insights.  In this way, open data can enrich the dialogue between the theoretical and experimental particle physics communities. | 
| Line 79: | Line 79: | ||
| Eventually, the CMS experiment will release the third batch of open data from 2012, with hopefully enough information to reproduce the monumental [[https://press.cern/press-releases/2012/07/cern-experiments-observe-particle-consistent-long-sought-higgs-boson|discovery of the Higgs boson]]. | Eventually, the CMS experiment will release the third batch of open data from 2012, with hopefully enough information to reproduce the monumental [[https://press.cern/press-releases/2012/07/cern-experiments-observe-particle-consistent-long-sought-higgs-boson|discovery of the Higgs boson]]. | ||
| - | Beyond the CMS open data, I am also looking for ways to use archival data from the [[https://hep-project-dphep-portal.web.cern.ch/sites/hep-project-dphep-portal.web.cern.ch/files/archive_data.pdf|ALEPH experiment]].  ALEPH was one of the four main experiments at the former Large Electron-Position (LEP) collider at CERN. LEP closed in 2000 such that the tunnel could be reused for the LHC. With the help of a former ALEPH collaboration member [[https://www.rd-alliance.org/users/mmaggi|Marcello Maggi]], we are taking ALEPH data from the 1990s and applying jet substructure techniques that weren't even conceived of until 2008. While LEP data is very different from LHC data, I expect some of the lessons from our archival LEP studies to inform ongoing analyses at the LHC. | + | Beyond the CMS open data, I am also looking for ways to use archival data from the [[https://hep-project-dphep-portal.web.cern.ch/sites/hep-project-dphep-portal.web.cern.ch/files/archive_data.pdf|ALEPH experiment]].  ALEPH was one of the four main experiments at the former Large Electron-Position (LEP) collider at CERN. LEP closed in 2000 such that the tunnel could be reused for the LHC. With the help of ALEPH collaboration member [[https://www.rd-alliance.org/users/mmaggi|Marcello Maggi]], we are taking ALEPH data from the 1990s and applying jet substructure techniques that weren't even conceived of until 2008. While LEP data is very different from LHC data, I expect some of the lessons from our archival LEP studies to inform ongoing analyses at the LHC. | 
| Line 86: | Line 86: | ||
| When I first started working with the CMS open data, people would often ask me why I didn't just join CMS. After all, instead of trying to lead a small group of theorists with no experimental experience, I could have leveraged the power and insights of a few-thousand-person collaboration.  This is true... if my only goal was to perform one specific jet substructure analysis. | When I first started working with the CMS open data, people would often ask me why I didn't just join CMS. After all, instead of trying to lead a small group of theorists with no experimental experience, I could have leveraged the power and insights of a few-thousand-person collaboration.  This is true... if my only goal was to perform one specific jet substructure analysis. | ||
| - | But what about more exploratory studies where the theory hasn't yet been invented?  What about engaging undergraduate students who haven't decided if they want to pursue theoretical or experimental work? What about examining old data for signs of new physics?  What about citizen-scientists who might not have world experts on [[http://web.mit.edu/lns/research/particle.html|proton-proton]] and [[http://web.mit.edu/mithig/|lead-lead]] collisions in the building next door? And what happens if I have a great new theoretical idea after the LHC has already shut down? These were the questions that motivated me to dig into the CMS open data, and I hope that they might motivate some of you to take a look as well. Our two publications are a proof of principle that open collider analyses are feasible and potentially impactful. | + | But what about more exploratory studies where the theory hasn't yet been invented?  What about engaging undergraduate students who haven't decided if they want to pursue theoretical or experimental work? What about examining old data for signs of new physics?  What about citizen-scientists who might not have world experts on [[http://web.mit.edu/lns/research/particle.html|proton-proton]] and [[http://web.mit.edu/mithig/|lead-lead]] collisions in the building next door? And what happens if I have a great new theoretical idea after the LHC has already shut down? These were the [[https://indico.cern.ch/event/639314/contributions/2721635/attachments/1540724/2415986/jthaler_Fermilab2017_OpenData.pdf|questions that motivated me]] to dig into the CMS open data, and I hope that they might motivate some of you to take a look as well. Our two publications are a proof of principle that open collider analyses are feasible and potentially impactful. | 
| - | Ultimately, physics is an experimental science, and Sal's aphorism that "data makes you smarter" is true at the highest level.  It is true that theoretical insights have played a crucial role in solidifying the principles of fundamental physics.  But almost everything we know for certain about the universe has originated from centuries of keen observations and detailed measurements.  Without experimental data, physical principles would be mere speculations.  With experimental data, we have an opportunity to expose the deepest structures of the universe... not just by scribbling on a chalkboard but also by smashing together particles at ever-increasing energies. | + | Ultimately, physics is an experimental science, and the aphorism "data makes you smarter" holds at the most foundational level.  It is true that theoretical insights have played a crucial role in solidifying the principles of fundamental physics.  But almost everything we know for certain about the universe has originated from centuries of keen observations and detailed measurements.  Without experimental data, physical principles would be mere speculations.  With experimental data, we have an opportunity to expose the deepest structures of the universe... not just by scribbling on a chalkboard but by smashing together particles at ever-increasing energies. | 
| - | When you decide to jump into the CMS open data yourself---and I hope you do---you will be confronted with this question:  [[http://opendata.cern.ch/getting-started/CMS|"I have installed the CERN Virtual Machine: now what?"]]  However you answer this question, I'm sure that you are going to learn something.  And hopefully, you will teach the rest of us something, too. | + | When you decide to jump into the CMS open data yourself (and I hope you do), you will be confronted with this question:  [[http://opendata.cern.ch/getting-started/CMS|"I have installed the CERN Virtual Machine: now what?"]]  However you answer this question, I am sure that you are going to learn something.  And hopefully, you will teach the rest of us something, too. |