![]() The current table data format and organization and the selection/reduction processing involved provides us with an opportunity to explorealternate approaches to represent the data and implement the processing. These millions of eventsare reduced to a few tens of events by the application of strict event selectioncriteria, and then summarized by a handful of numbers each, which are used inthe extraction of the neutrino oscillation parameters.The NOvA event selection code is currently a serial C++ program more » that readsthese n-tuples. File sizes range from a few hundred KiB toa few MiB the full dataset is approximately 1.4 TiB. These events are stored in an n-tupleformat, in 250 thousand ROOT files. In their recent measurement of the neutrino oscillation parameters,NOvA uses a sample of approximately 25 million reconstructed spills to searchfor electron-neutrino appearance events. ![]() We show the benefits and limitations of using Spark with HDF5 on Edison at NERSC. We use HDF5 as our input data format, and Spark to implement the use case. Our use case focuses on searching for new types of elementary particles explaining Dark Matter in the universe. The LHC is the highest energy particle collider in the world. We use an example from the Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider (LHC) in Geneva, Switzerland. The mainstream tools, however, are not designed for scientific computing or for exploiting the available HPC platform features. Use of big data tools for HEP analysis looks promising, mainly because extremely large HEP datasets can be represented more » and held in memory across a system, and accessed interactively by encoding an analysis using highlevel programming abstractions. Growth in size and complexity of experimental datasets, along with emerging big data tools are beginning to cause changes to the traditional ways of doing data analyses. This processing model limits what can be done with interactive data analytics. Processing within these phases is extremely time consuming, therefore intermediate results are stored in files held in mass storage systems and referenced as part of large datasets. = ,Ī full High Energy Physics (HEP) data analysis is divided into multiple data reduction phases. Here, using data from the CMS detector, we will use HDF5 as our input data format, and MPI with Python to implement our use case. Our example HPC use case is a search for a new elementary particle which might explain the phenomenon known as “Dark Matter”. Mainstream big data tools, while suggesting a direction in terms of what can be done if an entire data set can be available across a system and analysed with more » high-level programming abstractions, are not designed with either scientific computing generally, or modern HPC platform features in particular, such as data caching levels, in mind. This fragmentation causes difficulties for interactive data analysis, and as data sets increase in size and complexity (O10 TiB for a “small” neutrino experiment to the O10 PiB currently held by the CMS experiment at the LHC), data analysis methods traditional to the field must evolve to make optimum use of emerging HPC technologies and platforms. Also, algorithms tend to to be modular because of the heterogeneous nature of most detectors and the need to analyze different parts of the detector separately before combining the information. A typical vertical slice of an HEP data analysis is somewhat fragmented: the state of the reduction/analysis process must be saved at certain stages to allow for selective reprocessing of only parts of a generally time-consuming workflow. In this paper, we explore the features available in Python which are useful and efficient for end user analysis in High Energy Physics (HEP). High level abstractions in Python that can utilize computing hardware well seem to be an attractive option for writing data reduction and analysis tasks.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |