Non-volatile Generic Object Programming Model for In-Memory Computing (Spark) Performance Improvement

Speaker:Yanping Wang, Apache Mnemonic (Incubating) Project Lead, Apache Software Foundation

In-Memory Computing frameworks such as Spark are gaining tremendous popularity for Big Data processing as their in-memory primitives make it possible to eliminate disk I/O bottleneck. Logically, the more available memory they have, the better performance they can achieve. However, unpredicted GC activity from on-heap memory management, high cost for serialization/de-serialization (SerDe), and burst temporary object creation/destruction greatly impacts their performance and scale-out ability. For example in Spark, when the volume of datasets are much larger than the system memory volume, SerDe makes significant impact on almost every in-memory computing steps such as caching, checkpoint, shuffling/dispatching, data loading and Storing.

With fast growing advanced server platform with significant increased non-volatile memory such as Intel 3D Xpoint technology powered NVMe and Fast SSD Array Storage, how to best use various hybrid memory-like resources from DRAM to NVMe/SSD determines Big Data applications performance and scalability.

In this presentation, we will first introduce our non-volatile generic Java object programming model for In-Memory Computing. This programming model defines in-memory non-volatile objects which can be directly operated on memory-like resources. We then discuss our structured data in-memory persistence library that can be used to load/store non-volatile generic Java object from/to underlying heterogeneous memory-like resources, such as DRAM, NVMe, even SSD.

We then present a non-volatile computing case using Spark. We will introduce that this model can (1) Lazily loads data to minimize memory footprint, (2) Naturally fits both non-volatile RDD and off-heap RDD, (3) Uses non-volatile/off-heap RDDs to transform Spark datasets, (4) Avoids memory caching by using in-place non-volatile datasets.

Finally we will present that up to 2X performance boost can be achieved on Spark ML tests after applying this non-volatile computing approach that removed SerDe, caching hot data, and reducing GC pause time dramatically.

Download Slides