Integrating Data-Parallel Analytics into Stream-Processing Using an In-Memory Data Grid

This talk is targeted at application developers who want to explore the use of in-memory computing for streaming analytics. The talk’s goal is to describe a key innovation in stateful stream-processing with the incorporation of real-time, data-parallel analytics. It reviews how the use of an in-memory data grid (IMDG) enables the creation of powerful “digital twin” models of data sources and then shows how IMDGs enable the seamless incorporation of data-parallel analytics to further enhance the quality of feedback these models can provide. The audience should gain an understanding of a new design technique for IMC applications, learn how to make use of it, and explore the advantages it offers for streaming analytics. The importance of the talk is that this technique provides new tools for streaming applications made possible by in-memory computing platforms.

In use cases ranging from IoT to ecommerce, an ongoing challenge for stream-processing applications is to extract important insights from real-time systems as fast as possible and then generate effective feedback that optimizes operations or avoids costly failures. Unlike popular software platforms for streaming analytics (e.g., Apache Storm, Flink, Spark Streaming, and legacy CEP), which focus on extracting value from unfiltered data streams, in-memory data grids (IMDGs) have opened the door to stateful stream-processing that correlates event streams by data sources using a “digital twin” model and enables much deeper introspection on these data sources. This talk describes the next step in the evolution of the digital twin model made possible by IMDGs: the incorporation of real-time, data-parallel analytics that further enhances introspection by providing immediate feedback on aggregate behaviors. As illustrated by several real-world applications, this new capability leverages the power of IMDGs to significantly increase the effectiveness of stream-processing applications.

Stream processing and data-parallel analytics have traditionally led parallel lives. As described by the Lambda architecture, streaming analytics are usually hosted in the “speed layer,” while data-parallel analytics are hosted in the “batch layer” with results appearing later in queries that merge these two views. Data-parallel analytics captures vital aggregate trends that can enhance introspection. Built by the necessity to host processing layers on different systems, the Lambda architecture fails to deliver real-time feedback to stream-processing applications from data-parallel analytics.

Consider a medical monitoring application that captures and analyzes telemetry from hundreds of thousands of monitoring devices. Using an IMDG to host a digital twin model of the patients enables the tracking of real-time state for each patient, and it automatically correlates incoming telemetry from the data sources to the respective digital twin models. This allows the application to analyze telemetry in real time with rich context that includes the patient’s medical history and recent events, allowing much deeper introspection than available in stateless stream-processing systems.

The next step is to extract and analyze salient state information from the real-time state of all digital twin models and feed the results back to these models for incorporation into the analysis algorithm. This offers the next level of introspection that considers dynamic, aggregate trends. For example, the medical application can average key parameters, such as heart rate, across all patients and pivot this data by region, age group, gender, etc. These results can be reported back to the digital twin models in real time to enrich the analysis of incoming telemetry.

The use of an IMDG for stateful stream-processing makes integration of data-parallel analytics possible due to its ability to host state information for data sources in memory. Because an IMDG can perform data-parallel analytics (for example, MapReduce) in place – where the data lives, the large volume of accumulating state data does not have to be moved to a separate system for analysis. This allows results to be generated in real time and immediately fed back into the state models. It also avoids data motion which creates network bottlenecks. With these new capabilities IMDGs significantly increase the power of digital twin models for stream-processing.

Schedule

Wed, 10/03/2018 - 14:40

Room

Ballroom A

Tracks:

Streaming Data

Speakers

William

Bain

CEO

ScaleOut Software, Inc.

Dr. William L. Bain founded ScaleOut Software in 2003 to develop in-memory data grid and in-memory computing products. As CEO, he has led the creation of numerous innovations for integrating data-parallel computing with in-memory data storage. Bill holds a Ph.D. in electrical engineering from Rice University. Over a 38-year career focused on parallel computing, he has contributed to advancements at Bell Labs Research, Intel, and Microsoft, and holds several patents in computer architecture and distributed computing. Bill founded and ran three start-up companies prior to ScaleOut Software. The most recent, Valence Research, which developed and distributed Web load-balancing software, was acquired by Microsoft Corporation and is a key feature within the Windows Server operating system. As an investor and member of the screening committee for the Seattle-based Alliance of Angels, Bill is actively involved in entrepreneurship and the angel community. Bill has presented at the prior three IMCS conferences in San Francisco.
Recent talks presented by Bill Bain:
• In-Memory Computing Summit Amsterdam and San Francisco 2017: Stream Processing with In Memory Data Grids: Creating the Digital Twin
• DEVintersection Spring 2017: Supercomputing with Microsoft’s Task Parallel Library
• In-Memory Computing Summit 2016: Implementing User-Defined Data Structures in In-Memory Data Grids
• Database Month New York April 2016: Using Memory-Based NoSQL Data Structures to Eliminate the Network Bottleneck
• IBM POWER8 ISV Testimonial 2015: POWER8 and ScaleOut Software: In-memory computing for operational intelligence
• In-Memory Computing Summit 2015: Implementing Operational Intelligence Using In-Memory, Data-Parallel Computing
• Database Month New York May 2015: Using In-Memory, Data-Parallel Computing for Operational Intelligence
• Big Data Spain 2014: Real Time Analytics with MapReduce And In-Memory
• Strata+Hadoop World 2014: Using Operational Intelligence to Track 10M Cable TV Viewers in Real Time
URLs of previous presentations:
• In-Memory Computing Summit Amsterdam 2017: https://imcsummit.org/2018/us/sessions/stream-processing-memory-data-grids-c…
• In-Memory Computing Summit 2016: https://imcsummit.org/2016/videos-and-slides/implementing-user-defined-… • Database Month New York April 2016: http://www.databasemonth.com/database/nosql-data, https://youtu.be/2KfiQPkuemM
• IBM POWER8 ISV Testimonial 2015: https://www.youtube.com/watch?v=7q5ERajssvs
• In-Memory Computing Summit 2015: http://www.slideshare.net/imcsummit/imcs2015-1-devimplementing-operatio…
• Database Month New York May 2015: http://www.databasemonth.com/database/scaleout-data, https://youtu.be/xaFcJmu1yqg
• Big Data Spain 2014: https://www.youtube.com/watch?v=52smTmprT7w
• Strata + Hadoop 2014: http://conferences.oreilly.com/strata/stratany2014/public/content/solut…, https://www.youtube.com/watch?v=nOSk5nnzUpA