SQL engines, like Presto, Apache Spark SQL, or Apache Hive, consume data structured as tables of rows and columns, whereas files and directories are the standard means for a filesystem to arrange and access data. As a result, there is often a mismatch between the SQL engines and the storage systems. This disparity is analogous to a conversation between two people who speak different languages; in order for one to understand the other, there must always be a translator present. This inefficiency grows as the data scale increases since each piece of information retrieved must first be converted before it is consumable and vice versa when storing computed information.

In this talk, I will go over the challenges created by the mismatch between SQL engines and storage systems, and introduce a solution using Alluxio as an example of an open source data orchestration system that sits between compute and storage to deliver physical data independence, where the logical access of data by the SQL engines is independent from the physical format of the stored data.

 

Speakers
Gene
Pang
Software Engineer
at
Alluxio, Inc.
Gene Pang is the PMC Maintainer of the Alluxio open source project and a founding member of Alluxio, Inc. He graduated with a Ph.D. from the AMPLab at UC Berkeley, working on distributed database systems. Before starting at Berkeley, he worked at Google and has an M.S. from Stanford University, and a B.S. from Cornell University.

Track:

Schedule:

(Pacific Time Zone)