Engineering Velox: the unified execution engine for varied data infrastructures

By Sanjiban Sengupta

Elevator Pitch

The Velox Project by Meta is a unified execution engine that shall transform data compute operations by minimizing complexity with universal semantics for consistent features. Seamlessly integrable via PyVelox, harnessing Python’s simplicity for effortless adoption into various architectures.

Description

Velox, an open-source project by Meta, is a C++ database acceleration library which provides high-performance components for processing huge datasets. Voltron Data, in collaboration with the Meta open-source team, has been developing PyVelox, a Python package that adds bindings to commonly used Velox APIs. This addition empowers Velox developers to leverage Python’s interactive REPL, enabling them to efficiently explore and triage Velox vectors and associated data components. Consequently, PyVelox enables Python developers to execute data queries, including SQL queries, across a wide spectrum of workloads such as batch processing, stream processing, AI/ML, and more.

Velox stands out as a unified engine for data execution, offering a versatile execution engine that seamlessly integrates with diverse data compute architectures. This integration minimizes redundancy while extending consistent functionality across various frameworks. Currently, Velox finds applications in engines like Presto and Apache Spark, alongside Meta’s internal streaming service XStream, with integration plans underway for Apache Flink. Previously, say for a data compute architecture built on Spark or Presto, the user needed the respective engine for the execution of data queries, however, Velox facilitates using the same execution engine for both of them, thus unifying the process for any data compute operation. This unification not only reduces complexity but also ensures universal semantics throughout the entire data lifecycle, thus features generated during ad hoc training, or online execution remain consistent.

In this talk, we aim to briefly discuss Velox, its philosophy and methodology. Following this, we shall move to PyVelox, its data types, expressions, and functionalities, thus demonstrating the simplicity of running database queries on Velox using Python APIs without losing efficiency.


Outline:

  • Velox Incubation
    • Open-sourced by Facebook in late 2021.
  • Velox development over the years
  • Need for PyVelox
  • PyVelox Developments
    • Data types
    • Expressions
    • Serialization-deserialization
    • Conversion to-and-from Apache Arrow
    • Type and function signatures
  • Demo
  • PyVelox future goals

Pre-requisites:
Knowledge of data engineering and analytics will be helpful. The project is an execution engine, that can be integrated into a data compute architecture, so experience with SQL or data relational queries will be beneficial. The talk will include topics on Python bindings based on pybind11, thus Python and intermediate C++ knowledge is expected.


Project URLs:
Project GitHub Link
Project Webpage
Project Documentation

Notes

Why I am the best person to speak on this subject?
Being a part of Voltron Data, I have been collaborating with the official open-source team of Meta on the development of PyVelox, particularly involved in developing features like Arrow-Velox conversion, complex-types implementation, etc. I plan to talk about the project and discuss its future directions and practical applications.