NVIDIA announced that it is collaborating with the open-source community to bring end-to-end GPU acceleration to Apache Spark 3.0, an analytics engine for big data processing used by more than 500,000 data scientists worldwide.
With the anticipated late spring release of Spark 3.0, data scientists and machine learning engineers will for the first time be able to apply revolutionary GPU acceleration to the ETL (extract, transform and load) data processing workloads widely conducted using SQL database operations.
In another first, AI model training will be able to be processed on the same Spark cluster, instead of running the workloads as separate processes on separate infrastructure. This enables high-performance data analytics across the entire data science pipeline, accelerating tens to thousands of terabytes of data from data lake to model training, without changes to existing code used for Spark applications running on premises and in the cloud.
Recommended AI News: Intel, Cisco And NetApp Invest $19 Million In A Big Memory Computing Platform
“Data analytics is the greatest high performance computing challenge facing today’s enterprises and researchers,” said Manuvir Das, head of Enterprise Computing at NVIDIA. “Native GPU acceleration for the entire Spark 3.0 pipeline from ETL to training to inference delivers the performance and scale needed to finally connect the potential of big data with the power of AI.”
Building on its strategic AI partnership with NVIDIA, Adobe is one of the first companies working with a preview release of Spark 3.0 running on Databricks. It has achieved a 7x performance improvement and 90 percent cost savings in an initial test, using GPU-accelerated data analytics for product development in Adobe Experience Cloud and supporting features that power digital businesses.
The performance gains in Spark 3.0 enhance model accuracy by enabling scientists to train models with larger datasets and retrain models more frequently. This makes it possible to process terabytes of new data every day, which is critical for data scientists supporting online recommender systems or analyzing new research data. In addition, faster processing means that fewer hardware resources are needed to deliver results, providing significant cost savings.
“We’re seeing significantly faster performance with NVIDIA-accelerated Spark 3.0 compared to running Spark on CPUs,” said William Yan, senior director of Machine Learning at Adobe. “With these game-changing GPU performance gains, entirely new possibilities open up for enhancing AI-driven features in our full suite of Adobe Experience Cloud apps.”
Databricks and NVIDIA Bring More Speed to Spark
Apache Spark was originally created by the founders of Databricks, whose cloud-based Unified Data Analytics Platform runs on over 1 million virtual machines every day. NVIDIA and Databricks have collaborated to optimize Spark with the RAPIDS™ software suite for Databricks, bringing GPU acceleration to data science and machine learning workloads running on Databricks across healthcare, finance, retail and many other industries.
“Our continued work with NVIDIA improves performance with RAPIDS optimizations for Apache Spark 3.0 and Databricks to benefit our joint customers like Adobe,” said Matei Zaharia, original creator of Apache Spark and chief technologist at Databricks. “These contributions lead to faster data pipelines, model training and scoring, that directly translate to more breakthroughs and insights for our community of data engineers and data scientists.”
Faster ETL and Data Transfers in Spark with NVIDIA GPUs
NVIDIA is contributing a new open source RAPIDS Accelerator for Apache Spark to help data scientists increase the performance of their pipelines from end to end. The accelerator intercepts functions previously operated on by CPUs and instead uses GPUs to:
- Accelerate ETL pipelines in Spark by dramatically improving the performance of Spark SQL and DataFrame operations without requiring any code changes.
- Accelerate data preparation and model training on the same set of infrastructure, where a separate cluster is not required for machine learning and deep learning.
- Accelerate data transfer performance across nodes in a Spark distributed cluster. These libraries leverage the open source Unified Communication X (UCX) framework of the UCF Consortium and minimize latency by enabling data to move directly between GPU memory.