Speak dataholics, in today's post we will continue the theme of the last post, some really cool comments in the last post asking if Deletion Vector or Photon had been used in the tests between Merge and ReplaceWhere.
If you haven't read the last post yet, check it out:
In this post I showed the difference in performance between the Merge and ReplaceWhere commands, where ReplaceWhere when well applied is extremely more performant than the Merge command. Some comments encouraged me to write this post, where we will show the Merge command acting in different engines and contexts.
This article is not about Photon itself, but after these tests, I will bring a lot more Photon things to the Blog.
What we will see in this post:
Deletion Vector
Photon
Predictive IO
Test parameters
Merge performance comparison
Delete performance comparison
Summary
How much did this comparison cost?
Deletion Vector
There's not much to say about Deletion Vector, as I already wrote this post below, despite it being old, its base hasn't changed:
Basically, Deletion Vector comes to optimize writing operations in Storage, commands such as Delete, Update and MERGE that do a lot of writing in Storage were optimized with this engine.
In a very simplistic way, the main purpose of Deletion Vector is to reduce writing to storage by using a vector map to mark changed records in a PARQUET .
In other words, if you change 1 record within 1 giant PARQUET, instead of the engine marking it as deleted and creating a new giant PARQUET , it simply creates a vector map with a few KBs , reducing writing and storage, in the post above there is a lot more details and technical explanation.
Photon
Photon is a Databricks query acceleration engine , focused on SQL code, it is an engine fully compatible with Spark APIS and can run your previously created codes without the need for changes, that is, all SQL code that already worked with Spark SQL on an All Purspose cluster, it works natively on Photon.
Photon is the native Engine of SQL Warehouse clusters, while in All Purpose or Jobs Clusters, it is optional and even when Photon is enabled in All Purpose clusters, the engine will not always be used, for this you must use SQL code, To ensure whether you used Photon or not, just look at the execution plan.
Photon is a completely new engine written in C++, it has several features to accelerate query performance, more efficient cache control, better treatments for aggregations and JOINs, which is why it is natively used for SQL Warehouse, where it is connected to visualization tools like Power BI.
If you are interested, follow the Photon paper, I recommend reading it:
It's fascinating to see how Photon has integrated with DBR and interacts strongly with Spark APIs, which work outside the box, the interoperability of the two languages, in this case Java and C++ done all via JNI ( Java Native Interface ) to achieve surreal performance .
How to enable Photon on All Purspose and Job Cluster clusters:
Now in SQL Warehouse clusters, you don't need to enable it, it is the Default engine, what you choose are the Warehouse types:
Basically the main differences between the 3 models are these:
In our demo we will use a Warehouse PRO to compare performance, with this type, we have Photon + Predictive Optimization .
Predictive IO is a feature for optimizing access to Storage, using artificial intelligence for more efficient searches and writes to Storage, minimizing the number of reads or reading ahead to optimize the data cache, it is truly sensational.
Test parameters
For comparison tests we will use the following environment:
All Purpose Cluster : DBR: 14.3 LTS, 1x Standard_E8ds_v4
SQL Warehouse PRO : 2X-Small (Behind the scenes the machines used are Standard_E8ds_v4 )
Data tables : 1x table with 59 million records, partitioned by year, month and day.
Tests : We will apply a MERGE operation to a partition of 11 million records, we will test with and without Deletion Vector .
All Purpose Cluster:
Note: It is worth mentioning that when you enable Photon in the All Purpose or Job Cluster clusters, the cost of DBUs also increases, in other words, it is not free to enable Photon.
For this cluster below, the DBU cost goes from 2.75 to 5.5 DBU/h .
SQL Warehouse:
As Photon is native, the cost is already built in, which is why you will notice that using a SQL Warehouse is a little more expensive than using an All Purpose Cluster, the infrastructure of both are also very different.
Performance comparison in MERGE
We create two tables:
pedidosSemDeletionVector - Sem Deletion Vector
pedidosComDeletionVector - Com Deletion Vector
Both tables with the same number of files and size.
No Storage: Both folders exactly the same after creation.
Starting the tests:
For the test we will create a temporary view that contains the records that we want to apply MERGE to.
No Deletion Vector and no Photon running on an All Purpose cluster.
Merge: 2.91 minutes (Rounding up 3 minutes)
With Deletion Vector and without Photon running on an All Purpose cluster.
Merge: 2.46 minutes (Rounding 2 minutes)
Metrics:
No Deletion Vector
With Deletion Vector
If you look at the execution metrics, there is no difference and the time is relatively similar.
In storage we will notice that the Deletion Vector is being used in the enabled table.
Let's move on to testing Warehouse PRO.
No Deletion Vector and COM Photon running on a SQL Warehouse PRO cluster.
Merge: 1.58 minutes (Rounding up 2 minutes)
Note: Without data cache, testing with the cache reduced the time to 59 seconds.
With Deletion Vector and COM Photon running on a SQL Warehouse PRO cluster.
Merge: 58 seconds
Note: No data cache.
Conclusion using MERGE with or without Deletion Vector : Very equivalent performance .
Conclusion using With or without Photon : Very significant difference.
Performance comparison in DELETE
Now, we will compare the DELETION VECTOR in pure DELETES operations, I will simulate some 200x sequential DELETEs operations.
200x Deletes Without Photon and without Deletion Vector: 24 minutes
200x Deletes Without Photon and WITH Deletion Vector: 25 minutes
200x Deletes WITH Photon and Without Deletion Vector : 12 minutes
200x Deletes COM Photon and Com Deletion Vector : 10 minutes
Here we can confirm that the Deletion Vector is indeed working.
Looking at Storage we can also say that the Deletion Vector is in action:
Test summary
This wasn't quite the result I expected, I confess that I would have liked to see a bigger difference with the Deletion Vector , although, in relation to the performance of using or not using the Deletion Vector, we can notice that the difference was very small in all scenarios.
Where Deletion Vector stood out was in the space consumed by Storage, after all Delete operations are finished, note that the folder with Deletion Vector, despite having more files, is smaller in terms of space consumed , this is due to the fact that many PARQUETs files do not needed to be rewritten and only a vector map was created with a few KBs .
Despite the tests above, I recommend using Deletion Vector, as it is enabled by default for new tables created with the DBR 14.x version or created in SQL Warehouse using UC.
Today's big point goes to Photon , with it we can notice extreme differences in performance both in MERGE and in simple DELETE operations.
We can see that in all tests Photon managed to reduce all of them in half the time, which can greatly justify its use.
These tests are not an official benchmark , although anyone can simulate, I will not make all the details of the simulations available.
I tested several other situations, such as Delete in Merge, operations with a larger mass of data or with few records, all followed the same patterns, where Photon proved to be much more performant and Deletion Vector was imperceptible in terms of performance (but efficient in the space consumed in the storage).
These tests cost me around R$250 to keep the clusters connected during the tests , it's not cheap to play with these things, but I will definitely bring more experience to the environments I have Photon or Deletion Vector.
I hope you liked it, comment if you want to see new performance tests and how you would like to compare, I will add it to my testing backlog.
Stay safe and see you next time.
References:
Comments