Pages

Wednesday, April 16, 2014

Big Data – Real-Time Analytics Performance with ClustrixDB

Note: The product used in comparison is ClustrixDB. It is available to download for FREE.
NewSQL databases provide scale-out of NoSQL without giving up on SQL or ACID transactions. While most NewSQL databases focus only on transactions, ClustrixDB also provides fast real-time analytics that are becoming increasing important to many businesses. ClustrixDB does this by bringing Massively Parallel Processing (MPP) used in data warehouses, to the primary database.
So, I decided to get a workload and try it out to see what kind of performance improvements one can get, if any. Since, joins and aggregates are the workhorses of real-time analytics processing, they are a good place to start.

Configuration

I built a simple dataset with two tables USERS (100K rows), USER_ADDRESSES (200K rows) and BIDS (10M rows) so this dataset has 2GB of data (mysqldump). For platform I used AWS and got ClustrixDB from AWS Marketplace. For comparison, I decided to use MySQL 5.6 since the exact same data and queries can be run on both databases. For both databases, the instance types are m1.xlarge.
MySQL does not scale beyond a single server and is usually deployed with master and two read slaves. Since ClustrixDB provides horizontal scale-out within one cluster, rather than master-slave (with multiple copies of data), the equivalent configuration is 3 servers.ClustrixDB horizontal scaling allows all nodes to participate in all query types. For measuring performance single MySQL is enough because performance for one query will be the same – whether we use the master or read slave.
For ClustrixDB, I also tried out 6 servers to see if analytics get faster as you add servers.
Here is the resulting table:

Results

We see that some queries get significantly faster, however one query showed no performance improvement. The count query on users is only counting 100K rows so it is likely not enough work. The count query on the bids table (counting 10M rows) shows speedup with 3 nodes, but with 6 nodes we don’t get as much improvement. This is still a very simple query. The queries with aggregates and joins get significantly faster (23x and 8.79x) on 3 nodes. These queries also get nearly twice as fast as you go from 3-node ClustrixDB to 6-node ClustrixDB, this is because of MPP in ClustrixDB.
Overall, we see that for more complex analytical queries ClustrixDB gets significant advantage. This means reports will get much faster with ClustrixDB. For some other queries, there is not enough work or being distributed does not offer that much advantage and here the performance is about the same. For real-time analytics requirements, ClustrixDB seems like a good solution.
Note: The product used in comparison is ClustrixDB. It is available to download for FREE.

No comments:

Post a Comment