Breaking the Data Replication Wall

Data replication is a highly complex topic that crosses as one of the long-standing issues in the distributed systems(and it will continue to be). Achieving low latency is really hard for data replication, also for operations over the distributed data has a fluctuating trend based on both network medium, replication protocol, transport, algorithm, and network topology.

Apart from the difficulties of making a distributed system main problem arises when systems are latency oriented and handles highly critical data.

We realize everything slowly but steadily…

Things were different…

Data replication strategies are used every day for years. These strategies are not latency critical and mostly used for RDBMS replication systems like Postgres BDR, KV replications via sharded reads with Sentinels and many more like these...

Since these strategies relies mostly on the reliability/recoverability perspective of the systems (in addition to partition tolerance), current volatile memory based systems have different development direction than ordinary *DBMS systems. Surely NoSQL systems are on the rise and they need to excel at how they operate on data. That's true too.

Like every time we make a tradeoff in computer science when we include data also into the mix we need to consider how our workloads are. Current workloads are more oriented and tailored towards for the custom use cases. Everybody has a different use case and approaches are completely different from game industry to e-commerce sector.

Pillars of the future

I have worked in the security industry for 3+ years. That security industry experience starts way back in times where there was no blockchain. You see how other vulnerability researchers are doing their jobs and how system codes are getting hacked and given back to you in matter of seconds. I've learned a lot meanwhile I was writing CXX code too. Pain was real…

Meanwhile Rust was developed and became a de-facto systems programming language for the masses. That is nice. We've started with the mantra "Pursuing the trifecta" and we continued… ¹

Day to day I am working with data processing systems, I am mostly wrapping my head around with hardcore problems of ddata (short for Distributed Data from now on).

Pillars that I mention are the pillars that enables me/others to do their jobs and improve their systems. So I started to build the pillars for everyone.

Ionic order: Artillery

I am working on Artillery. A cluster management & distributed data management library. It is currently doing zeroconf SD and AP cluster instantiation. That means it is already assembling a cluster which node statuses can be notified and efficiently work at the medium scale datacenter local operation. That's beautiful but how about the replication?

Artillery: Cluster management & distributed data management library

I was working on the replication for the last two weeks where I am very much fond of couple of methods implementing in Rust is my primary goal for the reasons of trifecta.

For the design of low latency system I have decided to use CRAQ replication system² which is improvement over the general CR based systems. CRAQ is for OBS systems (Object based storage) and it is enabling high performance, low latency replication scheme over the existing nodes. Major improvements over the CR replication is the apportioned queries which are basically workloads that scattered all over the cluster, which might even span across zones, across datacenters all over the world. These are the systems like how Tanenbaum classifies as geographically dispersed system/s.

Artillery's ddata module is getting baked nowadays, and CRAQ was the fuel to burn to the peak performance.

Kanagawa

CRAQ works like a bunch of nodes weaved together as a chain (which forms the cluster, which also can span to multiple regions). This is how chain replication works in general. Advancement of this type of replication is that most of the time this workload is designed for data read/consumption not for writes/dissemination. That's why that long back story was written to envision where I want to go. You can imagine the read operations hit every node as a Kanagawa hitting to the coastline.

In a nutshell this is how CRAQ works:

Where every node is aware of other nodes and except tail and head, all the successor nodes have bidi connection to predecessors and their successors. Tail is the one and the only one that mostly works on the versioning and guarantees.

CRAQ also has three modes of consistency (by default it defaults to Strong, that is why CRAQ is made) which can be listed as Strong, Eventual, Eventual Maximum Version Bounded. We are not going to take a look at to these approaches but main difference between them is dirty object dissemination through the chain.

At Artillery I have defined the protocol for transport, messaging and for the sake of future proofness, I have enabled versioned strict message transport.

After hard work for the Rust implementation and adapting the secure, reliable implementation mindset, and heavy testing for performance under the expected message rate (even more too, I am not performance crazy, but if a wonderful approach like this exists I won't hold myself to give more load onto it, so I gave it), I have finalized the draft of first implementation.

Performance talks

Take a little bit of grain of salt here. Like every post of benchmark, performance or low tail latency post or section you read. In here, we are going to do something different. Our benchmarks won't be asynchronous at the user side. We are not going to include an executor or any threading at the client side so we don't include any latency of it into our loop that will read the data while we are benchmarking. So whole benchmarking intentionally not concurrent and, all the operations per second calculation at the Y axis is pure blocking. That said, any async code or multithreading that is working with this algorithm expected to be ~2.3x faster than the aforementioned approach.

Below you can see ops per sec as Y axis and connected clients as X axis for both casual CR and apportioned query based CR. Hover over to data points to see their exact results.

Artillery DData Chain Replication (Sync)500050006000600070007000800080009000900010000100001100011000120001200013000130001400014000150001500016000160001700017000101020203030404050506060707080809090100100Artillery DData Chain Replication (Sync)1: 5672.79328312.776923076923076465.84100050117762: 9945.79541519.229914529914527296.387080987860033: 15916.8081525.6829059829059859.5953640276650844: 17170.3296732.1358974358974349.8846153846153585: 16977.9286938.5888888888888917.51463726937919310: 15567.3521570.8538461538461573.4536976141180820: 12751.03602135.38376068376067185.139999581474930: 15857.9131199.9136752136752261.93095780590232440: 15705.98398264.443589743589767.9559921648665950: 15813.27683328.973504273504263.70109292675295100: 15920.01783651.623076923076859.468078102972471: 4707.65464612.776923076923076504.115384615384642: 9403.35699819.229914529914527317.89849449687053: 13939.2249825.68290598290598138.02013636691634: 15660.4807832.13589743589743469.760507090215535: 15920.0178338.5888888888888959.4680781029724710: 15946.1657470.8538461538461558.4311334951675620: 15681.35487135.3837606837606768.9327057815347130: 16056.51895199.9136752136752254.05487013872982540: 12755.50879264.4435897435897184.9626237428271350: 14602.80374328.9735042735042111.70467520057025100: 15107.33763651.623076923076891.69643661351239CRAQCR

Still this is very early sneak peek preview of implementation for DData in Artillery. After all everything it is in flexible mode. You can alter your use with a fault tolerant executor for reliability, speed oriented executor, or no executor at all. It isn't tied to anything specific. Pure Rust code. That said, let's take a look into the speedup gained from DC spanning reads over the assembled nodes vs single node read CR:

Artillery DData Chain Replication (Sync)1600016000120001200080008000400040000011.57079632722.14199499132.71319365543.2843923253.855590984104.426789648204.997988313305.569186977406.140385641506.7115843051007.28278297Artillery DData Chain Replication (Sync)5672.793283332.2178.3809679886375619945.795415235.87354192335502141.04295038073417215916.8081572.8304740052239165.36341342128736317170.3296727.738256488105684290.86566058910574416977.92869102.34187730471885411.08642882252036515567.35215253.6313980593597464.00807298825471012751.03602396.55462266956846426.557890781594152015857.9131546.8946309348564400.92151375906213015705.98398610.6961825370303287.97747876094754015813.27683589.8824496847645165.959466152441345015920.01783486.3876608290551671.3900793777011004707.654646332.2191.7567886123577519403.356998241.12713595076147147.3671864782633213939.22498105.055830036728176.7487973313086315660.4807854.51067352387869287.88773115197745415920.01783116.66455524640908401.4851570931074515946.16574251.7195215694071469.04537616281861015681.35487411.34397495469506465.52403300214082016056.51895549.5834846941563402.72399899881913012755.50879558.3787932285177282.158150103122244014602.80374570.1574000170142172.928427578097065015107.33763478.516736321306980.86503778391841100CRAQCR

And finally, pure, direct correlation over the median timing of the operation (in µs) and their respective counts were like below (lower is better):

Artillery DData Chain Replication | Median µs / ops (Sync)00100010002000200030003000400040005000500060006000123451020304050100176.2829.3006993006993497.53440173199821201.0987.9020979020979496.608181030064772188.48146.5034965034965497.07894455490883232.96205.1048951048951495.418392486307544294.5263.7062937062937493.120947020891155642.37322.3076923076923480.1340709394091101568.5380.9090909090909445.5592713150008201891.8439.5104895104895433.4896561221324302546.8498.11188811188816409.0368322702029403161.9556.7132867132867386.07357738268865506281.4615.3146853146852269.6146705031557100212.4255.085314685314685496.18520317625961212.69113.68671328671329496.17512338627492215.22172.2881118881119496.08067202086213255.42230.88951048951049494.579903289797134314.07289.4909090909091492.39034890977325627.11348.0923076923077480.70376573632586101275.4406.69370629370627456.5014433317955201868.4465.2951048951049434.363237920812303135.9523.8965034965034387.044223825666403424582.497902097902376.28871458636695506619.3641.0993006993006257.0100Artillery DData Chain Replication | Median µs / ops (Sync)CRAQCR

Final words

Currently, DData code is under construction. Soon I will release and publish a new version of Artillery with both core and ddata components. Until then feel free to rise your questions, share your comments and support me in Bastion project's Discord server. Please consider supporting me via GitHub Sponsors if you like my work.


1- If you are long time around the Rust, you will know what these are. They are safe, performant, and concurrent execution.

2- Terrace, J., & Freedman, M. J. (2019). Object storage on CRAQ: High-throughput chain replication for read-mostly workloads. Proceedings of the 2009 USENIX Annual Technical Conference, 143–158.