Paper summary: A Taxonomy of Partitioned Replicated Cloud-based Database Systems

- April 22, 2015

This paper is by Divy Agrawal, Amr El Abbadi, and Kenneth Salem, and appeared in IEEE Data Engineering journal in 2015.

This paper proposes a taxonomy of large scale partitioned replicated transactional databases. Partitioned replicated means, the database is divided in partitions and the partitions are replicated across different sites as in Figure 1. The motivation for partitioning is scalability, and the motivation for replication is to enable high availability even when some of the replicas are down. For geo-replicated databases, sites are maintained at different datacenters/regions, although the paper surveys non-geo-replicated databases as well.

The taxonomy

The taxonomy is based on the relationship between transaction management and replica management. This paper considers transactions that provide one-copy serializability guarantee, where concurrent transactions behave as if they execute sequentially on a single database. For a partitioned database, it is necessary to coordinate the database partitions to enforce transactional (ACID) guarantees. In addition, in order to support replication, the database system must also synchronize the database replicas so that replication can be hidden from the application.

Figure 2 shows the proposed taxonomy. Replicated object systems is a single leaf. Replicated transactions systems further divide in to symmetric and asymmetric replicated systems.

Replicated object systems

Replicated object systems implement transaction management on top of replica management, which ensures that partitions are consistently replicated across sites. A prominent example of this category is Spanner, another is Granola (which is not geo-replicated).

If you like to refresh your knowledge of these systems, here is a link to my Spanner review, and here a link to my Granola review.

A tablet in Spanner is a collection of directories or fragments of directories. Tablets correspond to partitions of the database and they are replicated across sites. Spanner uses Paxos to synchronize the replicas of each partition across sites. Spanner uses a separate instance of Paxos, with a long-lived leader, for each partition.

To implement transactions (including multiple partition transactions) Spanner uses two-phase locking for concurrency control, and two-phase commit. The Paxos leader in each partition is responsible for participating in these protocols on behalf of that partition. It does so in much the same way that it would if its partition were not replicated, except that changes to the partition's state are replicated using Paxos instead of just being stored locally at the leader's server. The leaders serve as the link between the transaction protocols and Paxos by ensuring that both database updates and changes in the state of the transaction are replicated.

Replicated transaction systems

Replicated transaction systems implement replica management on top of transaction management. Transactions run over individual partition replicas, rather than one logical partition as was the case in replicated object systems.

Each site has a local transaction manager that ensures that its local transactions have ACID properties with respect to other local transactions at that site. Each local transaction is responsible for applying the effects of its parent global transaction to the replicas at its own local site.

In symmetric replicated transaction systems, all of the local transactions of a given parent transaction are the same and run concurrently at the different sites. In the UCSB's replicated commit protocol the global transaction will be committed if the local coordinators at a majority of the sites vote to commit it. In return a client that wishes to read data from a partition sends its read request to all replicas of that partition, waits for a majority of the replicas to respond, and chooses the latest version that it receives from that majority. (Similar to the idea in Attiya, Bar-Noy, and Dolev, 1995.)

In an asymmetric replicated system, one local master transaction runs first at a single site. If that local transaction is able to commit, then the remaining local transactions are run at the other sites. These remaining transactions are called update propagation transactions. Typically, the update propagation transactions perform only the updates of the master transaction, and do not perform any reads.

An example of an asymmetric replicated primary copy system is Microsoft's Cloud SQL Server, which is the database management system behind the SQL Azure cloud relational database service. Cloud SQL server is not designed for geo-replication, so all of database replicas ("sites") are located in the same datacenter. Transactions are limited to a single partition unless they are willing to run at the read committed SQL isolation level.

An example of an asymmetric replicated update anywhere system is Google Megastore, where database partitions (called entity groups) are replicated and geographically distributed. In Megastore, a client can initiate a single-partition transaction at any replica of that partition --typically, the client will use a nearby replica. For each partition, Megastore manages a transaction log, which is replicated to all sites using Paxos to ensure that transactions commit in the same order everywhere. Here is a link to my Megastore review.

I guess Yahoo PNUTS could be considered somewhere between primary copy and update anywhere system due to its per-record master scheme.

What's next?

This taxonomy is useful for thinking about the cloud-based transactional database systems in a more systematic way. So where does your favorite transactional distributed database system fit? Are there inherent limitations/strengths to one category over another? Is it possible to have efficient multi-partition transaction capability for asymmetric replicated transaction systems?