A distributed systems research agenda for cloud computing

Distributed systems (a.k.a. distributed algorithms) is an old field of almost 40 years old. It gave us impossibility proofs on the theory side, and also algorithms like Paxos, logical/vector clocks, 2/3-phase commit, leader election, dining philosophers, graph coloring, spanning tree construction which are adopted in practice widely. Cloud computing is a relatively new field in contrast. It provides new opportunities as well as new challenges for the distributed systems/algorithms area. Below I briefly discuss some of these opportunities and challenges.

Opportunities

Cloud computing provides abundance. Nodes are replaceable, even hot swappable. You can dedicate several nodes for running customized support services, such as monitoring, logging, storage/recovery service. These opportunities are likely to have impact on how fault-tolerance is considered in distributed systems/algorithms work.

Programmatic interfaces and service-oriented architecture are also hallmarks of cloud computing domain. Similarly, virtualization/containerization facilitates many things, including moving the computation to the data. However, it is unclear how these technologies can be employed to provide substantial benefits for distributed algorithms which operate at a more abstract plane.

Challenges

Coordination introduces synchronization, which introduces potential for cascading shutdowns and halts.  Especially, in a large-scale system of systems, any coordination introduced may lead to spooking/latent/self-organized synchronization. This point is explained nicely in "Towards A Cloud Computing Research Agenda". Thus the challenge is to avoid coordination as much as possible, and still build useful and "consistent" systems.

Another curse of the extreme scale in cloud computing is the large fan-in/fan-out of components. These challenges are explained nicely in the "Tail at Scale" paper. These can lead to broadcast storms and incast storms, and algorithms/systems should be designed to avoid these problems.

Finally, another major challenge is geographically distributed services. Due to speed of light communication over large distances are prone to latencies and especially pose consistency versus availability challenges due to the CAP theorem.

The current state of distributed systems research on cloud computing

This is not an exhaustive list. From what I can recollect now, here is a crude categorization of current research in distributed systems on the cloud computing domain.


What is next?

If only I knew... I can only speculate, I have some ideas, and that will have to wait for another post.

In the meanwhile, I can point to these two articles which talked about what would be good/worthy research areas in cloud computing.
How can academics do research on cloud computing?
Academic cloud: pitfalls, opportunities

UPDATE: Part 2 of this post (i.e., new directions) is available here. 

Comments

Jhonathan said…
This comment has been removed by a blog administrator.
Unknown said…
This comment has been removed by a blog administrator.

Popular posts from this blog

The end of a myth: Distributed transactions can scale

Hints for Distributed Systems Design

Foundational distributed systems papers

Learning about distributed systems: where to start?

Metastable failures in the wild

Scalable OLTP in the Cloud: What’s the BIG DEAL?

The demise of coding is greatly exaggerated

SIGMOD panel: Future of Database System Architectures

Dude, where's my Emacs?

There is plenty of room at the bottom