Wednesday, February 21, 2018

Paper summary: Bitcoin-NG -- A scalable Blockchain Protocol

This week in our seminar, we discussed the Bitcoin-NG paper. The paper appeared in NSDI 16, and is authored by Ittay Eyal, Adem Efe Gencer, Emin Gün Sirer, and Robbert van Renesse at the Cornell University.

The model section of this paper is very well formalized and written. This is like a Rosetta Stone find for classical distributed systems researchers that want to enter blockchain research. So in this summary, I will start by covering as much of that as I can.

The Nakamoto Consensus Problem 

The blockchain system is comprised of a set of nodes N connected by a reliable peer-to-peer network. Nodes can generate public-private key-pairs for themselves. The system employs a cryptopuzzle system, defined by a cryptographic hash function H. The solution to a puzzle defined by the string y is a string x such that H(y|x) --the hash of the concatenation of the two-- is smaller than some target (i.e., the hash has k number of leading zeros). Each node i has a limited amount of compute power, called mining power, measured by the number of potential puzzle solutions it can try per second. A solution to a puzzle constitutes a "proof of work", as it statistically indicates the amount of work a node had to perform in order to find it.

At any time t, a subset of nodes B(t) are Byzantine and can behave arbitrarily, controlled by a single adversary. The mining power of the Byzantine nodes is less than 1/4 of the total compute power at any given time.

Comparing the Nakamoto consensus problem with the classic Byzantine consensus problem is very instructional. Nakamoto consensus has probabilistic guarantees for Termination, Agreement, and Validity, whereas the classic Byzantine Consensus has deterministic guarantees for them.

BitCoin's Blockchain protocol

Bitcoin provides a Byzantine fault tolerant blockchain protocol for implementing a decentralized cryptocurrency. Each user commands addresses, and sends Bitcoins by forming a transaction from her address to another's address and sending it to the Bitcoin mining nodes.  Transactions are protected with cryptographic techniques that ensure only the rightful owner of a Bitcoin address can transfer funds from it. A client owns x Bitcoins at time t if the aggregate of unspent outputs to its address is x. Miners accept transactions only if their sources have not been spent, preventing users from double-spending their funds. The miners commit the transactions into a global append-only log called the blockchain. If multiple miners create blocks with the same preceding block, the chain is forked into branches, forming a tree. All miners add blocks to the heaviest chain of which they know, with random tie-breaking.

If you want to understand the blockchain data structure, this youtube video and demo is fantastic for it.

Unfortunately Bitcoin is haunted with scalability woes. The maximum rate at which Bitcoin can process transactions is capped by the block size and block interval.

Increasing the block size (which is currently set at 1MB) improves throughput, but the resulting bigger blocks take longer to propagate in the network. Of course increasing that to 10 MB will not cause a significant problem in propagation, after all bandwidth is not that limited. However, taken at extreme the tradeoff will be there. Moreover, every second of headstart counts for mining the next block: New blocks received late means miners are wasting resources by building on an old block that is no longer the most recent.

Reducing the block interval reduces latency, but leads to instability due to frequent forks. Bitcoin currently targets a conservative 10 minutes between blocks, yielding 10-minute expected latencies for transactions to be encoded in the blockchain.

With this setup, Bitcoin yields a wimpy 1 to 3.5 transactions per second. Transaction finalization is also problematic. If you wait for 6 blocks to append to declare a transaction to be finalized, that will take an expected 60 minutes time.


Bitcoin-NG is a protocol that improves on Bitcoin in terms of transaction throughput and latency of propagation. Bitcoin-NG’s latency is limited only by the propagation delay of the network, and its bandwidth is limited only by the processing capacity of the individual nodes. Bitcoin-NG achieves this performance improvement by decoupling Bitcoin’s blockchain operation into two planes: leader election and transaction serialization. In Bitcoin-NG, consensus is pushed back only to identify the leader, and serialization is performed by the leader. This seems to me to be a classic application of chain replication (while the paper does not cite chain replication).

Bitcoin-NG divides time into epochs, where each epoch has a single leader. As in Bitcoin, leader election is performed randomly and infrequently via proof-of-work. Once a leader is chosen, it is entitled to serialize transactions via microblocks unilaterally until a new leader is chosen, marking the end of the former’s epoch.

Thus the protocol introduces two types of blocks: key blocks for leader election and microblocks that contain the ledger entries. Like a Bitcoin block, a key block contains the reference to the previous block (either a key block or a microblock, usually the latter), the current Unix time, a coinbase transaction to pay out the reward, a target value, and a nonce field containing arbitrary bits. Unlike Bitcoin, a key block in Bitcoin-NG contains a public key that will be used in subsequent microblocks.

Once a node generates a key block it becomes the leader. As a leader, the node is allowed to generate microblocks at a set rate smaller than a predefined maximum. (Specifically, if the timestamp of a microblock is in the future, or if its difference with its predecessor's timestamp is smaller than the minimum, then the microblock is invalid. This prohibits a greedy leader from swamping the system with microblocks.)  A microblock contains ledger entries and a header. The header contains the reference to the previous block, the current Unix time, a cryptographic hash of its ledger entries, and a cryptographic signature of the header. The signature uses the private key that matches the public key in the latest key block in the chain.

Microblock fork prevention is essential for this system, since the leader can spit out many microblocks quickly to make more money from transaction fees.  To report on a leader that violates the microblock production rate, a poison transaction is employed. The poison transaction contains the header of the first block in the pruned branch as a proof of fraud, and it has to be placed on the blockchain within the maturity window of the misbehaving leader’s key block, and before the revenue is spent by the malicious leader.

The new leader cannot fake an offending microblock to accuse the old leader, because the poison transaction should contain the private signature of the previous leader. Moreover the 40/60 partitioning of credit for microblock transactions between the old leader and the new leader incentivizes the new leader not to cut the microblock set short, because it would like to get more revenue from them. So the new leader that mines the last key block has all the incentive to behave according to the protocol. The 40/60 partitioning of the credit is shown to satisfy these constraints via mathematical modeling in the paper.

It looks like the microblocks need not be chained in a sequence. Rather the microblocks form a set that follow the keyblock of the corresponding leader and satisfying the rate limitations prescribed in the protocol. So what is the authoritative microblock set taken then, given that slightly different set of microblocks may arrive at different nodes of the network. It looks like the new leader (that mines the next key block) is the one to authoritatively decide on that.

MAD questions

1) In the Bitcoin protocol, if we increase the block size what could go wrong? That would definitely help with the throughput problems Bitcoin is facing. It would mean some slight increase in delivery latency depending on the bandwidth available in the path. But even increasing it to 10MB is unlikely to break anything, I think. (But hey I am a newbie in this space.) So why are there endless debates about this in the Bitcoin community? Could it be that the small blockers are concerned more about actually the increase in the transaction volume, which would mean the history grows quickly, which would make being a full Bitcoin node (rather than just a client) become more of a burden? But isn't that just delaying the inevitable? Is the idea here to delay the inevitable until more storage space and computing become available? That sounds so wrong to me as an engineer.

2) It looks like a hard fork will be needed for switching to the Bitcoin-NG protocol, because the current Bitcoin clients are unaware of the concept of microblocks. Oh, well, good luck with getting on a "consensus" on a hard fork in Bitcoin protocol. In his latest blog post "Why Decentralization Matters", Chris Dixon cited as the biggest problem with the centralized services that they can change the rules on the users: "The good news is that billions of people got access to amazing technologies, many of which were free to use. The bad news is that it became much harder for startups, creators, and other groups to grow their internet presence without worrying about centralized platforms changing the rules on them, taking away their audiences and profits. This in turn stifled innovation, making the internet less interesting and dynamic."

On the other hand, the "community-governed" decentralized protocols seem to be haunted by the reverse problem: It is very hard to change the rules even to fix problems with the protocol. Isn't that as big, if not a bigger, problem as the former for stifling innovation?

Tuesday, February 13, 2018

Paper review. IPFS: Content addressed, versioned, P2P file system

This week we discussed the IPFS whitepaper by Juan Benet in my Distributed Systems Seminar.

Remember peer-to-peer systems? IPFS is "peer-to-peer systems reloaded" with improved features. IPFS is a content-addressed distributed file system that combines Kademlia + BitTorrent + Git ideas. IPFS also offers better privacy/security features: it provides cryptographic hash content addressing, file integrity and versioning, and filesystem-level encryption and signing support.

The question is will it stick? I think it won't stick, but this work will still be very useful because we will transfer the best bits of IPFS to our datacenter computing as we did with other peer-to-peer systems technology. The reason I think it won't stick has nothing to do with the IPFS development/technology, but has everything to do with the advantages of centralized coordination and the problems surrounding decentralization. I rant more about this later in this post. Read on for the more detailed review on IPFS components, killer app for IPFS, and MAD questions.

IPFS components


Nodes are identified by a NodeId, the cryptographic hash3 of a public-key, created with S/Kademlia’s static crypto puzzle. Nodes store their public and private keys (encrypted with a passphrase).


Transport: IPFS can use any transport protocol, and is best suited for WebRTC DataChannels(for browser connectivity) or uTP.
Reliability: IPFS can provide reliability if underlying networks do not provide it, using uTP or SCTP.
Connectivity: IPFS also uses the ICE NAT traversal techniques.
Integrity: IPFS optionally checks integrity of messages using a hash checksum.
Authenticity: IPFS optionally checks authenticity of messages using HMAC with sender’s public key.


To find other peers and objects, IPFS uses a DSHT based on S/Kademlia and Coral. Coral DSHT improves over by Kademlia based on the three rules of real-estate: location, location, location. Coral stores addresses to peers who can provide the data blocks taking advantage of data locality. Coral can distribute only subsets of the values to the nearest nodes avoiding hot-spots. Coral organizes a hierarchy of separate DSHTs called clusters depending on region and size. This enables nodes to query peers in their region first, "finding nearby data without querying distant nodes" and greatly reducing the latency of lookups.


In IPFS, data distribution happens by exchanging blocks with peers using a BitTorrent inspired protocol: BitSwap. Unlike BitTorrent, BitSwap is not limited to the blocks in one torrent. The blocks can come from completely unrelated files in the filesystem. In a sense, nodes come together to barter in the marketplace. BitSwap incentivizes nodes to seed/serve blocks even when they do not need anything in particular. To avoid leeches (freeloading nodes that never share), peers track their balance (in bytes verified) with other nodes, and peers send blocks to debtor peers according to a function that falls as debt increases. For bartering, potentially, a virtual currency like FileCoin (again by Juan Benet) can be used.


IPFS builds a Merkle DAG, a directed acyclic graph where links between objects are cryptographic hashes of the targets embedded in the sources. (This video explains Merkle Trees superbly.) Merkle DAGs provide IPFS many useful properties:
1. Content addressing: All content is uniquely identified by its multihash checksum.
2. Tamper resistance: all content is verified with its checksum.
3. Deduplication: all objects that hold the exact same content are equal, and only stored once.


IPFS also defines a set of objects for modeling a versioned filesystem on top of the Merkle DAG. This object model is similar to Git’s:
1. block: a variable-size block of data.
2. list: an ordered collection of blocks or other lists.
3. tree: a collection of blocks, lists, or other trees.
4. commit: a snapshot in the version history of a tree.


IPNS is the DNS for IPFS. We have seen that NodeId is obtained by hash(node.PubKey). Then IPNS assigns every user a mutable namespace at: /ipns/<NodeId>. A user can publish an Object to this /ipns/<NodeId> path signed by her private key. When other users retrieve the object, they can check the signature matches the public key and NodeId. This verifies the authenticity of the Object published by the user, achieving mutable state retrieval.

Unfortunately since <NodeId> is a hash, it is not human friendly to pronounce and recall. For this DNS TXT IPNS Records are employed. If /ipns/<domain> is a valid domain name, IPFS looks up key ipns in its DNS TXT records: TXT "ipfs=XLF2ipQ4jD3U ..." 
# the above DNS TXT record behaves as symlink
ln -s /ipns/XLF2ipQ4jD3U /ipns/

There is even the Beaker browser to help you surf IPFS. But its usability is not great.  If IPFS wants to manage the web, it should further improve its IPNS and content discovery game. Where is the search engine for IPFS content? Do we need to rely on links from friends like the 1993's Web?

What is the killer app for IPFS?

The introduction of the paper discusses HTTP and Web, and then says:
"Industry has gotten away with using HTTP this long because moving small files around is relatively cheap, even for small organizations with lots of traffic. But we are entering a new era of data distribution with new challenges: (a) hosting and distributing petabyte datasets, (b) computing on large data across organizations, (c) high-volume high-definition on-demand or real-time media streams, (d) versioning and linking of massive datasets, (e) preventing accidental disappearance of important files, and more. Many of these can be boiled down to "lots of data, accessible everywhere." Pressed by critical features and bandwidth concerns, we have already given up HTTP for different data distribution protocols. The next step is making them part of the Web itself. 
What remains to be explored is how [Merkle DAG] data structure can influence the design of high-throughput oriented file systems, and how it might upgrade the Web itself. This paper introduces IPFS, a novel peer-to-peer version-controlled filesystem seeking to reconcile these issues."
How common are petabyte or even gigabyte files on the Internet? There is definitely an increase in size trend due to the popularity of the multimedia files. But when will this become a pressing issue? It is not a pressing issue right now because CDNs help a lot for reducing traffic for the Internet. Also bandwidth is relatively easy to add compared to latency improvements. Going for a decentralized model globally comes with several issues/headaches, and I don't know how bad the bandwidth problems would need to get before starting to consider that option. And it is not even clear that the peer-to-peer model would provide more bandwidth savings than CDNs at the edge model.

I am not convinced that the Web is the killer application for IPFS, although at the end, the paper gets ambitious:
"IPFS is an ambitious vision of new decentralized Internet infrastructure, upon which many different kinds of applications can be built. At the bare minimum, it can be used as a global, mounted, versioned filesystem and namespace, or as the next generation file sharing system. At its best, it could push the web to new horizons, where publishing valuable information does not impose hosting it on the publisher but upon those interested, where users can trust the content they receive without trusting the peers they receive it from, and where old but important files do not go missing. IPFS looks forward to bringing us toward the Permanent Web."
Decentralization opens a Pandora's box of issues. Centralized is efficient and effective. Coordination wants to be centralized. A common and overhyped misconception is not centralized is not scalable and centralized is a single point of failure. After close to two decades of work in cluster computing and cloud computing, we have good techniques in place for achieving scalability and fault-tolerance for centralized (or logically centralized, if you like) systems. For scalability, shard it, georeplicate it, and provide CDNs for reading. For fault-tolerance, slap Paxos on it, or use chain replication systems (where Paxos guards the chain configuration), or use the globe-spanning distributed datastores available today. Case in point, Dropbox is logically-centralized but is very highly available and fault-tolerant, while serving to millions of users. Facebook is able to serve billions of users.

If you want to make the natural disaster tolerance argument to motivate the use of IPFS, good luck trying to use IPFS over landlines when power and ISPs are down, and good luck trying to form a multihop wireless ad hoc network over laptops using IPFS. Our only hope in a big natural disaster is cell towers and satellite communication. Disaster tolerance is serious work and I hope governments around the world are funding sufficient research into operational, planning, and communications aspects of that.

In Section 3.8, the whitepaper talks about the use cases for IPFS:
1. As a mounted global filesystem, under /ipfs and /ipns.
2. As a mounted personal sync folder that automatically versions, publishes, and backs up any writes.
3. As an encrypted file or data sharing system.
4. As a versioned package manager for all software.
5. As the root filesystem of a Virtual Machine.
6. As the boot filesystem of a VM (under a hypervisor).
7. As a database: applications can write directly to the Merkle DAG data model and get all the versioning, caching, and distribution IPFS provides.
8. As a linked (and encrypted) communications platform.
9. As an integrity checked CDN for large files (without SSL).
10. As an encrypted CDN.
11. On webpages, as a web CDN.
12. As a new Permanent Web where links do not die.
I don't think any of these warrant going full peer-to-peer. There are centralized solutions for them, or centralized solutions are possible for them.

An important use case for IPFS is to circumvent government censorship. But isn't it easier to use VPNs then to use IPFS for this purpose. (Opera browser comes with VPN build-in, and many easy to use VPN apps are available.) If the argument is that the governments can ban VPNs or prosecute people using VPN software, those issues also apply to IPFS unfortunately. Technology is not always the solution especially when dealing with big social issues.

IPFS may be a way of sticking it to the man. But the invisible hand of the free market forces also help here; when one big corporation starts playing foul and upsets the users, new companies and startups quickly move in to disrupt the space and fill in the void.

Again, I don't want to come across wrong. I think IPFS is great work, and Juan Benet and IPFS contributors accomplished a gigantic task, with a lot of impact on future systems (I believe the good parts of IPFS will be "adopted" to improve Web and datacenter/cloud computing). I just don't believe dialing the crank to 11 on decentralization is the right strategy for wide adoption. I don't see the killer application that makes it worthwhile to move away from the convenience of the more centralized model to open the Pandora's box with a fully-decentralized model.

MAD questions 

1) Today's networking ecosystem evolved for the client-server model, what kind of problems could this create for switching to peer-to-peer model? As a basic example, the uplink at residential (or even commercial) spaces is an order of magnitude less than downlink assuming they are consumers of traffic not originators of traffic. Secondly, ISPs (for good or bad) evolved to take on traffic shaping/engineering responsibilities peering with other ISPs. It is a complex system. How does popular IPFS use interact with that ecosystem.

2) As a related point, smartphones gained primary citizenship status in today's Internet. How well can peer-to-peer and IPFS get along with smartphones? Smartphones are very suitable to be thin clients in the cloud computing model, but they are not suitable to act as peers in a peer-to-peer system (both for battery and connection bandwidth reasons). To use a technical term, the smartphones will be leeches in a peer-to-peer model. (Well unless there is good token/credit system in place, but it is unrealistic to expect that soon.)

3) On the academic side of things, designing a decentralized search engine for IPFS sounds like a great research problem. Google had it easy in the datacenter but can you design a decentralized keyword/content based search engine (or one day old indexes) maintained in a P2P manner over IPFS nodes? Popularity of a file in the system (how many copies it has in the system) can play a role in its relevance ranking for the keyword. Also could a bloom filter like data structure be useful in a p2p search?

4) Here are some more pesky problems with decentralization. I am not clear if satisfactory answers exist on these. Does IPFS mean I may be storing some illegal content originated by other users?
How does IPFS deal with the volatility? Just closing laptops at night may cause unavailability under an unfortunate sequence of events. What is the appropriate number of replicas for a data to avoid this fate? Would we have to over-replicate to be conservative and provide availability?
If IPFS is commonly deployed, how do we charge big content providers that benefit from their content going viral over the network? Every peer chips in distributing that content, but the content generator benefits let's say by way of sales. Would there need to be a token economy that is all seeing and all fair to solve this issue?

5) Is it possible to use Reed-Solomon erasure coding with IPFS? Reed-Solomon codes are very popular in the datacenters as they provide great savings for replication.

6) IPFS does not tolerate Byzantine behavior, right? The crypto puzzle needed for Node Id can help reduce the false spammers, as it makes them do some work. But after joining, there is no guarantee that the peers will play it fair: they can be Byzantine to wreak havoc on the system. But how much problems can they cause? Using cryptos and signatures prevent many problems. But can the Byzantine nodes somehow collude to cause data loss in the system, making the originator think the data is replicated, but then deleting this data? What other things can go wrong?

Saturday, February 10, 2018

Paper summary. SnailTrail: Generalizing critical paths for online analysis of distributed dataflows

Monitoring is very important for distributed systems, and I wish it would receive more attention in research conferences. There has been work on monitoring for predicate detection purposes and for performance problem detection purposes. As machine learning and big data processing frameworks are seeing more action, we have been seeing more work on the latter category. For example in ML there have been work on how to figure out what is the best configuration to run. And in the context of general big data processing framework there has been work on identifying performance bottlenecks.

Collecting information and creating statistics about a framework to identify the bottleneck activities seems like an easy affair. However, the "making sense of performance" paper (2015) showed that this is not as simple as it seems, and sophisticated techniques such as blocked time analysis are needed to get a more accurate picture of performance bottlenecks.

This paper (by ETH Zurich and due to appear in NSDI 18) extends the performance bottleneck analysis to more general frameworks and to be supported by an online monitoring tool called SnailTrail. SnailTrail is written in Rust, over the Timely Dataflow framework. It supports monitoring of applications written in Flink, Spark, Timely Dataflow, Tensorflow, and Heron.

SnailTrail overview

The SnailTrail tool operates in 5 stages:
  • it ingests the streaming logs from the monitored distributed application,
  • slices those streams into windows, 
  • constructs a program activity graph (PAG) for the windows, 
  • computes the critical path analysis of the windows, and 
  • outputs the summaries. 

Program activity graph (PAG) is a directed acyclic graph. The vertices denote the start and end of activities, such as: data processing, scheduling, buffer management, serialization, waiting, application data exchange, control messages, or unknown activities. The edges has a type and a weight for the activities. The edges capture the happened-before relationships between the vertices. Figure 2.a shows an example. The figure also shows how to project this PAG to an interval, which is pretty straightforward and as you expect.

As output, SnailTrail provides summaries for operation, worker, and communication bottlenecks. These summaries help identify which machine is overloaded, and which operators should be re-scaled. The figure below shows examples.

Critical path analysis

Critical path analysis (CPA) is a technique originally introduced in the context of project planning. CPA produces a metric which captures the importance of each activity in the transient critical paths of the computation.
The CPA score calculation in SnailTrail centers on the Transient Path Centrality concept which corresponds to the number of paths this activity appears on. In the formula e[w] corresponds to the weight of the edge e, which is taken as the start/end time duration of the activity.

In Figure 3, the activity (u,v) has the highest CPA score because it is involved in all of the 9 paths that appear in this window.

A noteworthy thing in the PAG in Figure 3 is the wait period (denoted as dashed lines) after b in worker 1. Because worker 1 is blocked with a wait after b, that path terminates at b, and does not feed into another path. The paper explains this as follows: "Waiting in our model is always a consequence of other, concurrent, activities, and so is a key element of critical path analysis: a worker does not produce anything useful while waiting, and so waiting activities can never be on the critical path." Because the two wait states terminates in deadend some of the paths, in this interval depicted in Figure 3, there are only 9 possible paths.

The formula above enables us to calculate the CPA scores without enumerating and materializing all the transient critical paths of the computation in each window. But how do you calculate N (which, for Figure 3 is calculated as 9) without enumerating the critical paths? It is simple really. The PAG is a directed acyclic graph (DAG). Everytime there is a split in the path, you copy path_count to both edges. Everytime two edges join, you set the new path_count to be the sum of the path_counts in the two incoming edges. At the end of the interval, look at how many paths are exiting, that is N. Give this a try for Figure 3 yourself.

MAD questions

1) What could be some drawbacks/shortcomings of PCA? One reason PCA may not be representative is because one execution of the application may not be typical of all executions. For example in Fig 3, w3 may take long time in the next execution in c-d activity and that could become the bottleneck in another execution of the application. But it is possible to argue that since SnailTrail produces summaries, this may not be a big issue. Another reason PCA may not be representative is because the execution may be data dependent. And again it is possible to argue that this won't be a big issue if the application uses several data in processing, and things get averaged.

Another criticism of CPA could be that it does not compose. Try combining two adjacent windows; that would lead to changing the CPA scores for the activities. Some previously low scored  activities will jump up to be high, and some highly scored activities will go down. Because of this non-composition, it becomes important to decide what is the best window/interval size to calculate these CPA metrics for summarization purposes. On the other hand, it may be reasonable to expect these metrics to be non-composable, since these metrics (blocked time analysis and critical time analysis) are designed to be complex to capture inherent/deep critical activities in the computations.

Could there be another approach than the CPA scoring presented here to capture the importance of an execution activity in the paths of the computation?  Maybe something that uses the semantics of activity types and their relation to each other?

2) The SnailTrail method uses snapshots to determine the start and end of the windows that the CPA scoring algorithm works on. Does time synchronization need to be perfect for snailtrail snapshot and analysis to work? What are the requirements on the time synchronization for this to work? 

It turns out this currently requires perfect clock synchronization. The evaluation experiments are run in the same machine with 32 cores. Without perfect clock synchronization, the snapshots may not be consistent and that could break the path calculation and CPA scoring. Hybrid Logical Clocks can help deal with the uncertainty periods in NTP clocks and can make the method work. The paper addresses this issue in Appendix F: "In SnailTrail, we assume that the trend toward strong clock synchronization in datacenters means that clock skew is not, in practice, a significant problem for our analysis. If it were to become an issue, we would have to consider adding Lamport clocks and other mechanisms for detecting and correcting for clock skew."

3) The paper doesn't have an example of improvement/optimization based on summary analytics. Even when the summary shows the high scored CPA activities to improve upon, it will be tricky for a developer to go in and change the code to improve things, because there is no indication of how easy it would be to improve the performance on this high CPA score activities. 

One way to address this could be to extend the CPA idea as follows. After ranking the activities based on CPA scores, construct another ranking of activities based on ease of optimizing/improving (I don't know how to do this, but some heuristics and domain knowledge can help). Then start the improvements/optimizations from the easiest to optimize activities that have highest CPAs.
Another way to make the summary analytics more useful is to process it further to provide more easily actionable suggestions, such as add more RAM, more CPU, more network/IO bandwidth. Or the suggestions could also say that you can do with less of resources of one kind, which can helpf for saving money in cloud deployments. If you can run with similar performance but on cheaper configurations, you would like to take that option. Would this extension require adopting SnailTrail monitoring and CPA scoring more towards capturing  resource usage related metrics? 

4) Since CPA first appeared in the project/resource management domain, could there be other techniques there that can apply to performance monitoring of distributed execution?

5) I think SnailTrail breaks new ground in the "context" or "philosophy" of monitoring: it is not trace based, it is not snapshot based, but it is window/interval-based. In our work on Retroscope, we argued it is important to aggregate the paths for both spatially (across the distributed computation) and temporally (as an evolving picture of the system's performance). SnailTrail extends the snapshot view to window views.


Frank McSherry recently joined ETH Zurich's systems group and will be helping with the strymon project and the SnailTrail work, so I'll be looking forward to more monitoring work from there.  

Wednesday, February 7, 2018

Think before you code

I have been reading Flash boys by Michael Lewis. It is a fascinating book about Wall Street and high-frequency traders. I will write a review about the book later but I can't refrain from an overview comment. Apart from the greediness, malice, and sneakiness of Wall Street people, another thing that stood out about them was the pervasive cluelessness and ineptness. It turns out nobody knew what they were doing, including the people that are supposed to be regulating things, even sometimes the people that are gaming the system. This brought to my mind Steve Jobs' quote from his 1994 interview: "Everything in this world... was created by people no smarter than you."

Anyways, today I will talk about something completely different in the book that caught my eye. It is about thinking before coding.

On Page 132 of the book:
Russians had a reputation for being at the best programmers on Wall Street, and Serge thought he knew why: They had been forced to learn to program computers without the luxury of endless computer time. Many years later, when he had plenty of computer time, Serge still wrote out new programs on paper before typing them into the machine. "In Russia, time on the computer was measured in minutes," he said. "When you write a program, you are given a tiny time slot to make it work. Consequently we learned to write the code in ways that minimized the amount of debugging. And so you had to think about it a lot before you committed it to paper.... The ready availability of computer time creates this mode of working where you just have an idea and type it and maybe erase it ten times. Good Russian programmers, they tend to have had that one experience at some time in the past--the experience of limited access to computer time."

Of course, Dijkstra was a big proponent of thinking before programming. Here in EWD 1284, he compares European CS vs American CS. He didn't have nice things to say about the keyboard-happy American CS.
The major differences between European and American CS are that American CS is more machine-oriented, less mathematical, more closely linked to application areas, more quantitative and more willing to absorb industrial products in its curriculum. For most of these differences there are perfect historical explanations, many of which reflect the general cultural differences between the two continents, but for CS we have also to take into account the special circumstance that due to the post-war situation, American CS emerged a decade earlier, for instance at the time when design, production, maintenance and reliability of the hardware were still causes for major concern. The names of the early professional societies are in this respect revealing: the “Association for Computing Machinery” and the “British Computer Society”. And so are the names of the scientific discipline and the academic departments: in the US, CS is short for Computer Science, in Europe it is short for Computing Science.
The other circumstance responsible for a transatlantic difference in how CS evolved I consider a true accident of history, viz. that for some reason IBM was very slow in getting interested in Europe as a potential market for its computers, and by the time it targeted Europe, this was no longer virgin territory. Consequently, IBM became in Europe never as predominant as it has been in Northern America. 

Here in EWD1165, Dijkstra shares an anecdote about thinking before coding and uses that as an opportunity to take a shot at "software engineering". Man, Dijkstra had the best rants.
A recent CS graduate got her first job, started in earnest on a Monday morning and was given her first programming assignment. She took pencil and paper and started to analyse the problem, thereby horrifying her manager 1.5 hours later because “she was not programming yet”. She told him she had been taught to think first. Grudgingly the manager gave her thinking permission for two days, warning her that on Wednesday she would have to work at her keyboard “like the others”! I am not making this up. And also the programming manager has found the euphemism with which to lend an air of respectability to what he does: “software engineering”.

In fact, Dijkstra takes things further, and advocates even forgoing the pen and the paper when thinking:
What is the shortest way to travel from Rotterdam to Groningen, in general: from given city to given city. It is the algorithm for the shortest path, which I designed in about twenty minutes. One morning I was shopping in Amsterdam with my young fiancée, and tired, we sat down on the café terrace to drink a cup of coffee and I was just thinking about whether I could do this, and I then designed the algorithm for the shortest path. As I said, it was a twenty-minute invention. In fact, it was published in ’59, three years late. The publication is still readable, it is, in fact, quite nice. One of the reasons that it is so nice was that I designed it without pencil and paper. I learned later that one of the advantages of designing without pencil and paper is that you are almost forced to avoid all avoidable complexities. Eventually that algorithm became, to my great amazement, one of the cornerstones of my fame.
Dijkstra (2001), in an interview with Philip L. Frana. (OH 330; Communications of the ACM 53(8):41–47)"

In several of his EWDs, Dijkstra mentioned how he favored solving problems without pen and paper and just by thinking hard, and how he has the entire article sketched in his mind before he sits to write it down in one go. Here is an example where he mentions the Mozart versus Beethoven approach to composing. Obviously Dijkstra was on the Mozart camp.
There are very different programming styles. I tend to see them as Mozart versus Beethoven. When Mozart started to write, the composition was finished. He wrote the manuscript and it was 'aus einem Guss' (from one cast). In beautiful handwriting, too. Beethoven was a doubter and a struggler who started writing before he finished the composition and then glued corrections onto the page. In one place he did this nine times. When they peeled them, the last version proved identical to the first one.
Dijkstra (2001) Source: Denken als discipline, a program from Dutch public TV broadcaster VPRO from April 10th, 2001 about Dijkstra"

For the record, and not that you care, I am on team Beethoven. Being perfectionist and taking an "I will get it right in one shot" approach to thinking/designing makes me freeze. Separating concerns by dividing my thinking between a creative mode and criticizing mode works much better for me. (I talked about that in here, here, and here.)

Maybe my thinking is sloppy and I need crutches. But it is possible to argue that writing/prototyping is a tool--not a crutch-- for thinking, and that tools are invaluable capability multipliers. And on the power of writing as a tool, I will end with these quotes:
Writing is nature's way of letting you know how sloppy your thinking is. -- Guindon. 
If you think without writing, you only think you're thinking. -- Leslie Lamport 
Without writing, you are reduced to a finite automaton.
With writing you have the extraordinary power of a Turing machine. -- Manuel Blum

MAD questions

1) Hmm, a handicap that becomes an advantage. I think that was the entire theme of the "David and Goliath" book by Gladwell. Of course, that book got a largely negative critical response. But that doesn't mean it cannot be a useful model sometimes.

Are there scientifically proven studies of a seeming handicap becoming an advantage? What are some examples?

2) Yeah, about that whole Mozart versus Beethoven mode thing... Am I mistaken? Should I be more open-minded about the Mozart mode? What is doable by the Mozart mode that would be impossible to do with the Beethoven mode?

3) Well, this is not a question. This is self reflection. Way to go Murat! You started with Flash Boys, jumped to Steve Jobs's 1994 interview, and finished with comparing Mozart and Beethoven mode. A great display of "discipline in thought" indeed.

Thursday, February 1, 2018

Paper review. Blockchains from a distributed computing perspective

Our distributed systems seminar met the first time this Monday. I went through the syllabus, explained how we will run the seminar, and introduced the papers we will discuss. In the second hour, to give students an overview of the blockchains technology, I went through this paper: "Blockchains from a distributed computing perspective".

I liked this paper a lot. It is from a respected distributed systems expert, Maurice Herlihy. It is written to be accessible and expository. The paper gives several examples (with increasing sophistication) to explain the blockchain concepts concretely. Finally, the paper is a perfect fit with our seminar's theme of looking at blockchains from a distributed systems perspective. Herlihy states that the paper is "colored by the perspective that much of the blockchain world is a disguised, sometimes distorted, mirror-image of the distributed computing world."

For a data processing perspective on blockchains, see this other paper.

Simple ledger example

The paper first introduces a simple ledger system using Alice's online news service as an example. To record articles as they arrive, Alice created a ledger that is implemented as a simple linked list. When an article arrives, it is placed in a shared pool, and a set of dedicated threads, called miners, collectively run a repeated protocol, called consensus, to select which item to append to the ledger. And to query for an article, a thread scans the linked-list ledger.

Is this a too simple, too constrained way of implementing the system? The paper says that the log-based system architecture has two compelling advantages:

  1. it is universal; it can implement any type of data structure, no matter how complex
  2. all questions of concurrency and fault-tolerance are compartmentalized in the consensus protocol 

Indeed this log/ledger based architecture is very popular for modern cloud-native distributed systems. This writeup, by Jay Kreps, tells you how important is the log/ledger abstraction for building distributed systems.

Kafka and BookKeeper are very popular platforms that enables compartmentalizing the concurrency and fault-tolerance in the framework's consensus protocol---realized by ZooKeeper. (As a side note, the private blockchain, Hyperledger v1.0.0-rc1, adopts a no-Byzantine consensus protocol based on Kafka.)

Finally, the Tango paper (SOSP'13) showed a good example of universality of logs: it showed how to build replicated in-memory data structures, and reliable and available distributed transactions and applications, on top of a shared log architecture. I believe the ideas explored in Tango paper can be used for efficiently maintaining multiple streams in the same log so that reading the whole chain is not needed for materializing the updates at the nodes.


Going from a centralized log implementation to a fully-decentralized public blockchain implementation needs some motivation. After her online news business, Alice wants to go to restaurant business. Like any trendy social influencer, she does an initial certificate sale (redeemable for meals) for her restaurant for raising capital.

This is somewhat like using Kickstarter for the restaurant. (Centralized is not de facto bad. Why would you not trust KickStarter? The market and the laws can straighten Kickstarter up, if it plays crooked.) However, the initial coin offering (ICO) adds more functionality on top of Kickstarter: you can sell/trade fractions of your token. In other words, the ICO creates a whole market around kickstarting.

The paper introduces the Proof-of-Work (PoW) idea using this example. Most miners are probably honest, content to collect their fees, but there is still a threat that even a small number of dishonest miners might collude with one another to cheat Alice’s investors. Alice’s first idea is to have miners, identified by their IP addresses, vote via a Byzantine fault-tolerant consensus algorithm. Alice quickly realizes this is a bad idea. Alice has a nemesis, Sybil, who is skilled in the art of manufacturing fake IP addresses. So Alice employs PoW for the blockchain. Sybil’s talent for impersonation is useless to her if each of her sock puppet miners must buy an expensive, long-shot lottery ticket.

PoW is a form of costly signaling: it is expensive in terms of time wasted and electricity bills. As a famous example, Bitcoin uses PoW for consensus: only a miner which has successfully solved a computationally hard puzzle (finding the right nonce for the block header) can append to the blockchain.

This is a good point for a concrete example. Luckily, this demonstration of PoW-based blockchain is a perfect way of doing that. If you haven't watched/tried this, take 15 minutes now to do it. Someone should give an award to Anders.

PoW has several shortcomings, of course. It is computationally very expensive and extremely wasteful for resources around the globe. Its throughput is miserable and its transactions are slow. It is nondeterministic: "It is still possible that two blocks are appended at the same time, creating a fork in the blockchain. Which block should subsequent miners build on? The usual answer is to build on the block whose chain is longest." So Bitcoin consolidates this by only considering a block as confirmed after it is followed by a number of blocks (typically six blocks).


The paper introduces smartcontracts using cross-chain swap as an example. Suppose Alice wants to trade some of her coupons to Bob in return for some bitcoins. Alice’s coupons live on one blockchain, and Bob’s bitcoin live on another, so they need to devise an atomic cross-chain swap protocol to consummate their deal. Naturally, neither one trusts the other.

I am acutely aware that technology is not the cure for everything. But, for the usecases where you can go all-digital, I think smartcontracts will be a great tool. It is an executable contract, so you can avoid notaries, lawyers, courts, and cops, because the failure clauses will be executed automatically. I think the smartcontracts idea is here to stay even when the fully-decentralized PoW-based blockchain approach dies.

On the other hand, the smartcontracts are not devoid of challenges either. The paper talks about the DAO attack, and makes a great point about the importance of concurrency control for smartcontracts: "We have seen that the notion that smart contracts do not need a concurrency model because execution is single-threaded is a dangerous illusion."  In a post last week, I had presented a TLA+/PlusCal modeling of the DAO attack.

MAD questions

Blockchain PoW Consensus vulnerabilities

The consensus problem, which is also at the heart of the blockchain, has been studied for more than 40 years in distributed systems. The PoW based blockchain takes a radical approach to implement it in an approximated manner.

Now, I am not saying distributed systems have consensus solved at scale. There is no solution that scales to the big number of participants that may be desired as in public blockchains. I wrote about it this earlier: "I love that Blockchains brings new constraints and requirements to the consensus problem. In blockchains, the participants can now be Byzantine, motivated by financial gains. And it is not sufficient to limit the consensus participants to be 3 nodes or 5 nodes, which was enough for tolerating crashes and ensuring persistency of the data. In blockchains, for reasons of attestability and tolerating colluding groups of Byzantine participants, it is preferred to keep the participants at 100s. Thus the problem becomes: How do you design a byzantine tolerant consensus algorithm that scales to 100s or 1000s of participants?"

On the other hand, the research on consensus in distributed systems literature has established a rigorous perimeter around the safety and liveness/progress guarantees that can be provided. The FLP impossibility result shows that it is impossible guarantee progress under a full asynchronous model with a crash failure---even with reliable channels. Paxos approach to solving consensus compartmentalizes the safety and progress properties nicely. Even under a fully asynchronous model (where all reasonable timing expectations break), Paxos preserves safety thanks to its balloting and anchoring system. Paxos also provides progress as soon as the partial synchrony kicks in and weak-complete & eventually-weak-accurate failure detectors are implementable (i.e., when we are out of the realm of the FLP result).

Because of this compartmentalization, we are guaranteed that Paxos's safety guarantees is not vulnerable to a timing attack where malicious parties break all reasonable timing assumptions about the system. This is a hard thing to get right, and several protocols have been shown to be vulnerable to timing attacks because they depended on some perfectly reasonable time synchronization assumptions.

It seems like the Bitcoin protocol makes several reasonable timing assumptions, but when an attacker manages to violate those timing assumptions, the safety of the protocol can be violated as well. 

How will the Bitcoin bubble end?

I had done a Twitter poll on this a while ago. What other options do you think are plausible? (Of course I am not claiming that this poll means anything.)

I still have a lot to learn about blockchains and cryptocurrencies, and in particular Bitcoin. Something I don't yet understand is whether it is possible for the ecosystem to erode in a slippery slope, tragedy of the commons fashion, to a majority Byzantine behavior. Say if a slightly different version of the protocol starts cutting some corners, and since that turns out to be advantageous financially more miners start adopting it, and soon that becomes the majority behavior and give rise to a hack (via a trojan) or slaying of the goose that lays the golden eggs.

While I don't know much about the dirty implementation details of Bitcoin, this incident doesn't give me much confidence that Bitcoin is decentralized, immutable, unforgeable, and unhackable.

What is the best medium for smartcontracts?

Is a scripting language the best medium? It is easy to screw that up as the DAO attack and ERC20 standard examples in the paper show. Maybe using a more declarative approach, and employing formal specifications and invariant-based reasoning to writing contracts could prove to be more resistant to errors.

What about applying model checking to smartcontracts? Could that help reduce the risks?

Sunday, January 28, 2018

Paxos derived

Lamport's fault-intolerant state machine replication algorithm

In 1978, Lamport published his classic "Time, Clocks, and the Ordering of Events in a Distributed System". As an application of logical clocks, he presented a distributed replicated state machine algorithm (and then he instantiated that algorithm to solve mutual exclusion as an example). Lamport complains that no one seemed to be aware of the distributed replicated state machine algorithm introduced in the paper:
"This is my most often cited paper. Many computer scientists claim to have read it. But I have rarely encountered anyone who was aware that the paper said anything about state machines. People seem to think that it is about either the causality relation on events in a distributed system, or the distributed mutual exclusion problem. People have insisted that there is nothing about state machines in the paper. I’ve even had to go back and reread it to convince myself that I really did remember what I had written."
I had talked about this distributed replicated state machine algorithm earlier. This algorithm is decentralized to a defect. It is not even tolerant to a single node failure. It assumes failure-free nodes.

The idea of the algorithm is as follows: In order to ensure that processes do not have different views of the order of updates, logical clocks is used to impose a total ordering on the updates. Each process keeps as part of its state the following: copy of the state, logical clock, queue of "modify requests" (with their logical time stamps), list of "known-times", one for every other process. Each process executes an update request on its copy of the state in increasing order of timestamps. For safety, all "known times" from other processes should be later than the time of the request.

The algorithm works as follows:
  1. Push your request in your own queue (timestamped with your logical clock)
  2. Broadcast your request to every node timestamp included.
  3. Wait for replies from all other nodes.
  4. If your request is now at the head of your queue and the known-times for other processes is ahead of its request timestamp (known-times is updated as processes send replies to the update request), enter critical section (where update to the state is done).
  5. Upon exiting the critical section, remove your request from the queue and send a release message to every process.

A fault-intolerant version of Paxos

I recently realized that the algorithm above (from the 1978 paper) constitutes a fault-intolerant instance of Paxos!

This occurred to me after thinking about it in the context of flexible quorums result. The flexible quorums idea (2016) states that we can weaken Paxos’ "all quorums should intersect" assertion to instead "only quorums from different phases should intersect". That is, majority quorums are not necessary for Paxos, provided that phase-1 quorums (Q1) intersect with phase-2 quorums (Q2).

This result allows trading off Q1 and Q2 sizes to improve performance (to the detriment of fault-tolerance)  Assuming failures and resulting leader changes are rare, phase-2 (where the leader tells the acceptors to decide values) is run more often than phase-1 (where a new leader is elected). Thus it is possible to improve performance of Paxos by reducing the size of Q2 at the expense of making the infrequently used Q1 larger. For example in a system of 10 acceptors, we can safely allow any set of only 3 acceptors to participate in Phase2, provided that we require 8 acceptors to participate for Phase1.  Note that the majority quorums (Q1=Q2=6) would be able to mask upto 5 node failures (f=5), whereas the Q1=8 configuration can only with stand upto 2 node failures (f=2) as it needs 8 nodes to be able to perform phase-1 if needed.

So, if you take Q1=N and Q2=1, the Paxos algorithm simplifies to the Lamport's distributed state machine replication algorithm above. Note that Q1=N implies the algorithm cannot tolerate any node failures, i.e., f=0. On the other hand, with this setup, you can combine phase 2 and phase 3 because you are writing to only one node, yourself. So phase 3 is non-existent in that algorithm.

The road from f=0 to Paxos

Ok, let's approach our claim from the other side as well. How do we take that f=0 protocol and strengthen it so that it doesn't block (lose progress) with one node failure?

This is how Phase 3 comes in to play as we add fault-tolerance. In order to tolerate one node crash  (in a fault-masking manner), you need Q2 to be 2. Then things suddenly get complicated, because you are not just writing to yourself, you will also need to write to another node in a guaranteed manner to persist the state. But, another leader may be stealing your turn before you can write to your other Q2 node your decision at Phase 2, so it is not safe to commit the update request! Therefore, Phase 2 clearing, which is phase 3, is needed to make this check, and it helps you replicate your state so it is preserved to the face of one node failure.

This is a point of objection, though. In Lamport's f=0 algorithm, logical clocks (LC) are used for reservation; every node respects LC, and puts requests into its queue ordered by LC. If one node needs to get its update done, it eventually will because the system is making progress. On the other hand, in Paxos, using the ballot numbers, for whose implementation LC could be used, a leader steals the previous leader's turn instead of patiently waiting the previous round to be complete. So what gives?

Well... In Lamport's f=0 algorithm, you could afford to be nice and patiently wait for each node to finish its turn, because f=0, and you are guaranteed to reach what you wait for. But when f>0 and a node can fail, you can't afford to wait for it to finish its turn (otherwise you would have to wait for an eternity in an asynchronous system model), and that is why Paxos is happy to change leaderships, and dueling leaders can arise (even to the point of violating progress).

In sum, something "fundamental" changes when you want to go fault-tolerant and tolerate node failure in an asynchronous system. When you combine faults and full-asynchrony, you get the FLP impossibility result. That means you lose progress! That is why Paxos does not guarantee making progress under a full asynchronous model with a crash failure. However, it preserves safety thanks to its balloting and anchoring system, and will provide progress as soon as the partial synchrony kicks in and weak-complete & eventually-weak-accurate failure detectors are implementable (i.e., when we are out of the realm of the FLP result). So, yes, there is a phase transition going from no faults to faults in asynchronous system.

I thank my PhD students, Ailidani Ailijiang and Aleksey Charapko, for discussion on this idea.

MAD questions

Was this actually how Leslie Lamport come up with the Paxos protocol? Does the 1978 fault-intolerant distributed state machine replication form a basis to evolve a fault-tolerant version?

I am not aware of any paper that makes this connection. Was this connection noticed and mentioned before?

Friday, January 26, 2018

Modeling the DAO attack in PlusCal

Maurice Herlihy's paper: "Blockchains from a distributed computing perspective" explains the DAO attack as follows:

"Figure 1 shows a fragment of a DAO-like contract, illustrating a function that allows an investor to withdraw funds. First, the function extracts the client's address (Line 2), then checks whether the client has enough funds to cover the withdrawal (Line 3). If so, the funds are sent to the client through an external function call (Line 4), and if the transfer is successful, the client’s balance is decremented (Line 5). 
This code is fatally  flawed. In June 2016, someone exploited this function to steal about $50 million funds from the DAO. As noted, the expression in Line 3 is a call to a function in the client's contract. Figure 2 shows the client's code. The client's contract immediately calls withdraw() again (Line 4). This re-entrant call again tests whether the client has enough funds to cover the withdrawal (Line 3), and because withdraw() decrements the balance only after the nested call is complete, the test erroneously passes, and the funds are transferred a second time, then a third, and so on, stopping only when the call stack overflows."
(Of course, that is a very simplified description of the DAO attack. More accurate descriptions are provided here and here.)

Even though the code seems sequential (after all the blockchain serializes everything), it has concurrency problems built in. This was a point made in Herlihy's paper as follows:
"In Ethereum, all contracts are recorded on the blockchain, and the ledger includes those contracts' current states. When a miner constructs a block, it fills that block with smart contracts and exe- cutes them one-by-one, where each contract's  final state is the next contract's initial state. These contract executions occur in order, so it would appear that there is no need to worry about concurrency." 
After showing DAO vulnerability and ERC20 token standard vulnerability, the paper says:
"We have seen that the notion that smart contracts do not need a concurrency model because execution is single-threaded is a dangerous illusion. Sergey and Hobor give an excellent survey of pitfalls and common bugs in smart contracts that are disguised versions of familiar concurrency pitfalls and bugs." 

Enter TLA+/PlusCal

Since TLA+/PlusCal is a great tool for catching concurrency problems, I thought it would be useful to model this DAO attack in PlusCal. After I got the idea, it took me a short time to model and model-check this in PlusCal. I used procedures in PlusCal (which I don't use often) to match the description of the problem.

TLA+ is all about invariant-based reasoning so I wrote the invariant first. Writing "SafeWithdrawal == (bankBalance=BALANCE /\ malloryBalance=0) \/ (bankBalance=BALANCE-AMOUNT /\ malloryBalance=AMOUNT)was too tight, because the updates of the balances are not happening atomically. That is how the invariant-based thinking helps us immediately: we can see that the withdrawal is a non-atomic operation, and realize that we should be more careful with the updates.

In the model checking pane, I set BALANCE as 10 and AMOUNT as 10. That is, initially Mallory has 10 coins in her bankBalance, and 0 in her wallet and wants to transfer her bankBalance and sets AMOUNT=10. When I run the model checker, it finds the double withdrawal problem immediately. Mallory's account got to 20 starting from 0! Normally we would expect it to go to 10 (line 27) temporarily, and then her bankBalance to be set to 0 (line 22). But this code managed to do double withdrawal, and the SafeWithdrawal invariant is violated.

The error trace contains 8 steps: Initially BankWithdraw is called, which then calls the MallorySendMoney to complete withdrawal. However, Mallory's SendMoney implementation includes another call to BankWithdraw and the balance check in line 18 passes because bankBalance is not decremented by amount (that comes in line 22). So the second BankWithdraw executes concurrently and Mallory manages to do double (and later triple) withdrawal.

Fixing things

Ok, let's check if we can fix this if we move the bankBalance subtraction before MallorySendMoney.
Of course for that we change SafeWithDrawal to accommodate the new way of updating bankBalance. But it turns out that is still too tight. If I call this with BALANCE=10 and AMOUNT=4, it is OK to have two withdrawals concurrently provided that in the final state no new money is produced: Invariant == bankBalance+malloryBalance <= BALANCE. I also model check for progress and write an EndState temporal formula for it: EndState == <>(bankBalance<=BALANCE-AMOUNT /\ bankBalance+malloryBalance=BALANCE). When we model check it, we see that this solves the problem.  So it leaves me puzzled, why, when it was this easy, the original BankWithdraw code was not coded this way and was left vulnerable to the attack.

These PlusCal models are available on my Github directory.

MAD questions

Should we come up with a PlusCal framework to facilitate modeling and model-checking of smart-contracts?

I had written about why you should model. Those apply here as well, and here things become even more critical. When money is involved, attackers get smart quickly, and it is easy to have vulnerabilities in concurrent code due to the many corner cases. Let TLA+/PlusCal show you those cornercases and help you design your protocol to achieve correctness guarantees. So if you are writing smartcontracts, I think it makes sense to first model-check and verify them. It doesn't take much effort, and it can save you from big problems.

Related links

Here is some previous discussion/context about why I started assigning TLA+/PlusCal modeling projects in distributed systems classes.

There is a vibrant Google Groups forum for TLA+ :!forum/tlaplus

Clicking on label "tla" at the end of the post you can reach all my posts about TLA+

Wednesday, January 24, 2018

Spring 18 Distributed Systems Seminar

I considered doing a blast from the past edition of the seminar focusing only on classical papers published before 2000. But, nah, I decided, I am doing a blockchain seminar instead. There has been many interesting papers on blockchains and this will be a good chance to catch up with those and look at the distributed systems, concurrency and coordination issues there.

Here is the list. I thank Adem Efe Gencer for suggesting several interesting papers for this list.

Blockchain papers:
  1. Blockchains from a Distributed Computing Perspective 
  2. Blockstack: A Global Naming and Storage System Secured by Blockchains 
  3. IPFS 
  4. Bitcoin-NG: A Scalable Blockchain Protocol 
  5. A Secure Sharding Protocol For Open Blockchains
  6. Enhancing Bitcoin Security and Performance with Strong Consistency via Collective Signing 
  7. Service-Oriented Sharding for Blockchains
  8. The stellar consensus protocol: A federated model for internet-level consensus
Smart contracts papers:
  1. Step by Step Towards Creating a Safe Smart Contract: Lessons and Insights from a Cryptocurrency Lab 
  2. A Concurrent Perspective on Smart Contracts
  1. Zerocash: Decentralized Anonymous Payments from Bitcoin 
  2. SoK: Research Perspectives and Challenges for Bitcoin and Cryptocurrencies 
  3. Lightning Network 

Finally, I will also cover some distributed machine learning papers as well:

Related links

2016 Seminar reading list
2015 Seminar reading list

Monday, January 22, 2018

Erasable pens for editing papers

I recently discovered the Pilot Frixion pens and I like them a lot. (I am not getting any advertisement money from them I swear :-)

The pens have erasable ink, so they are great for marking your comments/edits on a paper while reading. They erase via heat. Each pen comes with a plastic nub, and if you apply friction to the page with the plastic nub at the top, and it erases the writing --mostly clean. A word of caution though, this means if you leave your writing in a hot car, you will find it erased, which you can remedy by putting it in a freezer. I am not kidding.

So, don't use it for writing you want to keep permanently, but it is great for writing comments and marking on a paper when you are reading.

I print the research paper I am reading and I do a lot of marking on paper. If I use a regular pen, I cross over some of my guesswork, nonsensical questions, or misinformed comments, and it messes up the paper. But using Frixion pens, I erase and modify my comments without creating a mess. Well, at least, less of a mess.

I am still not an online shopping junkie. Although these pens are available on Amazon, I still like to buy them from a store. I was happy to see that Walmart has the 3 colors (black, red, blue) for $5. I use the red pen for question,  blue pen for important comments, and black pen for all the other comments and doodling.

MAD questions

I once prematurely advocated going back to the fountain pen.

I still switch between the two. Mostly I live in Emacs happily. But I print and edit with erasable pens when I need to edit. And I switch to a fountain pen (a cheap Pilot Metropolitan pen combined with a refillable ink-pump --- again not advertising), when I am badly stuck and I need to think on paper, or when I want to break monotony. I sign things with that fountain pen, and the pen surprises the students even my colleagues. Fountain pens are now considered antique.

I wonder if it would be possible to combine the perfect pen on paper feel with all the ease/convenience of text wrangling in Emacs within the next 10 years. I feel that would be the ultimate productivity product for me.  

No I don't think MS surface* is there yet; it is still lacking in both fronts: writing-feel and editing-convenience. I was hoping Apple with get there, but after Steve Jobs passed away, I am not counting on it anymore.

Thursday, January 18, 2018

Remember peer-to-peer systems?

Traditionally computer systems use client server model. This is more of a centralized approach; server sits there and responds to clients requests. If one server is not enough for computation/analysis, a "hierarchical" organization of servers model is adopted in datacenter and cloud computing. One node becomes the master, other nodes act as workers. This is called the master-worker model. This simple model make sense if you have an infrastructure. Centralized control architecture is simple, so you can keep the coordination simple and efficient.

Peer-to-peer model is on the other end of the spectrum: it calls for a fully decentralized system model. There is no distinguished master. Each node acts as both server and client, each node is a peer. This model does not require stable infrastructure and it can self-organize with what is presently available. As such, they are great for circumventing laws, bans, and censorship.
In 2000s, peer-to-peer systems were all the craze. Peer-to-peer music sharing applications, Napster, Gnutella, and Kazaa, were very popular and successful. There were systems called CAN, Chord, Pastry, and Tapestry. Bittorrent, peer-to-peer communication protocol was also very popular: "In November 2004, BitTorrent was responsible for 25% of all Internet traffic." 
Then, peer-to-peer systems disappeared from the scene in the next 5 years or so. The peer-to-peer architecture got abolished, but the best ideas from those work found their way to traditional datacenter computing. Consistent hashing got adopted for distributed key-value stores. Bittorrent saw some uses in datacenter application-layer networking. 

Today, the pendulum is swinging back again to peer-to-peer systems for decentralized attestation with blockchain and ipfs applications. 

As a distributed systems professor, I should be exuberant like everybody else, but I am cautious. (As I wrote in 2014, distributed is not necessarily more scalable than centralized). The centralized coordination architectures (today's cloud computing and datacenter architectures) have a strong attraction point: they are simple and efficient to coordinate. Even then we mess up building those systems. So we don't stand much chance in building, scaling, and maintaining fully-decentralized systems, let alone leveraging on them as scaffolding to build/control/coordinate more sophisticated and scalable applications.

So, most likely, this will play out similar to the peer-to-peer systems of 2000. The blockchain architectures will fade away, but the best ideas of blockchain systems will be adopted for adding attestation, authentication and smart-contracts for cloud computing and datacenter applications. Hopefully those ideas can be used to fix the problems with social networks and make them used for enabling collaboration to make/build things, where individual effort/labor/contribution can be tracked and rewarded appropriately. 

Tuesday, January 16, 2018

Paper summary. A Berkeley view of systems challenges for AI

This position paper from Berkeley identifies an agenda for systems research in AI for the next 10 years. The paper also serves to publicize/showcase their research, and steer interest towards these directions, which is why you really write position papers.

The paper motivates the systems agenda by discussing how systems research/development played a crucial role in fueling AI’s recent success. It says that the remarkable progress in AI has been made possible by a "perfect storm" emerging over the past two decades, bringing together: (1) massive amounts of data, (2) scalable computer and software systems, and (3) the broad accessibility of these technologies.

The rest of the paper talks about the trends in AI and how those map to their systems research agenda for AI.

Trends and challenges

The paper identifies 4 basic trends in the AI area:
  • Mission-critical AI: Design AI systems that learn continually by interacting with a dynamic environment in a timely, robust, and secure manner.
  • Personalized AI: Design AI systems that enable personalized applications and services while respecting users’ privacy and security.
  • AI across organizations: Design AI systems that can train on datasets owned by different organizations without compromising their confidentiality. (I think it was possible to simplify presentation by combining this with the Personalized AI.)
  • AI demands outpacing the Moore’s Law: Develop domain-specific architectures and distributed software systems to address the performance needs of future AI applications in the post-Moore’s Law era.
To enable progress on these fronts, the paper then identifies 9 research topics, across 3 main areas: Acting in dynamic environments, Secure AI, and AI specific architectures.

Acting in dynamic environments

R1: Continual learning

Despite Reinforcement Learning (RL)'s successes (Atari games, AlphaGo in chess and Go games), RL has not seen widescale real-world application. The paper argues that coupling advances in RL algorithms with innovations in systems design will drive new RL applications.

Research: (1) Build systems for RL that fully exploit parallelism, while allowing dynamic task graphs, providing millisecond-level latencies, and running on heterogeneous hardware under stringent deadlines. (2) Build systems that can faithfully simulate the real-world environment, as the environment changes continually and unexpectedly, and run faster than real time.
Of course, the second part here refers to research described in "Real-Time Machine Learning: The Missing Pieces". Simulated Reality (SR) focuses on continually simulating the physical world with which the agent is interacting. Trying to simulate multiple possible futures of a physical environment in high fidelity within a couple milliseconds is a very ambitious goal. But research here can also help other fields, so this is interesting.

R2: Robust decisions

The challenges here are: (1) robust learning in the presence of noisy and adversarial feedback, and (2) robust decision-making in the presence of unforeseen and adversarial inputs.

Research: (1) Build  fine grained provenance support into AI systems to connect outcome changes (e.g., reward or state) to the data sources that caused these changes, and automatically learn causal, source-specific noise models. (2) Design API and language support for developing systems that maintain confidence intervals for decision-making, and in particular can flag unforeseen inputs.

R3: Explainable decisions

Here we are in the domain of causal inference, a field "which will be essential in many future AI applications, and one which has natural connections to diagnostics and provenance ideas in databases."

Research: Build AI systems that can support interactive diagnostic analysis, that faithfully replay past executions, and that can help to determine the features of the input that are responsible for a particular decision, possibly by replaying the decision task against past perturbed inputs. More generally, provide systems support for causal inference.

Secure AI

R4: Secure enclaves

A secure enclave is a secure execution environment—which protects the application running within from malicious code running outside.

Research: Build AI systems that leverage secure enclaves to ensure data confidentiality, user privacy and decision integrity, possibly by splitting the AI system’s code between a minimal code base running within the enclave, and code running outside the enclave. Ensure the code inside the enclave does not leak information, or compromise decision integrity.

R5: Adversarial learning

The adaptive nature of ML algorithms opens the learning systems to new categories of attacks: evasion attacks and data poisoning attacks.

Research: Build AI systems that are robust against adversarial inputs both during training and prediction (e.g., decision making), possibly by designing new machine learning models and network architectures, leveraging provenance to track down fraudulent data sources, and replaying to redo decisions after eliminating the fraudulent sources.

R6: Shared learning on confidential data

The paper observes that, despite the large volume of theoretical research, there are few practical differential privacy systems in use today, and proposes to simplify differential privacy use for real-world applications.

Research: Build AI systems that (1) can learn across multiple data sources without leaking information from a data source during training or serving, and (2) provide incentives to potentially competing organizations to share their data or models.

AI specific architectures

R7: Domain specific hardware

The paper argues that "the one path left  to continue the improvements in performance-energy-cost of processors is developing domain-specific processors." It mentions the Berkeley Firebox project, which proposes a multi-rack supercomputer that connects thousands of processor chips with thousands of DRAM chips and nonvolatile storage chips using fiber optics to provide low-latency, high-bandwidth, and long physical distance.

Research: (1) Design domain-specific hardware architectures to improve the performance and reduce power consumption of AI applications by orders of magnitude, or enhance the security of these applications. (2) Design AI software systems to take advantage of these domain-specific architectures, resource disaggregation architectures, and future non-volatile storage technologies.

R8: Composable AI systems

The paper says modularity and composition will be key to increasing development speed and adoption of AI. The paper cites the Clipper project.

Research: Design AI systems and APIs that allow the composition of models and actions in a modular and  flexible manner, and develop rich libraries of models and options using these APIs to dramatically simplify the development of AI applications.

R9: Cloud-edge systems

The paper mentions the need to repurpose code to multiple heterogeneous platforms via re-targetable software design and compiler technology. It says "To address the wide heterogeneity of edge devices and the relative difficulty of upgrading the applications running on these devices, we need new software stacks that abstract away the heterogeneity of devices by exposing the hardware capabilities to the application through common APIs."

Research: Design cloud-edge AI systems that (1) leverage the edge to reduce latency, improve safety and security, and implement intelligent data retention techniques, and (2) leverage the cloud to share data and models across edge devices, train sophisticated computation-intensive models, and take high quality decisions.

MAD questions

(The questions that led to these explanations are left as an exercise to the reader.)

1) In 2009, there was a similar position paper from Berkeley called "Above the Clouds: A Berkeley View of Cloud Computing". That paper did a very good job of summarizing, framing, and selling the cloud computing idea to the academia. But it looks like the research agenda/directions from that report didn't fare very well after 8 years---which is totally expected. Plans are useless but planning is indispensable. The areas of interest change after some time and the research adapts to it. It is impossible to tightly plan and manage exploratory research in CS areas (maybe this is different in biology and sciences areas.)

I think it is a YES for items 4, 5, 6, and partial for the rest, with very little progress in items 2 and 9. While the opportunities did not include them, the following developments have since reshaped the cloud computing landscape:

  • dominance of machine learning workloads in the cloud
  • the rise of NewSQL systems, the trend for more consistent distributed databases, and the importance of coordination/Paxos/ZooKeeper in the cloud 
  • the development of online in-memory dataflow and stream processing systems, such as Spark, which came out of Berkeley
  • the race towards finer-granularity virtualization via containers and functions as a service
  • the prominence of SLAs (mentioned only once in the paper)

So even though the AI-systems agenda from Berkeley makes a lot of sense, it will be instructive to watch how these pan out and what unexpected big AI-systems areas open up in the coming years.

2) Stanford also released a similar position paper earlier this year, although theirs was for a limited scope/question for developing a [re]usable infrastructure for ML. Stanford's DAWN project aims to target end-to-end ML workflows, empower domain experts, and optimize end-to-end. This figure summarizes their vision for the reusable ML stack:

Of course, again, this inevitably reflects the strengths and biases of the Stanford team; they are more on the database, datascience, production side of things. It  looks like this has some commonalities with the AI-specific architectures section of the Berkeley report, but different approaches are proposed for the same questions.

3) For R2: Robust decisions, it seems like formal methods, modeling, invariant-based reasoning, can be useful, especially when concurrency control becomes an issue in distributed ML deployments.