Shoal: How We Reduce Bullshark Latency on The Aptos Blockchain

Aptos Labs
Aptos
Published in
12 min readJun 13, 2023

--

By Balaji Arun and Alexander Spiegelman

World-class engineering meets cutting-edge research at Aptos. For more details, please refer to the Shoal paper.

TL;DR: We have solved two important open problems in DAG BFT that significantly reduce latency and, for the first time, have eliminated the need for timeouts in deterministic practical protocols. Overall, we have improved Bullshark’s latency by 40% in the failure-free case and by 80% in the failure case.

Shoal is a framework for enhancing any Narwhal-based consensus protocol (e.g., DAG-Rider, Tusk, Bullshark) with pipelining and leader reputation. Pipelining reduces DAG ordering latency by introducing an anchor every round and, leader reputation further improves it by ensuring anchors are associated with the fastest validators. Moreover, leader reputation allows Shoal to take advantage of the asynchronous DAG construction to eliminate timeouts in all (but extremely uncommon) scenarios. This allows Shoal to provide a property we name Prevalent Responsiveness, which subsumes the often desired Optimistic Responsiveness.

Our technique, which is surprisingly simple, involves running multiple instances of the underlying protocol one after another sequentially. Consequently, when instantiated with Bullshark, we obtain a Shoal of sharks running a relay race.

Motivation

In the pursuit of high performance in blockchain networks, there has been a focus on reducing communication complexity. However, this approach has not led to significant improvements in throughput. For instance, Hotstuff — implemented in earlier versions of Diem — achieves only 3500 TPS, falling significantly short of our target of achieving 100k+ TPS.

A recent breakthrough, however, stemmed from the realization that data dissemination is the primary bottleneck in leader-based protocols, and it can benefit from parallelization. The Narwhal system separated data dissemination from the core consensus logic and proposed an architecture where all validators simultaneously disseminate data, while the consensus component orders only a smaller amount of metadata. The Narwhal paper reported a throughput of 160,000 TPS.

In a previous post, we presented Quorum Store, our Narwhal implementation to decouple data dissemination from consensus, and how we use it to scale our current consensus protocol, Jolteon. Jolteon is a leader-based protocol that combines the linear fast path of Tendermint with a PBFT style view-change, to reduce the Hotstuff latency by 33%. Nevertheless, it has become apparent that leader-based consensus protocols cannot fully harness Narwhal’s throughput potential. Despite separating data dissemination from consensus, the leaders in Hotstuff/Jolteon are still constrained as throughput increases.

We, therefore, decided to deploy Bullshark, a zero communication overhead consensus protocol, on top of the Narwhal DAG. Unfortunately, the DAG construction that enables Bullshark’s high throughput comes with a 50% latency price compared to Jolteon.

In this post, we present Shoal and explain how we drastically reduce Bullshark latency.

DAG-BFT Background

Let us start with the relevant background for understanding this post. Please refer to the DAG meets BFT post for a detailed description of Narwhal and Bullshark.

Each vertex within the Narwhal DAG is associated with a round number. In order to progress to round r, a validator must first obtain n-f vertices belonging to round r-1. Every validator can broadcast one vertex per round, with each vertex referencing a minimum of n-f vertices from the previous round. Due to the network asynchrony, different validators may observe different local views of the DAG at any point in time. Below is an illustration of a possible local view:

Figure 1. The causal history of the vertex identified by validator 2 in round 2 is highlighted in green.

A key property of the DAG is non-equivocation: if two validators have the same vertex v in their local view of the DAG, then they have exactly the same causal histories of v.

Total Order

It is possible to agree on the total order of all vertices in the DAG with no additional communication overhead. In order to do so, validators, in DAG-Rider, Tusk, and Bullshark, interpret the structure of the DAG as a consensus protocol, where a vertex represents a proposal and an edge represents a vote.

While the quorum intersection logic on the DAG structure differs, all existing Narwhal-based consensus protocols share the following structure:

  1. Predetermined anchors. Every few rounds (e.g., two in Bullshark) there is a round with a pre-determined leader. The vertex of the leader is called an anchor.
  2. Ordering anchors. Validators independently, but deterministically, decide which anchors to order and which to skip.
  3. Order causal histories. Validators process their list of ordered anchors one by one, and for each anchor, order all previously unordered vertices in their causal history by some deterministic rule.
Figure 2. An illustration of a possible local view of the DAG in Bullshark protocol. In this example, the red and yellow anchors are ordered, while the green (not in the DAG) is skipped. As a result, to order the DAG, the validator deterministically orders the red anchor’s causal history first, and immediately after the yellow’s.

The key to satisfying safety is to ensure that in step (2) above, all honest validators create a list of ordered anchors such that all lists share the same prefix. In Shoal, we make the following observation regarding all the above protocols:

All validators agree on the first ordered anchor.

Bullshark Latency

Bullshark’s latency depends on the number of rounds between the ordered anchors in the DAG. While the most practical partially synchronous version of Bullshark has better latency than the asynchronous version, it is far from optimal.

Problem 1: Average block latency. In Bullshark, there is an anchor in every even round, and vertices in every odd round are interpreted as votes. In the common case, two DAG rounds are required to order an anchor. However, the vertices in the anchor’s causal history require more rounds to wait for the anchor to be ordered. In the common case, vertices in odd rounds require three rounds, while non-anchor vertices in even rounds require four rounds (see Figure 3).

Figure 3. In the common case, the anchor in round i+1 requires two rounds to be ordered. The vertices in round i are ordered at the same time, therefore their latency is three rounds. The vertices in round i+1, however, must wait for the next anchor to be ordered (the one in round i+3) and thus their order latency is four rounds.

Problem 2: Failure-Case latency. The above latency analysis applies to the failure-free case. On the other hand, if a leader of a round fails to broadcast the anchor fast enough, then the anchor cannot be ordered (and is therefore skipped), thus all the unordered vertices in previous rounds must wait for the next anchor to be ordered. This can significantly degrade performance in geo-replicated networks, especially because Bullshark uses timeouts to wait for leaders.

Shoal Framework

Shoal solves both latency issues. It enhances Bullshark (or any other Narwhal-based BFT protocol) with pipelining, which allows an anchor in every round and reduces the latency for all non-anchor vertices in the DAG to three rounds. Shoal also introduces a zero overhead leader reputation mechanism into the DAG, which biases the selection towards fast leaders.

Challenges

Pipelining and leader reputation in the context of DAG protocols were considered hard problems for the following reasons:

  • Previous pipelining attempts tried to modify the core Bullshark logic, but this inherently seems impossible.
  • Introduced in DiemBFT and formalized in Carousel, Leader Reputation is the idea of dynamically selecting future leaders (anchors in Bullshark) based on validators’ past performance. While disagreeing on leaders’ identities does not violate safety in those protocols, in Bullshark, however, it can lead to completely different orderings. This leads to the core of the problem that selecting round anchors dynamically and deterministically is necessary to solve consensus, while validators need to agree on the ordered history to select future anchors.

As evidence of the problems’ difficulty, we note that none of the Bullshark implementations, including the ones currently in production in the wild, support these features.

Protocol

Despite the challenges above, it turns out the solution was hidden behind simplicity, as the saying goes.

In Shoal, we lean into the power of performing local computations on the DAG and realize the ability to preserve and re-interpret information from previous rounds. With the core insight that all validators agree on the first ordered anchor, Shoal sequentially combines several instances of Bullshark pipelining them such that (1) the first ordered anchor is the switching point for the instances and (2) the anchor’s causal history is used for computing leaders’ reputations.

Pipelining

Similar to Bullshark, validators apriori agree on potential anchors i.e., there is a known mapping F: R -> V that maps rounds to leaders. Shoal runs instances of Bullshark one after another such that for each instance, the anchors are predetermined by the mapping F. Each instance orders one anchor, which triggers a switch to the next instance.

Initially, Shoal starts the first instance of Bullshark in the first round of the DAG and runs it until the first ordered anchor is determined, say in round r. All validators agree on this anchor. Therefore, all validators can deterministically agree to reinterpret the DAG starting from round r+1. Shoal simply starts a new instance of Bullshark in round r+1.

In the best case, this allows Shoal to order an anchor in every round. The anchor in the first round is ordered by the first instance. Then, Shoal starts a new instance in the second round, which itself has an anchor. This anchor is ordered by that instance. Then, another new instance orders an anchor in the third round and the process continues. See the figure below for an illustration:

Figure 4. The vertices corresponding to the leaders determined by F are marked by a crown. The first instance of Bullshark starts by interpreting the DAG with anchors in rounds 1, 3, and 5. Bullshark determines that the anchor in round 1, marked by a green checkmark, is the first to be ordered in the first instance. (Note that in the general case, this anchor could be skipped and some other anchor would be the first to be ordered.) Then, the rest of the first instance is ignored and a new instance of Bullshark starts at round 2 with the anchors marked in rounds 2 and 4.

Leader reputation

Latency increases when anchors are skipped during the Bullshark ordering. In such cases, the pipelining technique cannot help since a new instance cannot be started until the previous instance orders an anchor. Shoal deals with missed anchors by ensuring that the corresponding leaders are less likely to be selected in the future using a reputation mechanism to assign each validator a score based on the history of its recent activity. A validator that has been responsive and participating in the protocol would be assigned a high score. Otherwise, the validator would be assigned a low score since it may be crashed, slow, or malicious.

The idea is to deterministically re-compute the pre-defined mapping F from rounds to leaders every time the scores are updated, biasing towards leaders with higher scores. In order for validators to agree on the new mapping, they should agree on the scores, and thus on the history used to derive the scores.

In Shoal, pipelining and leader reputation can be naturally combined as they both utilize the same core technique of re-interpreting the DAG after agreeing on the first ordered anchor.

In fact, the only difference is that after ordering an anchor in round r, validators just have to compute a new mapping F’, starting from round r+1, based on the causal history of the ordered anchor in round r. Then, the validators start executing a new instance of Bullshark from round r+1 with the updated anchor selection function F’. See the illustration below:

Figure 5. The vertices corresponding to the leaders determined by F are marked by a transparent crown. The first instance of Bullshark orders an anchor in round 1, marked by a green checkmark. Then, a new mapping F’ is computed according to the information in the anchor’s causal history. The leaders determined by F’ are marked by a colorful crown.

No More Timeouts!

Timeouts play a crucial role in all leader-based deterministic partially synchronous BFT implementations. However, they introduce intricacies that increase the number of internal states to manage and observe. This added complexity complicates the debugging process and necessitates more involved observability techniques.

Timeouts also significantly increase latency since configuring them aptly is non-trivial and often needs to be dynamically adjusted as it is highly environment (network) dependent. The protocol pays a full timeout latency penalty for a faulty leader before it can move on to the next leader. Therefore, timeouts cannot be set too conservatively. But if the timeouts are too short, the protocol may skip good leaders. For example, we observed that with a high load, the leaders in Jolteon/Hotstuff are overwhelmed and timeouts expire before they can drive progress.

Unfortunately, leader-based protocols (like Hotstuff and Jolteon) inherently require timeouts to ensure protocol progress every time a leader is faulty. Without timeouts, even a crashed leader could stall the protocol forever. Since it is impossible to distinguish between a faulty and a slow leader during asynchronous periods, timeouts may cause validators to view-change all leaders without consensus liveness.

In Bullshark, timeouts are used in the DAG construction to ensure that in synchronous periods honest leaders add the anchors to the DAG fast enough for them to be ordered.

We observe that the DAG construction provides a “clock” that estimates the network speed. Without timeouts, the rounds keep advancing as long as n-f honest validators continue adding vertices to the DAG. While Bullshark might not order at network speed (due to faulty leaders), the DAG still grows at network speed despite some leaders being faulty or the network being asynchronous. Eventually, when a non-faulty leader is fast enough to broadcast the anchor, the entire causal history of the anchor will be ordered.

In our evaluation, we compare Bullshark with and without timeouts across the following cases:

  • Fast leader, meaning faster than at least f other validators. In this case, both approaches provide the same latency as anchors are ordered and timeouts are not used.
  • Faulty leader. In this case, Bullshark with no timeouts provides much better latency as validators will skip its anchor immediately, whereas with timeouts validators will wait for their expiration before moving on.
  • Slow leader. This is the only case where Bullshark with timeouts will perform better. This is because, without timeouts, the anchor will likely be skipped as the leader will fail to broadcast it fast enough, whereas, with timeouts, validators will wait for the anchor.

In Shoal, avoiding timeouts and leader reputation go hand in hand. Instead of repeatedly waiting for the slow leaders and as a result increasing latency, the leader reputation mechanism excludes slow validators from being selected as leaders. This way, the system takes advantage of the fast validators to operate at network speed in all real-world scenarios.

Note that the FLP impossibility result states that no deterministic consensus protocol can avoid timeouts. Shoal does not circumvent this result since there exists a theoretical adversarial schedule of events that can prevent all anchors from being ordered. Instead, Shoal falls back to timeouts after a configurable amount of consecutive skipped anchors. In practice, such a scenario is extremely unlikely.

Prevalent Responsiveness

The Hotstuff paper popularized the notion of optimistic responsiveness. Although not formally defined, intuitively it means that the protocol can operate at network speed in a good case that includes fast leaders and synchronous networks.

Shoal provides a strictly better property, which we call prevalent responsiveness. Specifically, compared to Hotstuff, Shoal continues to operate at network speed even if leaders fail for a configurable number of consecutive rounds or the network experiences asynchronous periods. See a more detailed comparison in the table below.

Note that prevalent responsiveness provides a strictly better progress guarantee during asynchronous periods and when leaders are faulty. During synchrony with slow leaders, the properties are incomparable as it depends on how slow the leader is. However, with leader reputations, slow leaders should rarely appear in Shoal.

Evaluation

We implemented Bullshark and Shoal on top of Quorum Store, our Narwhal implementation. A detailed comparison between Shoal, Bullshark, and Jolteon can be found in the evaluation Section of the Shoal paper. Here we provide some highlights.

First, to demonstrate the power of the asynchronous DAG construction, we compare Bullshark with and without timeouts. The full Bullshark paper assumes an asynchronous network but provides fast path mode due to which timeouts are required in all rounds. We refer to this version as Vanilla Bullshark. We observed that for the stand-alone partially synchronous network assumption version, the timeout in the voting round is not required. We refer to this version as Vanilla Bullshark w/o Vote timeout. Baseline Bullshark is the version without any timeouts.

The figures below compare the timeouts effect on Bullshark Latency with and without failures. Obviously, Baseline Bullshark (no timeouts) performs the best in the presence of failures. In the failure-free case, Baseline Bullshark is comparable to Vanilla Bullshark w/o Vote timeout. This is because, as discussed earlier, without a leader reputation mechanism, timeouts may have an advantage in the case of good but slow leaders.

Figure 6. Effect of timeouts on Bullshark latency without failures (left) and with failures (right). There are 50 validators in the failure case.

Next, we instantiate Shoal with Baseline Bullshark (no timeouts) and demonstrate the latency improvement of the pipelining and leader reputation mechanisms with and without failures. For completeness, we also compare it to Jolteon in the failure-free case.

Figure 7 below presents the failure-free case. While both pipelining and leader reputation individually reduce latency, combining them together achieves the best possible latency.

As for Jolteon, it fails to scale beyond 20 validators and achieves only about half of the Bullshark/Shoal throughput even though it runs on top of Quorum Store, which removes the data dissemination bottleneck.

The results show that Shoal drastically improves Bullshark latency. As for Jolteon, It is important to note though that we measured consensus latency only. Since Jolteon cannot be locally run on top of the DAG, it requires additional latency to decouple data dissemination, which we do not measure. Thus, under high load, Shoal should match the end-to-end latency of Jolteon.

Figure 7. Throughput and latency in the failure-free case. Shoal PL and Shaol LR support only pipelining and leader reputation, respectively.

Figure 8 below shows the failure case with 50 validators, where the leader reputation mechanism improves latency significantly by reducing the likelihood of failed validators from being selected as leaders. Note that, with 16 out of 50 failures, Shoal achieves 65% lower latencies than Baseline Bullshark.

Figure 8. Latency in the failure case with 50 validators.

World-class engineering meets cutting-edge research at Aptos. For more details, please refer to the Shoal paper.

--

--

Aptos Labs
Aptos

Aptos Labs is a premier Web3 studio of engineers, researchers, strategists, designers, and dreamers building on Aptos, the Layer 1 blockchain.