Monday, December 18, 2017

An overview of some existing distributed systems (of the time)


Existing distributed systems which make use of multicast communication can be categorised as using either one—to—many communication (one client interacting with a replica group) or many—to—many communication (replica groups interacting with replica groups). We shall now look at a number of the popular protocols and show how they have approached the problems of maintaining replica consistency to both external and internal events.


One-to-Many Communication 

The V System

The V System [Cheriton 84][Cheriton 85makes use of what they term one-to-many communication transactions, where the start of a transaction marks the end of the previous transaction (any outstanding messages associated with the old transaction are destroyed without being processed). Each group member has a finite message buffer into which replies are queued provided they arrive before the start of the next send transaction. Although a reliable communication protocol is not used, the assumption is made that by making use of retransmissions is it possible to ensure (with a high probability) that messages are delivered if the sender and receiver remain operational. A further assumption is that at least one operational group member receives and replies to a message.

The designers of the V System recognized the possibility of message loss due to message buffer overflow, and their solution was to make use of larger buffers and to modify the kernel to introduce a random delay when replying to a group send, thereby reducing the number of messages in a queue at any time. This imposes a consistent overhead on all group replies but still does not ensure that buffer overflow will never occur (if the number of different groups interacting with each other increases then even with this delay it is still probable that the finite sized message queues will overflow).

When servers reply to a client they do so on a many-to--one basis i.e., all servers which received a request from a particular client will attempt to reply to that client (and not to any replica group the client may be a member of). This means that there is the possibility of members of the same group taking different actions as a result of local events such as timeouts (resulting in state divergence), because the servers replied to some, but not all, clients on time.

The Andrew System 

The Andrew System [Satyanarayanan 90] assumes that the communication system is reliable in much the same manner as the V System i.e., given a sufficient number of retries then any operational server can always be contacted. Communication is one-to-many and the termination conditions for a particular client-server interaction is set on a per call basis (depending on the application). The designers recognized that calls between remote processes can take an indeterminate amount of time and so if a client times out it sends an "are you alive" message to any server replica which has not responded and if a reply is returned then it continues to wait. However, because servers reply on a many-to-one basis, state divergence between replicated clients can still occur: if only one client receives a reply sent before all servers fail, then the replicated clients may take different actions.

It was recognized that messages might be lost by processes because of message buffer overflow, but in Andrew it is assumed that such loses will be detected by retries. This implicitly relies on the servers keeping the last transmitted replies to every client, but could still result in state divergence if, for example, a server group is seen to have failed by one replicated client when it retries a request, when in fact all other members of the client group received the reply before the failure. 

Many-to-Many Communication 

The Circus System

The Circus System [Cooper 84b] (and Multi-Functions [Banatre 86b] in Gothic [Banatre 86a]) makes use of many-to-many communication, with both client and server groups. As with the V System they assume that with a sufficient number of transmission retries it is possible to deliver messages to all operational members of a group. Thus, if a call eventually times out (after retrying a sufficient number of times) a sender can assume that the receiver(s) has failed. Each member of a group has a message queue which is assumed to be infinite in size (if this assumption cannot be made then the designers of Circus assume that retransmissions of requests and replies can account for losses due to buffer overflow).


Throughout the description of Circus the assumption is made that the members of a group are detenninistic (given the same set of messages in the same order and the same starting state then they will all arrive at the same next state). Client groups' members operate as though they were not replicated and do not know that they are members of a replica group. No mention is made of how this determinism of group members extends to asynchronous local events such as timeouts.

Despite the assumptions made by the designers, both the problem of message buffer overflow, and local timeouts, can occur in the Circus system, leading to replica state divergence.  

No comments: