Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Designing a Global Name Service - Research Papers | CS 6393, Papers of Cryptography and System Security

Material Type: Paper; Class: AT: Cyber Security; Subject: Computer Science; University: University of Texas - San Antonio; Term: Fall 2007;

Typology: Papers

Pre 2010

Uploaded on 08/17/2009

koofers-user-vh7
koofers-user-vh7 🇺🇸

10 documents

1 / 19

Toggle sidebar

Related documents


Partial preview of the text

Download Designing a Global Name Service - Research Papers | CS 6393 and more Papers Cryptography and System Security in PDF only on Docsity! Designing a Global Name Service 1 Designing a Global Name Service1 Butler W. Lampson2 Digital Equipment Corporation3 Abstract A name service maps a name of an individual, organization or facility into a set of labeled properties, each of which is a string. It is the basis for resource location, mail addressing, and authentication in a distributed computing system. The global name service described here is meant to do this for billions of names distributed throughout the world. It addresses the problems of high availability, large size, continuing evolution, fault isolation and lack of global trust. The non-deterministic behavior of the service is specified rather precisely to allow a wide range of client and server implementations. Introduction There are already enough names. One must know when to stop. Knowing when to stop averts trouble. Tao Te Ching The name service I am describing in this talk is intended to be the basis for resource location, mail addressing, and authentication in a distributed computing system. The system I have in mind is a large one, large enough to encompass all the computers in the world and all the people who use them. Of course, the amount of communication between most pairs of computers or people in such a system is small, just as the number of letters or telephone calls between most pairs of people is small. But we expect the postal system or the telephone system to handle such communication on demand, and we should expect the same from a computing system. A name service maps a name for an entity (an individual, organization, or facility) into a set of labeled properties, each of which is a string. Typical properties are: 1 This paper originated as an invited talk at the 1985 Conference on Principles of Distributed Computing, Minaki, Ontario. It was published in the proceedings of the 1986 Conference on Principles of Distributed Computing. 2 This design was done jointly by the author, Andrew Birrell, Roger Needham and Michael Schroeder. 3 Author’s address: Systems Research Center, Digital Equipment Corporation, 130 Lytton Avenue, Palo Alto, CA 94301. Designing a Global Name Service 2 Password = XQE$# Mailboxes = {Cabernet, Zinfandel} network address = 173#4456#1655476653 distribution list = {Birrell, Needham, Schroeder} Grapevine [1, 3] and the Xerox Clearinghouse [4] are examples of such a name service, and they are the basis for the present design. I exclude descriptive “names” from consideration, since I don’t know how to specify, much less implement, a service which maps predicates into strings and meets the other requirements of a global name service. A name service is not a general database: the set of names changes slowly, and the properties of a given name also change slowly. Furthermore, the integrity constraints of a useful name service are much weaker those of a database. Nor is it like a file directory system, which must create and look up names much faster than a name service, but need not be as large or as available. Either a database or a file system can be named by the service, though. The name service has its own requirements: • Large size, to handle an essentially arbitrary number of names and serve an arbitrary number of administrative organizations. • Long life, during which many changes will occur in the organization of the name space and the component that implement the service. • High availability, because the system can’t work when the name service is broken. • Fault isolation, so that local failures don’t cause the entire service to fail. • Tolerance of mistrust, since a large-scale service won’t have any component which is trusted by all the clients. These requirements imply a hierarchical system; hierarchy is the fundamental method for accommodating growth and isolating faults. In addition to the functional requirements, there is a need for a precise specification of how the service behaves, especially in the presence of faults. The designers devoted a good deal of effort to such a specification. The system described here was designed by Andrew Birrell, Butler Lampson, Roger Needham, and Michael Schroeder. We talked extensively with Dave Oran and Tony Lauck. A toy implementation has been done by William Stoye, but no real one has yet been attempted. The next section gives an overview of the name service, from the viewpoint first of a client and then of an administrator. Next is an explanation of the precise nature of the name space and the provisions for changing it, followed by informal but fairly precise specifications for both client and administrative levels of the service. Interesting Designing a Global Name Service 5 An update to a directory makes the node at the end of a given path present or absent. The update is time-stamped, and a later time-stamp takes precedence over an earlier one with the same path. The subtleties of this scheme are discussed later; its purpose is to allow the tree to be updated concurrently from a number of places without any prior synchronization. Access control is based on the notion of a principal, which is an entity that can be authenticated by its knowledge of some encryption key (which acts as its password). A principal is identified either by a full name or, in case the root of the full name is not trusted, by a relative name, a path through the directory tree starting at the target directory and using ‘..’ to denote the parent; for example, in the Finance directory the principal ANSI/DEC/SRC/Lampson can also be identified by the relative name ../SRC/Lampson. Each directory has an access control function which maps a principal and a path into a set of rights drawn from {read, write, test}. Each of the operations provided by the name service requires the principal that invokes it to have certain rights to the nodes involved in the operation. For the convenience of the users, the access control function is defined by a set of triples (principal pattern, path pattern, rights); in the directory of figure 2 the triple (ANSI/DEC/*, Lampson/*, {read}) gives every principal starting with ANSI/DEC read rights to the subtree which is the value of Lampson. The triple (../*, Lampson/*, {read}) has the same effect, but the authentication of the principal must come from the parent directory. Authentication is based on the use of encryption to provide a secure channel between the caller of an operation and its implementor. A directory has an authentication function af, which is a mapping from keys to principals; it accepts a message encrypted with key k as coming from principal af(k). Each directory has a few values for which af is defined by some external means (such as a courier). In particular, there is a secure channel for each parentchild link; the parent’s af maps this channel’s key to the child’s name, and the child’s af maps it to ‘..’. The authentication function can be extended by a certificate, a message encrypted with key k’ which says, “Key k authenticates the principal whose name is N relative to me.” This allows af(k) to be defined as af(k’)/N. For example, suppose ANSI/DEC sends the SRC directory a certificate that k authenticates Finance/Wright over the secure channel from DEC to SRC, then SRC’s af can be extended with (k, ../Finance/Wright). A sequence of certificates can establish a secure channel between any two directories; the relative names to which the directories map the channel will depend on what other directories participated in setting it up. The details of this scheme, together with arguments for its soundness, can be found in [2]. Designing a Global Name Service 6 Administrative level The client sees a single name service and is not concerned with the actual machines on which it is implemented or the replication of the database that makes it reliable. The administrator allocates resources to the implementation of the service and reconfigures it to deal with long term failures. Instead of a single directory, she sees a set of directory copies (DC), each one stored on a different server (S) machine. Figure 3 shows this situation for the directory DEC/SRC, which is stored on four servers named alpha, beta, gamma, and delta. A directory reference (DR) now includes not just the DI of the directory, but also a list of the servers that store its DCs. A lookup can try one or more of the servers to find a copy from which to read. The copies are kept approximately but not exactly the same. The figure shows four updates to SRC, with timestamps 10, 11, 12 and 14. The copy on delta is current to time 12, as indicated by the italic 12 under it. This is called its lastSweep time. The others have different sets of updates, but all have lastSweep = 10. Each copy also has a nextTS value (not shown), the next time-stamp it will assign to a new update; this value can only increase. An update originates at one DC, and is initially recorded there. The basic method for spreading updates to all the copies is a sweep operation, which visits every DC, collects a complete set of updates, and then writes this set back to every DC. The sweep has a time- stamp sweepTS. Before it reads from a DC it increases that DC’s nextTS to sweepTS; this ensures that the sweep collects all updates earlier than sweepTS. After it writes back to a DC, it sets that DC’s lastSweep to sweepTS. Figure 4 shows the state of SRC after a sweep at time 14. DEC Figure 3: Directory copies with different contents alpha SRC beta gamma delta 10 Lampson 10 Birrell 12 10 Lampson 10 Schroeder 14 10 Lampson 10 12 Lampson 10 Needham 11 Birrell 12 Birrell 12 Needham 11 Designing a Global Name Service 7 In order to speed up the spreading of updates, any DC may send some updates to any other DC in a message. Figure 3 shows the updates for Birrell and Needham being sent to server beta. I expect that most updates will be distributed in messages, but it is extremely difficult to make this method fully reliable. The sweep, on the other hand, is quite easy to implement reliably. A sweep’s major problem is to obtain the set of DCs reliably. The set of servers in the DR stored in the parent is not suitable, because it is too difficult to ensure that the sweep gets a complete set if the directory’s parent or the set of DCs is changing during the sweep. Instead, all the DCs are linked into a ring, shown by the fat arrows in figure 5. Each arrow represents the name of the server to which it points. The sweep starts at any DC and follows the arrows; if it eventually reaches the starting point, then it has found a complete set of DCs. Of course, this operation need not be done sequentially; given a hint about the contents of the set, say from the parent DR, the sweep can visit all the DCs and DEC Figure 4: The directory of figure 3 after a sweep alpha SRC beta gamma delta 14 Lampson 10 Needham 11 Birrell 12 Schroeder 14 14 Lampson 10 Needham 11 Birrell 12 Schroeder 14 14 Lampson 10 Needham 11 Birrell 12 Schroeder 14 14 Lampson 10 Needham 11 Birrell 12 Schroeder 14 DEC Figure 5: The ring used for a sweep alpha SRC beta gamma delta Designing a Global Name Service 10 directory. Thus the DIs act as names in an imaginary super-root which has all the directories as its children; an FN behaves like a file system name relative to the superroot. There are two obvious questions: how do users type FNs, and how can a directory be found from its DI among millions of directories scattered all over the world? The first question has the same answer as it does in a file system: a user will have a working root, which is prefixed to any name she types. Many variations are possible. For example, as in Unix the user could have a working root which is prefixed to typed names that start with / and a working directory which is prefixed to typed names that do not. Of course the user is also free to type an initial DI explicitly, or to define named links to various roots in his working directory. The second question is more subtle. At any time, an instance of the name service has a single root, and there are data structures maintained by the administrative level that allow a copy of the root to be found from any server; these are discussed later. Taken without qualification, this means that only FNs beginning with the root’s DI can be looked up, which is fine when the root is created first and growth occurs at the leaves. To handle the growth by combination shown in figure 6, the root keeps an ersatz super-root, in the form of a table of well-known directories that maps certain DIs into links which are FNs relative to the root. Thus in the figure the well-known DIs in ANSI (shown in gray) are #311 and #552, the DIs for DEC and IBM. Now when a lookup reaches the root, it can consult the well-known table and replace the FN’s DI with a path that starts at the root itself. Thus #311 is replaced by #999/DEC, and hence #311/SRC/Lampson becomes #999/DEC/SRC/Lampson, which can be looked up starting at the ANSI root. When combining name services, it is prudent to make the old roots well-known in the new root, so that old names can still be looked up. Restructuring Sometimes what is wanted is not growth but restructuring. Suppose that DEC buys IBM. The subtree rooted in the IBM directory should be moved under the DEC directory, as shown in figure 7. Moving a subtree is the only restructuring operation; as long as it doesn’t form a cycle (not allowed), it preserves the tree structure of the service. The obvious problem with this move is that all the names that begin ANSI/IBM no longer work, since IBM is no longer a child of ANSI. The solution is familiar from the telephone system: when a number changes, a call to the old number elicits the response, “The number you have reached has been changed. The new number is....” Similarly, an entry in ANSI for IBM, with the link ANSI/DEC/IBM as its value, gives names beginning ANSI/IBM the same meaning they had before the takeover. Caching Name lookup is not likely to be especially cheap. Indeed, if the servers that store the name or its parent directories are far away in the network, lookup may be quite expensive. Hence it is very desirable for a client to be able to cache the result of a lookup for a while, Designing a Global Name Service 11 rather than repeating it every time the value is needed. Since it is impractical for the service to keep track of clients that are doing this and notify them when there is a change, caching must be paid for either by enforcing a slow rate of change on the naming database, or by tolerating some inaccuracy in the cached information. The latter requires no work from the service, but the former does. The enforcement mechanism is an expiration time (TX) on entries in the data base, and in particular on parent-child arcs in the directory tree and on links. The rule is: an arc or link may not be changed until its TX has expired, except that an arc may be deleted by a subtree move if it is replaced by a link to the moved subtree; e.g., see figure 7. With this restriction, the result of a directory lookup can be safely cached until the minimum TX of any arc or link that was followed. In figure 8, for example, the result of looking up ANSI/DEC/SRC is valid until 15 Sept 1985, which is the minimum of the two TX values encountered. One important client for caching is the name service itself: directories are expected to cache their names from the root, so that a lookup which encounters a server storing the SRC directory need not find one that stores ANSI in order to look up ANSI/DEC/SRC. Without this mechanism, access to ANSI might well become a bottleneck. Figure 7: Restructuring a name space by moving the IBM node ANSI DEC SRC #311 #783 IBM TJW #552 #935 #999 #311 = #999/DEC #552 = #999/IBM ANSI DEC SRC #311 #783 IBM TJW #552 #935 #999 #311 = #999/DEC #552 = #999/IBM IBM #999/DEC/IBM Figure 8: Using expiration times to validate caches ANSI DEC 15 Sep 85 #311 #783 #999 SRC 1 Oct 85 #999/DEC = 311 valid until 15 Sep 85 #311/SRC = #783 valid until 1 Oct 85 #999/DEC/SRC = 783 valid until 15 Sep 85 Designing a Global Name Service 12 Authentication is another client of caching, since “key authenticates principal” is the result of a name lookup. The name service interface The following table gives the procedures in the interface to the name service. I have included it to give a feeling for the complexity of the system when viewed from the outside; many programming details are missing. As you can see, the interface is based on remote procedure calls. It is organized according to four main abstractions: values and directories at the client level, and directory copies and servers at the administrative level. Procedures that create a directory have a server name argument; although this isn’t logically necessary, allowing the service to choose the server would be impractical. A few types are used in the table as abbreviations. A path is a sequence of labels on arcs in the value tree. A TSpath is a sequence of (label, time-stamp) pairs. A value designator (VD) is a pair (full name, path) that designates a node in the value tree. A tree is either a mark (present or absent), or a set of (label, tree) pairs; it is a representation of a value tree without the time-stamps. All these procedures can also return various errors, such as “value not present.” Values and updates Snapshot VD → (mark, TSpath) give status and path DoUpdates (VD, tree) → time-stamp add the updates Enumerate VD → set of labels give all VD’s children GetValue VD → tree give all of VD’s value SetValue (VD, tree) → time-stamp replace VD’s value Directories FirstRoot server address → DI NewRoot (SN, FN, name) → DI old root FN, called name in new root NewD (SN, FN) → DI named FN MoveSubtree (VD, FN) → () give VD name FN Directory copies NewDC (SN, FN) → () copy of FN at SN RemoveDC (SN, FN) → () Sweep (FN, SN) → time-stamp start at SN NewEpoch (DI, set of SN) → epoch Baptise (FN, epoch, SN) → () add FN on SN to ring Designing a Global Name Service 15 This complicated definition serves to make the order of updates immaterial to the result. Why is this important? A value is determined by the sequence of update operations that have been applied to an initial empty value. An update can be thought of as a function that takes one value into another. Suppose the update functions have the following properties: • Total: it always makes sense to apply an update function. • Commutative: the order in which two updates are applied does not affect the result. • Idempotent: applying the same update twice has the same effect as applying it once. Then it follows that the set of updates that have been applied uniquely defines the state of the value. It can be shown that the updates on values defined earlier are total, commutative and idempotent. Hence a set of updates uniquely defines a value. This observation is the basis of the concurrency control scheme, as explained in the next subsection. The right side of figure 9 gives one set of updates that will produce the value on the left. The presence of the time-stamps in p ensures that the update is modifying the value that the client intended. This is significant when two clients concurrently try to create the same name. The two updates will have different time-stamps, and the earlier one will lose. The fact that later modifications, e.g. to set the password, include the creation time- stamp ensures that those made by the earlier client will also lose. Without the time- stamps in p there would be no way to tell them apart, and the final value might be a mixture of the two sets of updates. Directories (D) The main post-condition for the value of a directory depends on the property of value updates established above. It makes precise the notion that the result of a name lookup depends on which updates have reached the directory copy being read. (D1) A read operation (Snapshot, Enumerate, or GetValue) on a directory d with identifier d.di returns a result determined by the state of d after some set S of updates which is a subset of the updates in DB(d.di). S includes: • All the updates with a time-stamp less than d.lastSweep, the time of the last completed sweep of d. • An arbitrary subset of the updates with a time-stamp greater than d.lastSweep. This is a fairly weak post-condition, since S is chosen non-deterministically; one might well ask why it isn’t stronger. The reason is that it is sufficient for the needs of a name Designing a Global Name Service 16 service client, allows both read and update operations even if only one copy of the directory is accessible, and admits of a simple and robust implementation. In addition to this condition, there are the obvious postconditions on the update operations: that they modify DB appropriately. The other directory predicates have to do with the tree structure. The next one gives a condition under which looking up a FN is guaranteed to succeed. (D2) Looking up the FN di/n1/.../nk yields a directory if for the entire duration of the lookup operation • di is the root, and each nj is defined in the directory di/n1/.../nj-1 and always yields the directory reference drj (in spite of the non-determinism of (D1)), or • di is in the root’s well-known table with value di’/n1’/.../nl’, and di’/n1’/.../nl’/n1/.../nk satisfies the conditions of (D2). Of course a lookup can also succeed if some of the prefixes yield links, or if the directory structure is changing during the lookup, but this is the fundamental rule. There are two invariants to ensure that the tree structure of the directories remains well- formed. They are based on the observation that a collection of nodes forms a tree if every node (except for one called the root) has a single back-reference (BR) to another node, provided the backreferences form no cycles. The BRs are the child-parent arcs in this tree. We therefore take the BRs as the primary structure defining the tree, and view the DRs as secondary. Figure 10 illustrates. Thus MoveSubtree simply changes the BR, and then adjusts the DR to agree. (D3) The D’s defined in DB form a tree rooted in root whose arcs are DRs that are the reverse of the BR backpointers. (D4) Each DR is pointed to by a BR with a longer TX. Figure 10: Back-references in the directory tree ANSI DEC SRC #311 #783 IBM TJW #552 #935 #999 #311 = #999/DEC #552 = #999/IBM Designing a Global Name Service 17 Note that because of (D1) the tree you see by doing lookups can shift around if the BRs change during the lookups, or faster than sweeps can keep up. Actually (D3) is a bit oversimplified. Since MoveSubtree cannot be atomic, each directory has a set of BRs; MoveSubtree adds the new parent to the set, adjusts the DR, and finally removes the old parent. (D3) should say that there is some BR in each set which satisfies the predicate above. In addition to these invariants, there are the obvious post-conditions on the procedures that modify the tree: NewRoot establishes a new value for root, and so forth. Directory copies (DC) The actual implementation of values and directories is based on directory copies stored in servers, as described earlier. The main invariant relates the contents of the copies to the value of DB. (DC1) The database value of a directory DB(di), is equal to the union of the updates in all the DCs for di, and each DC has at least all the updates with timestamps earlier than its lastSweep. From this it is easy to deduce that reading from any copy satisfies (D1). It is also not hard to show that the sweep operation, which increases lastSweep, maintains this invariant. There are no new invariants for the tree. The intended implementation of procedures that change the tree is somewhat subtle, however. Since these procedures all involve changes at more than one server, they cannot be implemented atomically. Instead, each procedure makes an atomic change at one server, and then a cleanup procedure propagates the consequences implied by this change to the other servers involved. If some server crashes during this process, the cleanup procedure restarts. It is constructed in such a way that it maintains the invariants, and when it finally completes successfully it has completed the treechanging operation, if that is possible. The D invariants take account of the intermediate states during cleanup by allowing the back-reference to be a set, as the previous subsection points out. There are obvious post-conditions for the procedures that add and subtract directory copies, and for the Sweep procedure (it leaves all the lastSweep values at least as late as the time-stamp that it returns). Two invariants govern the epoch mechanism. (DC2) There is always at least one complete ring in the set of DCs for a directory. (DC3) There are never two complete non-intersecting rings. The latter condition is essential to ensure that a sweep cannot complete without seeing all the updates. However, when an administrator invokes NewEpoch to construct a new ring,
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved