Next Patent: MTS-switch generic verification
Next Patent: MTS-switch generic verification
[0001] This application is related to and claims priority to U.S. provisional application entitled CLUSTERED FILESYSTEM having Ser. No. 60/296,046, by Bannister et al., filed Jun. 5, 2001 and incorporated by reference herein. This application is a continuation-in-part of the U.S. application entitled CLUSTERED FILESYSTEM having Ser. No. 10/162,258, filed Jun. 5, 2002 and incorporated by reference herein.
[0002] 1. Field of the Invention
[0003] The present invention is related to data storage, and more particularly to a system and method for creating a copy of data during operation of a computing system.
[0004] 2. Description of the Related Art
[0005] A storage area network (SAN) provides direct, high-speed physical connections, e.g., Fibre Channel connections, between multiple hosts and disk storage. The emergence of SAN technology offers the potential for multiple computer systems to have high-speed access to shared data. However, the software technologies that enable true data sharing are mostly in their infancy. While SANs offer the benefits of consolidated storage and a high-speed data network, existing systems do not share that data as easily and quickly as directly connected storage. Data sharing is typically accomplished using a network filesystem such as Network File System (NFS™ by Sun Microsystems, Inc. of Santa Clara, Calif.) or by manually copying files using file transfer protocol (FTP), a cumbersome and unacceptably slow process.
[0006] The challenges faced by a distributed SAN filesystem are different from those faced by a traditional network filesystem. For a network filesystem, all transactions are mediated and controlled by a file server. While the same approach could be transferred to a SAN using much the same protocols, that would fail to eliminate the fundamental limitations of the file server or take advantage of the true benefits of a SAN. The file server is often a bottleneck hindering performance and is always a single point of failure. The design challenges faced by a shared SAN filesystem are more akin to the challenges of traditional filesystem design combined with those of high-availability systems.
[0007] Traditional filesystems have evolved over many years to optimize the performance of the underlying disk pool. Data concerning the state of the filesystem (metadata) is typically cached in the host system's memory to speed access to the filesystem. This caching—essential to filesystem performance—is the reason why systems cannot simply share data stored in traditional filesystems. If multiple systems assume they have control of the filesystem and cache filesystem metadata, they will quickly corrupt the filesystem by, for instance, allocating the same disk space to multiple files. On the other hand, implementing a filesystem that does not allow data caching would provide unacceptably slow access to all nodes in a cluster.
[0008] Systems or software for connecting multiple computer systems or nodes in a cluster to access data storage devices connected by a SAN have become available from several companies. EMC Corporation of Hopkington, Mass. offers HighRoad filesystem software for their Celerra™ Data Access in Real Time (DART) file server. Veritas Software of Mountain View, Calif. offers SANPoint which provides simultaneous access to storage for multiple servers with failover and clustering logic for load balancing and recovery. Sistina Software of Minneapolis, Minn. has a similar clustered filesystem called Global File System ™ (GFS). Advanced Digital Information Corporation of Redmond, Wash. has several SAN products, including Centra Vision for sharing files across a SAN. As a result of mergers the last few years, Hewlett-Packard Company of Palo Alto, Calif. has more than one cluster operating system offered by their Compaq Computer Corporation subsidiary which use the Cluster File System developed by Digital Equipment Corporation in their TruCluster and OpenVMS Cluster products. However, none of these products are known to provide direct read and write over a Fibre Channel by any node in a cluster. What is desired is a method of accessing data within a SAN which provides true data sharing by allowing all SAN-attached systems direct access to the same filesystem. Furthermore, conventional hierarchal storage management uses an industry standard interface called data migration application programming interface (DMAPI). However, if there are five machines, each accessing the same file, there will be five separate events and there is nothing tying those DMAPI events together.
[0009] It is an aspect of the present invention to create a point-in-time image of a filesystem without interruption, using minimal storage.
[0010] It is another aspect of the present invention to allow point-in-time backups of a filesystem while the base filesystem is still being used.
[0011] It is a further aspect of the present invention to keep low overhead “versions” of a filesystem online.
[0012] It is yet another aspect of the present invention to provide a recovery mechanism in the event of data loss.
[0013] It is a still further aspect of the present invention to create archive or backup volumes that are readable and write-able.
[0014] At least one of the above aspects can be attained by a method of maintaining a copy of at least one data volume in a computer system for at least one point in time, including establishing a first repository for a first snapshot of a base volume; and prior to a write operation to a first region of the base volume, copying the first region of the base volume to the first repository. Preferably, the copying is performed only for regions in the base volume for which write operations are detected and further, only if the first region was not previously written to the first repository. Preferably, data is read from a second region of the first snapshot by determining whether the second region has changed in the base volume, reading the second region from the first repository if the second region has changed; and reading the second region from the base volume if the second region has not changed.
[0015] The method may also include establishing a second repository for a second snapshot of the base volume at a point in time later than the first repository was established and, prior to writing to a third region of the base volume after the second repository was established, copying the third region of the base volume to the second repository. Under these circumstances, data is preferably read from a fourth region of the second snapshot by determining whether the fourth region has changed in the base volume since establishing the second repository, reading the fourth region from the second repository if the fourth region has changed since establishing the second repository, and reading the fourth region from the base volume if the fourth region has not changed since establishing the second repository.
[0016] These together with other aspects and advantages which will be subsequently apparent, reside in the details of construction and operation as more fully hereinafter described and claimed, reference being had to the accompanying drawings forming a part hereof, wherein like numerals refer to like parts throughout.
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029] FIGS.
[0030]
[0031]
[0032]
[0033]
[0034]
[0035]
[0036] Following are several terms used herein that are in common use in describing filesystems or SANs, or are unique to the disclosed system. Several of the terms will be defined more thoroughly below.
[0037] bag indefinitely sized container object for tagged data
[0038] behavior chain vnode points to head, elements are inode, and vnode operations
[0039] cfs or CXFS cluster filesystem (CXFS is from Silicon Graphics, Inc.)
[0040] chandle client handle: barrier lock, state information and an object pointer
[0041] CMS cell membership services
[0042] CORPSE common object recovery for server endurance
[0043] dcvn filesystem specific components for vnode in client, i.e., inode
[0044] DMAPI data migration application programming interface
[0045] DNS distributed name service, such as SGI's white pages
[0046] dsvn cfs specific components for vnode in server, i.e., inode
[0047] heartbeat network message indicating a node's presence on a LAN
[0048] HSM hierarchical storage management
[0049] inode filesystem specific information, i.e., metadata
[0050] KORE kernel object relocation engine
[0051] manifest bag including object handle and pointer for each data structure
[0052] quiesce render quiescent, i.e., temporarily inactive or disabled
[0053] RPC remote procedure call
[0054] token an object having states used to control access to data & metadata
[0055] vfs virtual filesystem representing the filesystem itself
[0056] vnode virtual inode to manipulate files without filesystem details
[0057] XVM volume manager for CXFS
[0058] In addition there are three types of input/output operations that can be performed in a system according to the present invention: buffered I/O, direct I/O and memory mapped I/O. Buffered I/O are read and write operations via system calls where the source or result of the I/O operation can be system memory on the machine executing the I/O, while direct I/O are read and write operations via system calls where the data is transferred directly between the storage device and the application programs memory without being copied through system memory.
[0059] Memory mapped I/O are read and write operations performed by page fault. The application program makes a system call to memory map a range of a file. Subsequent read memory accesses to the memory returned by this system call cause the memory to be filled with data from the file. Write accesses to the memory cause the data to be stored in the file. Memory mapped I/O uses the same system memory as buffered I/O to cache parts of the file.
[0060] A SAN layer model is illustrated in
[0061] Layer
[0062] The real promise of SANs, however, lies in layer
[0063] In practice, this means that on most SANs, storage is still partitioned between various systems. SAN managers may be able to quickly reassign storage to another system in the face of a failure and to more flexibly manage their total available storage, but independent systems cannot simultaneously access the same data residing in the same filesystems.
[0064] Shared, high-speed data access is critical for applications where large data sets are the norm. In fields as diverse as satellite data acquisition and processing, CAD/CAM, and seismic data analysis, it is common for files to be copied from a central repository over the LAN to a local system for processing and then copied back. This wasteful and inefficient process can be completely avoided when all systems can access data directly over a SAN.
[0065] Shared access is also crucial for clustered computing. Access controls and management are more stringent than with network filesystems to ensure data integrity. In most existing high-availability clusters, storage and applications are partitioned and another server assumes any failed server's storage and workload. While this may prevent denial of service in case of a failure, load balancing is difficult and system and storage bandwidth is often wasted. In high-performance computing clusters, where workload is split between multiple systems, typically only one system has direct data access. The other cluster members are hampered by slower data access using network filesystems such as NFS.
[0066] In a preferred embodiment, the SAN includes hierarchical storage management (HSM) such as data migration facility (DMF) by Silicon Graphics, Inc. (SGI) of Mountain View, Calif. The primary purpose of HSM is to preserve the economic value of storage media and stored data. The high input/output bandwidth of conventional machine environments is sufficient to overrun online disk resources. HSM transparently solves storage management issues, such as managing private tape libraries, making archive decisions, and journaling the storage so that data can be retrieved at a later date.
[0067] Preferably, a volume manager, such as XVM from SGI supports the cluster environment by providing an image of storage devices across all nodes in a cluster and allowing for administration of the devices from any cell in the cluster. Disks within a cluster can be assigned dynamically to the entire cluster or to individual nodes within the cluster. In one embodiment, disk volumes are constructed using XVM to provide disk striping, mirroring, concatenation and advanced recovery features. Low-level mechanisms for sharing disk volumes between systems are provided, making defined disk volumes visible across multiple systems. XVM is used to combine a large number of disks across multiple Fibre Channels into high transaction rate, high bandwidth, and highly reliable configurations. Due to its scalability, XVM provides an excellent complement to CXFS and SANs. XVM is designed to handle mass storage growth and can configure millions of terabytes (exabytes) of storage in one or more filesystems across thousands of disks.
[0068] An example of a cluster computing system formed of heterogeneous computer systems or nodes is illustrated in
[0069] Other kinds of storage devices besides disk drives
[0070] In a conventional SAN, the disks are partitioned for access by only a single node per partition and data is transferred via the LAN. On the other hand, if node
[0071] In the preferred embodiment, the cluster filesystem is layer that distributes input/output directly between the disks and the nodes via Fibre Channel
[0072] Preferably, the underlying layer uses a directory structure based on B-trees, which allow the cluster filesystem to maintain good response times, even as the number of files in a directory grows to tens or hundreds of thousands of files. The cluster filesystem adds a coordination layer to the underlying filesystem layer. Existing filesystems defined in the underlying layer can be migrated to a cluster filesystem according to the present invention without necessitating a dump and restore (as long as the storage can be attached to the SAN). For example, in the IRIX nodes
[0073] In the cluster filesystem of the preferred embodiment, one of the nodes, e.g., IRIX node
[0074] As illustrated in
[0075] Token Infrastructure
[0076] The tokens operated on by the token client
[0077] Certain types of write operations may be performed simultaneously by more than one client, in which case the shared write level is used. An example is maintaining the timestamps for a file. To reduce overhead, when reading or writing a file, multiple clients can hold the shared write level and each update the timestamps locally. If a client needs to read the timestamp, it obtains the read level of the token. This causes all the copies of the shared write token to be returned to the metadata server
[0078] Acquiring a token puts a reference count on the token, and prevents it from being removed from the token client. If the token is not already present in the token client, the token server is asked for it. This is sometimes also referred to as obtaining or holding a token. Releasing a token removes a reference count on a token and potentially allows it to be returned to the token server. Recalling or revoking a token is the act of asking a token client to give a token back to the token server. This is usually triggered by a request for a conflicting level of the token.
[0079] When a client needs to ask the server to make a modification to a file, it will frequently have a cached copy of a token at a level which will conflict with the level of the token the server will need to modify the file. In order to minimize network traffic, the client ‘lends’ its read copy of the token to the server for the duration of the operation, which prevents the server from having to recall it. The token is given back to the client at the end of the operation.
[0080] Following is a list of tokens in an exemplary embodiment:
[0081] DVN_EXIST is the existence token. Represents the fact that a client has references to the vnode. Each client which has a copy of the inode has the read level of this token and keeps it until they are done with the inode. The client does not acquire and release this token around operations, it just keeps it in the token client. The server keeps one reference to the vnode (which keeps it in memory) for each client which has an existence token. When the token is returned, this reference count is dropped. If someone unlinks the file—which means it no longer has a name, then the server will conditionally recall all the existence tokens. A conditional recall means the client is allowed to refuse to send the token back. In this case the clients will send back all the tokens and state they have for the vnode if no application is currently using it. Once all the existence tokens are returned, the reference count on the server's vnode drops to zero, and this results in the file being removed from the filesystem.
[0082] DVN_IOEXCL is the I/O exclusive token. The read token is obtained by any client making read or write calls on the vnode. The token is held across read and write operations on the file. The state protected by this token is what is known as the I/O exclusive state. This state is cached on all the clients holding the token. If the state is true then the client knows it is the only client performing read/write operations on the file. The server keeps track of when only one copy of the token has been granted to a client, and before it will allow a second copy to be given out, it sends a message to the first client informing it that the I/O exclusive state has changed from true to false. When a client has an I/O exclusive state of true is allowed to cache changes to the file more aggressively than otherwise.
[0083] DVN_ IO is the IO token which is used to synchronize between read and write calls on different computers. CXFS enforces a rule that buffered reads are atomic with respect to buffered writes, and writes are atomic with respect to other writes. This means that a buffered read operation happens before or after a write, never during a write. Buffered read operations hold the read level of the token, buffered writes hold the write level of the token. Direct reads and writes hold the read level of the token.
[0084] DVN_PAGE_DIRTY represents the right to hold modified file data in memory on a system.
[0085] DVN_PAGE_CLEAN represents the right to hold unmodified file data in memory on a computer. Combinations of levels of DVN_PAGE_DIRTY and DVN_PAGE_CLEAN are used to maintain cache coherency across the cluster.
[0086] DVN_NAME is the name token. A client with this token in the token client for a directory is allowed to cache the results of lookup operations within the directory. So if we have a name we are looking up in a directory, and we have done the same lookup before, the token allows us to avoid sending the lookup to the server. An operation such as removing or renaming, or creating a file in a directory will obtain the write level of the token on the server and recall the read token —invalidating any cached names for that directory on those clients.
[0087] DVN_ATTR protects fields such as the ownership information, the extended attributes of the file, and other small pieces of information. Held by the client for read, and by the server for write when the server is making modifications. Recall of the read token causes the invalidation of the extended attribute cache.
[0088] DVN_TIMES protects timestamp fields on the file. Held at the read level by hosts who are looking at timestamps, held at the shared write level by hosts doing read and write operations, and held at the write level on the server when setting timestamps to an explicit value. Recall of the shared write token causes the client to send back its modified timestamps, the server uses the largest of the returned values as the true value of the timestamp.
[0089] DVN_SIZE protects the size of the file, and the number of disk blocks in use by the file. Held for read by a client who wants to look at the size, or for write by a client who has a true
[0090] DVN_EXTENT protects the metadata which indicates where the data blocks for a file are on disk, known as the extent information. When a client needs to perform read or write operation it obtains the read level of the token and gets of a copy of the extent information with it. Any modification of the extent information is performed on the server and is protected by the write level of the token. A client which needs space allocated in the file will lend its read token to the server for this operation.
[0091] DVN_DMAPI protects the DMAPI event mask. Held at the read level during
[0092] Data coherency is preferably maintained between the nodes in a cluster which are sharing access to a file by using combinations of the DVN_PAGE_DIRTY and DVN_PAGE_CLEAN tokens for the different forms of input/output. Buffered and memory mapped read operations hold the DVN_PAGE_CLEAN_READ token, while buffered and memory mapped write operations hold the DVN_PAGE_CLEAN_WRITE and VN_PAGE_DIRTY_WRITE tokens. Direct read operations hold the DVN_PAGE_CLEAN_SHARED_WRITE token and direct write operations hold the DVN_PAGE_CLEAN_SHARED_WRITE and VN_PAGE_DIRTY_SHARED_WRITE tokens. Obtaining these tokens causes other nodes in the cluster which hold conflicting levels of the tokens to return their tokens. Before the tokens are returned, these client nodes perform actions on their cache of file contents. On returning the DVN_PAGE_DIRTY_WRITE token a client node must first flush any modified data for the file out to disk and then discard it from cache. On returning the DVN_PAGE_CLEAN_WRITE token a client node must first flush any modified data out to disk. If both of these tokens are being returned then both the flush and discard operations are performed. On returning the DVN_PAGE_CLEAN_READ token to the server, a client node must first discard any cached data for the file it has in system memory.
[0093] An illustration to aid in understanding how tokens are requested and returned is provided in
[0094] If metadata client
[0095] Appropriate control of the tokens for each file by metadata server
[0096] Mounting of a filesystem as a metadata server is arbitrated by a distributed name service (DNS), such as “white pages” from SGI. A DNS server runs on one of the nodes, e.g., node
[0097] Hierarchical Storage Management
[0098] In addition to caching data that is being used by a node, in the preferred embodiment hierarchical storage management (HSM), such as the data migration facility (DMF) from SGI, is used to move data to and from tertiary storage, particularly data that is infrequently used. As illustrated in
[0099] Flowcharts of the operations performed when client node
[0100] As illustrated in
[0101] The possible DMAPI events are read, write and truncate. When a read event is queued, the DMAPI server informs the HSM software to ensure that data is available on disks. If necessary, the file requested to be read is transferred from tape to disk. If a write event is set, the HSM software is informed that the tape copy will need to be replaced or updated with the contents written to disk. Similarly, if a truncate event is set, the appropriate change in file size is performed, e.g., by writing the file to disk, adjusting the file size and copying to tape.
[0102] Upon completion of the DMAPI event, a reply is forwarded
[0103] Maintaining System Availability
[0104] In addition to high-speed disk access obtained by caching data and shared access to disk drives via a SAN, it is desirable to have high availability of the cluster. This is not easily accomplished with so much data being cached and multiple nodes sharing access to the same data. Several mechanisms are used to increase the availability of the cluster as a whole in the event of failure of one or more of the components or even an entire node, including a metadata server node.
[0105] One aspect of the present invention that increases the availability of data is the mirroring of data volumes in mass storage
[0106] The volume manager may have several servers which operate independently, but are preferably chosen using the same logic. A node is selected from the nodes that have been in the cluster membership the longest and are capable of hosting the server. From that pool of nodes the lowest numbered node is chosen. The volume manager servers are chosen at cluster initialization time or when a server failure occurs. In an exemplary embodiment, there are four volume manager servers, termed boot, config, mirror and pal.
[0107] The volume manager exchanges configuration information at cluster initialization time. The boot server receives configuration information from all client nodes. Some of the client nodes could have different connectivity to disks and thus, could have different configurations. The boot server merges the configurations and distributes changes to each client node using a volume manager multicast facility. This facility preferably ensures that updates are made on all nodes in the cluster or none of the nodes using two-phase commit logic. After cluster initialization it is the config server that coordinates changes. The mirror server maintains the mirror specific state information about whether a revive is needed and which mirror legs are consistent.
[0108] In a cluster system according to the present invention, all data volumes and their mirrors in mass storage
[0109] If a client node, e.g., node
[0110] When a mirror revive is in progress, the mirror master coordinates input/output to the mirror. The mirror revive process uses an overlap queue to hold I/O requests from client nodes made during the mirror revive process. Prior to beginning to read from an intact leg of the mirror, the mirror revive process ensures that all other input/output activity to the range of addresses is complete. Any input/output requests made to the address range being revived are refused by the mirror master until all the data in that range of addresses has been written by the mirror revive process.
[0111] If there is an I/O request for data in an area that is currently being copied in reconstructing the mirror, the data access is retried after a predetermined time interval without informing the application process which requested the data access. When the mirror master node
[0112] Input/output access to the mirror continues during the mirror revive process with the volume manager process keeping track of the first unsynchronized block of data to avoid unnecessary communication between client and server. The client node receives the revive status and can check to see if it has an I/O request preceding the area being synchronized. If the I/O request precedes that area, the I/O request will be processed as if there was no mirror revive in progress.
[0113] Data read from unreconstructed portions of the mirror by applications are preferably written to the copy being reconstructed, to avoid an additional read at a later period in time. The mirror revive process keeps track of what blocks have been written in this manner. New data written by applications in the portion of the mirror that already have been copied by the mirror revive process are mirrored using conventional mirroring. If an interior mirror is present, it is placed in writeback mode. When the outer revive causes reads to the interior mirror, it will automatically write to all legs of the interior mirror, thus synchronizing the interior mirror at the same time.
[0114] Snapshot Copying of Volumes
[0115] In addition to maintaining consistency of mirrors, it is desirable to be able to create a “snapshot” of a data volume, i.e., a copy of the data volume contents at a point in time. Such capability is desirable for both clusters and stand-alone systems. A block diagram of a filesystem and a repository which will hold the snapshot of the data volume is illustrated in
[0116] Once the repository volume
[0117] Once the snapshot has been initiated, the volume manager, such as XVM, monitors all write operations to the original or base filesystem. When a write to the base volume
[0118] In an embodiment using XFS, the function F_SETLKW (set file lock) is preferably used to coordinate the copying of data to repository
[0119] One purpose of a snapshot is to retain the contents of data at a point in time. When used for this purpose, write operations are not permitted to a repository for a snapshot volume, except to preserve the contents of the original or base volume
[0120] When a snapshot volume is retained over a long period of time relative to the rate of change of the base volume
[0121] Multiple snapshots may be initiated for the same base volume
[0122] In the preferred embodiment, when reading
[0123] Examples of reading from multiple snapshot volumes
[0124] In an alternative embodiment, writing directly to a snapshot volume is permitted, e.g., where the snapshot volume contains an earlier version of software that has been patched and the snapshot volume is maintained with the patches included. In this case, operations are performed to ensure that an older snapshot of the base volume has a copy of the region being written. Examples are illustrated in
[0125] Base volume
[0126] Recovery and Relocation
[0127] In the preferred embodiment, a common object recovery protocol (CORPSE) is used for server endurance. As illustrated in
[0128] As illustrated in
[0129] The node in the leader state
[0130] If a node is deactivated in an orderly fashion, the node sends a withdrawal request to the other nodes in the cluster, causing one of the nodes to transition to the leader state
[0131] In the stable state
[0132] Upon notification of a node failure, the CMS daemon blocks
[0133] In the preferred embodiment, CMS includes support for nodes to operate under different versions of the operating system, so that it is not necessary to upgrade all of the nodes at once. Instead, a rolling upgrade is used in which a node is withdrawn from the cluster, the new software is installed and the node is added back to the cluster. The time period between upgrades may be fairly long, if the people responsible for operating the cluster want to gain some experience using the new software.
[0134] Version tags and levels are preferably registered by the various subsystems to indicate version levels for various functions within the subsystem. These tags and levels are transmitted from follower nodes to the CMS leader node during the membership protocol
[0135] Upon initiation
[0136] When the CMS daemon receives acknowledgment that the credentials have been flushed, common object recovery is initiated
[0137] After all of the messages from the failed node have been processed, CORPSE recovers the system in three passes starting with the lowest layer (cluster infrastructure) and ending with the filesystem. In the first pass, recovery of the kernel object relocation engine (KORE) is executed
[0138] As illustrated in
[0139] The next step to be performed includes detargeting a chandle. A chandle or client handle is a combination of a barrier lock, some state information and an object pointer that is partially subsystem specific. A chandle includes a node identifier for where the metadata server can be found and a field that the subsystem defines which tells the chandle how to locate the metadata server on that node, e.g., using a hash address or an actual memory address on the node. Also stored in the chandle is a service identifier indicating whether the chandle is part of the filesystem, vnode file, or distributed name service and a multi-reader barrier lock that protects all of this. When a node wants to send a message to a metadata server, it acquires a hold on the multi-reader barrier lock and once that takes hold the service information is decoded to determine where to send the message and the message is created with the pointer to the object to be executed once the message reaches the metadata server.
[0140] With messages interrupted and create locks held, celldown callouts are performed
[0141] The CORPSE subsystems executing on the metadata clients go through all of the objects involved in recovery and determine whether the server for that client object is in the membership for the cluster. One way of making this determination is to examine the service value in the chandle for that client object, where the service value contains a subsystem identifier and a server node identifier. Object handles which identify the subsystems and subsystem specific recovery data necessary to carry out further callouts are placed in the manifest. Server nodes recover from client failure during celidown callouts by returning failed client tokens and purging any state associated with the client.
[0142] When celldown callouts have been performed
[0143] After the celldown callouts
[0144] When all of the nodes are notified of the results of the election, gather callouts are performed
[0145] The reconstruct callouts
[0146] In the case of the second pass, WP/XVM
[0147] After all of the retargets are performed
[0148] Kernel Object Relocation Engine
[0149] As noted above, the first pass
[0150] As illustrated in
[0151] In response, the source metadata server initiates
[0152] The stages of the relocation process are illustrated in FIGS.
[0153] Interruptible Token Acquisition
[0154] Preferably interruptible token acquisition is used to enable recovery and relocation in several ways: (1) threads processing messages from failed nodes that are waiting for the token state to stabilize are sent an interrupt to be terminated to allow recovery to begin; (2) threads processing messages from failed nodes which may have initiated a token recall and are waiting for the tokens to come back are interrupted; (3) threads that are attempting to lend tokens which are waiting for the token state to stabilize and are blocking recovery/relocation are interrupted; and (4) threads that are waiting for the token state to stabilize in a filesystem that has been forced offline due to error are interrupted early. Threads waiting for the token state to stabilize first call a function to determine if they are allowed to wait, i.e. none of the factors above apply, then go to sleep until some other thread signals a change in token state.
[0155] To interrupt, CORPSE and KORE each wake all sleeping threads. These threads loop, check if the token state has changed and if not attempt to go back to sleep. This time, one of the factors above may apply and if so a thread discovering it returns immediately with an “early” status. This tells the upper level token code to stop trying to acquire, lend, etc. and to return immediately with whatever partial results are available. This requires processes calling token functions to be prepared for partial results. In the token acquisition case, the calling process must be prepared to not get the token(s) requested and to be unable to perform the intended operation. In the token recall case, this means the thread will have to leave the token server data structure in a partially recalled state. This transitory state is exited when the last of the recalls comes in, and the thread returning the last recalled token clears the state. In lending cases, the thread will return early, potentially without all tokens desired for lending.
[0156] The many features and advantages of the invention are apparent from the detailed specification and, thus, it is intended by the appended claims to cover all such features and advantages of the invention that fall within the true spirit and scope of the invention. Further, since numerous modifications and changes will readily occur to those skilled in the art, it is not desired to limit the invention to the exact construction and operation illustrated and described, and accordingly all suitable modifications and equivalents may be resorted to, falling within the scope of the invention.