Plaque It!
Sponsored by: Flash of Genius |
This application relates to and claims priority from Japanese Patent Application No. 2007-85792, filed on Mar. 28, 2007 and Japanese Patent Application No. 2006-293485, filed on Oct. 30, 2006, the entire disclosure of which is incorporated herein by reference.
The present invention relates to a storage system comprising a plurality of storage areas, and a host computer coupled to the storage system.
Generally, an information system is equipped with a storage apparatus that uses an HDD (hard disk drive) as a storage device, and a storage system including this storage apparatus is accessed from a plurality of host systems (hosts, for example) via a storage area network (SAN: Storage Area Network). Generally, with a storage apparatus, a high-reliability method according to RAID (Redundant Array of Independent (or Inexpensive) Disks) technology is adopted to provide reliability to the storage apparatus beyond the reliability of a stand-alone HDD. Nevertheless, pursuant to the advancement of information society in recent years, the availability (service continuity) of information systems depending on reliability based on RAID is becoming inadequate.
Japanese Patent Laid-Open Publication No. H7-244597 (Patent Document 1) describes high-availability technology to deal with the foregoing situation. This technology prepares a production site and a backup site respectively including a host computer (hereinafter abbreviated as a “host”) and a storage apparatus, and mirrors data stored in the storage apparatus of the production site to the storage apparatus of the backup site. If the storage apparatus of the production site fails and shuts down, application processing that was suspended as a result of such storage apparatus failure is resumed using the storage apparatus and the host of the backup site. This technology is generally referred to as remote copy or remote mirroring.
With the technology of Patent Document 1, since the application is resumed with a different host when a storage apparatus fails and shuts down, re-boot processing of the application is required. Needless to say, there will be a problem concerning availability since the application will not be able to perform its normal operation from the time such application is suspended until the re-boot is complete.
Thus, an object of the present invention is to improve the availability of an information system including a storage system that performs remote copy between two or more storage apparatuses, and a host that uses this storage system.
In order to achieve the foregoing object, the present invention provides an information system comprising a computer as a host system, a first storage apparatus coupled to the computer and including a first primary volume and a first secondary volume, and a second storage apparatus coupled to the first storage apparatus and the computer and including a second primary volume and a second secondary volume. The first and second storage apparatuses execute remote copy of copying data written into the first primary volume from the computer to the second primary volume. At least one of the first and second storage apparatuses executes local copy of copying the data written into the first or second primary volume in a self-storage apparatus to the corresponding first or second secondary volume. The computer switches the destination of a write request of the data from the first storage apparatus to the second storage apparatus in case of a failure occurring in the first storage apparatus.
The present invention also provides a data transfer method in an information system comprising a computer as a host system, a first storage apparatus coupled to the computer and including a first primary volume and a first secondary volume, and a second storage apparatus coupled to the first storage apparatus and the computer and including a second primary volume and a second secondary volume. The information system further comprises a third storage apparatus coupled to the first storage apparatus and including a third volume. The data transfer method comprises a first step of the first and second storage apparatuses executing remote copy of copying data written into the first primary volume to the second primary volume, only one of the first and second storage apparatuses copies data stored in the first or second secondary volume to the third volume, and the third storage apparatus creating a snapshot constituted as a replication of the third volume, and a second step switching the destination of a write request of the data from the first storage apparatus to the second storage apparatus in case of a failure occurring in the first storage apparatus.
The present invention further provides an information system comprising a computer as a host system, a first storage apparatus coupled to the computer and including a first primary volume and a first secondary volume, and a second storage apparatus coupled to the first storage apparatus and the computer and including a second primary volume and a second secondary volume. The first and second storage apparatuses execute remote copy of copying data written into the first primary volume from the computer to the second primary volume. At least one of the first and second storage apparatuses saves pre-updated data of the first or second primary volume updated following a creation command of a logical snapshot in the first or second secondary volume. The computer switches the destination of a write request of the data from the first storage apparatus to the second storage apparatus in case of a failure occurring in the first storage apparatus.
The present invention additionally provides a data transfer method in an information system comprising a computer as a host system, a first storage apparatus coupled to the computer, and a second storage apparatus coupled to the first storage apparatus and the computer. The first storage apparatus includes a first primary volume and a first secondary volume. The second storage apparatus includes a second primary volume and a second secondary volume. This data transfer method comprises a first step of the first and second storage apparatuses executing remote copy of copying data written into the first primary volume from the computer to the second primary volume, and at least one of the first and second storage apparatuses saving pre-updated data of the first or second primary volume updated following a creation command of a logical snapshot in the first or second secondary volume, and a second step of the computer switching the destination of a write request of the data from the first storage apparatus to the second storage apparatus in case of a failure occurring in the first storage apparatus.
According to the present invention, it is possible to improve the availability of an information system including a storage system that performs remote copy between two or more storage apparatuses, and a host that uses this storage system.
FIG. 1 is a block diagram showing an example of the hardware constitution of an information system according to a first embodiment of the present invention;
FIG. 2 is a first conceptual diagram showing the overview of a first embodiment of the present invention;
FIG. 3 is a second conceptual diagram showing the overview of the first embodiment;
FIG. 4 is a third conceptual diagram showing the overview of the first embodiment;
FIG. 5 is a conceptual diagram representing the software constitution in a host;
FIG. 6 is a block diagram representing the software constitution in a virtual storage apparatus and a storage apparatus;
FIG. 7 is a conceptual diagram representing the pair status of remote copy and the transition of pair status;
FIG. 8 is a conceptual diagram showing a device relation table to be managed by an I/O path manager;
FIG. 9 is a flowchart showing the flow when the I/O path manager performs initialization processing;
FIG. 10 is a flowchart showing the flow when the I/O path manager performs write processing;
FIG. 11 is a flowchart showing the flow when the I/O path manager performs read processing;
FIG. 12 is a conceptual diagram showing the overview of a second embodiment of the present invention;
FIG. 13 is a conceptual diagram showing the overview of a third embodiment of the present invention;
FIG. 14 is a conceptual diagram showing the overview of a fourth embodiment of the present invention;
FIG. 15 is a conceptual diagram showing the overview of a fifth embodiment of the present invention;
FIG. 16 is a conceptual diagram showing the overview of a sixth embodiment of the present invention;
FIG. 17 is a conceptual diagram showing the overview of a seventh embodiment of the present invention;
FIG. 18 is a conceptual diagram showing the overview of a eighth embodiment of the present invention;
FIG. 19 is a flowchart explaining another read/write processing method in the first embodiment;
FIG. 20 is a flowchart explaining another read/write processing method in the first embodiment;
FIG. 21 is a flowchart explaining another read/write processing method in the first embodiment;
FIG. 22 is a conceptual diagram showing local copy pair information in a ninth embodiment of the present invention;
FIG. 23 is a conceptual diagram explaining local copy pair information;
FIG. 24 is a flowchart explaining the write processing to be performed by a primary virtual storage apparatus in the ninth embodiment;
FIG. 25 is a flowchart explaining the remote copy processing in the ninth embodiment;
FIG. 26 is a flowchart explaining the write processing to be performed by a secondary virtual storage apparatus in the ninth embodiment;
FIG. 27 is a flowchart explaining the local copy processing in the ninth embodiment;
FIG. 28 is a flowchart explaining the background copy processing in the ninth embodiment;
FIG. 29 is a flowchart explaining the pair operation processing in the ninth embodiment;
FIG. 30 is a flowchart explaining the destaging processing in the ninth embodiment;
FIG. 31 is a flowchart explaining the write processing to be performed by the secondary virtual storage apparatus during a failure in the ninth embodiment;
FIG. 32 is a conceptual diagram showing the overview of a tenth embodiment of the present invention;
FIG. 33 is a diagram showing the write processing to be performed by the secondary virtual storage apparatus in the tenth embodiment;
FIG. 34 is a flowchart explaining the background copy processing in the tenth embodiment;
FIG. 35 is a flowchart explaining the pair operation processing in the tenth embodiment;
FIG. 36 is a flowchart explaining the local copy processing in the tenth embodiment;
FIG. 37 is a flowchart explaining the destaging processing in the tenth embodiment;
FIG. 38 is a conceptual diagram showing the overview of an eleventh embodiment of the present invention;
FIG. 39 is a conceptual diagram explaining a virtual address/real address mapping table;
FIG. 40 is a diagram showing the local copy processing (Copy-On-Write mode) in the eleventh embodiment;
FIG. 41 is a flowchart explaining the background copy processing (Copy-On-Write mode) in the eleventh embodiment;
FIG. 42 is a flowchart explaining the local copy processing (Copy-After-Write mode) in the eleventh embodiment;
FIG. 43 is a flowchart explaining the background copy processing (Copy-After-Write mode) in the eleventh embodiment;
FIG. 44 is a conceptual diagram showing the overview of a twelfth embodiment of the present invention;
FIG. 45 is a flowchart explaining the background copy processing (Copy-On-Write mode) in the twelfth embodiment; and
FIG. 46 is a flowchart explaining the background copy processing (Copy-After-Write mode) in the twelfth embodiment.
Embodiments of the present invention are now explained with reference to the attached drawings.
FIG. 1 is a diagram showing an example of the hardware constitution(configuration) of an information system according to an embodiment of the present invention.
The information system, for example, comprises a storage apparatus 1500 , a host computer (hereafter abbreviated as a “host”) 1100 , a management host 1200 , and two or more virtual storage apparatuses 1000 . A plurality of storage apparatuses 1500 , host computers (hereafter abbreviated as the “hosts”) 1100 , and management hosts 1200 may be provided, respectively. The virtual storage apparatus 1000 and the host 1100 are mutually connected via an I/O network 1300 . The virtual storage apparatus 1000 and the storage apparatus 1500 and the management host 1200 are mutually connected via a management network (not shown) or the I/O network 1300 .
The host 1100 has a host internal network 1104 , and coupled to this network 1104 are a processor (abbreviated as Proc in the diagrams) 1101 , a memory (abbreviated as Mem in the diagrams) 1102 , and an I/O port (abbreviated as I/O P in the diagrams) 1103 . The management host 1200 may also have the same hardware constitution as the host 1100 . Incidentally, an expansion card for adding an I/O port to the host 1100 is sometimes referred to as an HBA (Host Bus Adapter).
The management host 1200 has a display device, and this display device is able to display a screen for managing the virtual storage apparatus 1000 and the storage apparatus 1500 . Further, the management host 1200 is able to receive a management operation request from a user (for instance, an operator of the management host 1200 ), and send the received management operation request to the virtual storage apparatus 1000 and the storage apparatus 1500 . The management operation request is a request for operating the virtual storage apparatus 1000 and the storage apparatus 1500 , and, for example, there are a parity group creation request, an internal LU (Logical Unit) creation request, a path definition request, and operations related to a virtualization function.
Connection via a fibre channel is foremost considered as the I/O network 1300 , but in addition thereto, a combination of FICON (Flbre CONnection: registered trademark), or Ethernet (registered trademark) and TCP/IP (Transmission Control Protocol/Internet Protocol) and iSCSI (internet SCSI (Small Computer System Interface)), and a combination of network file systems such as Ethernet (registered trademark) and NFS (Network File System) of CIFS (Common Internet File System) may also be considered. Further, the I/O network 1300 may also be other than the above so as long as it is a communication device capable of transferring I/O requests. Further, the network that connects the virtual storage apparatus 1000 and the storage apparatus 1500 is also the same as the I/O network 1300 .
The virtual storage apparatus 1000 comprises a controller (indicated as CTL in the diagrams) 1010 , a cache memory (indicated as CM in the diagrams) 1020 , and a plurality of HDDs 1030 . As a preferred embodiment, the controller 1010 and the cache memory 1020 are respectively constituted of a plurality of components. The reason for this is because even if a failure occurs in a single component and such component is blocked, the remaining components can be used to continue receiving I/O requests as represented by read and write requests.
The controller 1010 is an apparatus (a circuit board, for example) for controlling the operation of the virtual storage apparatus 1000 . The controller 1010 has an internal network 1017 , and coupled to this internal network 1017 are an I/O port 1013 , a cache port (abbreviated as CP in the diagrams) 1015 , a management port (abbreviated as MP in the diagrams) 1016 , a back-end port (abbreviated as B/E P in the diagrams) 1014 , a processor (a CPU (Central Processing Unit), for instance) 1011 , and a memory 1012 . The controllers 1010 and the cache memories 1020 are mutually connected each other via a storage internal network 1050 . Further, the controller 1010 and the respective HDDs 1030 are mutually connected via a plurality of back-end networks 1040 .
The hardware constitution of the storage apparatus 1500 is constituted of similar components as those of the virtual storage apparatus 1000 . Incidentally, when the virtual storage apparatus 1000 is a dedicated device or switch for virtualization without an HDD, the storage apparatus 1500 does not need to be constituted of similar components as those of the virtual storage apparatus 1000 . Further, the internal network of the host 1100 and the virtual storage apparatus 1000 is preferably of a broader bandwidth than the transfer bandwidth of the I/O port 1013 , and all or a part thereof may be substituted with a bus or switch-type network. Further, in FIG. 1, although only one I/O port 1013 is provided to the controller 1010 , in reality, a plurality of I/O ports 1013 may exist in the controller 1010 .
According to the foregoing hardware constitution, the host 1100 will be able to read or write all or a part of the data stored in the HDD of the virtual storage apparatus 1000 and the storage apparatus 1500 . Incidentally, in the ensuing explanation, the system handling the storage of data is referred to as a storage cluster. Further, a subsystem that realizes high availability by including two subsystems inside the storage cluster and which includes the virtual storage apparatus 1000 and/or the storage apparatus 1500 is referred to as a storage subsystem.
In this embodiment, in order to improve the availability of a storage system including the virtual storage apparatus 1000 having a virtualization function for virtualizing a storage area such as a volume in another storage apparatus, a redundant constitution using another virtual storage apparatus 1000 is adopted. FIG. 2 is a diagram showing an overview of such a duplex constitution.
In this overview, the storage system includes a virtual storage apparatus 1000 L, a virtual storage apparatus 1000 R, a storage apparatus 1500 L, and a storage apparatus 1500 R. Incidentally, in order to simplify the following explanation, let it be assumed that the virtual storage apparatus 1000 L and the storage apparatus 1500 L serve as a primary system (production system), and the virtual storage apparatus 1000 R and the storage apparatus 1500 R serve as a secondary system (backup system). Nevertheless, when the number of volumes to be respectively provided by the virtual storage apparatuses 1000 L, 1000 R to the host 1100 is two or more volumes, in substitute for handling the primary system/secondary system in virtual storage apparatus units, only the virtual storage apparatuses 1000 L, 1000 R to serve as the primary system in volume units need to be defined.
The respective virtual storage apparatuses 1000 L, 1000 R provide partial or all areas of a parity group (configured based on RAID technology) with its own HDD 1030 as the constituent element as a volume 3000 LA and a volume 3000 RA to the host 1100 (corresponds to the portion in which ‘A’ is indicated in a cylinder in FIG. 2). Further, the virtual storage apparatus 1000 is also able to optionally provide, based on the virtualization function, virtual volumes 3000 LB, 3000 RB (volumes in which the nonvolatile storage areas of the corresponding HDD or the like exist outside the virtual storage apparatuses 1000 L, 1000 R). In this overview, a part or all of the volumes 3500 LB, 3500 RB provided by the storage apparatuses 1500 L, 1500 R are used as the corresponding nonvolatile storage areas. Incidentally, reference to “data of a volume” in the following explanation includes, in addition to the data stored in the HDD 1030 , data that is temporarily stored in the cache memory 1020 . Further, “data of a virtual volume” described later includes, in addition to the data stored in the volumes 3500 LB, 3500 RB of the storage apparatuses 1500 L, 1500 R, data that is temporarily stored in the cache memory 1020 of the virtual storage apparatuses 1000 L, 1000 R.
Meanwhile, an application program (hereinafter sometimes abbreviated as an “application”) 2010 , an OS, and system programs as represented by daemon and management programs for assisting in the setting and processing of the OS are executed in the host 1100 . The OS provides to the application 2010 an interface for I/O requests to data existing in the volumes 3000 LA, 3000 LB, 3000 RA, 3000 RB provided by the virtual storage apparatuses 1000 L, 1000 R, and sends I/O requests to the appropriate virtual storage apparatuses 1000 L, 1000 R and volumes 3000 LA, 3000 LB, 3000 RA, 3000 RB according to the request from the application 2010 . In a normal status, the host 1100 issues an I/O request as represented by a read or write request to the volumes 3000 LA, 3000 LB of the virtual storage apparatus 1000 L, and thereby sends and receives data. In other words, upon receiving a read request, the virtual storage apparatus 1000 L reads data from the HDD 1030 and returns such data to the host 110 when the requested volumes 3000 LA, 3000 LB, 3500 LB correspond to the HDD 1030 inside the virtual storage apparatus 1000 L, or acquires the necessary data and returns such data (all or a part) to the host 1100 by issuing a read request to the storage apparatus 1500 L.
In the case of a write request, in order to make the data redundant, the virtual storage apparatus 1000 L that received the write data sends the write data to the virtual storage apparatus 1000 R as the secondary system, and returns the write complete message to the host 1100 after the virtual storage apparatus 1000 L receives a write data reception complete message from the virtual storage apparatus 1000 R. Incidentally, write data to the virtual storage apparatus 1000 L and write data received by the virtual storage apparatus 1000 R via the virtual storage apparatus 1000 L may also be temporarily retained in the cache memories 1020 L, 1020 R of the respective virtual storage apparatuses 1000 L, 1000 R. Incidentally, as one example of this embodiment, the transfer of this write data is conducted via storage remote copy.
FIG. 3 shows the processing overview of the information system after a failure occurred in the virtual storage apparatus 1000 L under a normal status.
When the primary virtual storage apparatus 1000 L fails and shuts down, the system program in the host 1100 detects this failure, and switches the destination of the I/O request from the primary virtual storage apparatus 1000 L to the secondary virtual storage apparatus 1000 R. Nevertheless, in this case also, the application 2010 is able to continue I/O without being aware that the destination of the I/O request has been switched. Thus, normally, as a volume identifier designated at the time of an I/O request from the application 2010 or the file system, the system program provides a virtual volume identifier (or a device file) at an OS layer (more specifically, a layer that is lower than the file system), and the lower layer of OS manages the correspondence of that identifier and the identifier (or device file) actually allocated to the volume. When switching the destination of the I/O request, the correspondence thereof is switched from the volume 3000 LA and the volume 3000 LB of the virtual storage apparatus 1000 L to the volume 3000 RA and the volume 3000 RB of the virtual storage apparatus 1000 R, so as to realize switching that will be transparent to the application 2010 .
Further, the virtual storage apparatus 1000 R is also able to process the write request, according to the arrival of such write request to the volumes 3000 RA, 3000 RB from the host 1100 , or other express fail over requests. As an example of this change processing, in line with the data copy from the virtual storage apparatus 1000 L to the virtual storage apparatus 1000 R, when the setting is configured to deny the write request from the host 1100 to the volumes 3000 RA, 3000 RB of the virtual storage apparatus 1000 R, such setting is cancelled. Further, when write data is being transferred using remote copy, the copy status of remote copy may also be changed.
FIG. 4 shows the processing overview of the information system after the occurrence of a failure in the network between the virtual storage apparatuses 1000 L, 1000 R.
The virtual storage apparatus 1000 L that detected the network failure notifies this failure to the host 1100 . The host 1100 that received the failure notice requests the secondary virtual storage apparatus 1000 R to process the write request and issues subsequent write requests to both the primary virtual storage apparatus 1000 L and the secondary virtual storage apparatus 1000 R so as to make the data of the primary system and the data of the secondary system uniform.
FIG. 5 is a diagram illustrating the concept to be provided by the respective software programs in addition to the software programs to be executed in the host 1100 and information to be used by such software programs. Incidentally, although the software programs are retained in the memory 1102 (FIG. 1) and executed by the processor 1101 (FIG. 1), such software programs may be partially realized as hardware and executed.
In the host 1100 , in addition to the application 2010 and the remote copy manager 5030 , a file system 5020 , an I/O path manager 5000 and an HBA device driver 5010 are executed as program modules inside the OS or Kernel (it is not necessary to execute all processing, for the file system 5020 , the I/O path manager 5000 or the HBA device driver 5010 , inside the Kernel.).
The HBA device driver 5010 is a program for sending and receiving I/O requests and incidental data through the I/O port 1103 (FIG. 1) mounted on the HBA, and controlling communication with the other virtual storage apparatuses 1000 L, 1000 R and the storage apparatuses 1500 L, 1500 R. The HBA device driver 5010 is also able to provide an identifier corresponding to the volumes 3000 LA, 3000 LB, 3000 RA, 3000 RB provided by the virtual storage apparatuses 1000 L, 1000 R to the upper layer, and receive an I/O request accompanied with such identifier. The volume 5040 illustrates this concept, and corresponds to the respective volumes 3000 LA, 3000 LB, 3000 RA, 3000 RB provided by the virtual storage apparatuses 1000 L, 1000 R.
The I/O path manager 5000 is a module for switching the I/O request destination of the application 2010 . This module provides to the file system 5020 an I/O request interface and the identifier, which is the same type of identifier corresponding to the volume 5040 provided by the HBA device driver 5010 and corresponds to a virtual volume in the host 1100 . The identifier corresponding to the virtual volume in the host 1100 corresponds to the identifier corresponding to the volume 5040 provided by the HBA device driver 5010 in the module, and the device relation table 5001 retains the correspondence thereof. The volume 5050 illustrates the concept of this virtual volume in the host 1100 , and, in FIG. 5, an example of the correspondence thereof corresponds to the identifier corresponding to the volumes 3000 LA, 3000 LB of the virtual storage apparatus 1000 L (to put it differently, it could be said that the entities of the virtual volume 5050 in the host 1100 are the volumes 3000 LA, 3000 LB of the virtual storage apparatus 1000 L).
An I/O request up to this layer is usually designated in a fixed-length block access format. Nevertheless, the I/O request is not limited thereto when the host 1100 is a mainframe, and it may also be designated in a CKD (Count Key Data) format.
The file system 5020 is a module for sending an I/O request and sending and receiving data from/to the virtual storage apparatuses 1000 L, 1000 R, which is done through the identifier and the I/O interface corresponding to the volume 5040 provided by the HBA device driver 5010 , and the identifier and the interface corresponding to the virtual volume 5050 in the host 1100 provided by the I/O path manager 5000 . FIG. 5 illustrates as an example of the structure of a directory tree inside the file system 5020 in a state where a part of such tree structure 5052 is stored in the volume 5050 provided through virtualization in the host 1100 by the I/O path manager 5000 (as explained above, more precisely, provision of the virtual volume 5050 in the host 1100 of the I/O path manager 5000 is made through the identifier, and the data indicated as being stored in the volume 5050 is actually stored in the volumes 3000 LA, 3000 LB, 3000 RA, 3000 PB provided by the virtual storage apparatuses 1000 L, 1000 R shown in the device relation table 5001 ). The file system 5020 provides an interface of a file l/O to the application 2010 . The file system 5020 called from the application 2010 through the file I/O interface converts the read or write request accompanied with a file name and data offset in the file into a read or write request of a block format while referring to structural information in the file system 5020 such as a directory file or an inode, and delivers the read or write request to the I/O path manager 5000 or the HBA device driver 5010 .
Incidentally, with a Unix system or Windows (registered trademark) system OS, the file I/O interface is used to provide a function referred to as a device file system as the interface for directly operating the data of volumes. Normally, the device file system is deployed under the control of the ‘/dev’ directory of the file space, and the file name of the file of the foregoing directory and below (rsda and so on in the illustrated example) corresponds to the volumes 5040 , 5050 provided by the lower layer (HBA device driver 5010 and I/O path manager 5000 ) of the file system 5020 . Then, data stored in the volumes 5040 , 5050 can be read and written with the file I/O interface as though such data is stored in the device files 5070 , 5080 . Incidentally, in the example shown in FIG. 5, the device file 5070 (rsda, rsdb, rsdc, rsdd) corresponds to the volume 5040 recognized and provided by the HBA device driver 5010 , and the device file 5080 (vsda, vsdb) corresponds to the volume 5050 provided by the I/O path manager 5000 . These device files 5070 , 5080 may be used for the purpose of realizing independent data organization or buffer management when the application 2010 is a database.
The remote copy manager 5030 is a program for acquiring the status of remote copy for realizing the data transfer between the virtual storage apparatuses 1000 L, 1000 R, and for the host 1100 and the I/O path manager 5000 to perform the operation of remote copy, and communicates with the virtual storage apparatuses 1000 L, 1000 R according to the request of a program, a user or the I/O path manager 5000 using this program.
Incidentally, as explained above, it would be desirable if the functions of the HBA device driver 5010 and the I/O path manager 5000 could be partially or wholly installed and uninstalled as modules inside the Kernel. This is because, since the HBA device driver 5020 is a program for controlling the HBA, it is often provided by the manufacturer of the HBA. Similarly, since the processing of the I/O path manager 5000 is decided subject to the processing of the virtual storage apparatuses 1000 L, 1000 R, it is possible that some or all of the modules will be provided by the manufacturer of the virtual storage apparatuses 1000 L, 1000 R. Therefore, as a result of being able to install/uninstall this program, it will be possible to constitute an information system based on a broad range of combinations of HBA and virtual storage apparatuses 1000 L, 1000 R. Further, with the present invention, since the primary system and the secondary system are switched in a manner that is transparent to the application 2010 , transparent switching that does not require the recompilation or the like of the application 2010 can be realized by executing processing inside the Kernel. Moreover, since the I/O path manager 5000 exists in the intermediate layer of the file system 5020 and the HBA device driver 5010 , recompilation of the file system 5020 is no longer required, and transparency of the file system is also secured. In addition, the I/O path manager 5000 is able to use the functions of the HBA device driver 5010 .
Further, the following two methods can be considered when the I/O path manager 5000 inside the Kernel calls the remote copy manager 5030 or performing the opposite communication method thereof.
(A) The I/O path manager 5000 creates a virtual volume for communication, and the file system 5020 creates this communication volume as a device file in the file space. The remote copy manager 5030 stands by in a state of periodically executing a read system call to the device file. The I/O path manager 5000 receives an I/O request from the remote copy manager 5030 , but pends it internally. Then, when it becomes necessary for this module to send a message to the remote copy manager 5030 , the I/O path manager 5000 returns the data containing the message defined as a return value of the I/O request to the remote copy manager 5030 through the file system 5020 . Incidentally, the read system call issued by the remote copy manager thereupon will be forced to wait inside the Kernel for a long period of time. If this is not preferable, the I/O path manager 5000 should return data indicating that there is no message to the remote copy manager 5030 through the file system 5020 after the lapse of a prescribed period of time, and the remote copy manager 5030 that received this message should execute the read system call once again.
(B) Unix (registered trademark) domain socket is used and this is treated as a virtual network communication. Specifically, the remote copy manager 5030 operates one end of the socket, and the I/O path manager 5000 operates the remaining end.
Incidentally, in the following explanation, when the I/O path manager 5000 is to operate remote copy or refer to the status, let it be assumed that such operation is conducted by calling the remote copy manager 5030 through the foregoing communication.
FIG. 6 is a diagram showing the programs to be executed by the virtual storage apparatuses 1000 ( 1000 L, 1000 R) and the storage apparatuses 1500 ( 1500 L, 1500 R), and information to be managed by these programs. Incidentally, although the programs are retained in the memory 1102 (FIG. 1) and the cache memory 1020 and executed by the processor 1101 (FIG. 1), such programs may be partially constituted as hardware and executed.
<4.1. I/O Processing Program 6020 , Parity Group Information 6060 and Volume Information 6050 >
The parity group information 6060 contains information relating to the following configuration of each parity group.
(1) Identifier of HDD 1030 configuring the parity group. Since a plurality of HDDs 1030 are participating in the parity group, this information exists in a plurality for each parity group.
(2) RAID level
Further, the volume information 6050 contains information relating to the following configuration of each volume.
(1) Volume capacity
(2) Identifier of the parity group and areas (start address and/or end address) in the parity group storing data corresponding to the volume.
The I/O processing program 6020 executes the following processing relating to the I/O request received from the host 1100 by referring to the volume information 6050 and the parity group information 6060 .
(A) Staging: Copying data stored in the HDD 1030 to the cache memory 1020 .
(B) Destaging: Copying data stored in the cache memory 1020 to the HDD 1030 . Incidentally, as the pre-processing thereof, redundant data based on RAID technology may also be created.
(C) Read processing: Determining whether data corresponding to the request exists in the cache memory 1020 in response to the read request received from the host 1100 . In case of the data corresponding to the request not existing in the cache memory 1020 , staging processing is executed to copy the data to the cache memory 1020 , and such data is sent to the host 1100 . Incidentally, in case of such data existing in the cache memory 1020 , this data is sent to the host 1100 .
(D) Write processing: Storing the write data received from the host 1100 in the cache memory 1020 . Incidentally, in case of the free area in the cache memory 1020 not being enough during the processing, destaging processing is executed to copy appropriate data to the HDD 1030 , and the area in the cache memory 1020 is thereafter reused. Further, in case of the address, of which data is previously stored in the cache memory 1020 , is included in the target area of the write request, the data of the area may sometimes be directly overwritten in the cache memory 1020 .
(E) Cache algorithm: Deciding the data in the HDD 1030 to be staged and the data in the cache memory 1020 to be destaged according to an LRU algorithm or the like based on the reference frequency or reference period of data in the cache memory 1020 .
<4.2. Virtualization Program 6030 and Virtualization Information 6070 >
The virtualization information 6070 contains information relating to the following configuration of each virtualization volume.
(1) Following information concerning areas in the volume of the storage apparatus 1500 , and address space in the virtual volume as which the foregoing areas is provided to the host 1100 . In case of the virtual volume constituting a plurality of volumes, the following information will also exist in a plurality.
(1-1) Identifier of the storage apparatus 1500 (or identifier of the port), identifier of the volume, and areas (start address and end address) in the volume, constituted of the virtual volume
(1-2) Areas (start address and end address) in the virtual volume
(2) Capacity of the virtual volume
The virtualization program 6030 is a program for the virtual storage apparatus 1000 to provide a volume to the host 1100 by using the volume provided by the storage apparatus 1500 . Incidentally, there are the following patterns as the correspondence of the virtual volume provided by the virtualization program 6030 and the relating volume in the storage apparatus 1500 .
(A) A case of using the overall volume in the storage apparatus 1500 as the storage area of the virtual volume. In this case, capacity of the virtual volume will be roughly the same capacity as the selected volume (‘roughly same’ is a case of storing the control information and redundant information in a volume of the storage apparatus 1500 . When there is no such information, this will be the same capacity).
(B) A case of using a part of the volume in the storage apparatus 1500 as the storage area corresponding to the virtualization volume. Here, capacity of the virtual volume will be roughly the same as the area capacity to be used.
(C) A case of combining and using a plurality of volumes in a plurality of storage apparatuses 1500 as the storage area of the virtual volume. Here, capacity of the virtual volume will be roughly the same capacity as the total value of the capacity of the respective volumes. Incidentally, as this kind of combination method, there are striping, concatenate (method of linking a plurality of volumes and treating them as a single volume) and so on.
(D) In addition to pattern (C), further storing parity information or mirror data. Here, capacity of the virtual volume will be half of (C) when storing mirror data, or depend on the parity calculation method when storing parity. Reliability of data stored in the virtual volume can be improved through combination with high-reliability based on RAID inside the storage apparatus 1500 .
Incidentally, regarding every pattern, the storage apparatus identifier (or port identifier) and the volume identifier (information for identifying volumes in the virtual storage apparatus or controlled by ports used in the I/O request, such as LUN (Logical Unit Number), CKD-format CU number, LDEV (Logical DEVice) number, and the like), designated in the I/O request, differ from the original volume.
The virtualization program 6030 is called by the I/O processing program 6020 when the data to be subject to staging or destaging corresponds to the virtual volume, and uses the virtualization information 6070 to execute the following processing.
(A) Staging: Deciding which data stored in the volume of which storage apparatus 1500 should be copied to the cache memory 1020 based on the correspondence of the virtualization volume and the volume of the storage apparatus 1500 , and thereafter copying such data to the cache memory 1020 .
(B) Destaging: Deciding which volume of the storage apparatus 1500 should be target to copy data in the cache memory 1020 to, based on the correspondence of the virtual volume and the volume of the storage apparatus 1500 , and thereafter copying such data to the storage apparatus 1 500 . Incidentally, as the pre-processing thereof, redundant data based on RAID technology may also be created.
<4.3. Remote Copy Program 6010 and Copy Pair Information 6040 >
The copy pair information 6040 possesses the following information for each copy pair (hereinafter sometimes abbreviated as a “pair”) of the copy source volume and the copy destination volume of remote copy. Incidentally, in this embodiment, volumes that are the target of high availability are designated as the copy source volume and the copy destination volume.
(1) Identifier of the virtual storage apparatus 1000 having the copy source volume, and identifier of the volume
(2) Identifier of the virtual storage apparatus 1000 having the copy destination volume, and identifier of the volume
(3) Status of the copy pair (details will be described later)
The remote copy program 6010 is a program for mirroring the data stored in the copy source volume to the copy destination volume, and refers to the copy pair information 6040 to perform the processing. The processing overview and pair status of remote copy (in particular synchronous remote copy) are explained below.
<4.3.1. Copy Processing Operation of Synchronous Remote Copy>
As the method of the synchronous remote copy described above, when the virtual storage apparatus 1000 of the copy source receives a write request for writing into the copy source volume from the host 1100 , the virtual storage apparatus 1000 of the copy source sends write data to the virtual storage apparatus 1000 of the copy destination and thereafter returning a write request completion notice to the host 1100 .
When synchronous remote copy is to be executed, the controller 1010 of the virtual storage apparatus 1000 manages information referred to as a copy pair status (Simplex, Initial-Copying, Duplex, Suspend and Duplex-Pending), in order to display the status of remote copy between the pair of copy source volume and copy destination volume on a management screen 1200 or operate the status of remote copy. FIG. 7 shows a status transition diagram relating to the pair status of synchronous remote copy. The respective pair statuses are explained below.
<4.3.1.1. Simplex Status>
The Simplex status is a status where copy between the copy source volume and the copy destination volume configuring a pair has not been started.
<4.3.1.2. Duplex Status>
The Duplex status is a status where synchronous remote copy has been started, the initialization copy described later is complete and the data contents of the copy source volume and the copy destination volume configuring a pair are the same. In this status, excluding the areas that are currently being written, data contents of the copy source volume and data contents of the copy destination volume will be the same. Incidentally, during the Duplex status and in the Duplex-Pending and Initial-Copying statuses, write requests from the host 1100 to the copy destination volume are denied.
<4.3.1.3. Initial-Copying Status>
The Initial-Copying status is an intermediate status during the transition from the Simplex status to the Duplex status, and initialization copy from the copy source volume to the copy destination volume (copy of data already stored in the copy source volume to the copy destination volume) is performed as required during this period. When initialization copy is complete and processing necessary for making the transition to the Duplex status is complete, the pair status becomes a Duplex status.
<4.3.1.4. Suspend Status>
The Suspend status is a status where the contents written into the copy source volume are not reflected in the copy destination volume. In this status, data contents of the copy source volume and the copy destination volume configuring a pair are not the same. Triggered by a command from the user or the host 1100 , the pair status makes a transition from another status to the Suspend status. In addition, a case may be considered where, when it is no longer possible to perform synchronous remote copy due to a network failure or the like between the virtual storage apparatuses 1000 , the pair status makes an automatic transition to the Suspend status.
In the following explanation, the latter case; that is, the Suspend status caused by a failure will be referred to as a Failure Suspend status. Representative examples that cause such Failure Suspend status are, in addition to a network failure, failures in the copy source volume and the copy destination volume, and failure of the controller 1010 .
When entering the Suspend status, although the copy source storage 1000 receives write data according to a write request and stores it in the copy source volume when such write request is issued to the copy source volume subsequent to entering the Suspend status, the copy source storage 1000 does not send the write data to the virtual storage apparatus 1000 of the copy destination. Further, the virtual storage apparatus 1000 of the copy source stores the writing location of the written write data in the copy source volume as a differential bitmap or the like.
Incidentally, when a write request is issued to the copy source volume subsequent to entering the Suspend status, the virtual storage apparatus 1000 of the copy destination also performs the foregoing operation. Further, when a setting referred to as “fence” is configured in a pair before such pair enters the Failure Suspend status, writing of the copy source volume is denied after the pair status makes a transition to the Failure Suspend status. Incidentally, the virtual storage apparatus 1000 of the copy destination may also deny the write request to the copy destination volume during the Failure Suspend status.
<4.3.1.5. Duplex-Pending Status>
The Duplex-Pending status is the intermediate status during the transition from the Suspend status to the Duplex status. In this status, data copy from the copy source volume to the copy destination volume is executed in order to make the data contents of the copy source volume and the copy destination volume coincide. After the data contents of the copy source volume and the copy destination volume become identical, the pair status becomes a Duplex status.
Incidentally, data copy during the Duplex-Pending status is executed, via differential copy of copying only the portions that need to be updated (in other words, the inconsistent data between the copy source volume and the copy destination volume) by using the writing location (for instance, the foregoing differential bitmap or the like) recorded in the virtual storage apparatus 1000 of the copy source or the virtual storage apparatus 1000 of the copy destination during the Suspend status.
Further, although the Initial-Copying status and the Duplex-Pending status were explained above as being separate statuses, these may also be combined and displayed as one status on the screen of the management host 1200 , or subject to transition as one status.
<4.3.1.6. Pair Operation Command>
The pair status makes a transition to another status based on the following commands from the host 1100 or the management host 1200 .
(A) Initialization command: When this command is received during the Simplex status, transition is made to the Initial-Copying status.
(B) Resynchronization command: When this command is received during the Suspend status or the Failure Suspend status, transition is made to the Duplex-Pending status.
(C) Partition command: When this command is received during the Duplex status, transition is made to the Suspend status.
(D) Copy direction inversion command: When this command is received during the Duplex status, Suspend status or Failure Suspend status, relationship of the copy source and the copy destination is inverted. In the case of a Duplex status, the copy direction is also inverted when this command is received.
Incidentally, the initialization command is expected to designate the virtual storage apparatus 1000 of the copy source and the copy source volume, and the virtual storage apparatus 1000 of the copy destination and the copy destination volume, and the remaining commands merely need to designate identifiers showing the pair relationship since such pair relationship has already been formed (combination of the virtual storage apparatus 1000 of the copy source and the copy source volume, and the virtual storage apparatus 1000 of the copy destination and the copy destination volume is also one of such identifiers).
FIG. 6 illustrates the programs and information to be executed by the storage apparatus 1500 , and the respective programs and information perform the same operation as the virtual storage apparatus 1000 .
FIG. 8 is a diagram showing the information contained in the device relation table 5001 . The device relation table 5001 manages the following information for each virtual volume (more specifically, for each identifier corresponding to such volume) in the host 1100 provided by the I/O path manager 5000 .
(A) Identifiers of the virtual volumes in the host 1100
(B) Related volume identifier list: Identifiers of volumes of the storage apparatus 1500 that may become the entity of virtual volumes in the host 1100 are included. Incidentally, as said individual identifiers, the identifiers allocated by the HBA device drivers 5010 as the lower layer of the I/O path manager 5000 are used. In this embodiment, identifiers of volumes in the primary virtual storage apparatus 1000 ( 1000 L) and volumes in the secondary virtual storage apparatus 1000 ( 1000 R) are listed (if a normal status).
(C) Primary volume: Shows which volume listed at (B) is a primary.
(D) Failure status
(E) Pair status
Incidentally, since the identifiers of (A) and the identifiers of (B) are handled the same from the perspective of the file system 5020 , overlap of the identifiers of (A) and (B) is not allowed. Further, since overlap is also not allowed in the case of combining (A) and (B), the I/O path manager 5000 needs to create the identifiers of (A) while giving consideration to this point.
FIG. 9 is a flowchart illustrating the initialization processing of the I/O path manager 5000 . This initialization processing is now explained with reference to the flowchart. Incidentally, although there are cases below where the processing subject of various processes is explained as the “I/O path manager 5000,” in reality, it goes without saying that the processor 1101 (FIG. 1) of the host 1100 executes the corresponding processing based on a program called the “I/O path manager 5000.”
(S 9001 ) The I/O path manager 5000 receives an initialization command containing the following information from the user of the management host 1200 or the host 1100 . Incidentally, as the initialization processing of a duplex system, this is also referred to as an HA (High Availability) initialization command.
(A) Primary virtual storage apparatus 1000 and its volumes
(B) Secondary virtual storage apparatus 1000 and its volumes
(S 9002 ) The I/O path manager 5000 communicates with both virtual storage apparatuses 1000 commanded at S 9001 and acquires the existence of volumes and the capacity thereof.
(S 9003 ) The I/O path manager 5000 confirms that volumes commanded at S 9001 exist and are of the same capacity. When this cannot be confirmed, the I/O path manager 5000 returns an error to the command source.
(S 9004 ) The I/O path manager 5000 sends a remote copy initialization command to one or both virtual storage apparatuses 1000 . This initialization command is commanded with the primary volume as the copy source volume and the secondary volume as the copy destination volume. Based on this command, the virtual storage apparatus 1000 starts remote copy.
(S 9005 ) The I/O path manager 5000 registers the following information in the device relation table 5001 , and thereafter returns an initialization start reply to the source of the initialization command.
(A) Identifiers of the virtual volumes in the host 1100 (=values created by the I/O path manager 5000 )
(B) Related volume identifier list (=two identifiers corresponding to the virtual storage apparatus 1000 and the volume designated at S 9001 (both the primary system and secondary system)).
(C) Identifier of the primary volume (=primary volume designated at S 9001 )
(D) Failure status (=secondary system in preparation)
(E) Pair status (=Initial-Copying)
(S 9006 ) The I/O path manager 5000 monitors the pair status of remote copy, and updates the device relation table 50001 to the following information upon transition to the Duplex status.
(D) Failure status (=normal status)
(E) Pair status (=Duplex)
As a result of the foregoing processing, the I/O path manager 5000 is able to start the preparation for high availability including the setting of remote copy according to the user's command. Incidentally, in reality, since the I/O path manager 5000 is able to provide the virtual volume in the host 1100 immediately after S 9005 , users who wish to make access in a file format is able to start file I/O by issuing a mount command to the volume. Further, as a different method, the I/O path manager 5000 may define the virtual volume in the host 1100 corresponding to the volume to realize high availability before the setting of remote copy, and the file system 5020 may also start the foregoing processing from a state of mounting the volume by the user designating a volume to become a secondary system.
FIG. 10 is a diagram showing the processing flow when the I/O path manager 5000 receives a write request from the file system 5020 .
(S 10001 ) From the file system 5020 , the I/O path manager 5000 is called (or receives a message of) a write request function including the identifier of the virtual volume in the host 1100 to become the write destination, write location of the volume, and the write length.
(S 10002 ) The I/O path manager 5000 confirms the failure status of the virtual volume and, if it is a remote copy failed status, transfers the control to the dual write processing at S 10020 , and otherwise executes S 10003 .
(S 10003 ) The I/O path manager 5000 issues a write request to the primary volume. Incidentally, issuance of the write request is actually realized by calling the HBA device drive 5010 of the lower layer.
(S 10004 ) The I/O path manager 5000 confirms the reply of the write request, returns a completion reply to the file system 5020 if it is a normal end or transfers the control to the dual write processing at S 10020 if it is a remote copy failure or transfers the control to the switch processing at S 10010 if it is a no reply or in other cases.
Incidentally, the dual write processing at S 10020 is executed at the following steps.
(S 10021 ) If the writing into the primary or secondary volume is denied due to the setting of remote copy, the I/O path manager 5000 cancels this setting.
(S 10022 ) The I/O path manager 5000 issues a write request to the primary volume.
(S 10023 ) The I/O path manager 5000 issues a write request to the secondary volume. The I/O path manager 5000 waits for the arrival of a write request reply from both the primary system and secondary system, and returns a completion reply to the file system 5020 .
<8.1. Flow of Switch Processing>
The processing realized by the switch processing is further explained.
(S 10011 ) The I/O path manager 5000 foremost confirms whether the secondary volume is available by referring to the failure status of the device relation table 5001 , and returns an error reply to the file system 5020 if it determines that the secondary volume is unavailable, or executes S 10012 if the secondary volume is available. Incidentally, a status where there is no secondary system (when the secondary virtual storage apparatus 1000 is not functioning due to a failure, or in a case of a volume in which the secondary virtual storage apparatus 1000 is not set to begin with), and the status of initialization in preparation described above may consider the status of unavailable.
(S 10012 ) The I/O path manager 5000 issues a remote copy stop command to the secondary virtual storage apparatus 1000 and, after confirming that the copy status entered the Suspend status, issues a copy direction inversion command.
(S 10013 ) The I/O path manager 5000 issues a remote copy resynchronization command to the secondary virtual storage apparatus 1000 . Incidentally, there is no need to wait until the resynchronization is actually complete and the pair status enters the Duplex status.
(S 10014 ) The I/O path manager 5000 updates the primary volume identifier of the device relation table 5001 to a volume identifier that was a secondary system theretofore, and switches the primary system and the secondary system. Then, the I/O path manager 5000 sends a write request to the new primary volume through the HBA device driver 5010 .
(S 10015 ) The I/O path manager 5000 confirms the reply of the write request, returns a completion reply to the file system 5020 if it is a normal end or returns an error reply if it is an error, and ends the processing.
<8.1.1. Countermeasures Against Write Request Failure During Dual Write Processing>
When the write request to the primary volume at S 10022 ends in a failure during the dual write processing at S 10020 , control may be transferred to the switch processing at S 10010 . Further, when the write request to the secondary volume at S 10023 ends in a failure, the failure status of the device relation table 5001 is changed to ‘no secondary system,’ and writing is thereby completed.
Further, since the pair status is a Failure Suspend status during the dual write processing, a write location is indicated in the volume of the virtual storage apparatus 1000 based on a differential bitmap of remote copy. Nevertheless, since the write data written in both volumes based on the dual write processing are the same, it is desirable to avoid recording in the differential bitmap while the dual write processing is being conducted normally, and to copy only the differential data during the resynchronization processing after recovery of the communication failure. As a solution for the above, while the dual write processing is being conducted normally, a case may be considered of periodically and repeatedly clearing the differential bitmap of the volume of both the primary and secondary virtual storage apparatuses 1000 . With this method, there is no need to issue a clear command for each write request, and it is possible to avoid the copy of all areas of the target volume during the resynchronization of remote copy. This is because, although the write request of the dual write after the time of the nearest clearing process and the write request of the dual write during the failure of the dual write will be recorded as a write location in the differential bitmap, there will be no data inconsistency or copy omission area. Because, even when the data area recorded during the dual write is copied with resynchronization, the data contents of the copy destination will not change.
Incidentally, in the foregoing solution, processing of the write request may be temporarily stopped in order to clear the differential bitmap of both the primary and secondary system. As a method of stopping the processing, considered may be a method of the I/O path manager 5000 not transferring the write request received from the file system 5020 to the virtual storage apparatus 1000 until both differential bitmaps are cleared, or a method of pending the write request processing in the primary virtual storage apparatus 1000 until both differential bitmaps are cleared.
As a second solution, there is a method of allocating two differential bitmaps respectively to the primary and secondary volumes. The processing contents thereof are shown below.
(Initial status) The primary and secondary virtual storage apparatuses 1000 respectively record the location of the write request on one side of the two differential bitmaps. Thus, both virtual storage apparatuses 1000 will retain and manage information concerning an active side (this side refers to the side recording the write location when the write request arrives, and the other side of the differential bitmap is referred to as an inactive side). Further, it is desirable that there is nothing recorded on the inactive side of the differential bitmap.
(Step 1 ) The primary virtual storage apparatus 1000 switches the differential bitmap to become the recording destination of the location of the write request and the subsequent write requests are recorded in the switched differential bitmap by updating the management information of the active side to an alternative differential bitmap that was an inactive side. The secondary virtual storage apparatus 1000 is similarly switched. Incidentally, the trigger for starting the switch processing is given from the I/O path manager 5000 to both virtual storage apparatuses 1000 . Incidentally, the switch processing of the primary system and secondary system may be executed in any order, or may be executed in parallel.
(Step 2 ) The I/O path manager 5000 issues a differential bitmap clear command to both virtual storage apparatuses 1000 upon waiting for a switch completion reply from both virtual storage apparatuses 1000 . The virtual storage apparatus 1000 that received the clear command clears the write location of the differential bitmap that is an inactive side, and returns a reply to the I/O path manager 5000 . Similar to the switch processing, the clear processing of the primary system and secondary system may be executed in any order, or may be executed in parallel.
(Step 3 ) The I/O path manager 5000 waits for a clear completion reply from the both virtual storage apparatuses 1000 , and re-executes the process from Step 1 after the lapse of a certain period of time.
In the case of this solution, with the resynchronization processing after recovery of the communication failure, the area to perform differential copy can be decided during the Duplex-Pending status by calculating the logical sum of four bitmaps of the primary system and secondary system. Further, although there are many bitmaps in this method, there is no need to pend the write request.
The following third solution is a modified example of the foregoing second solution.
(Initial status) The primary and secondary virtual storage apparatuses 1000 respectively record the location of the write request on both side of the differential bitmaps. Thus, both virtual storage apparatuses 1000 will retain and manage information concerning the differential bitmap side that was previously cleared.
(Step 1 ) The I/O path manager 5000 issues a differential bitmap clear command to both virtual storage apparatuses 1000 . The virtual storage apparatus 1000 that received the clear command clears the write location of the alternative differential bitmap that is not the different bitmap that was cleared previously, and returns a reply to the I/O path manager 5000 .
(Step 3 ) The I/O path manager 5000 waits for a clear completion reply from the both virtual storage apparatuses 1000 , and re-executes the process from Step 1 after the lapse of a certain period of time.
FIG. 11 is a flowchart showing the processing contents when the I/O path manager 5000 receives a read request from the file system 5020 .
(S 11001 ) From the file system 5020 , the I/O path manager 5000 is called (or receives a message of) a read request function including the identifier of the virtual volume in the host 1100 to become the read destination, read location of the volume, and the read length.
(S 11002 ) The I/O path manager 5000 confirms the failure status of the virtual volume, executes S 11021 if it is a normal status and the I/O load against the primary volume is high (for instance, when a given IOPS is exceeded or a given bandwidth is exceeded) or otherwise executes S 11003 (no secondary system, secondary system in preparation, normal status, etc.).
(S 11003 ) The I/O path manager 5000 issues a read request to the primary volume.
(S 11004 ) The I/O path manager 5000 confirms the reply of the read request, returns a completion reply to the file system 5020 if it is a normal end or transfers the control to the switch processing at S 11010 in other cases.
(S 11021 ) The I/O path manager 5000 issues a read request to the secondary volume.
(S 11022 ) The I/O path manager 5000 confirms the reply of the read request, returns a completion reply to the file system 5020 if it is a normal end or executes S 11023 in other cases.
(S 11023 ) The I/O path manager 5000 updates a failure status of the device relation table 5001 to ‘no secondary system,’ and executes S 11003 .
<9.1. Flow of Switch Processing>
The processing realized by the switch processing is further explained.
(S 11011 ) The I/O path manager 5000 foremost confirms whether the secondary volume is available by referring to the failure status of the device relation table 5001 , and returns an error reply to the file system 5020 if it determines that the secondary volume is unavailable or executes S 11012 if the secondary volume is available. Incidentally, as a status of being determined as being unavailable, considered may be a status where there is no secondary system (when the secondary virtual storage apparatus 1000 is not functioning due to a failure, or in a case of a volume in which the secondary virtual storage apparatus 1000 is not set to begin with), and the status of initialization in preparation described above.
(S 10012 ) The I/O path manager 5000 issues a remote copy stop command to the secondary virtual storage apparatus 1000 and, after confirming that the copy status entered the Suspend status, issues a copy direction inversion command.
(S 10013 ) The I/O path manager 5000 issues a remote copy resynchronization command to the secondary virtual storage apparatus 1000 . Incidentally, there is no need to wait until the resynchronization is actually complete and the pair status enters the Duplex status.
(S 10014 ) The I/O path manager 5000 updates the primary volume identifier of the device relation table 5001 to a volume identifier that was a secondary system theretofore, and switches the primary system and the secondary system. Then, the I/O path manager 5000 sends a read request to the new primary volume through the HBA device driver 5010 .
(S 10015 ) The I/O path manager 5000 confirms the reply of the read request, returns a completion reply to the file system 5020 if it is a normal end or returns an error reply if it is an error and ends the processing.
In this section, the flow of processing from the time the I/O path manager 5000 detects a failure until the recovery is complete is explained. Incidentally, this processing is periodically executed in the background.
<10.1. Network Failure between Virtual Storage Apparatuses 1000 >
(Step 1 ) The I/O path manager 5000 monitors the pair status of remote copy and detects the occurrence of some kind of failure by discovering a Failure Suspend status.
(Step 2 ) The I/O path manager 5000 issues a remote copy stop command to the secondary virtual storage apparatus 1000 , inverts the copy direction after confirming that the copy status entered a Suspend status, and inquires the status to the respective virtual storage apparatuses 1000 . Then the I/O path manager 5000 updates the failure status of the device relation table 5001 to ‘remote copy failure’ after confirming that no failure has occurred to the self virtual storage apparatus 1000 and that the cause is a network failure. Incidentally, this processing may also utilize the work result of the work performed by the storage administrator.
(Step 3 ) Wait until the network recovers.
(Step 4 ) The I/O path manager 5000 issues a pair resynchronization command to the primary virtual storage apparatus 1000 .
(Step 5 ) The I/O path manager 5000 updates the failure status of the device relation table 5001 to ‘secondary system in preparation.’
(Step 6 ) The I/O path manager 5000 waits for the pair status to become a Duplex status, and thereafter updates the failure status of the device relation table 5001 to ‘normal status.’
<10.2. Failure and Shutdown of Primary Virtual Storage Apparatus 1000 >
(Step 1 ) The I/O path manager 5000 detects the occurrence of a failure by monitoring the status of the primary virtual storage apparatus 1000 .
(Step 2 ) The I/O path manager 5000 switches the subsequent I/O request destination to the secondary virtual storage apparatus 1000 by changing the identifier of the primary volume of the device relation table 5001 to the identifier of the secondary volume, and further updates the failure status to ‘no secondary system.’
(Step 3 ) The I/O path manager 5000 waits until the old primary (currently secondary switched at Step 2 ) virtual storage apparatus 1000 recovers.
(Step 4 ) The I/O path manager 5000 issues a pair resynchronization command or initialization command to the primary virtual storage apparatus 1000 .
(Step 5 ) The I/O path manager 5000 updates the failure status of the device relation table 5001 to ‘secondary system in preparation.’
(Step 6 ) The I/O path manager 5000 waits for the pair status to become a Duplex status, and then updates the failure status of the device relation table 5001 to ‘normal status.’
<10.3. Failure and Shutdown of Secondary Virtual Storage Apparatus 1000 >
(Step 1 ) The I/O path manager 5000 detects the occurrence of a failure by monitoring the status of the secondary virtual storage apparatus 1000 .
(Step 2 ) The I/O path manager 5000 updates the failure status of the device relation table 5001 to ‘no secondary system.’
(Step 3 ) The I/O path manager 5000 waits until the secondary virtual storage apparatus 1000 recovers.
(Step 4 ) The I/O path manager 5000 issues a pair resynchronization command or initialization command to the primary virtual storage apparatus 1000 .
(Step 5 ) The I/O path manager 5000 updates the failure status of the device relation table 5001 to ‘secondary system in preparation.’
(Step 6 ) The I/O path manager 5000 waits for the pair status to become a Duplex status, and then updates the failure status of the device relation table 5001 to ‘normal status.’
In the foregoing explanation, although remote copy was configured to the virtual storage apparatus 1000 according to an initialization request issued from the I/O path manager 5000 , the opposite method described below can also be considered.
(Step 1 ) The management host 1200 starts remote copy by issuing a remote copy pair initialization command to the virtual storage apparatus 1000 .
(Step 2 ) The I/O path manager 5000 receives a scanning request.
(Step 3 ) The I/O path manager 5000 acquires the configuration of remote copy in the respective volumes through the HBA device driver 5010 (status of remote copy configuration, whether it is a copy source or a copy destination, the virtual storage apparatus 1000 to become the other pair and its volume). Incidentally, as the foregoing acquisition method, a SCSI command can be used in the I/O network, or information can be acquired using other communication networks.
(Step 4 ) The I/O path manager 5000 creates a device relation table 5001 based on the information acquired at the previous step, and starts the processing described above. Incidentally, creation examples of the device relation table 5001 are shown below.
(A) Identifier of the virtual volume in the host 1100 =value created by the I/O path manager 5000
(B) Related volume identifier list=identifiers of the copy source volume and the copy destination volume of remote copy
(C) Primary volume=copy source volume remote copy
(D) Failure status=‘normal status’ when the pair status acquired from the virtual storage apparatus 1000 is a Duplex status, ‘secondary system in preparation’ when it is an Initial-Copying status or a Duplex-Pending status, ‘remote copy failure’ when it is a Suspend status or a Failure Suspend status
(E) Pair status=pair status acquired from the virtual storage apparatus 1000
High availability is realized in this embodiment based on the operation of the hardware and programs described above. Incidentally, as countermeasures to be taken when much time is required for the switch processing illustrated in FIG. 10 and FIG. 11, a part of the foregoing switch processing can be executed as preliminary processing when it becomes necessary for the I/O path manager 5000 to re-send the I/O request. Here, the preliminarily performed switch processing can be restored if the re-sent I/O request is returned with a normal reply, and the remaining portions of the foregoing switch processing can be executed if the re-sent I/O request is returned with error reply, or there is no reply. Further, in this embodiment, all volumes may be virtualized with the virtual storage apparatus 1000 , the entity may be a virtual volume in the storage apparatus 1500 , and the virtual storage apparatus 1000 may be an apparatus dedicated to virtualization, or contrarily a constitution where the entity of all volumes is inside the virtual storage apparatus 1000 may be adopted. Moreover, in addition to the capacity, various other attributes may be configured to the volumes provided by the virtual storage apparatus 1000 (for instance, an emulation type or a volume identification number acquirable with an Inquiry command defined based on a SCSI standard).
Such attribute information and attribute change are also transferred from the primary virtual storage apparatus to the secondary virtual storage apparatus based on remote copy, and these may also be managed in both virtual storage apparatuses.
In the write/read processing illustrated in FIG. 10 and FIG. 11, the I/O path manager 5000 specifically transfers the operation of remote copy to the virtual storage apparatus 1000 . Nevertheless, since the operation of remote copy may differ for each vendor of the virtual storage apparatus 1000 , there are cases when it would be more preferable not to include such operation in the write processing and read processing of the I/O path manager 5000 . FIG. 19 to FIG. 21 show the processing contents of such a form. Incidentally, although there are cases below where the processing subject of various processes is explained as the “virtual storage apparatus 1000,” in reality, it goes without saying that the processor 1101 (FIG. 1) of the virtual storage apparatus 1000 executes the corresponding processing based on programs stored in the memory 1012 (FIG. 1).
<12.1. Write Processing of I/O Path Manager>
FIG. 19 is a flowchart showing the processing contents of write processing to be executed by the I/O path manager 5000 . The processing contents at the respective steps of S 19001 to S 19023 in FIG. 19 are the same as the processing contents at the respective steps of S 10001 to S 10023 in FIG. 10. FIG. 19 differs from the FIG. 10 in the following points.
(Difference 1) The operation of remote copy at steps S 19012 , S 19013 and S 19021 is skipped.
(Difference 2) The routine does not reach step S 19020 of the flow during remote copy failure. Nevertheless, these differences only occur when it is not possible to identify an error message signifying remote copy failure in normal read/write processing.
<12.2. Processing of Storage Apparatus 1000 >
FIG. 21 is a diagram showing the operation of remote copy to be performed when the virtual storage apparatus 1000 receives a write request.
(S 21001 ) The virtual storage apparatus 1000 receives a write request.
(S 21002 ) The virtual storage apparatus 1000 determines whether the target volume of the write request is related to remote copy, and executes S 21003 when it is unrelated, and executes S 21004 when it is related.
(S 21003 ) The virtual storage apparatus 1000 performs normal write processing, returns a reply to the host 1100 and ends this processing.
(S 21004 ) The virtual storage apparatus 1000 determines the remote copy attribute of the target volume of the write request, and executes S 21005 when it is a copy source attribute, and executes S 21011 when it is a copy destination attribute.
(S 21005 ) The virtual storage apparatus 1000 executes synchronous remote copy processing, transfers write data to the secondary storage, and waits for a reply.
(S 21006 ) The virtual storage apparatus 1000 determines whether the copy was successful, and executes S 21008 if the copy was successful, and executes S 21007 is the copy was unsuccessful.
(S 21007 ) The virtual storage apparatus 1000 changes the status of the remote copy pair in which the target volume will become the copy source to a Failure Suspend status. However, writing to this volume is not prohibited.
(S 21008 ) The virtual storage apparatus 1000 performs normal write processing, returns a reply to the host 1100 , and ends this processing.
(S 21011 ) The virtual storage apparatus 1000 stops remote copy, and inverts the relationship of the copy source and the copy destination.
(S 21012 ) The virtual storage apparatus 1000 starts the resynchronization processing.
(S 21013 ) The virtual storage apparatus 1000 performs normal write processing, returns a reply to the host 1100 , and then ends this processing.
Incidentally, it is not necessary to wait until the resynchronization processing at S 21012 is complete. This is because the virtual storage apparatus 1000 executing S 21012 is a secondary system, the primary virtual storage apparatus 1000 is not necessarily operating normally, and much time may be required until the resynchronization processing is complete. Incidentally, the foregoing case is the same in that it can be recovered with the processing described in <10. Failure Measure Processing Flow>.
<12.3. Read Processing of I/O Path Manager>
FIG. 20 is a flowchart showing the processing contents of read processing to be executed by the I/O path manager 5000 . The processing contents at the respective steps of S 20001 to S 20023 in FIG. 20 are the same as the processing contents at the respective steps of S 11001 to S 11023 in FIG. 11. FIG. 20 differs from the FIG. 11 in the following point.
(Difference 1) The operation of remote copy at steps S 20012 and S 20013 is skipped.
Incidentally, although in FIG. 11 the direction of remote copy was inverted according to the read processing, the remote copy direction is not inverted in this processing. This is because, in addition to cases where the primary virtual storage apparatus 1000 will not return a reply to the read request to the secondary virtual storage apparatus 1000 (including cases caused by a communication failure between hosts=virtual storage apparatuses), there are cases where this is caused by the excess load of the primary virtual storage apparatus 1000 . Thus, if the secondary virtual storage apparatus 1000 performs the pair inversion of remote copy triggered by the read request to the copy destination volume, the pair will be inverted with the read request that just happened to be issued to the secondary virtual storage apparatus 1000 , and the pair will be inverted once again with the subsequent read request, and the read performance will deteriorate as a result.
Nevertheless, when the execution of S 20021 is inhibited, the virtual storage apparatus 1000 may perform pair inversion of remote copy by performing the following processing upon read processing.
(Step 1 ) The virtual storage apparatus 1000 receives a read request.
(Step 2 ) The virtual storage apparatus 1000 performs normal read processing.
(Step 3 ) The virtual storage apparatus 1000 determines whether the read-target volume is the copy destination volume of remote copy, and executes subsequent Step 4 if so, and ends this processing if not.
(Step 4 ) The virtual storage apparatus 1000 stops remote copy, and inverts the relationship of the copy source and the copy destination.
The second embodiment is now explained with reference to FIG. 12. The second embodiment differs from the first embodiment in that the storage apparatus 1500 L is coupled to a plurality of virtual storage apparatuses 1000 L, 1000 R, and these virtual storage apparatuses 1000 L, 1000 R share the volumes in the storage apparatus 1500 L to enable the continuation of service at a lower cost than the first embodiment even when one of the virtual storage apparatuses 1000 L, 1000 R shuts down.
Nevertheless, since the virtual storage apparatuses 1000 L, 1000 R include cache memories 1020 L, 1020 R, in preparation for a case when the primary virtual storage apparatus 1000 L shuts down due to a disaster immediately after write data is written into the virtualization volume, it is necessary to also store the write data into the cache memory 1020 R of the secondary virtual storage apparatus 1000 R, and the destaging and staging of both virtual storage apparatuses 1000 L, 1000 R must be devised accordingly.
A write request in a normal status is processed according to the following steps.
(Step 1 ) The primary virtual storage apparatus 1000 L that received a write request from the host 1100 determines whether the write request is addressed to the volume 3000 LA corresponding to the HDD 1030 inside the virtual storage apparatus 1000 L, addressed to the virtualization volume (hereinafter referred to as the “shared virtualization volume”) 3000 LB provided by both virtual storage apparatuses 1000 L, 1000 R by sharing the volume 3500 L of the storage apparatus 1500 L, or addressed to the normal virtualization volume. Incidentally, processing other than the shared virtualization volume 3000 LB is the same as the processing of the first embodiment.
(Step 2 ) The primary virtual storage apparatus 1000 L stores the write data in its internal cache memory 1020 L, stores the write data in the cache memory 1020 R of the secondary virtual storage apparatus 1000 R based on a remote copy program, and thereafter returns a normal reply to the host 1100 .
(Step 3 ) The caching algorithm of the primary virtual storage apparatus 1000 L decides the data in the cache memory 1020 L to be destaged, and destages the data to the volume of the storage apparatus 1500 L.
(Step 4 ) After destaging is complete, the primary virtual storage apparatus 1000 L commands the secondary virtual storage apparatus 1000 R to discard the address of data in the destaged cache memory 1020 L. Incidentally, the secondary virtual storage apparatus 1000 R that received the command discards the target data from the cache memory 1020 R.
Incidentally, in this constitution, when switching of the I/O request is conducted to the secondary virtual storage apparatus 1000 R in a state where the network between the virtual storage apparatuses 1000 L, 1000 R is disconnected, there are cases where the virtual storage apparatuses 1000 L, 1000 R will both autonomously perform destaging as primary systems. In order to avoid this kind of situation, when both virtual storage apparatuses 1000 L, 1000 R are to perform processing as primary systems, they may foremost perform exclusion control using a function such as SCSI Reserve or the like to the volume 3500 L shared in the storage apparatus 1500 L. Further, as another method, caching of virtual storage apparatus 1000 L may be invalidated regarding the shared virtualization volume 3000 LB, and, in such a case, when the access authority of the shared virtual volume 3000 LB is changed to a read-only access authority, caching may be validated according to such change.
The third embodiment is now explained with reference to FIG. 13. In this embodiment, the information system described in the foregoing embodiments is separately prepared at a remote site (backup site) that is different from the production site to perform remote copy, and the service can be resumed at the backup site when the production site is subject to a disaster.
Incidentally, in the following explanation, there are cases where the foregoing “virtual storage apparatus” is referred to as a storage apparatus, the “copy source volume” as a primary volume, the “copy destination volume” as a secondary volume, the “primary system” as an active side, and the “secondary system” as a standby side. Further, the information systems of the production site and the backup site may be collectively referred to as a remote copy system.
<1. Constitution of Remote Copy System>
In this embodiment, each site is constituted of hosts 13010 , 13020 and a plurality of storage subsystems 13001 , 13002 , 13003 , 13004 . At the production site, the storage subsystems 13001 , 13002 jointly adopt the high availability constitution described above. Moreover, at the backup site also, the storage subsystems 13003 , 13004 jointly adopt the high availability constitution.
Further, in this embodiment, synchronous or asynchronous remote copy is performed from the active-side storage subsystem (with a copy source volume) 13001 of the production site to the active-side storage subsystem (with a copy destination volume) 13003 of the backup site. When the production site is subject to a disaster, the host 1310 of the backup site issues an I/O request to active side of the storage subsystems 13003 , 13004 of a high availability constitution, and the re-booted application 2010 thereby resumes the processing.
Incidentally, as described above, a storage subsystem refers to both concepts including a constitution that does not use the virtualization function of the virtual storage apparatus 1000 (FIG. 1), as well as to a constitution where the virtual storage apparatus 1000 provides a virtualization volume using the virtualization function based on a combination of the virtual storage apparatus 1000 and the storage apparatus 1500 (FIG. 1). Further, in this embodiment, each storage subsystem 13001 , 13002 , 13003 , 13004 may adopt separate internal constitutions (for instance, configuring only the storage subsystem 13001 with the virtual storage apparatus 1000 without using the virtualization function, or sharing the storage apparatus 1500 (FIG. 1) with the storage subsystems 13003 and 13004 of the backup site, but not sharing the same on the production site side).
Incidentally, although there are cases below where the processing subject of various processes is explained as the “storage subsystem,” in reality, it goes without saying that the processor of the storage subsystem executes the corresponding processing based on programs stored in the memory of the storage subsystem.
<2. Processing>
When the application 2010 of the host 1301 of the production site issues a write request, the OS determines the active-side storage subsystem in the production site, and transfers the write request thereto. Incidentally, the storage subsystem 13001 corresponds to this in FIG. 13.
The active-side storage subsystem 13001 of the production site transfers write data to the standby-side storage subsystem ( 13002 corresponds to this in FIG. 13) in the production site based on synchronous remote copy. Further, the active-side storage subsystem 13001 transfers write data to the active-side storage subsystem ( 13003 corresponds to this in FIG. 13) of the backup site as synchronous or asynchronous remote copy (since only the active side processes the write request in the high availability constitution in this embodiment, remote copy is also similarly processed on the active side). The active-side storage subsystem 13003 in the backup site that received the write data transfers the received write data to the standby-side storage subsystem 13004 in the site based on synchronous remote copy.
Thus, the storage subsystems 13001 , 13002 of the production site are keeping track of the active-side storage subsystem of the backup site, and the storage subsystems 13003 , 13004 of the backup site are also keeping track of the active storage subsystem (storage subsystem 1301 ) of the production site so that they will not accept remote copy from an unexpected storage subsystem.
As a result of the foregoing processing, high availability is realized in both the production site and the backup site. However, the backup site may be of a constitution that does not adopt the high availability constitution for reduction of costs.
<3. Asynchronous Remote Copy>
Unlike with synchronous remote copy described above, asynchronous remote copy does not transfer write data at the time a write request arrives from the host 13010 , but rather transfers such write data after the request completion reply (to put it differently, asynchronous remote copy transfers write data in a timing independent from the request reply to the host 13010 ). Thus, with asynchronous remote copy, it is possible to perform remote copy without deteriorating the response time of the write request even when the communication delay is significant because the distance between the sites is long. Nevertheless, with asynchronous remote copy, it is necessary to buffer write data in the storage subsystem 13001 on the side of the production site. The following methods for buffering write data may be considered.
(1) The storage subsystem 13001 of the production site creates a journal containing write data to the copy source volume and sequence information of such write data, stores this in its own cache memory or a dedicated volume, transfers this journal to the storage subsystem 13003 of the backup site, and the storage subsystem 13003 of the backup site stores write data in the copy destination volume by referring to the sequence information of the journal. Thereby, when the production site is subject to a disaster, it is possible to provide data with a protected write sequence (more specifically, write data with dependence on the side of the backup site.
(2) The storage subsystem 13001 of the production site groups the data written into the copy source volume every given period and stores such group in its own cache memory or a dedicated volume, transfers this asynchronously to the storage subsystem 13003 of the backup site, and stores data in group units in the copy destination volume of the storage subsystem 13003 of the backup site.
Thus, unless the write data to be buffered for asynchronous remote copy is also retained in the standby-side storage subsystem 13002 , it will not be possible to succeed the asynchronous remote copy when the active-side storage subsystem 13001 shuts down. Thus, the active-side storage subsystem 13001 of the production site conveys, in addition to write data, information of the copy destination volume, foregoing sequence information or timing of performing the grouping process to the standby-side storage subsystem 13002 , and the standby-side storage subsystem 13002 creates buffering data for asynchronous