20030120777 | Forms auditing systems and methods | June, 2003 | Thompson et al. |
20080235375 | SOCIAL NETWORKING ONLINE COMMUNITY | September, 2008 | Reynolds et al. |
20020194371 | Loop network and method for operating the same | December, 2002 | Kadoi |
20100005187 | Enhanced Streaming Operations in Distributed Communication Systems | January, 2010 | Breiter et al. |
20090172120 | MANAGING PARTICIPANTS IN AN INSTANT MESSAGING MEETING | July, 2009 | Ruelas |
20090144394 | METHOD OF PREPARING AND DISTRIBUTING BOARD PAPERS | June, 2009 | Percival et al. |
20090217028 | METHOD OF ADDING A POSTSCRIPT MESSAGE TO AN EMAIL | August, 2009 | Khan et al. |
20100074239 | Wireless Detector and Adapter | March, 2010 | Born |
20090205019 | Mobile access to location-based community services | August, 2009 | Treu et al. |
20090125633 | SERVER INITIATED SECURE NETWORK CONNECTION | May, 2009 | Watsen et al. |
20050246436 | System for registering, locating, and identifying network equipment | November, 2005 | Day et al. |
n/a
n/a
The present invention relates to a method and apparatus for distributing files within a network.
Traditionally computer data is usually stored in form of individual files in a computer long-term storage (non-volatile). This most commonly is a “hard disk”. Hard disks suffer from following issues:
Limited capacity
Prone to damage and failure because of mechanical (moving) parts—short lifetime
Not shared, generally only one machine can access it at a time
To overcome such problems, disks can be combined in to larger pools of storage with data protection, for example, in a RAID array. In terms of their interface to a host computer, disks or pools of disks can be either be:
Internal—e.g. IDE, SATA, SCSI disks.
External—DAS Directly Attached Storage e.g. USB, SCSI, Fiber Channel. However, DAS is only capable being connected to very limited (<10) number of servers at a time.
External—NAS Network Attached Storage e.g. Ethernet, TCP/IP, IPX/SPX. NAS is just a more advanced, specifically designed in hardware file server.
External—SAN Storage Area Network e.g. Fiber Channel (FC) network infrastructure. It is acknowledged that SAN is capable of being connected to multiple machines however the infrastructure costs for doing so are prohibitive for desktops and, in spite of improvements such as iSCSI, SAN is typically used only for servers.
A pool of storage usually needs to be accessible by more than just a single machine. Traditionally the most common way of sharing storage is to use a “file server” which is a dedicated computer on the network, providing its storage pool (connected through any of the above 4 ways, internal or external) transparently to other computers by a “File Sharing Protocol” over a computer network (LAN/WAN/etc) with the possibility of adding extra security (access control) and availability (backups) from a central location.
Some commonly used file sharing protocols are:
However file servers suffer from some serious issues:
Single point of failure—when server fails all clients are unable to access data
The present invention provides a virtual storage pool from a combination of unused disk resources from participating nodes presented as a single virtual disk device (of combined size) available to and shared with all nodes of a network. Under a host operating system, the virtual storage pool is visible as normal disk drive (disk letter on Windows and mount point on Unix), however all disk I/O is distributed to participating nodes over the network. In the present specification, this is referred to as a Network Distributed File System (NDFS).
If any of the peer workstations becomes unavailable (even for a short period of time) the virtual storage pool could become unavailable or inconsistent. In preferred embodiments of the invention, to achieve availability comparable to a server or NAS storage, data is distributed in such a way that if any number (of predefined) participating nodes become unavailable, the virtual storage pool is still accessible to the remaining computers.
The invention is based on a Peer-to-Peer (P2P) network protocol which allows data to be stored and retrieved in a distributed manner. The data redundancy mechanism is used at the protocol level in such a way that if any of participating nodes is inaccessible or slow in response to requests, they are automatically omitted in the processing. The protocol therefore is network congestion or break resistant.
For example, given a network of 25 workstations, each of which having 120 GB disk of which 100 GB is unused, a virtual storage pool of size 2.5 TB could be formed and made available to all nodes on the network.
The storage pool of the preferred embodiments, in contrast to the traditional file server approach has following characteristics:
In accordance with one aspect, the present invention provides a storage pool component operable on a computing device including a storage medium having an otherwise free storage capacity for forming a portion of a storage capacity of a storage pool and being operably connected across a network to at least one other storage pool component. Each storage pool component operates on a computing device providing a respective portion of the storage pool capacity. The storage pool component has configuration data identifying the at least one other computing device to which the computing device may connect across the network, a directory for identifying file information for files of the storage pool stored on the storage medium, the file information being stored with a degree of redundancy across the computing devices of the storage pool, means responsive to instantiation of the component for communicating with at least one other component operating on one of the at least one other computing devices for verifying the contents of the directory, means for reconciling file information stored on the storage medium with file information from the remainder of the storage pool, and a driver. The driver is responsive to an access request for a file stored in the storage pool received across the network from another component of the storage pool, and determines a location of the file on the storage medium from the directory and for accessing the accordingly.
In accordance with another aspect, the present invention provides a system having a plurality of computing devices. The plurality of computing devices each has a storage medium. At least one of the computing devices includes a storage pool component. The storage pool component is operable on the computing device and the storage medium has an otherwise free storage capacity for forming a portion of a storage capacity of a storage pool and is operably connected across a network to at least one other storage pool component. Each storage pool component operates on a computing device providing a respective portion of the storage pool capacity. The system also includes one or more legacy clients accessing the storage pool through a legacy disk device driver.
A more complete understanding of the present invention, and the attendant advantages and features thereof, will be more readily understood by reference to the following detailed description when considered in conjunction with the accompanying drawings wherein:
FIG. 1 shows schematically a pair of virtual storage pools (VSP) distributed across a set of nodes according to an embodiment of the present invention;
FIG. 2 shows a client application accessing a virtual storage pool according to an embodiment of the invention;
FIG. 3 shows the main components within a Microsoft Windows client implementation of the invention;
FIG. 4 shows the main components within an alternative Microsoft Windows client implementation of the invention;
FIG. 5 a write operation being performed according to an embodiment of the invention; and
FIG. 6 shows a cluster of VSPs in a high availability group.
In traditional data storage, the term “Storage Pool” refers to a pool of physical disks or logical disks served by a SAN or LUNs (Logical Units) in DAS. Such a storage pool can be used either as a whole or partially, to create a higher level “logical volume(s)” by means of RAID-0, 1, 5, etc. before being finally presented through the operating system.
According to the present invention, a storage pool is created through a network of clients communicating through a P2P network protocol. For the purposes of the present description, the network includes the following node types:
From the above it will be seen that a virtual storage pool can be created from Peer Nodes or Server Nodes and accessible for Active Clients or Peer Nodes. VSP therefore can function for example on:
Referring now to FIG. 1, a VSP (Virtual Storage Pool), VSP A or VSP B, according to the preferred embodiment is formed up from Local Storage Entities (LSE) served by either Server or Peer Nodes 1 . . . 5. In a simple implementation, an LSE can be just a hidden subdirectory on a disk of the node. However, alternative implementations referred to later could implement an LSE as an embedded transactional database. In general, LSE size is determined by the available free storage space on the various nodes contributing to the VSP. Preferably, LSE size is the same on every node, and so global LSE size within a VSP will be dependent on smallest LSE in the VSP.
The size of VSP is calculated on VSP Geometry:
When RAID3/5 is being used (Geometry=N+1), the size of the VSP equals N+1 multiplied by size of LSE.
When RAID-6 is being used (Geometry=N+2), the size of VSP equals N+2 multiplied by size of LSE.
If N+M redundancy is used (Geometry=N+M), the size of VSP equals N multiplied by the size of LSE.
Because the LSE is the same on every node, a situation may occur when one or few nodes having a major storage size difference could be under utilized in contributing to virtual network storage. For example in a workgroup of 6 nodes, two nodes having 60 GB disks and four having 120 GB disks, the LSE on two nodes may be only 60 GB, and so single VSP size could only be 6*60 GB=360 GB as opposed to 120+120+120+120+60+60=600 GB. In such a situation, multiple VSPs can be defined. So in the above example, two VSPs could be created, one 6*60 GB and a second 4*60 GB, and these will be visible as two separate network disks. In fact, multiple VSPs enable different redundancy levels and security characteristics to be applied to different VSPs, so enabling greater flexibility for administrators.
Using the invention, a VSP is visible to an Active Client, Peer Node or indeed Legacy Client as a normal disk formed from the combination of LSEs with one of the geometries outlined above. When a client stores or retrieves data from a VSP it attempts to connect to every Server or Peer Node of the VSP and to perform an LSE I/O operation with an offset based on VSP Geometry.
Before describing an implementation of the invention in detail, we define the following terms:
VSP Cluster Size (VCS)—data (contents of the files before redundancy is calculated) is divided into so called clusters, similar in to data clusters of traditional disk based file systems (FAT, NTFS). Cluster size is determined by VSP Geometry and NBS (Network Block Size) in following way:
VCS=Number of Data Nodes*NBS
Referring now to the implementation of FIG. 3 where only Peer Nodes and a single VSP per network are considered. In this implementation, a simple user mode application (u_ndfs.exe) is used for startup, maintenance, recovery, cleanup, VSP forming, LSE operations and the P2P protocol, however, it will be seen that separate functionality could equally be implemented in separate applications.
Upon startup, u_ndfs.exe reads config.xml, a configuration file, which defines LSE location and VSP properties i.e. geometry, disk name and IP addresses of peer nodes. (The configuration file is defined through user interaction with a configuration GUI portion (CONFIG GUI) of U_ndfs.) U_ndfs then spawns a networking P2P protocol thread, NDFS Service. The network protocol used by the thread binds to a local interface on a UDP port and starts network communications with other nodes contributing to the VSP.
If less than a quorum N of N+M nodes are detected by the node on start-up, the VSP is suspended for that node until a quorum is reached.
Where there is N+M redundancy and where N<=M, it is possible for two separate quorums to exist on two detached networks. In such a case, if N<=50% of N+M, but a quorum is reached at a node, the VSP is set to read-only mode at that node.
Once a quorum is present, local LSE to VSP directory comparison is performed by recovering directory metadata from another node.
If the VSP contains any newer files/directories than the local LSE (for instance if the node has been off the network and files/directories have been changed), a recovery procedure is performed by retrieving redundant network parts from one or more other nodes and rebuilding LSE data for the given file/directory. In a simple implementation, for recovery, the node closest to the requesting node based on network latency is used as the source for metadata recovery.
So for example, in an N+M redundancy VSP implementation, a file is split into N+M clusters, each cluster containing a data component and a redundant component. Where one or more the N+M nodes of the VSP was unavailable when the file was written or updated, during recovery, the previously unavailable node must obtain at least N of the clusters in order to rebuild the cluster which should be stored for the file on the recovering node to maintain the overall level of redundancy for all files of the VSP.
It will also be seen that, after start-up and recovery, the networking protocol should remain aware of network failure and needs to perform an LSE rescan and recovery every time the node is reconnected to the network. The user should be alerted to expect access to the VSP when this happens.
A transaction log can be employed to speed up the recovery process instead of using a directory scan, and if the number of changes to the VSP exceeds the log size, a full recovery could be performed.
It can also be useful during recovery to perform full disk scan in a manner of fsck (“file system check” or “file system consistency check” in UNIX) or chkdsk (Windows) to ensure files have not been corrupted.
When LSE data is consistent with the VSP, the networking thread begins server operations and u_ndfs.exe loads a VSP disk device kernel driver (ndfs.sys). The disk device driver (NFDS Driver) then listens to requests from the local operating system and applications, while u_ndfs.exe listens to requests from other nodes through the networking thread.
Referring to FIG. 2, in operation, an application (for instance Microsoft Word) running on the host operating system, calls the I/O subsystem in the OS kernel and requests a portion of data with an offset (0 to file length) and size. (If the size is bigger than HBS, the kernel will fragment the request to smaller subsequent requests). The I/O subsystem then sends an IRP (I/O request packet) message to the responsible device driver module, NFDS driver. In case of a request to the VSP, the kernel device driver receives the request and passes it on to the P2P network protocol thread, NDFS Service, for further processing based on the VSP geometry.
At the same time, when the server side of the networking thread receives a request from a client node through the network, an LSE I/O operation is performed on the local storage.
Both client and server I/Os can be thought of as normal I/O operations with an exception that they are intercepted and passed through the NDFS driver and NDFS service like a proxy. N+M redundancy can thus be implemented with the P2P network protocol transparent to both clients and servers.
Referring now to FIG. 4, in further refined implementation of the invention, a separate kernel driver, NDFS Net Driver, is implemented for high-speed network communications instead of using Winsock. This driver implements its own layer-3 protocol and only reverts to IP/UDP in case of communication problems.
Also, instead of using the Windows file system for the LSE, a database, NDFS DB, can be used. Such a database implemented LSE can also prevent users from manipulating the raw data stored in a hidden directory as in the implementation of FIG. 3.
For the implementation of FIG. 3, a P2P network protocol is used to provide communications between VSP peer nodes on the network. Preferably, every protocol packet comprises:
Protocol ID
Protocol Version
Geometry
Function ID
Function Data
For the implementations of FIGS. 3 and 4, the following functions are defined:
NDFS_FN_READ_FILE_REQUEST | 0x0101 | |
NDFS_FN_READ_FILE_REPLY | 0x0201 | |
NDFS_FN_WRITE_FILE | 0x0202 | |
NDFS_FN_CREATE_FILE | 0x0102 | |
NDFS_FN_DELETE_FILE | 0x0103 | |
NDFS_FN_RENAME_FILE | 0x0104 | |
NDFS_FN_SET_FILE_SIZE | 0x0105 | |
NDFS_FN_SET_FILE_ATTR | 0x0106 | |
NDFS_FN_QUERY_DIR_REQUEST | 0x0207 | |
NDFS_FN_QUERY_DIR_REPLY | 0x0203 | |
NDFS_FN_PING_REQUEST | 0x0108 | |
NDFS_FN_PING_REPLY | 0x0204 | |
NDFS_FN_WRITE_MIRRORED | 0x0109 | |
NDFS_FN_READ_MIRRORED_REQUEST | 0x0205 | |
NDFS_FN_READ_MIRRORED_REPLY | 0x0206 | |
As can be seen above, every function has a unique id, and the highest order byte defines whether the given function is BROADCAST (1) or UNICAST (2) based.
The functions can be categorized as carrying data or metadata (directory operations). Also defined are control functions such as PING, which do not directly influence the file system or data.
Functions, which carry data are as follows:
READ_REQUEST
READ_REPLY
WRITE
WRITE_MIRRORED
READ_MIRRORED_REQUEST
READ_MIRRORED_REPLY
whereas functions, which carry metadata are as follows:
In the present implementations, all metadata (directory information) is available on every participating node. All functions manipulating metadata are therefore BROADCAST based and do not require two way communications—the node modifying data is sent as a broadcast message to all other nodes to update the metadata. Verification of such operations is performed only on the requesting node.
The rest of the metadata functions are used to read directory contents and are used in the recovery process. These functions are unicast based, because the implementations assume metadata to be consistent on all available nodes.
After fragmentation of a file into clusters, the last fragment usually has a random size smaller than the full cluster size (unless the file size is rounded up to the full cluster size). Such a fragment cannot easily be distributed using N+M redundancy and is stored using 1+M redundancy (replication) using the function WRITE_MIRRORED. This is also valid for files that are smaller than cluster size. (Alternative implementations may have different functionality such as padding or reducing block size to 1 byte.)
WRITE_MIRRORED is a BROADCAST function because an identical data portion is replicated to all nodes. It should be noted that for READ_MIRRORED operations, all data is available locally (because it is identical on every node) and no network I/O is required for such small portions of data (except for recovery purposes).
Note that the mirrored block size has to be smaller than cluster size, however it can be larger than NBS size. In such cases more than one WRITE_MIRRORED packet has to be sent with a different offset for the data being written.
In implementing N+M redundancy, clusters are divided into individual packets. To read data from a file, the broadcast function READ_REQUEST is used. The function is sent to all nodes with the cluster offset to be retrieved. Every node replies with unicast function READ_REPLY with its own data for the cluster at NBS size.
The node performing READ_REQUEST waits for first number of data nodes READ_REPLY packets sufficient to recover the data. If enough packets are received, any following reply packets are discarded. The data then is processed by an N+M redundancy function to recover the original file data.
Functions like REQUEST/REPLY have a 64-bit unique identification number generated from the computer's system clock inserted while sending REQUEST. The packet ID is stored to a queue. When the required amount of REPLY packets with same ID is received, the REQUEST ID is removed from the queue. Packets with IDs not matching those in the queue are discarded.
The packet ID is also used in functions other than REQUEST/REPLY to prevent execution of functions on the same node as the sending node. When a node receives a REQUEST packet with an ID matching a REQUEST ID in the REQUEST queue, the REQUEST is removed from the queue. Otherwise the REQUEST function in the packet will be executed.
The broadcast function PING_REQUEST is sent when the networking thread is started on a given node. In response, the node receives a number of unicast responses PING_REPLY from the other nodes, and if these are less than required, the VSP is suspended until quorum is reached.
Every other node starting up sends following PING_REQUEST packets and this can be used to indicate to a node that the required number of nodes are now available, so that VSP operations can be resumed for read-only or read/write.
The PING functions are used to establish the closest (lowest latency) machine to the requesting node and this is used when recovery is performed. As explained above, re-sync and recovery are initiated when a node starts up and connects to the network that has already reached quorum. This is done to synchronize any changes made to files when the node was off the network. When the recovery process is started, every file in every directory is marked with a special attribute. The attribute is removed after recovery is performed. During the recovery operation the disk is not visible to the local user. However, remote nodes can perform I/O operations on the locally stored files not marked with the recovery attribute. This ensures that data cannot be corrupted by desynchronization.
The recovering node reads the directory from the lowest latency node using QUERY_DIR_REQUEST/RESPONSE functions. The directory is compared to locally stored metadata for the VSP. When comparing individual files, the following properties are taken into consideration:
Note that last modification time recovery wouldn't make sense if local time would be used on every machine. Instead every WRITE and WRITE_MIRRORED request carry a requesting node generated timestamp in the packet payload and this timestamp is assigned to the metadata for the file/directory on every node.
Per-file data recovery process is performed by first retrieving the file size from the metadata (which prior to data recovery has to be “metadata recovered”). Then the file size is divided into cluster sizes and standard READ_REQUESTS performed to retrieve the data. An exception is the last cluster which is retrieved from the metadata source node (lowest latency) using READ_MIRRORED_REQUEST. The last part of recovery process comprises setting proper metadata parameters (size, attributes, last modification time) on the file.
File and attribute comparison is performed recursively for all files and folders on the disk storage. When recovery is finished all data is in sync and normal operations are resumed.
Alternative implementations of the invention can have dynamic recovery as opposed to recovery on startup only. For example, the networking thread can detect that the node lost communication with the other nodes and perform recovery each time communication is restored.
As mentioned above, a live transaction log file (journaling) can assist such recovery and the node could periodically check the journal or its serial number to detect if any changes have been made that the node was unaware of. Also the journal checking and metadata and last cluster recovery should be performed in more distributed manner than just trusting the node with lowest latency.
While the above implementations have been described as implemented in Windows platforms, it will be seen that the invention can equally be implemented with other operating systems, as despite operating system differences a similar architecture to that shown in FIGS. 3 and 4 can be used.
In more extensive implementations of the invention, different security models can be applied to a VSP:
While the implementations above have been described in terms of active clients, servers and peer nodes, it will be seen that the invention can easily be made available to legacy clients, for example, using Windows Share. It may be particularly desirable to allow access only to clients which are more likely not be highly available, for example, a laptop, as becoming a peer in a VSP could place an undue recovery burden, not only the laptop but on other nodes participating in the VSP, as the laptop connects and disconnects from the network.
Further variations of the above described implementations are also possible. So for example, rather than using an IP or MAC to identify nodes participating in a VSP, a dedicated NODE_ID could be used. Administration functions could also be expanded to enable one node to be replaced with another node in the VSP, individual nodes to be added or removed from the VSP or the VSP geometry to be changed.
Additionally the VSP could be implemented in a way that represents a continuous random access device formatted with a native file system such as FAT, NTFS or EXT/UFS on Unix. The VSP could also be used as virtual magnetic tape device for storing backups using traditional backup software.
Native Filesystem usage represents a potential problem where multiple nodes, while updating the same volume, could corrupt the VSP file system meta data due to multi node locking. To mitigate this, either a clustered filesystem would be used, or each node could access only a separate virtualized partition at a time.
For example, in a High Availability cluster such as Microsoft Cluster Server, Sun Cluster or HP Serviceguard, a HA Resource Group traditionally comprises a LUN or Disk Volume or partition residing on a shared storage (disk array or SAN) that is used only by this Resource Group and moves between nodes together with other resources. Referring now to FIG. 6, such a LUN or partition could be replaced with NDFS VSP formed out of cluster nodes and internal disks, so removing HA cluster software dependency on shared physical storage.
It will be appreciated by persons skilled in the art that the present invention is not limited to what has been particularly shown and described herein above. In addition, unless mention was made above to the contrary, it should be noted that all of the accompanying drawings are not to scale. A variety of modifications and variations are possible in light of the above teachings without departing from the scope and spirit of the invention, which is limited only by the following claims.