[0001] This Application is a Continuation-in-Part of U.S. application Ser. No. 10/093,340 filed on Mar. 6, 2002 and claims benefit of U.S. application Ser. No. 10/131,118 filed on Apr. 23, 2002, and U.S. Provisional Patent Application Serial No. 60/386,924, filed on Jun. 6, 2002.
[0002] 1. Technical Field
[0003] The invention relates to telecommunications. More particularly, the invention relates to a method and apparatus for processing data in connection with communication protocols that are used to send and receive data.
[0004] 2. Description of the Prior Art
[0005] Computer networks necessitate the provision of various communication protocols to transmit and receive data. Typically, a computer network comprises a system of devices such as computers, printers and other computer peripherals, communicatively connected together. Data are transferred between each of these devices through data packets which are communicated through the network using a communication protocol standard. Many different protocol standards are in current use today. Examples of popular protocols are Internet Protocol (IP), Internetwork Packet Exchange (IPX), Sequenced Packet Exchange (SPX), Transmission Control Protocol (TCP), and Point to Point Protocol (PPP). Each network device contains a combination of hardware and software that translates protocols and processes data.
[0006] An example is a computer attached to a Local Area Network (LAN) system, wherein a network device uses hardware to handle the Link Layer protocol, and software to handle the Network, Transport, and Communication Protocols and information data handling. The network device normally implements the one Link Layer protocol in hardware, limiting the attached computer to only that particular LAN protocol. The higher protocols, e.g. Network, Transport, and Communication protocols, along with the Data handlers, are implemented as software programs which process the data once they are passed through the network device hardware into system memory. The advantage to this implementation is that it allows a general purpose device such as the computer to be used in many different network setups and support any arbitrary network application that may be needed. The result of this implementation, however, is that the system requires a high processor overhead, a large amount of system memory, complicated configuration setup on the part of the computer user to coordinate the different software protocol and data handlers communicating to the computer's Operating System (O.S.) and computer and network hardware.
[0007] This high overhead required in processing time is demonstrated in U.S. Pat. No. 5,485,460 issued to Schrier et al on Jan. 16, 1996, which teaches a method of operating multiple software protocol stacks implementing the same protocol on a device. This type of implementation is used in Disk Operating System (DOS) based machines running Microsoft Windows. During normal operation, once the hardware verifies the transport or link layer protocol, the resulting data packet is sent to a software layer which determines the packets frame format and strips any specific frame headers. The packet is then sent to different protocol stacks where it is evaluated for the specific protocol. However, the packet may be sent to several protocols stacks before it is accepted or rejected. The time lag created by software protocol stacks prevent audio and video transmissions to be processed in real-time; the data must be buffered before playback. It is evident that the amount of processing overhead required to process a protocol is very high and extremely cumbersome and lends itself to applications with a powerful Central Processing Unit (CPU) and a large amount of memory.
[0008] Consumer products that do not fit in the traditional models of a network device are entering the market. A few examples of these products are pagers, cellular phones, game machines, smart telephones, and televisions. Most of these products have small footprints, eight-bit controllers, limited memory or require a very limited form factor. Consumer products such as these are simplistic and require low cost and low power consumption. The previously mentioned protocol implementations require too much hardware and processor power to meet these requirements. The complexity of such implementations are difficult to incorporate into consumer products in a cost effective way. If network access can be simplified such that it may be easily manufactured on a low-cost, low-power, and small form-factor device, these products can access network services, such as the Internet.
[0009] Communications networks use protocols to transmit and receive data. Typically, a communications network comprises a collection of network devices, also called nodes, such as computers, printers, storage devices, and other computer peripherals, communicatively connected together. Data is transferred between each of these network devices using data packets that are transmitted through the communications network using a protocol. Many different protocols are in current use today. Examples of popular protocols include the Internet Protocol (IP), Internetwork Packet Exchange (IPX) protocol, Sequenced Packet Exchange (SPX) protocol, Transmission Control Protocol (TCP), Point-to-Point Protocol (PPP) and other similar new protocols that are under development. A network device contains a combination of hardware and software that processes protocols and data packets.
[0010] In 1978, the International Standards Organization (ISO), a standards setting body, created a network reference model known as the Open System Interconnection (OSI) model. The OSI model includes seven conceptual layers: 1) The Physical (PHY) layer that defines the physical components connecting the network device to the network; 2) The Data Link layer that controls the movement of data in discrete forms known as frames that contain data packets; 3) The Network layer that builds data packets following a specific protocol; 4) The Transport layer that ensures reliable delivery of data packets; 5) The Session layer that allows for two way communications between network devices; 6) The Presentation layer that controls the manner of representing the data and ensures that the data is in correct form; and 7) The Application layer that provides file sharing, message handling, printing and so on. Sometimes the Session and Presentation layers are omitted from this model. For an explanation of how modern communications networks and the Internet relate to the ISO seven-layer model see, for example, chapter 11 of the text “Internetworking with TCP/IP” by Douglas E. Comer (volume 1, fourth edition, ISBN 0201633469) and Chapter 1 of the text “TCP/IP Illustrated” by W. Richard Stevens (volume 1, ISBN 0130183806).
[0011] An example of a network device is a computer attached to a Local Area Network (LAN), wherein the network device uses hardware in a host computer to handle the Physical and Data Link layers, and uses software running on the host computer to handle the Network, Transport, Session, Presentation and Application layers. The Network, Transport, Session, and Presentation layers, are implemented using protocol-processing software, also called protocol stacks. The Application layer is implemented using application software that process the data once the data is passed through the network-device hardware and protocol-processing software. The advantage to this software-based protocol processing implementation is that it allows a general-purpose computer to be used in many different types of communications networks and supports any applications that may be needed. The result of this software-based protocol processing implementation, however, is that the overhead of the protocol-processing software, running on the Central Processing Unit (CPU) of the host computer, to process the Network, Transport, Session and Presentation layers is very high. A software-based protocol processing implementation also requires a large amount of memory on the host computer, because data must be copied and moved as the software processes it. The high overhead required by protocol-processing software is demonstrated in U.S. Pat. No. 5,485,460 issued to Schrier et al. on Jan. 16, 1996, which teaches a method of operating multiple software protocol stacks. This type of software-based protocol processing implementation is used, for example, in computers running Microsoft Windows.
[0012] During normal operation of a network device, the network-device hardware extracts the data packets that are then sent to the protocol-processing software in the host computer. The protocol-processing software runs on the host computer, and this host computer is not optimized for the tasks to be performed by the protocol-processing software. The combination of protocol-processing software and a general-purpose host computer is not optimized for protocol processing and this leads to performance limitations. Performance limitations in protocol processing, such as the time lag created by the execution of protocol-processing software, is deleterious and may prevent, for example, audio and video transmissions from being processed in real-time or prevent the full speed and capacity of the communications network from being used. It is evident that the amount of host-computer CPU overhead required to process a protocol is very high and extremely cumbersome and requires the use of the CPU and a large amount of memory in the host computer.
[0013] New consumer and industrial products that do not fit in the traditional models of a network device are entering the market and, at the same time, network speed continues to increase. Examples of these consumer products include Internet-enabled cell phones, Internet-enabled TVs, and Internet appliances. Examples of industrial products include network interface cards (NICs), Internet routers, Internet switches, and Internet storage servers. Software-based protocol processing implementations are too inefficient to meet the requirements of these new consumer and industrial products. Software-based protocol processing implementations are difficult to incorporate into consumer products in a cost effective way because of their complexity. Software-based protocol processing implementations are difficult to implement in high-speed industrial products because of the processing power required. If protocol processing can be simplified and optimized such that it may be easily manufactured on a low-cost, low-power, high-performance, integrated, and small form-factor device, these consumer and industrial products can read and write data on any communications network, such as the Internet.
[0014] A hardware-based, as opposed to software-based, protocol processing implementation, an Internet tuner, is described in J. Minami; R. Koyama; M. Johnson; M. Shinohara; T. Poff; D. Burkes;
[0015] It would be advantageous to provide a gigabit Ethernet adapter that provides a hardware solution to high network communication speeds. It would further be advantageous to provide a gigabit Ethernet adapter that adapts to multiple communication protocols.
[0016] The invention is embodied in a gigabit Ethernet adapter. A system according to the invention provides a compact hardware solution to handling high network communication speeds. In addition, the invention adapts to multiple communication protocols via a modular construction and design. A presently preferred embodiment of the invention provides an integrated network adapter for decoding and encoding network protocols and processing data. The network adapter comprises a hardwired data path for processing streaming data; a hardwired data path for receiving and transmitting packets and for encoding and decoding packets; a plurality of parallel, hardwired protocol state machines; wherein each protocol state machine is optimized for a specific network protocol; and wherein said protocol state machines execute in parallel; and means for scheduling shared resources based on traffic.
[0017]
[0018]
[0019]
[0020]
[0021]
[0022]
[0023]
[0024]
[0025]
[0026]
[0027]
[0028]
[0029]
[0030]
[0031]
[0032]
[0033]
[0034]
[0035]
[0036]
[0037]
[0038]
[0039]
[0040]
[0041]
[0042]
[0043]
[0044]
[0045]
[0046]
[0047]
[0048]
[0049]
[0050]
[0051]
[0052]
[0053]
[0054]
[0055]
[0056]
[0057]
[0058]
[0059]
[0060]
[0061]
[0062]
[0063]
[0064]
[0065]
[0066]
[0067]
[0068]
[0069]
[0070]
[0071]
[0072]
[0073]
[0074]
[0075]
[0076]
[0077]
[0078]
[0079]
[0080]
[0081]
[0082]
[0083]
[0084]
[0085] The invention is embodied in a gigabit Ethernet adapter. A system according to the invention provides a compact hardware solution to handling high network communication speeds. In addition, the invention adapts to multiple communication protocols via a modular construction and design.
[0086] General Description
[0087] The invention comprises an architecture to be used in a high-speed hardware network stack (hereafter referred to as the IT10G). The description herein defines the data paths and flows, registers, theory of applications, and timings. Combined with other system blocks, the IT10G provides the core for line speed TCP/IP processing.
[0088] Definitions
[0089] As used herein, the following terms shall have the corresponding meaning:10 Gbps 10 Gigabit (10,000,000,000 bits per second)
[0090] ACK Acknowledgment
[0091] AH Authentication Header
[0092] AHS Additional Header Segment
[0093] ARP Address Resolution Protocol
[0094] BHS Basic Header Segment
[0095] CB Control Block
[0096] CPU Central Processing Unit
[0097] CRC Cyclic Redundancy Check
[0098] DAV Data Available
[0099] DDR Double Data Rate
[0100] DIX Digital Intel Xerox
[0101] DMA Direct Memory Access
[0102] DOS Denial of Service
[0103] DRAMDynamic RAM
[0104] EEPROM Electrically Erasable PROM
[0105] ESP Encapsulating Security Payload
[0106] FCIP Fiber Channel over IP
[0107] FIFO First-In First-Out
[0108] FIM Fixed Interval Marker
[0109] FIN Finish
[0110] Gb Gigabit (1,000,000,000 bits per second)
[0111] HDMAHost DMA
[0112] HO Half Open
[0113] HR Host Retransmit
[0114] HSU Header Storage Unit
[0115] IB Instruction Block
[0116] ICMP Internet Control Message Protocol
[0117] ID Identification
[0118] IGMP Internet Group Management Protocol
[0119] IP Internet Protocol
[0120] IPsec IP Security
[0121] IPX Internet Packet Exchange
[0122] IQ Instruction Block Queue
[0123] iSCSI Internet Small Computer System Interface
[0124] ISN Initial Sequence Number
[0125] LAN Local Area Network
[0126] LDMA Local DMA
[0127] LIP Local IP Address
[0128] LL Linked List
[0129] LP Local Port
[0130] LSB Least-Significant Byte
[0131] LUT Look-Up Table
[0132] MAC Media Access Controller
[0133] MCB CB Memory
[0134] MDL Memory Descriptor List
[0135] MIB Management Information Base
[0136] MII Media Independent Interface
[0137] MPLS Multiprotocol Label Switching
[0138] MRX Receive Memory
[0139] MSB Most-Significant Bit
[0140] MSS Maximum Segment Size
[0141] MTU Maximum Transmission Unit
[0142] MTX TX Memory
[0143] NAT Network Address Translation
[0144] NIC Network Interface Card
[0145] NS Network Stack
[0146] OR OR Logic Function
[0147] PDU Protocol Data Unit
[0148] PIP Peer IP Address
[0149] PP Peer Port
[0150] PROM Programmable ROM
[0151] PSH Push
[0152] PV Pointer Valid
[0153] QoS Quality of Service
[0154] RAM Random Access Memory
[0155] RARP Reverse Address Resolution Protocol
[0156] Rcv Receive
[0157] RDMARemote DMA
[0158] ROM Read-Only Memory
[0159] RST Reset
[0160] RT Round Trip
[0161] RTO Retransmission Timeout
[0162] RTT Round-Trip Time
[0163] RX Receive
[0164] SA Security Association
[0165] SB Status Blocks
[0166] SEQ Sequence
[0167] SM Status Message
[0168] SNMP Simple Network Management Protocol
[0169] SPI Security Parameter Index
[0170] Stagen Status Generator
[0171] SYN Synchronization
[0172] TCP Transport Control Protocol
[0173] TOE Transport Offload Engine
[0174] TOS Type of Service
[0175] TTL Time to Live
[0176] TW Time Wait
[0177] TX Transmit
[0178] UDP User Datagram Protocol
[0179] URG Urgent
[0180] VLAN Virtual LAN
[0181] VSOCK Virtual Socket
[0182] WS Window Scaling
[0183] XMTCTL Transmit Control
[0184] XOR Exclusive-OR
[0185] Overview
[0186] As bandwidth continues to increase, the ability to process TCP/IP communications becomes more of an overhead for system processors. Many sources state that as Ethernet rates reach the gigabit per second (Gbps) rate, that TCP/IP protocol processing will consume close to 100% of the host computer's CPU bandwidth, and when the rates increase further to 10 Gbps, that the entire TCP/IP protocol processing must be off-loaded to dedicated sub-systems. The herein described IT10G implements TCP and IP, along with related protocols including, for example, ARP, RARP, and IP host routing, as a series of state machines. The IT 10G core forms an accelerator or engine, also known as a Transport Offload Engine (TOE). The IT10G core uses no processor or software, although hooks are provided so that a connected on-chip processor can handle be used to extend the features of the network stack.
[0187] Sample Applications
[0188] An example usage of the IT10G core is an Intelligent Network Interface Card (NIC). In a typical application, the NIC is plugged into a computer server and natively processes TCP/UDP/IP packets.
[0189]
[0190]
[0191] The Challenge
[0192] The challenge for high-speed bandwidths is in processing TCP/IP packets at wire line speeds. This is shown in the following table.
TABLE 1 Processing Power Requirements Rate Bytes/sec Packets/sec Instr/sec 10 Mbps 1,000,000 2,000 2 MIPs 100 Mbps 10,000,000 20,000 20 MIPs 1 Gbps 100,000,000 200,000 200 MIPs 10 Gbps 1,000,000,000 2,000,000 2 GIPs
[0193] The figures in the above table are very conservative, and do not take into account, for example, the full duplex nature of networking. If full-duplex operation is factored in, then the processing power requirements can easily double. In any case, it is apparent that starting at the gigabit level, the processing overhead of TCP/IP becomes a major drain on host computer processing power and that another solution is needed.
[0194] Bandwidth Limitation
[0195] The IT10G addresses the limitation of host computer processing power by various architecture implementations. These include the following features:
[0196] On the fly (streaming) processing of incoming and outgoing data
[0197] Ultra wide datapaths (64 bits in the current implementation)
[0198] Parallel execution of protocol state machines
[0199] Intelligent scheduling of shared resources
[0200] Minimized memory copying
[0201] Overview
[0202] This section describes the top level of the preferred embodiment. It provides a block level description of the system as well as a theory of operation for different data paths and transfer types.
[0203] This embodiment of the invention incorporates the IT10G network stack and combines it with a processor core, and system components to provide a complete networking sub-system for different applications. A block level diagram for the system is shown in
[0204] Clock Requirements
[0205] The presently preferred embodiment of the invention is a chip that is designed to operate with different clock domains. The following table lists all clock domains for both 1 Gbps and 10 Gbps operations.
TABLE 2 Clock Domains 1 Gb 10 Gb Domain Symbol (Mhz) (MHz) Notes MAC CLK 125 125 System CLK 20 200 This clock serves the Clock network stack and the on- chip processor core System CLK 66/133 133 PCI 64/66 or PCI-X 133 is Interface used for 1 Gbps. PCI- Express is used for 10 Gbps.
[0206] Overview
[0207] This section provides an overview of the internal Protocol processor.
[0208] Processor Core
[0209] The herein described chip uses an internal (or on-chip) processor for programmability and flexibility. This processor is also furnished with all the peripherals needed to complete a working system. Under normal operating conditions, the on-chip processor controls the network stack.
[0210] Memory Architecture
[0211] The on-chip processor has the capability to address up to 4 GBytes of memory. Within this address space are located all of its peripherals, its RAM, ROM, and the network stack.
[0212] Network Stack Architecture
[0213] Overview
[0214] This section overviews the IT 10G architecture. Subsequent sections herein go into detail on individual modules. The IT10G takes the hardware protocol processing function of a network stack, and adds enhancements that enable it to scale up to 10 Gbps rates. The major additions to previous versions are widening of the data paths, parallel execution of state machines, and intelligent scheduling of shared resources. In addition, other protocols previously not supported are added with support for protocols such as RARP, ICMP, and IGMP.
[0215] Theory of Operation
[0216] TCP/UDP Socket Initialization
[0217] Prior to transferring any data using the IT 10G, a socket connection must be initialized. This can be done either by using commands blocks or by programming up the TCP socket registers directly. Properties that must be programmed for every socket include the Destination IP address, Destination Port number, and type of connection (TCP or UDP, Server or Client, for example). Optional parameters include such settings as a QoS level, Source Port, TTL, and TOS setting. Once these parameters have been entered, the socket may be activated. In the case of UDP sockets, data can start to be transmitted or received immediately. For TCP clients, a socket connection must first be established, and for TCP servers a SYN packet must be received from a client, and then a socket connection established. All these operations may be performed completely by the IT 10G hardware.
[0218] Transmission of Packets
[0219] When TCP packets need to be transmitted, the application running on the host computer first writes the data to a socket (either a fixed socket or virtual socket—virtual sockets are supported by the IT 10G architecture). If the current send buffer is empty, then a partial running checksum is kept as the data is being written to memory. The partial checksum is used as the starting seed for checksum calculations, and alleviates the need for the TCP layers in the IT 10G network stack to read through the data again prior to sending data out. Data can be written to the socket buffer in either 32-bit or 64-bit chunks. Up to four valid_byte signals are used to indicate which bytes are valid. Data should be packed when writing to the socket buffers, with only the last word having possible invalid bytes. This stage also applies to UDP packets for which there is an option of not calculating the data checksum.
[0220] Once all the data has been written, the SEND command can be issued by the application running on the host computer. At this point, the TCP/UDP engine calculates the packet length, checksums and builds the TCP/IP header. This TCP/IP header is pre-pended to the socket data section. The buffer pointer for the packet, along with the sockets QoS level is then put on the transmission queue.
[0221] The transmission scheduler looks at all sockets that have pending packets and selects the packet with the highest QoS level. This transmission scheduler looks at all types of packets that need transmission. These packets may include TCP, UDP, ICMP, ARP, RARP, and raw packets, for example. A minimum-bandwidth algorithm is used to make sure that no socket is completely starved. When a socket packet is selected for transmission, the socket buffer pointer is passed to the MAC TX Interface. The MAC TX Interface is responsible for reading the data from the socket buffer and sending the data to the MAC. A buffer is used to store the outgoing packet in case it needs to be retransmitted due to Ethernet collisions or for other reasons. Once the packet data is sent from the original socket buffer, then that data buffer is freed. When a valid transmit status is received back from the MAC, the data buffer is flushed, and the next packet can then be sent. If an invalid transmission status is received from the MAC, then the last packet stored in the data buffer is retransmitted.
[0222] Reception of Packets
[0223] When a packet is received from the MAC, the Ethernet header is parsed to determine if the packet is destined for this network stack. The MAC address filter may be programmed to accept a unicast addresses, unicast addresses that fall within a programmed mask, broadcast addresses, or multicast addresses. In addition, the encapsulating protocol is also determined. If the 16-bit TYPE field in the Ethernet header indicates an ARP (0x0806) or RARP (0x0835) packet, then the ARP/RARP module is enabled to further process the packet. If the TYPE field decodes to IPv4 (0x0800), then the IP module is enabled to process the packet further. A complete list of example supported TYPE fields is shown in the following table. If the TYPE field decodes to any other value, the packet may optionally be routed to a buffer and the host computer notified that an unknown Ethernet packet has been received. In this last case, the application may read the packet, and determine the proper course of action. With this construction of the datapath any protocol not directly supported in hardware, such as IPX for example, may be indirectly supported by the IT10G.
TABLE 3 Supported Ethernet TYPE Field Values TYPE Field Description 0x0800 IPv4 Packet 0x0806 ARP Packet 0x8035 RARP Packet 0x8100 VLAN Tagged Packets 0x8847 MPLS Unicast Packets 0x8848 MPLS Multicast Packets
[0224] ARP/RARP Packets
[0225] If the received packet is an ARP or RARP packet, then the ARP/RARP module is enabled. It examines the OP field in the packet and determines if it is a request or a reply. If it is a request, then an outside entity is polling for information. If the address that is being polled is for the IT 10G, then a reply_req is sent to the ARP/RARP reply module. If the packet received is an ARP or RARP reply, then the results, i.e. the MAC and IP addresses, are sent to the ARP/RARP request module.
[0226] In an alternative embodiment the ARP and/or RARP functions are handled in the host computer using dedicated and optimized hardware in the IT10G to route ARP/RARP packets to the host via the exception path.
[0227] IP Packets
[0228] If the received packet is an IP packet, then the IP module is enabled. The IP module first examines the version field in the IP header to determine if the received packet is an IPv4 packet.
[0229] The IP module parses the embedded protocol of the received packet. Depending on what protocol is decoded, the received packet is sent to the appropriate module. Protocols supported directly by hardware in the current embodiment include TCP and UDP, for example. Other protocols, such as RDMA, may be supported by other optimized processing modules. All unknown protocols are processed using the exception handler.
[0230] TCP Packets
[0231] If a TCP packet is received by the IT 10G, then the socket information is parsed, and the corresponding socket enabled. The state information of the socket is retrieved, and based on the type of packet received, the socket state is updated accordingly. The data payload of the packet (if applicable) is stored in the socket data buffer. If an ACK packet needs to be generated, the TCP state module generates the ACK packet and schedules the ACK packet for transmission. If a TCP packet is received that does not correlate to an open socket, then the TCP state module generates a RST packet and the RST packet is scheduled for transmission.
[0232] UDP Packets
[0233] If a UDP packet is received, then the socket information is parsed, and the data stored in the socket receive data buffer. If no open socket exists, then the UDP packet is silently discarded.
[0234] In an alternative embodiment UDP packets may be handled by the host computer using the exception handler.
[0235] Network Stack Registers
[0236] The hardware network stack of the IT 10G is configured to appear as a peripheral to the on-chip processor. The base address for the network stack is programmed via the on-chip processor's NS_Base_Add register. This architecture allows the on-chip processor to put the network stack at various locations in its memory or I/O space.
[0237] Ethernet MAC Interface
[0238] Overview
[0239] The following discussion describes the Ethernet MAC interface module. The function of the Ethernet MAC interface module is to abstract the Ethernet MAC from the core of the IT10G. This allows the IT 10G network stack core to be coupled to different speed MACs and/or MACs from various sources without changing the IT10G core architecture, for example. This section describes the interface requirements for communication with the IT10G core.
[0240] Module I/Os
[0241]
[0242] Ethernet Interface
[0243] Overview
[0244] This section describes the Ethernet Interface module. The Ethernet interface module communicates with the Ethernet MAC interface at the lower end, and to blocks such as the ARP, and IP modules on the upper end. The Ethernet interface module handles data for both the receive and transmit paths. On the transmit side, the Ethernet interface module is responsible for scheduling packets for transmission, setting up DMA channels for transmission, and communicating with the Ethernet MAC interface transmit signals. On the receive side, the Ethernet interface module is responsible for parsing the Ethernet header, determining if the packet should be received based upon address filter settings, enabling the next encapsulated protocol based upon the TYPE field in the packet header, and aligning the data so that it starts on a 64-bit boundary for the upper layer protocols.
[0245] Sub Module Block Descriptions
[0246] Transmission Scheduler
[0247] The Transmission Scheduler block
[0248] Check to see that no packet channel has reached the starved state. This is a programmable level, per channel type, i.e. TCP, IP, ARP, and Raw buffers, that states how many times a channel is passed over before the scheduler over-rides the QoS level and the packet is sent out. If two or more packets have reached the starved state at the same time, then the channel with the higher weighting is given priority. The other packet is then scheduled to be sent next. If the packets have the same priority weighting they are sent out one after the other according to the following order; TCP/UDP then ARP then IP then Raw Ethernet.
[0249] If no channel has a packet in the starved state, then the channel with the highest combined QoS level and channel weighting is sent.
[0250] If only one channel has a packet to be sent, it is sent immediately.
[0251] Once a packet channel has been selected for transmission, the channel memory pointer, packet length, and type are passed to the DMA engine. The DMA engine in turn signals back to the transmission scheduler when the transfer has been completed. At this point the scheduler sends the packet parameters to the DMA engine.
[0252] DMA Engine
[0253] The DMA Engine
[0254] Transmitter Interface
[0255] The Transmitter Interface
[0256] Receiver Interface
[0257] The Receiver Interface
[0258] Address Filter and Packet Type Parser
[0259] The Address Filter and Packet Type Parser
[0260] Determine if the packet is for the local network stack
[0261] Parse the encapsulated packet type to determine where to send the rest of the packet.
[0262] Address Filtering
[0263] The network stack can be programmed with the following filter options:
[0264] Accept a programmed unicast address
[0265] Accept broadcast packets
[0266] Accept multicast packets
[0267] Accept addresses within a range specified by a netmask
[0268] Promiscuous mode (accepts all packets)
[0269] These parameters are all settable by the host computer via registers.
[0270] Packet Types Supported
[0271] The following packet types are known by the IT10G hardware and are natively supported:
[0272] IPv4 packets with type=0x8000
[0273] ARP packets with type=0x0806
[0274] RARP packets with type=0x8035
[0275] The packet type parser also handles the case where an 802.3 length parameter is included in the TYPE field. This case is detected when the value is equal to or less then 1500 (decimal). When this condition is detected, the type parser sends the encapsulated packet to both the ARP and IP receive modules, along with asserting an 802_frame signal so that each subsequent module realizes that it must decode the packet with the knowledge that it may not be really meant for that module.
[0276] Note: IPv6 packets are treated as exception packets by the Ethernet layer.
[0277]
[0278] If the Address Filter and Packet Type Parser module parses a packet that it does not understand, and if the unsupported type feature is enabled, then the packet is routed to the Exception Handler for storage and further processing.
[0279] Data Aligner
[0280] The Data Aligner
[0281] Ethernet Packet Formats
[0282] The IT10G accepts both 802.3(SNAP) and DIX format packets from the network, but only transmits packets in DIX format. Furthermore, when 802.3 packets are received, they are first translated into DIX format, and then processed by the Ethernet filter. Therefore, all Ethernet exception packets are stored in DIX format.
[0283] ARP Protocol and ARP Cache Modules
[0284] Overview
[0285] The following discussion details the ARP Protocol and ARP Cache modules. In one embodiment of the IT10G architecture, the ARP protocol module also supports the RARP protocol, but does not include the ARP cache itself. Because each module capable of transmitting a packet queries the ARP cache ahead of time, this common resource is separated from this ARP module. The ARP Protocol and ARP Cache module may send updates to the ARP cache based upon packet types received.
[0286] ARP Feature List:
[0287] Able to respond to ARP requests by generating ARP replies
[0288] Able to generate ARP requests in response to the ARP cache
[0289] Able to provide ARP replies for multiple IP addresses (multi-homed host/ARP proxy)
[0290] Able to generate targeted (unicast) ARP requests
[0291] Filters out illegal addresses
[0292] Passes aligned ARP data up to the processor
[0293] Capable of performing a gratuitous ARP
[0294] CPU may bypass automatic ARP reply generation, dumping ARP data into the exception handler
[0295] CPU may generate custom ARP replies (when in bypass mode)
[0296] Variable priority of ARP packets, depending on network conditions
[0297] RARP Feature List:
[0298] Request an IP address
[0299] Request a specific IP address
[0300] RARP requests are handed off to the exception handler
[0301] Handles irregular RARP replies
[0302] Passes aligned RARP data up to the processor
[0303] CPU may generate custom RARP requests and replies
[0304] ARP Cache Features:
[0305] Dynamic ARP table size
[0306] Automatically updated ARP entry information
[0307] Interrupt when sender's hardware address changes
[0308] Capable of promiscuous collection of ARP data
[0309] Duplicate IP address detection and interrupt generation
[0310] ARP request capability via the ARP module
[0311] Support for static ARP entries
[0312] Option for enabling static ARP entries to be replaced by dynamic ARP data
[0313] Support for ARP proxying
[0314] Configurable expiration time for ARP entries
[0315] (The CPU may be either the host computer CPU or the on-chip processor in this context.)
[0316] ARP Module Block Diagram
[0317]
[0318] ARP Cache Module Block Diagram
[0319]
[0320] ARP Module Theory of Operations
[0321] Parsing Packets
[0322] The ARP module
[0323] Data is read from the Ethernet interface in 64-bit words. An ARP packet takes up 3.5 words. The first word of an ARP-type packet contains mostly static information. The first 48 bits of the first word of an ARP-type packet contain the Hardware Type, Protocol Type, Hardware Address Length, and Protocol Address Length. These received values are compared with the values expected for ARP requests for IPv4 over Ethernet. If the received values do not match, the data is passed to the exception handler for further processing. Otherwise, the ARP module continues with parsing. The last 16 bits of the first word of an ARP-type packet contain the opcode. The ARP module stores the opcode and checks if it is valid, i.e. 1, 2 or 4. If the opcode is invalid, the data is passed to the exception handler for further processing. Otherwise, the ARP module continues with parsing.
[0324] The second word of an ARP-type packet contains the Source Ethernet Address and half of the Source IP Address. The ARP module stores the first 48 bits into the Source Ethernet Address register. Then the ARP module checks if this field is a valid Source Ethernet Address. The address should not be same as the address of the IT 10G network stack. If the source address is invalid, the packet is discarded. The last 16 bits of the packet are then stored in the upper half of the Source IP Address register.
[0325] The third word of an ARP-type packet contains the second half of the Source IP Address and the Target Ethernet Address. The ARP module stores the first 16 bits in the lower half of the Source IP Address register, and checks if this stored value is a valid Source IP Address. The address should not be same as that of the IT10G hardware, or the broadcast address. Also, the source address should be in the same subnet. The ARP module discards the packet if the source address is invalid. If the packet is an ARP/RARP reply, compare the Target Hardware Address with my Ethernet address. If the address does not match, the ARP module discards the packet. Otherwise the ARP module continues with parsing.
[0326] Only the first 32 bits of the last word of an ARP-type packet contain data (the Target IP Address). The ARP module stores the Target IP Address in a register. If the packet is an ARP packet (as opposed to ARP request or RARP packet), compare the Target IP Address with my IP address. If the addresses do not match, discard this packet. Otherwise, if this packet is an ARP request, generate an ARP reply. If this is a RARP reply, pass the assigned IP address to the RARP handler.
[0327] Once all the address data have been validated, the source addresses are passed to the ARP Cache.
[0328] Transmitting Packets
[0329] The ARP module may receive requests for transmitting packets from three sources: the ARP Cache
[0330] Transmission requests are placed in the queue in a first-come first-served order, except when two or more entities want to transmit. In that case, the next request placed in the queue depends on its priority. RARP requests normally have the highest priority, followed by ARP requests. ARP replies usually have the lowest priority. Using priority allows resources to be shared depending on data traffic.
[0331] There is one condition where ARP replies have the highest priority. This occurs when ARP reply FIFO buffer is filled. When the FIFO buffer is filled, incoming ARP requests begin to be discarded, therefore ARP replies should have the highest priority at that point to avoid forcing retransmissions of ARP requests.
[0332] When the transmission queue is full, no more requests may be made until one or more transmission requests have been fulfilled (and removed from the queue). When the ARP module detects a full queue, it requests an increase in priority from the transmission arbiter. Because there should be only two conditions for the queue, full or not full, this request signal may be a single bit.
[0333] When the transmission arbiter allows the ARP module to transmit, ARP/RARP packets are generated dynamically depending on the type of packet to be sent. The type of packet is determined by the opcode, which is stored with each entry in the queue.
[0334] Bypass Mode
[0335] The ARP module has the option of bypassing the automatic processing of incoming packet data. When a bypass flag is set, incoming ARP/RARP data are transferred to the exception handler buffer. The CPU then accesses the buffer, and processes the data. When in bypass mode, the CPU may generate ARP replies on its own, passing data to the transmission scheduler. The fields that can be customized in outgoing ARP/RARP packets are: the source IP address, the source Ethernet address, the target IP address, and the opcode. All other fields match the standard values used in ARP/RARP packets for IPv4 over Ethernet, and the source Ethernet address is set to that of the Ethernet interface. (The CPU may be either the host computer or the on-chip processor in this context.)
[0336] Note: If it is necessary to modify these other ARP/RARP fields, the CPU must generate a raw Ethernet frame itself.
[0337] ARP Cache Theory of Operation
[0338] Adding Entries to the ARP Cache
[0339] ARP entries are created when receiving targeted ARP requests and replies (dynamic), or when requested by the CPU (static). (The CPU may be either the host computer or the on-chip processor in this context.) Dynamic entries are ARP entries that are created when an ARP request or reply is received for one of the interface IP addresses. Dynamic entries exist for a limited time as specified by the user or application program running on the host computer; typically five to 15 minutes. Static entries are ARP entries that are created by the user and do not normally expire.
[0340] New ARP data come from two sources: the CPU via the ARP registers and the ARP packet parser. When both sources request to add an ARP entry at the same time the dynamic ARP entries have priority, because it is necessary to process incoming ARP data as quickly as possible.
[0341] Once an ARP data source has been selected, we need to determine where in IT 10G hardware memory the ARP entry is to be stored. To do this we use a lookup table (LUT) to map a given IP address to a location in memory. The lookup table contains 256 entries. Each entry is 16 bits wide and contains a memory pointer and a pointer valid (PV) bit. The PV bit is used determine if the pointer is pointing to a valid address, i.e. the starting address of a memory block allocated by the ARP cache.
[0342] To determine from where in the LUT we need to retrieve the pointer, we use an 8-bit index. The index is taken from the last octet of a 32-bit IP address. The reason for using the last octet is that in a local area network (LAN) this is the portion of the IP address that varies the most between hosts.
[0343] Once we determine which slot in the LUT to use, we check to see if there is a valid pointer contained in that slot (PV=“1”). If there is a valid pointer, that means there is a block of memory allocated for this index, and the target IP address may be found in that block. At this point, the block of memory being pointed to is retrieved and the target IP address is searched for. If the LUT does not contain a valid pointer in this slot, then memory must be allocated from an internal memory, malloc1. Once the memory has been allocated the address of the first word of the allocated memory is stored in the pointer field of the LUT entry.
[0344] After allocating memory and storing the pointer in the LUT, we need to store the necessary ARP data. This ARP data includes the IP address, necessary for determining if this is the correct entry during cache lookups. Also used is a set of control fields. The retry counter is used to keep track of the number of ARP request attempts performed for a given IP address. The type field indicates the type of cache entry (000=dynamic entry; 001=static entry; 010=proxy entry; 011=ARP check entry). The resolved flag indicates that this IP address has been successfully resolved to an Ethernet address. The valid flag indicates that this ARP entry contains valid data. Note: an entry may be valid and unresolved while the initial ARP request is being performed. The src field indicates the source of the ARP entry (00=dynamically added, 01=system interface, 10=IP router, and 11=both system interface and IP router). The interface field allows the use of multiple Ethernet interfaces, but defaults to a single interface (0). Following the controls fields is the link address that points to the following ARP entry. The most significant bit (MSB) of the link address is actually a flag, link_valid. The link_valid bit indicates that there is another ARP entry following this one. The last two fields are the Ethernet address to which the IP address has been resolved, and the timestamp. The timestamp indicates when the ARP entry was created, and is used to determine if the entry has expired.
[0345] In LANs with more than 256 hosts or with multiple subnets, collisions between different IP addresses may occur in the LUT. In other words, more than one IP address may map to the same LUT index. This would be due to more than one host having a given value in the last octet of its IP address. To deal with collisions, the ARP cache uses chaining., which we describe next.
[0346] When performing a lookup in the LUT, and an entry is found to already exist in that slot, we retrieve the ARP entry that is being pointed to from memory. We examine IP address in the ARP entry and compare it to the target IP address. If the IP addresses match then we can simply update the entry. However, if the addresses do not match, then we look at the Link_Valid flag and the last 16 bits of ARP entry. The last 16 bits contain a link address pointing to another ARP entry that maps to the same LUT index. If the Link_Valid bit is asserted, then we retrieve the ARP entry pointed to in the Link Address field. Again the IP address in the entry is compared with the target IP address. If there is a match then the entry is updated, otherwise the lookup process continues (following the links in the chain) until a match is found or the Link_Valid bit is not asserted.
[0347] When the end of a chain is reached and a match has not been found, a new ARP entry is created. Creating a new ARP entry may require the allocation of memory by the malloc1 memory controller. Each block of memory is 128 bytes in size. Thus, each block can accommodate 8 ARP entries. If the end of a block has been reached, then a new memory block must be requested from malloc1.
[0348] As previously mentioned, the user (or application running on the host computer) has the option of creating static or permanent ARP entries. The user may have the option of allowing dynamic ARP data to replace static entries. In other words, when ARP data are received for an IP address that already has a static ARP entry created for it, that static entry may be replaced with the received data. The benefit of this arranegment is that static entries may become outdated and allowing dynamic data to overwrite static data may result in a more current ARP table. This update capability may be disabled if the user is confident that IP-to-Ethernet address mappings will remain constant, e.g. storing the IP and Ethernet addresses of a router interface. The user may also choose to preserve static entries to minimize the number of ARP broadcasts on a LAN. Note: ARP proxy entries can never be overwritten by dynamic ARP data.
[0349] Looking Up Entries in the Cache
[0350] Looking up entries in the ARP cache follows a process similar to that for creating ARP entries. Lookups begin by using the LUT to determine if memory has been allocated for a given index. If memory has been allocated, the memory is searched until either the entry is found (a cache hit occurs), or an entry with the link_valid flag set to zero (a cache miss) is encountered.
[0351] If a cache miss occurs, an ARP request is generated. This involves creating a new ARP entry in the cache, and a new LUT entry if necessary. In the new ARP entry, the target IP address is stored, the resolved bit is set to zero and the valid bit is set to one. The request counter is set to zero as well. The entry is then time stamped and an ARP request is passed to the ARP module. If a reply is not received after one second, then the request counter is incremented and another request is sent. After sending three requests and receiving no replies, attempts to resolve the target IP are abandoned. Note: the retry interval and number of request retries are user-configurable.
[0352] When a cache miss occurs, the requesting module is notified of the miss. This allows the CPU or IP router the opportunity to decide to wait for an ARP reply for the current target IP address, or to begin a new lookup for another IP address and place the current IP address at the back of the queue. This helps to minimize the impact of a cache miss on establishing multiple connections.
[0353] If a matching entry is found (cache hit) then the resolved Ethernet address is returned to the module requesting the ARP lookup. Otherwise if, the target IP address was not found in the cache, and all ARP request attempts have timed out, the requesting module is notified that the target IP address could not be resolved.
[0354] Note: if an ARP lookup request from the IP router fails, the router must wait a minimum of 20 seconds before initiating another lookup for that address.
[0355] Cache Initialization
[0356] When the ARP cache is initialized several components are reset. The lookup table (LUT) is cleared, by setting all the PV bits to zero. All memory currently in use is de-allocated and released back to the mallocl memory controller. The ARP expiration timer is also set to zero.
[0357] During the initialization period, no ARP requests are generated. Also, any attempts to create ARP entries from the CPU (static entries), or from received ARP data (dynamic entries) are ignored or discarded.
[0358] Expiring ARP Entries
[0359] Dynamic ARP entries may only exist in the ARP cache for a limited amount of time. This is to prevent any IP-to-Ethernet address mappings from becoming stale. Outdated address mappings could occur if a LAN uses DHCP to assign IP addresses or if the Ethernet interface on a device is changed during a communications session.
[0360] To keep track of the time, a 16-bit counter is used. Operating with a clock frequency of 1 Hz the counter is used to track the number of seconds that have passed. Each ARP entry contains a 16-bit timestamp taken from this counter. This timestamp is taken when an IP address is successfully resolved.
[0361] ARP entry expiration occurs when the ARP cache is idle, i.e. no requests or lookups are currently being processed. At this time, an 8-bit counter is used to cycle through and search the LUT. Each slot in the LUT is checked to see if it contains a valid pointer. If a pointer is valid, the memory block pointed to is retrieved. Then, each entry within that block is checked to see if the difference between its timestamp and the current time is greater than or equal to the maximum lifetime of an ARP entry. If other memory blocks are chained off the first memory block, the entries contained in those blocks are also checked. Once all the entries associated with a given LUT index have been checked, then the next LUT slot is checked.
[0362] If an entry is found to have expired, the valid bit in the entry is set to zero. If there are no other entries within the same memory block, then the block is de-allocated and returned to malloc1. If the block being de-allocated is the only block associated with a given LUT slot, the PV bit in that slot is also set to zero.
[0363] Performing ARP Proxying
[0364] The ARP cache supports proxy ARP entries. ARP proxying is used when this device acts as a router for LAN traffic, or there are devices on the LAN that are unable to respond to ARP queries.
[0365] With ARP proxying enabled, the ARP module passes requests for IP addresses that do not belong to the host up to the ARP cache. The ARP cache then does a lookup to search for the target IP address. If it finds a match, it checks the type field of the ARP entry to determine if it is a proxy entry. If it is a proxy entry, the ARP cache passes the corresponding Ethernet address back to the ARP module. The ARP module then generates an ARP reply using the Ethernet address found in the proxy entry as the source Ethernet address. Note: ARP proxy lookups occur only for incoming ARP requests.
[0366] Detection of Duplicate IP Addresses (ARP Check)
[0367] When the system (host computer plus IT 10G hardware) initially connects to a network, the user or application running on the host computer should perform a gratuitous ARP request to test if any other device on the network is using one of the IP addresses assigned to its interface. If two devices on the same LAN use the same IP address, this could result in problems with routing packets for the two hosts. A gratuitous ARP request is a request for the host's own IP address. If no replies are received for the queries, then it can be assumed that no other host on the LAN is using our IP address.
[0368] An ARP check is initiated in a manner similar to that of performing an ARP lookup. The only difference is that the cache is discarded once the gratuitous ARP request has been completed. If no replies are received, the entry is removed. If a reply is received, an interrupt is generated to notify the host computer that the IP address is in use by another device on the LAN, and the entry is removed from the cache.
[0369] Cache Access Priorities
[0370] Different tasks have different priorities in terms of access to the ARP cache memory. Proxy entry lookups have the highest priority due to the need for rapid responses to ARP requests. Second in priority is adding dynamic entries to the cache; incoming ARP packets may be received at a very high rate and must be processed as quickly as possible to avoid retransmissions. ARP lookups from the IP router have the next highest priority, followed by lookups by the host computer. The manual creation of ARP entries has the second lowest priority. Expiring cache entries has the lowest priority and is performed whenever the cache is not processing an ARP lookup or creating a new entry.
[0371] Overview
[0372] The IT 10G natively supports IPv4 packets with automatic parsing for all types of received packets.
[0373] IP Module Block Diagram
[0374]
[0375] IP Sub Module Descriptions
[0376] IP Parser
[0377] The IP Parser module
[0378] IP Header Field Parsing
[0379] IP Version
[0380] Only IPv4 are accepted and parsed by the IP module, therefore this field must be 0x4 to be processed. If an IPv6 packet is detected, it is handled as an exception and processed by the Exception Handler. Any packet having a version that is less then 0x4 is considered malformed (illegal) and the packet is dropped.
[0381] IP Header Length
[0382] The IP Header Length field is used to determine if any IP options are present. This field must be greater then or equal to five. If it is less, the packet is considered malformed and dropped.
[0383] IP TOS
[0384] This field is not parsed or kept for received packets.
[0385] Packet Len
[0386] This field is used to determine the total number of bytes in the received packet, and is used to indicate to the next level protocol where the end of its data section is. All data bytes received after this count expires and before the ip_packet signal de-asserts are assumed to be padding bytes and are silently discarded.
[0387] Packet ID, Flags, and Fragmentation Offset
[0388] These fields are used for defragmenting packets. Fragmented IP packets may be handled by dedicated hardware or may be treated as exceptions and processed by the Exception Handler.
[0389] TTL
[0390] This field is not parsed or kept for received packets.
[0391] PROT
[0392] This field is used to determine the next encapsulated protocol. The following protocols are fully supported (or partially supported in alternative embodiments) in hardware:
TABLE 4 Supported Protocol Field Decodes Hex value Protocol 0x06 TCP 0x11 UDP
[0393] If any other protocol is received, and if the unsupport_prot feature is enabled, then the packet may be sent to the host computer. A protocol filter may be enabled to selectively receive certain protocols. Otherwise, the packet is silently discarded.
[0394] Checksum
[0395] This field is not parsed or kept. It is used just to make sure the checksum is correct. If the checksum turns out bad, then the bad_checksum signal, which goes to all the next layers is asserted. It stays asserted until it is acknowledged.
[0396] Source IP Address
[0397] This field is parsed and sent to the TCP/UDP layers.
[0398] Destination IP Address
[0399] This field is parsed and checked against valid IP addresses that the local stack should be responding to. This may take more then one clock cycle, in which case the parsing should continue. If the packet turns out to be misdirected, then the bad_ip_add signal is asserted. It stays asserted until it is acknowledged.
[0400] IP ID Generation Algorithm
[0401] The on-chip processor can set the IP ID seed value by writing any 16-bit value to the IP_ID_Start register. The ID generator takes this value and does a mapping of the 16 bits to generate the IP ID used by different requestors. The on-chip processor, TCP module, and ICMP echo reply generator can all request an IP ID. A block diagram of one implementation of the ID generator is shown in the
[0402] The IP ID Seed register is incremented every time a new IP ID is requested. The Bit Mapper block rearranges the IP_ID_Reg value such that the IP_ID_Out bus is not a simple incrementing value.
[0403] IP Injector Module
[0404] The IP injector module is used to inject packets from the on-chip processor into the IP and TCP modules. The IP injector control registers are located in the IP module register space, and these registers are programmed by the on-chip processor. A block diagram depicting the data flow of the IP Injector is shown in
[0405] As can be seen, the IP Injector is capable of inserting data below the IP module. To use IP Injection, the on-chip processor programs the IP Injectior module with the starting address in its memory of where the packet resides, the length of the packet, and the source MAC address. The injector module generates an interrupt when it has completed transmitting the packet from the on-chip proce