Chapter 2. The Transport Layer: TCP, UDP, and SCTP


This chapter focuses on the transport layer: TCP, UDP, and Stream Control Transmission Protocol (SCTP). UDP is a simple, unreliable datagram protocol, while TCP is a sophisticated, reliable byte-stream protocol. SCTP is similar to TCP as a reliable transport protocol, but it also provides message boundaries, transport-level support for multihoming, and a way to minimize head-of-line blocking.

The Big Picture

Overview of TCP/IP protocols:

Protocol Description
IPv4 Internet Protocol version 4. IPv4 uses 32-bit addresses and provides packet delivery service for TCP, UDP, SCTP, ICMP, and IGMP.
IPv6 Internet Protocol version 6. IPv6 uses 128-bit addresses.
TCP Transmission Control Protocol. TCP is a connection-oriented protocol that provides a reliable, full-duplex byte stream to its users
UDP User Datagram Protocol. UDP is a connectionless protocol, and UDP sockets are an example of datagram sockets.
SCTP Stream Control Transmission Protocol. SCTP is a connection-oriented protocol that provides a reliable full-duplex association
ICMP Internet Control Message Protocol. ICMP handles error and control information between routers and hosts.
IGMP Internet Group Management Protocol. IGMP is used with multicasting.
ARP Address Resolution Protocol. ARP maps an IPv4 address into a hardware address (such as an Ethernet address). ARP is normally used on broadcast networks such as Ethernet, token ring, and FDDI, and is not needed on point-to-point networks.
RARP Reverse Address Resolution Protocol. RARP maps a hardware address into an IPv4 address. It is sometimes used when a diskless node is booting.
ICMPv6 Internet Control Message Protocol version 6. ICMPv6 combines the functionality of ICMPv4, IGMP, and ARP.
BPF BSD packet filter. This interface provides access to the datalink layer. It is normally found on Berkeley-derived kernels.
DLPI Datalink provider interface.

User Datagram Protocol (UDP)

Transmission Control Protocol (TCP)

Stream Control Transmission Protocol (SCTP)

Like TCP, SCTP provides reliability, sequencing, flow control, and full-duplex data transfer.

Unlike TCP, SCTP provides:

TCP Connection Establishment and Termination

Three-Way Handshake

Figure 2.2. TCP three-way handshake.

  1. Server: passive open, by calling socket, bind, and listen
  2. Client: active open, by calling connect. The client TCP to send a "synchronize" (SYN) segment with no data but it contains client's initial sequence number for the data to be sent on the connection.
  3. Server: acknowledges (ACK) client's SYN. The server sends its SYN and the ACK of the client's SYN in a single segment which also contains its own SYN containing the initial sequence number for the data to be sent on the connection.
  4. Client: acknowledges the server's SYN.

The client's initial sequence number as J and the server's initial sequence number as K. The acknowledgment number in an ACK is the next expected sequence number for the end sending the ACK. Since a SYN occupies one byte of the sequence number space, the acknowledgment number in the ACK of each SYN is the initial sequence number plus one.

TCP Options

These common options are supported by most implementations. The latter two are sometimes called the "RFC 1323 options", or "long fat pipe options", since a network with either a high bandwidth or a long delay is called a long fat pipe.

TCP Connection Termination

Figure 2.3. Packets exchanged when a TCP connection is closed.

It takes four segments to terminate a connection:

  1. One end calls close first by sending a FIN segment to mean it is finished sending data. This is called active close.
  2. The other end that receives the FIN performs the passive close. The received FIN is acknowledged by TCP (sending an ACK segment). The receipt of the FIN is also passed to the application as an end-of-file.
  3. Sometime later, the application that received the end-of-file will close its socket. This causes its TCP to send a FIN.
  4. The TCP on the system that receives this final FIN (the end that did the active close) acknowledges the FIN

A FIN occupies one byte of sequence number space just like a SYN. Therefore, the ACK of each FIN is the sequence number of the FIN plus one.

TCP State Transition Diagram

Figure 2.4. TCP state transition diagram.

There are 11 different states defined for a connection and the rules of TCP dictate the transitions from one state to another, based on the current state and the segment received in that state.

Watching the Packets

Figure 2.5. Packet exchange for TCP connection.

The client in this example announces an MSS of 536 (minimum reassembly buffer size) and the server announces an MSS of 1,460 (typical for IPv4 on an Ethernet). It is okay for the MSS to be different in each direction. The acknowledgment of the client's request is sent with the server's reply. This is called piggybacking and will normally happen when the time it takes the server to process the request and generate the reply is less than around 200 ms.

With TCP, there would be eight segments of overhead. If UDP was used, only two packets would be exchanged.


The end that performs the active close goes through the TIME_WAIT state. The duration that this endpoint remains in the TIME_WAIT state is twice the maximum segment lifetime (MSL), sometimes called 2MSL, which is between 1 and 4 minutes. The MSL is the maximum amount of time that any given IP datagram can live in a network. The IPv4 TTL field IPv6 hop limit field have a maximum value 255. The assumption is made that a packet with the maximum hop limit of 255 cannot exist in a network for more than MSL seconds. [p43]

TCP must handle lost duplicates (or wandering duplicate).

There are two reasons for the TIME_WAIT state:

Port Numbers

All three transport layers (UDP, SCTP and TCP) use 16-bit integer port numbers to differentiate between processes.

Figure 2.10. Allocation of port numbers.

Some notes from the figure above:

Socket Pair

TCP Port Numbers and Concurrent Servers


Buffer Sizes and Limitations

Figures: IPv4 Header, IPv6 Header

TCP Output

The following figure shows what happens when an application writes to a TCP socket:

Figure 2.15. Steps and buffers involved when an application writes to a TCP socket.

Every TCP socket has a send buffer and we can change the size of this buffer with the SO_SNDBUF socket option. When an application calls write, the kernel copies all the data from the application buffer into the socket send buffer. If there is insufficient room in the socket buffer for all the application's data, the process is put to sleep. This assumes the normal default of a blocking socket. The kernel will not return from the write until the final byte in the application buffer has been copied into the socket send buffer. Therefore, the successful return from a write to a TCP socket only tells us that we can reuse our application buffer. It does not tell us that either the peer TCP has received the data or that the peer application has received the data. This is discussed with SO_LINGER socket option.

TCP takes the data in the socket send buffer and sends it to the peer TCP. The peer TCP must acknowledge the data, and as the ACKs arrive from the peer, only then can our TCP discard the acknowledged data from the socket send buffer. TCP must keep a copy of our data until it is acknowledged by the peer.

TCP sends the data to IP in MSS-sized or smaller chunks, prepending its TCP header to each segment, where the MSS is the value announced by the peer, or 536 if the peer did not send an MSS option. IP prepends its header, searches the routing table for the destination IP address, and passes the datagram to the appropriate datalink. IP might perform fragmentation before passing the datagram to the datalink, but one goal of the MSS option is to try to avoid fragmentation and newer implementations also use path MTU discovery. Each datalink has an output queue, and if this queue is full, the packet is discarded and an error is returned up the protocol stack [p58]

UDP Output

The following figure shows what happens when an application writes data to a UDP socket:

Figure 2.16. Steps and buffers involved when an application writes to a UDP socket.

UDP socket doesn't have a socket send buffer, since it does not need to keep a copy of the application's data. It has a send buffer size (which we can change with the SO_SNDBUF socket option), but this is simply an upper limit on the maximum-sized UDP datagram that can be written to the socket. If an application writes a datagram larger than the socket send buffer size, EMSGSIZE is returned.

UDP simply prepends its 8-byte header and passes the datagram to IP. IP determines the outgoing interface by performing the routing function, and then either adds the datagram to the datalink output queue (if it fits within the MTU) or fragments the datagram and adds each fragment to the datalink output queue (see UDP and IP Fragmentation in TCPv1). If a UDP application sends large datagrams, there is a much higher probability of (IP) fragmentation than with TCP.

Standard Internet Services

Protocol Usage by Common Internet Applications