CEC2006 DTrace TCP Demo

At the CEC 2006 conference in San Francisco, Bryan Cantrill and Brendan Gregg demonstrated a prototype DTrace tcp provider - part of a planned collection of DTrace network providers. The following documents the demonstrations that were performed (expanded a little further, and updated to match provider developments). These are actual screenshots from DTrace (command line output).

I originally posted this at http://www.opensolaris.org/os/community/dtrace/NetworkProvider/CEC2006/

Last Update: 16-Oct-2006

Intro

A few weeks ago, a proposed design for a DTrace Network provider was posted here on OpenSolaris, and then discussed on DTrace discuss. This design has been improved from the DTrace discuss feedback and from discoveries made when attempting to implement these providers. Last week we began putting together a prototype tcp provider (based on SDT tracing) to see how these ideas will work. This provider was first demonstrated at the CEC 2006 conference.

Network Utilisation

Lets begin with the basics. How do we measure network utilisation? That is suprisingly hard given that your network is a key resource, and while CPUs have numerous tools for per-CPU and per-process performance observability - there is little by default in Solaris for per-NIC and per-connection observabality.

Kstat solves many per-NIC observability questions, as first shown (and shown well) by the SE Toolkit. netstat -i, a Kstat consumer, only throws out packet counts - which have some value, but we really want bytes as well. Among other things, bytes allows us to calculate a per-NIC utilisation metric. nicstat, an open source tool available in both Perl and C, does this:

# ./nicstat -i e1000g0 1
    Time     Int   rKB/s   wKB/s   rPk/s   wPk/s    rAvs    wAvs   %Util    Sat
13:46:08 e1000g0   77.43   19.89   59.65   24.26  1329.3   839.7    0.08   0.00
13:46:09 e1000g0  1387.7   387.9  1837.2  1337.5   773.4   297.0    1.45   0.00
13:46:10 e1000g0  1396.1   350.5  1734.3  1188.5   824.3   302.0    1.43   0.00
13:46:11 e1000g0   129.6   42.62   213.2   171.5   622.7   254.5    0.14   0.00
13:46:12 e1000g0   196.1   66.27   279.2   219.8   719.1   308.7    0.21   0.00
13:46:13 e1000g0   155.7   54.13   240.6   193.1   662.5   287.1    0.17   0.00
13:46:14 e1000g0   122.8   58.01   229.7   201.0   547.6   295.6    0.15   0.00
^C

Ok, we see some network traffic (%Util around 1%) in this example output. If you have multiple interfaces, nicstat prints a line for each.

nicstat uses Kstat, and such on-going statistics are (and should) be maintained by Kstat. MIB provides similar info for SNMP. So by-NIC observability has a number of solutions, depending on what interests you.

There aren't many options for by-connection observability. We can capture and examine packets using tools such as snoop, however the overheads of promiscuous tracing can have double digit percentage performance costs (depending on the interface and workload). In which case, you won't want to run snoop much in production, and even if you do - the performance impact casts doubt on any derived performance metrics.

With DTrace we can safely explore the potential of per-connection and per-packet tracing, armed with the rich programming functionality that DTrace provides.

DTT/Net/tcpsnoop

The following is tcpsnoop from the DTraceToolkit:

# ./tcpsnoop
  UID    PID LADDR           LPORT DR RADDR           RPORT  SIZE CMD
    0    418 192.168.1.185      79 <- 192.168.1.109   55036    54 inetd
    0    418 192.168.1.185      79 -> 192.168.1.109   55036    54 inetd
    0    418 192.168.1.185      79 <- 192.168.1.109   55036    54 inetd
    0   3265 192.168.1.185      79 -> 192.168.1.109   55036   423 in.fingerd
    0   3265 192.168.1.185      79 <- 192.168.1.109   55036    54 in.fingerd
    0   3265 192.168.1.185      79 -> 192.168.1.109   55036    54 in.fingerd
    0   3265 192.168.1.185      79 <- 192.168.1.109   55036    54 in.fingerd
    0   3265 192.168.1.185      79 <- 192.168.1.109   55036    54 in.fingerd
    0   3265 192.168.1.185      79 -> 192.168.1.109   55036    54 in.fingerd
^C

This output shows an inbound finger connection, and shows when the connection was passed from inetd to in.fingerd. This tool was designed to show a line for each ethernet frame, along with IP addresses, TCP ports and process details. tcpsnoop was written by Brendan Gregg in July 2005, after studying the public OpenSolaris code since its release in June 2005. It uses the DTrace fbt provider (function boundary tracing), which has an unstable interface whose probes and arguments change as the kernel changes; and since it's release, tcpsnoop has indeed needed a few updates to keep up with Solaris 10 updates.

So, since DTrace exists (Dec 2004), and the fbt provider exists, and OpenSolaris provides public code access (June 2005) -- customers already have all the components needed to solve such observability needs, should they have the skills and time to do it and patience to get it right. tcpsnoop is an example of this - and it is a giant observabaility win that this is at all possible.

However writing tcpsnoop required reading through thousands of lines of kernel TCP/IP C code, something that most customers are unlikely to have the time or skills to do (especially time), nor maintain as the kernel changes. Such network observability is important and should be available to all Solaris users, in a simple and stable manner. This is a key need that the stable networking providers will fulfill.

tcpaccept1.d

The following is the first demonstration of the prototype tcp provider:

# ./tcpaccept1.d
dtrace: script './tcpaccept1.d' matched 1 probe
^C
192.168.1.107                                         79                1
fe80::214:4fff:fe3b:76c8                              79                1
192.168.1.107                                        513                2
192.168.1.109                                         22                2
192.168.1.109                                         79                3
192.168.1.109                                         80                8 

This counts inbound TCP connections ("accepts") by IP address and TCP port number. The output shows 8 connections to port 80 (HTTP) from 192.168.1.109, and 1 connection to port 79 (finger) from fe80::214:4fff:fe3b:76c8 - an IPv6 connection. Great!

Also great is the source code,

#!/usr/sbin/dtrace -s

tcp:::accept-established {
        @num[args[1]->ip_saddr, args[2]->tcp_dport] = count();
}

The tcp:::accept-established probe fires for inbound TCP connections. The values that we aggregated on are:

tcpaccept2.d

# ./tcpaccept2.d 
dtrace: script './tcpaccept2.d' matched 1 probe 
Tracing... Hit Ctrl-C to end.
^C
   HOSTNAME                   PORT        COUNT
   marlin6                    finger          1
   bass                       login           2
   marlin                     ssh             2
   marlin                     telnet          2
   marlin                     finger          3

Fantastic! Rather than print out IP addresses and port numbers, tcpaccept2.d prints out hostnames and service names. This was achieved by adding new format operators to the DTrace printf and printa functions, as can be seen in the below source of tcpaccept2.d.

#!/usr/sbin/dtrace -qs

dtrace:::BEGIN
{
        printf("Tracing... Hit Ctrl-C to end.\n");
}

tcp:::accept-established
{
        @num[args[1]->ip_saddr, args[2]->tcp_dport] = count();
}

dtrace:::END
{
        printf("   %-26s %-8s %8s\n", "HOSTNAME", "PORT", "COUNT");
        printa("   %-26I %-8P %@8d\n", @num);
}

The %I will lookup and print a hostname, and %P a service name. This is an observability feature should it be desired - some people may prefer to keep the IP addresses and port numbers as they were.

tcpio1.d

# ./tcpio1.d
dtrace: script './tcpio1.d' matched 5 probes
CPU     ID                    FUNCTION:NAME
  0  47016            tcp_rput_data:receive   192.168.1.109:53816  ->    192.168.1.185:23
  0  46929               tcp_send_data:send   192.168.1.185:23     ->    192.168.1.109:53816
  0  47017         tcp_conn_request:receive   192.168.1.109:37879  ->    192.168.1.185:79
  0  46929               tcp_send_data:send   192.168.1.185:79     ->    192.168.1.109:37879
  0  47016            tcp_rput_data:receive   192.168.1.109:37879  ->    192.168.1.185:79
  0  47016            tcp_rput_data:receive   192.168.1.109:37879  ->    192.168.1.185:79
  0  46929               tcp_send_data:send   192.168.1.185:79     ->    192.168.1.109:37879
  0  46929               tcp_send_data:send   192.168.1.185:79     ->    192.168.1.109:37879
  0  47016            tcp_rput_data:receive   192.168.1.109:37879  ->    192.168.1.185:79
  0  46929               tcp_send_data:send   192.168.1.185:79     ->    192.168.1.109:37879
  0  47016            tcp_rput_data:receive   192.168.1.109:37879  ->    192.168.1.185:79
  0  47016            tcp_rput_data:receive   192.168.1.109:37879  ->    192.168.1.185:79
  0  46929               tcp_send_data:send   192.168.1.185:79     ->    192.168.1.109:37879
^C

The above output shows the tcpio1.d script snooping an inbound finger connection. By examining the IP addresses and direction of packets, we can see where the initial TCP SYN/SYN-ACK/ACK 3-way handshake would be. The output is fairly raw, it contains the DTrace default fields - this demonstration is more about the source of the D-script tcpio1.d:

#!/usr/sbin/dtrace -s

tcp:::send,
tcp:::receive
{
        printf("%15s:%-5d  ->  %15s:%-5d",
            args[1]->ip_saddr, args[2]->tcp_sport,
            args[1]->ip_daddr, args[2]->tcp_dport);
}

This could not get any easier. For TCP send and receive packets, print out the source IP address and source TCP port, and the destination IP address and destination IP port.

The next demonstration spends a few more lines in our D-script to improve the output.

tcpio2.d

# ./tcpio2.d
           LADDR:PORT                 RADDR:PORT   BYTES FLAGS
   192.168.1.185:79     <-    192.168.1.109:56145      0 (SYN)
   192.168.1.185:79     ->    192.168.1.109:56145      0 (SYN|ACK)
   192.168.1.185:79     <-    192.168.1.109:56145      0 (ACK)
   192.168.1.185:79     <-    192.168.1.109:56145      2 (PUSH|ACK)
   192.168.1.185:79     ->    192.168.1.109:56145      0 (ACK)
   192.168.1.185:79     ->    192.168.1.109:56145    126 (PUSH|ACK)
   192.168.1.185:79     <-    192.168.1.109:56145      0 (ACK)
   192.168.1.185:79     ->    192.168.1.109:56145      0 (FIN|ACK)
   192.168.1.185:79     <-    192.168.1.109:56145      0 (ACK)
   192.168.1.185:79     <-    192.168.1.109:56145      0 (FIN|ACK)
   192.168.1.185:79     ->    192.168.1.109:56145      0 (ACK)
^C

The tcpio2.d script prints headers, TCP payload bytes and TCP flags. Now we can clearly see where the TCP SYN/SYN-ACK/ACK handshake is, right down to the final FINs. Rather than always having the source address on the left and the destination on the right, we now print the local address on the left (LADDR) and the remote address on the right (RADDR); the arrow changes so that we know which direction the packet travelled.

The script for tcpio2.d did get a little longer, but it is still fairly straightforward:

#!/usr/sbin/dtrace -s

#pragma D option quiet
#pragma D option switchrate=10hz

dtrace:::BEGIN
{
        printf(" %15s:%-5s      %15s:%-5s %6s %s\n",
            "LADDR", "PORT", "RADDR", "PORT", "BYTES", "FLAGS");
}

tcp:::send
{
        this->length = args[1]->ip_plength - args[2]->tcp_offset;
        printf(" %15s:%-5d  ->  %15s:%-5d %6d (",
            args[1]->ip_saddr, args[2]->tcp_sport,
            args[1]->ip_daddr, args[2]->tcp_dport, this->length);
}

tcp:::receive
{
        this->length = args[1]->ip_plength - args[2]->tcp_offset;
        printf(" %15s:%-5d  <-  %15s:%-5d %6d (",
            args[1]->ip_daddr, args[2]->tcp_dport,
            args[1]->ip_saddr, args[2]->tcp_sport, this->length);
}

tcp:::send,
tcp:::receive
{
        printf("%s", args[2]->tcp_flags & TH_FIN ? "FIN|" : "");
        printf("%s", args[2]->tcp_flags & TH_SYN ? "SYN|" : "");
        printf("%s", args[2]->tcp_flags & TH_RST ? "RST|" : "");
        printf("%s", args[2]->tcp_flags & TH_PUSH ? "PUSH|" : "");
        printf("%s", args[2]->tcp_flags & TH_ACK ? "ACK|" : "");
        printf("%s", args[2]->tcp_flags & TH_URG ? "URG|" : "");
        printf("%s", args[2]->tcp_flags & TH_ECE ? "ECE|" : "");
        printf("%s", args[2]->tcp_flags & TH_CWR ? "CWR|" : "");
        printf("%s", args[2]->tcp_flags == 0 ? "null " : "");
        printf("\b)\n");
}

The last section prints out each of the flags as a string, if they are set. Note that we can directly match args[2]->tcp_flags using bitwise and with meaningful TCP flag names - TH_FIN, TH_SYN, etc (TH is transport header).

This also allows us to match TCP traffic with particular flags in a DTrace predicate, so that scripts could identify and measure different TCP events.

tcpio2x.d

Now the tcpio2.d script is made a little wider, so that IPv6 addresses look neat and tidy:

# ./tcpio2x.d
C                     LADDR:PORT                           RADDR:PORT   BYTES FLAGS
0    fe80::214:4fff:fe2a:64:79     <-   fe80::214:4fff:fe3b:76c8:61706      0 (SYN)
0    fe80::214:4fff:fe2a:64:79     ->   fe80::214:4fff:fe3b:76c8:61706      0 (SYN|ACK)
0    fe80::214:4fff:fe2a:64:79     <-   fe80::214:4fff:fe3b:76c8:61706      0 (ACK)
0    fe80::214:4fff:fe2a:64:79     <-   fe80::214:4fff:fe3b:76c8:61706      2 (PUSH|ACK)
0    fe80::214:4fff:fe2a:64:79     ->   fe80::214:4fff:fe3b:76c8:61706      0 (ACK)
0    fe80::214:4fff:fe2a:64:79     ->   fe80::214:4fff:fe3b:76c8:61706    126 (PUSH|ACK)
0    fe80::214:4fff:fe2a:64:79     <-   fe80::214:4fff:fe3b:76c8:61706      0 (ACK)
0    fe80::214:4fff:fe2a:64:79     ->   fe80::214:4fff:fe3b:76c8:61706      0 (FIN|ACK)
0    fe80::214:4fff:fe2a:64:79     <-   fe80::214:4fff:fe3b:76c8:61706      0 (ACK)
0    fe80::214:4fff:fe2a:64:79     <-   fe80::214:4fff:fe3b:76c8:61706      0 (FIN|ACK)
0    fe80::214:4fff:fe2a:64:79     ->   fe80::214:4fff:fe3b:76c8:61706      0 (ACK)
^C

Perhaps more exciting is the ability to examine loopback traffic:

# ./tcpio2x.d
C                     LADDR:PORT                           RADDR:PORT   BYTES FLAGS
0                 127.0.0.1:47335  ->                  127.0.0.1:79         0 (SYN)
0                 127.0.0.1:79     <-                  127.0.0.1:47335      0 (SYN)
0                 127.0.0.1:79     ->                  127.0.0.1:47335      0 (SYN|ACK)
0                 127.0.0.1:47335  <-                  127.0.0.1:79         0 (SYN|ACK)
0                 127.0.0.1:47335  ->                  127.0.0.1:79         0 (ACK)
0                 127.0.0.1:79     <-                  127.0.0.1:47335      0 (ACK)
0                 127.0.0.1:79     ->                  127.0.0.1:47335      0 (FIN|ACK)
0                 127.0.0.1:47335  <-                  127.0.0.1:79         0 (FIN|ACK)
0                 127.0.0.1:47335  ->                  127.0.0.1:79         0 (ACK)
0                 127.0.0.1:79     <-                  127.0.0.1:47335      0 (ACK)
0                 127.0.0.1:47335  ->                  127.0.0.1:79         0 (FIN|ACK)
0                 127.0.0.1:79     <-                  127.0.0.1:47335      0 (FIN|ACK)
0                 127.0.0.1:79     ->                  127.0.0.1:47335      0 (ACK)
0                 127.0.0.1:47335  <-                  127.0.0.1:79         0 (ACK)

0                       ::1:51380  ->                        ::1:79         0 (SYN)
0                       ::1:79     <-                        ::1:51380      0 (SYN)
0                       ::1:79     ->                        ::1:51380      0 (SYN|ACK)
0                       ::1:51380  <-                        ::1:79         0 (SYN|ACK)
0                       ::1:51380  ->                        ::1:79         0 (ACK)
0                       ::1:79     <-                        ::1:51380      0 (ACK)
0                       ::1:79     ->                        ::1:51380      0 (FIN|ACK)
0                       ::1:51380  <-                        ::1:79         0 (FIN|ACK)
0                       ::1:51380  ->                        ::1:79         0 (ACK)
0                       ::1:79     <-                        ::1:51380      0 (ACK)
0                       ::1:51380  ->                        ::1:79         0 (FIN|ACK)
0                       ::1:79     <-                        ::1:51380      0 (FIN|ACK)
0                       ::1:79     ->                        ::1:51380      0 (ACK)
0                       ::1:51380  <-                        ::1:79         0 (ACK)
^C

The traffic appears duplicated - this is because the source and destination are on the same host, localhost, and we can trace both the send and the receive of the same packet.

So far this is sufficient to see which loopback connections are being made, what the destination ports are, and the rate of connections; however note that the BYTES column is always zero. Due to performance reasons, once loopback connections are established their data transfers bypass costly TCP code (the session becomes "fused"), and the regular tcp:::send and tcp:::receive probes don't fire. It is still possible to trace this traffic, and we plan to do so, we just haven't added the probes yet.

The first column in the previous output is the CPU id. For DTrace reasons, when tracing events as they occur - the output can become shuffled as data is collected from each CPU in turn. So a CPU column is handy to have - if you are on a multi-CPU server, and you are printing events as they occur, and you expect them to appear in the correct order. A different tactic can be to print out a timestamp, and use a tool to post-process and sort the output.

The Probes

So far we have seen probes with intuitive names - accept-established, send, receive. The entire list of probenames so far is as follows:

# dtrace -lP tcp | awk '{ print $5 }' | sort -u
NAME
accept-established
accept-refused
connect-established
connect-refused
connect-request
receive
send
state-bound
state-close-wait
state-closed
state-closing
state-established
state-fin-wait1
state-fin-wait2
state-idle
state-last-ack
state-listen
state-syn-received
state-syn-sent
state-time-wait

The accept-* probes are high-level events for inbound traffic; connect- probes are similar events for outbound traffic. send and receive are for all TCP traffic, including ACKs and exotic port scans. The state-* probes allow us to trace state changes in tcp by connection.

This collection of probes opens the door to a huge number of tracing possibilities. More probes have been imagined, however due to concerns for maintainability of the tcp code these may not be implemented. At the very least, send and receive allows all traffic to be examined.

one-liners

Another way to appreciate a provider is to see what one-liners are possible. Here is a collection that now work with this prototype tcp provider:

DTrace Network Providers One-Liners
-----------------------------------

# Watch inbound TCP connections by remote address,
dtrace -n 'tcp:::accept-established { trace(args[1]->ip_saddr); }'

# Inbound TCP connections by remote address summary,
dtrace -n 'tcp:::accept-established { @addr[args[1]->ip_saddr] = count(); }'

# Inbound TCP connections by destination port summary,
dtrace -n 'tcp:::accept-established { @port[args[2]->tcp_dport] = count(); }'

# Outbound TCP connections by destination port summary,
dtrace -n 'tcp:::connect-established { @port[args[2]->tcp_dport] = count(); }'

# Inbound TCP refused connections by destination port summary (port scan?),
dtrace -n 'tcp:::accept-refused { @port[args[2]->tcp_dport] = count(); }'

# Outbound TCP refused connections by destination port summary,
dtrace -n 'tcp:::connect-refused { @port[args[2]->tcp_dport] = count(); }'

# Inbound TCP packets by remote address summary,
dtrace -n 'tcp:::receive { @addr[args[1]->ip_saddr] = count(); }'

# Outbound TCP packets by remote address summary,
dtrace -n 'tcp:::send { @addr[args[1]->ip_daddr] = count(); }'

# Inbound TCP packets by destination port summary,
dtrace -n 'tcp:::receive { @port[args[2]->tcp_dport] = count(); }'

# Outbound TCP packets by destination port summary,
dtrace -n 'tcp:::send { @port[args[2]->tcp_dport] = count(); }'

# sent TCP bytes summary,
dtrace -n 'tcp:::send { @b = sum(args[1]->ip_plength - args[2]->tcp_offset); }'

# TCP events by type summary,
dtrace -n 'tcp::: { @event[probename] = count(); }'

For this short demonstration we won't each one by one - but from the descriptions and code we can get a feeling for what is possible.

tcpbytes1.d

Back to the scripts. Here we create summaries of by-byte traffic:

# ./tcpbytes1.d
dtrace: script './tcpbytes1.d' matched 5 probes
^C

  192.168.1.107                                        513               87
  fe80::214:4fff:fe3b:76c8                              79              128
  192.168.1.109                                         79              256
  192.168.1.109                                         22             9128
  192.168.1.109                                        514         21380483

The tcpbytes1.d script adds sent and received bytes, by IP address and by destination port. Here we can see that 192.168.1.109 transferred 9128 bytes to port 22 on our server. The source to tcpbytes1.d is as follows:

#!/usr/sbin/dtrace -s

tcp:::receive
{
        @bytes[args[1]->ip_saddr, args[2]->tcp_dport] =
            sum(args[1]->ip_plength - args[2]->tcp_offset);
}

tcp:::send
{
        @bytes[args[1]->ip_daddr, args[2]->tcp_sport] =
            sum(args[1]->ip_plength - args[2]->tcp_offset);
}

The bytes are found by adding the payload length from the IP observability info structure (args[1] - which is based on the IP header), and then subtracting the TCP header length (args[2]->tcp_offset - from the TCP header). Both these values return a byte count for convienence, whereas the RFCs specify that these would be the number of 32-bit words (of course, the raw TCP header will be made available if people want it; args[2] provides a endian-correct processed form for convienence).

unprepared - TCP byte size

About this point in the demonstration, someone in the audience asked Brendan if the size in bytes of the TCP I/O could be quantized. Yes - that's a great idea! Brendan ran the following while Bryan generated some test traffic:

# dtrace -n 'tcp:::send { @q = quantize(args[1]->ip_plength - args[2]->tcp_offset); }'
^C


           value  ------------- Distribution ------------- count
              -1 |                                         0
               0 |@@@@@                                    227
               1 |@                                        24
               2 |                                         4
               4 |                                         0
               8 |                                         1
              16 |                                         1
              32 |                                         6
              64 |                                         1
             128 |                                         0
             256 |                                         2
             512 |                                         2
            1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       1501
            2048 |                                         0

To start with, lets feed the byte size into a single quantize. We'll do sent traffic, adding a receive probe as well would be no problem. The above one-liner does quantizes by byte size, and we can see that our TCP traffic is in the 0 byte to 2047 byte range (which we can assume is bounded by MTU). There is a large spike in the 1024 to 2047 byte bucket, which sounds like a data transfer from someone..

By adding keys to our aggregate, we can find out who the IP addresses are that are receiving bytes from us, and what port they are receiving from:

# dtrace -n 'tcp:::send { @q[args[1]->ip_daddr, args[2]->tcp_sport] =
     quantize(args[1]->ip_plength - args[2]->tcp_offset); }'
dtrace: description 'tcp:::send ' matched 2 probes
^C


  192.168.1.107                                         79
           value  ------------- Distribution ------------- count
              -1 |                                         0
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         4
               1 |                                         0
               2 |                                         0
               4 |                                         0
               8 |                                         0
              16 |                                         0
              32 |                                         0
              64 |                                         0
             128 |@@@@@@@@                                 1
             256 |                                         0

  192.168.1.109                                         22
           value  ------------- Distribution ------------- count
              -1 |                                         0
               0 |@@                                       5
               1 |                                         0
               2 |                                         0
               4 |                                         0
               8 |                                         0
              16 |                                         1
              32 |@@@@@@@@@@@@@@@@@                        38
              64 |@@@@@@@@@@@@@@@@@                        37
             128 |                                         1
             256 |@@                                       4
             512 |                                         1
            1024 |                                         0

  192.168.1.109                                        514
           value  ------------- Distribution ------------- count
              -1 |                                         0
               0 |                                         5
               1 |                                         2
               2 |                                         1
               4 |                                         0
               8 |                                         0
              16 |                                         1
              32 |                                         1
              64 |                                         0
             128 |                                         0
             256 |                                         0
             512 |                                         0
            1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1499
            2048 |                                         0
o

Superb! This immediatly shows that our spike of large packets (this time 1499 packets in the 1024 to 2047 byte bucket) belongs to 192.168.1.109 connecting to port 514. Port 22 (ssh) from the same host is responsible for some medium sized traffic - mostly 32 to 128 bytes in size; and most of the TCP traffic to port 79 wil zero sized (ACKs).

Now if only we could remember who 192.168.1.109 was, and what port 514 was for...

# dtrace -qn 'tcp:::send { @q[args[1]->ip_daddr, args[2]->tcp_sport] = 
    quantize(args[1]->ip_plength - args[2]->tcp_offset); } END { printa("   %-32I %21P %@d", @q); }'
^C
   marlin                                           shell
           value  ------------- Distribution ------------- count
              -1 |                                         0
               0 |                                         5
               1 |                                         3
               2 |                                         1
               4 |                                         0
               8 |                                         0
              16 |                                         1
              32 |                                         1
              64 |                                         0
             128 |                                         0
             256 |                                         0
             512 |                                         0
            1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1499
            2048 |                                         0

We are now using the enhanced printf format characters to fetch the hostname and service name - so it was marlin connecting to our shell port, probably doing an rcp. (We need to run DTrace on marlin to check).

tcpbytes2.d

Back to the scripts we brought to the demonstration; this one is tcpbytes2.d:

# ./tcpbytes2.d

   HOST                                      BYTES/s
   192.168.1.109                            21210735

   HOST                                      BYTES/s
   192.168.1.109                            16511328

   HOST                                      BYTES/s
   192.168.1.109                            26249638

   HOST                                      BYTES/s
   192.168.1.107                                  54

   HOST                                      BYTES/s
   192.168.1.109                                 418

   HOST                                      BYTES/s
   fe80::214:4fff:fe3b:76c8                      209

   HOST                                      BYTES/s
^C

Per second, tcpbytes2.d prints a report of TCP payload byte transfers by IP address. In the output we can see that 192.168.1.109 began transferring at around 20 Mbytes/sec. The source was:

#!/usr/sbin/dtrace -s

#pragma D option quiet

tcp:::receive
{
        @bytes[args[1]->ip_saddr, args[2]->tcp_dport] =
            sum(args[1]->ip_plength - args[2]->tcp_offset);
}

tcp:::send
{
        @bytes[args[1]->ip_daddr, args[2]->tcp_sport] =
            sum(args[1]->ip_plength - args[2]->tcp_offset);
}

profile:::tick-1sec
{
        printf("\n   %-32s %16s\n", "HOST", "BYTES/s");
        printa("   %-32s %@16d\n", @bytes);
        trunc(@bytes);
}

This can be customised as desired, and provides a quick summary of who is transferring data to our host.

tcpnmap.d

Since the tcp provider allows us to easily match on TCP flags - it will be possible to write numerous tools that use this capability to look for certain packets. Here is a demonstration that is more about the D-script that it is about practicality. Long term defense against port scans includes the installation of NIDS (Network Intrusion Detection Systems); for example, snort.

The nmap port scanner is a popular and powerful security tool for host vulnerability and remote identification analysis. By using flags, various exotic scans can be performed, including Xmas and null scans. The tcpnmap.d tool examines these flags to find traffic that is possible scan evets - however (as with the connect() scan) they may also be normal traffic. The differentiator (left to the user) is the volume of these suspicious events.

# ./tcpnmap.d
Tracing for possible nmap scans... Hit Ctrl-C to end.
^C
Possible scan events,

   TYPE                     HOST                            COUNT
   TCP_null_scan            192.168.1.109                     208
   TCP_Xmas_scan            192.168.1.109                     304
   TCP_connect()_scan       192.168.1.109                     388

Here all our scan types had counts of more than 100, which (for this 10 second sample), is evidence of scanning.

The tcpnmap.d script was:

#!/usr/sbin/dtrace -s

#pragma D option quiet

dtrace:::BEGIN
{
        printf("Tracing for possible nmap scans... Hit Ctrl-C to end.\n");
}

tcp:::accept-refused
{
        @num["TCP_connect()_scan", args[1]->ip_daddr] = count();
}

tcp:::receive
/args[2]->tcp_flags == 0/
{
        @num["TCP_null_scan", args[1]->ip_saddr] = count();
}

tcp:::receive
/args[2]->tcp_flags == (TH_URG|TH_PUSH|TH_FIN)/
{
        @num["TCP_Xmas_scan", args[1]->ip_saddr] = count();
}

dtrace:::END
{
        printf("Possible scan events,\n\n");
        printf("   %-24s %-28s %8s\n", "TYPE", "HOST", "COUNT");
        printa("   %-24s %-28s %@8d\n", @num);
}

Null and Xmas scans were identified by matching on the TCP flags. Connect scans were matched simply by examining accept-refused events - an inbound SYN to a closed port. This means that we will see the connect() scans to closed ports, however we will miss them to open ones. Wait! We can in fact measure connect() scans to open ports by checking if accept-established events were immediatly closed with a RST - which is what nmap does. In fact, I just verified it by using the tcpio2.d script! Hey - this is really cool - sorry, I didn't do this at CEC, but check this out:

nmap connect() to an open port

# ./tcpio2.d
           LADDR:PORT                 RADDR:PORT   BYTES FLAGS
   192.168.1.185:22     <-    192.168.1.109:60631      0 (SYN)
   192.168.1.185:22     ->    192.168.1.109:60631      0 (SYN|ACK)
   192.168.1.185:22     <-    192.168.1.109:60631      0 (ACK)
   192.168.1.185:22     <-    192.168.1.109:60631      0 (RST)

nmap connect() to a closed port

# ./tcpio2.d
           LADDR:PORT                 RADDR:PORT   BYTES FLAGS
   192.168.1.185:26     <-    192.168.1.109:47113      0 (SYN)
   192.168.1.185:26     ->    192.168.1.109:47113      0 (RST|ACK)

nmap TCP Xmas to an open port

# ./tcpio2.d
           LADDR:PORT                 RADDR:PORT   BYTES FLAGS
   192.168.1.185:22     <-    192.168.1.109:43968      0 (FIN|PUSH|URG)

nmap TCP Xmas to a closed port

# ./tcpio2.d
           LADDR:PORT                 RADDR:PORT   BYTES FLAGS
   192.168.1.185:26     <-    192.168.1.109:33466      0 (FIN|PUSH|URG)

nmap TCP Null to an open port

# ./tcpio2.d
           LADDR:PORT                 RADDR:PORT   BYTES FLAGS
   192.168.1.185:22     <-    192.168.1.109:53034      0 (null)

nmap TCP Null to a closed port

# ./tcpio2.d
           LADDR:PORT                 RADDR:PORT   BYTES FLAGS
   192.168.1.185:26     <-    192.168.1.109:61054      0 (null)

nmap TCP Stealth to an open port

# ./tcpio2.d
           LADDR:PORT                 RADDR:PORT   BYTES FLAGS
   192.168.1.185:22     <-    192.168.1.109:39438      0 (SYN)
   192.168.1.185:22     ->    192.168.1.109:39438      0 (SYN|ACK)
   192.168.1.185:22     <-    192.168.1.109:39438      0 (RST)

nmap TCP Stealth to a closed port

# ./tcpio2.d
           LADDR:PORT                 RADDR:PORT   BYTES FLAGS
   192.168.1.185:26     <-    192.168.1.109:57433      0 (SYN)
   192.168.1.185:26     ->    192.168.1.109:57433      0 (RST|ACK)

Very, very cool. Bryan commented that this script needs to be used at Universities when teaching TCP - you can probably see why!

Note that Solaris doesn't respond to null or Xmas scans -- good Solaris! :-)

tcpstate.d

A TCP connection changes between well defined states (see RFC 793), which are highly useful to observe. The following script simply prints out when a state has changed, including the previous and the new state:

# ./tcpstate.d
C   PREV                    NEW
0   state-established    -> state-idle
0   state-idle           -> state-bound
0   state-bound          -> state-syn-sent
0   state-syn-sent       -> state-established
0   state-established    -> state-close-wait
0   state-close-wait     -> state-last-ack
0   state-last-ack       -> state-closed
^C

And the script was:

#!/usr/sbin/dtrace -s

#pragma D option quiet

dtrace:::BEGIN
{
        printf("%s   %-20s    %-20s\n", "C", "PREV", "NEW");
}

tcp:::state-bound,
tcp:::state-close-wait,
tcp:::state-closed,
tcp:::state-closing,
tcp:::state-established,
tcp:::state-fin-wait1,
tcp:::state-fin-wait2,
tcp:::state-idle,
tcp:::state-last-ack,
tcp:::state-listen,
tcp:::state-syn-received,
tcp:::state-syn-sent,
tcp:::state-time-wait
{
        printf("%d   %-20s -> %-20s\n", cpu, args[2]->ts_prevstr, probename);
}

All TCP probes as arg0 provide a connection-ID, similar to a process-ID, so that we can uniquely identify TCP connections. If we can (it may not make sense for all contexts), we will make it available as the variable cid. Apart from the old and new states, these state probes will also provide the cid. They will not, however, provide IP or TCP header information - such as IP addresses or ports - as such header information is not in a stable state when these probes fire. Not a great problem - the cid will allow us to associate IP and TCP header details to the state events, by caching them when they are available.

About cids: If we had a system with sudden CPU load, and suspected a runaway process, we are of the mentality to go hunting for the PID responsible. What about a server that has network load? We need a similar mechanism to refer to established network connections, so that tools can be written to associate cid data and present it. The kernel already has such a mechanism, the struct conn_s *, and we plan to make the cid be an integer form of that pointer (it's unique, it's valid for the session, and it works for both TCP and UDP). Perhaps years form now we will take for granted a wide range of by-cid tools (powered by DTrace) for network connection observability, just as we currently have a range of by-pid tools (powered by procfs).. (either that or I invent my own filesystem - /connfs ;-) - Brendan)

These state probes should allow some really amazing scripts to be written - watch this space!

tcpconnlat.d

The tcp provider gives us the means to measure numerous new and useful latency metrics. To begin with, there is latency incurred for the TCP 3-way handshake to complete for outbound connections. The tcpconnlat.d script measures this.

# ./tcpconnlat.d
dtrace: script './tcpconnlat.d' matched 2 probes
^C

  Connect Latency (ns)                                192.168.1.109
           value  ------------- Distribution ------------- count
           65536 |                                         0
          131072 |@@@@@@@@@@@@@@@@@@@@@@@@                 3
          262144 |@@@@@@@@@@@@@@@@                         2
          524288 |                                         0

  Connect Latency (ns)                                72.5.124.61
           value  ------------- Distribution ------------- count
         4194304 |                                         0
         8388608 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@           3
        16777216 |@@@@@@@@@@                               1
        33554432 |                                         0

192.168.1.109 is a host on the same subnet, 72.5.124.61 is www.sun.com. This measures an outbound SYN to the final ACK in the 3 way handshake. So a nearby host was connecting in less than 0.3 ms, while www.sun.com was connecting in around 10 ms. This latency includes network hop latency, and the time for the remote kernel to create a new TCP session and reply.

This example is much more interesting than what we could do at the CEC conference, where we only had two laptops and a crossover cable!

The source for tcpconnlat.d is:

#!/usr/sbin/dtrace -s

tcp:::connect-request
{
        start[arg0] = timestamp;
}

tcp:::connect-established
/start[arg0]/
{
        @latency["Connect Latency (ns)", args[1]->ip_daddr] =
            quantize(timestamp - start[arg0]);
        start[arg0] = 0;
}

tcp1stbyte.d

Another useful latency metric is the time from when an outbound connection is established to the when first application data arrives - which we can call "1st byte latency". The tcp1stbyte.d script measures this:

# ./tcp1stbyte.d
dtrace: script './tcp1stbyte.d' matched 4 probes
^C

  1st Byte Latency (ns)                               192.168.1.109
           value  ------------- Distribution ------------- count
         8388608 |                                         0
        16777216 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              2
        33554432 |@@@@@@@@@@@@@                            1
        67108864 |                                         0

  1st Byte Latency (ns)                               72.5.124.61
           value  ------------- Distribution ------------- count
         8388608 |                                         0
        16777216 |@@@@@@                                   1
        33554432 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       6
        67108864 |                                         0

This is the time for the first TCP data byte to be received, and includes both network latency and application latency. Now www.sun.com is mostly in the 32 to 64 ms bucket, and these times are much larger. We would usually expect this application latency to be large, as it is the time for the application to be scheduled, read the request, process it and respond - which is likely to be much slower than a streamlined kernel TCP handshake.

The source for tcp1stbyte.d is:

#!/usr/sbin/dtrace -s

tcp:::connect-established
{
        start[arg0] = timestamp;
}

tcp:::receive
/start[arg0] && (args[1]->ip_plength - args[2]->tcp_offset) > 0/
{
        @latency["1st Byte Latency (ns)", args[1]->ip_saddr] =
            quantize(timestamp - start[arg0]);
        start[arg0] = 0;
}

Conclusion

Hopefully these short demonstrations of this prototype tcp provider will help convey what we are trying to achieve. This is by no means a certain addition to Solaris, and there are many challenges ahead, including:

Questions

Q: Where can I get a BFU archive so I can help by testing this?
A: We'll put something together and post on DTrace-discuss.


Last Updated: 16-Oct-2006