Systems Performance 2nd Ed.



BPF Performance Tools book

Recent posts:
Blog index
About
RSS

zfsday 2012: ZFS Performance Analysis and Tools

zfsday 2012 talk by Brendan Gregg.

Video: http://www.youtube.com/watch?v=xkDqe6rIMa0

Description: "The performance of the file system, or disks, is often the target of blame, especially in multi-tenant cloud environments. At Joyent we deploy a public cloud on ZFS-based systems, and frequently investigate performance with a wide variety of applications in growing environments. This talk is about ZFS performance observability, showing the tools and approaches we use to quickly show what ZFS is doing. This includes observing ZFS I/O throttling, an enhancement added to illumos-ZFS to isolate performance between neighbouring tenants, and the use of DTrace and heat maps to examine latency distributions and locate outliers."



next
prev
1/65
next
prev
2/65
next
prev
3/65
next
prev
4/65
next
prev
5/65
next
prev
6/65
next
prev
7/65
next
prev
8/65
next
prev
9/65
next
prev
10/65
next
prev
11/65
next
prev
12/65
next
prev
13/65
next
prev
14/65
next
prev
15/65
next
prev
16/65
next
prev
17/65
next
prev
18/65
next
prev
19/65
next
prev
20/65
next
prev
21/65
next
prev
22/65
next
prev
23/65
next
prev
24/65
next
prev
25/65
next
prev
26/65
next
prev
27/65
next
prev
28/65
next
prev
29/65
next
prev
30/65
next
prev
31/65
next
prev
32/65
next
prev
33/65
next
prev
34/65
next
prev
35/65
next
prev
36/65
next
prev
37/65
next
prev
38/65
next
prev
39/65
next
prev
40/65
next
prev
41/65
next
prev
42/65
next
prev
43/65
next
prev
44/65
next
prev
45/65
next
prev
46/65
next
prev
47/65
next
prev
48/65
next
prev
49/65
next
prev
50/65
next
prev
51/65
next
prev
52/65
next
prev
53/65
next
prev
54/65
next
prev
55/65
next
prev
56/65
next
prev
57/65
next
prev
58/65
next
prev
59/65
next
prev
60/65
next
prev
61/65
next
prev
62/65
next
prev
63/65
next
prev
64/65
next
prev
65/65

PDF: zfsperftools2012.pdf

Keywords (from pdftotext):

slide 1:
    ZFS Performance
    Analysis and Tools
    Brendan Gregg
    Lead Performance Engineer
    brendan@joyent.com
    @brendangregg
    Wednesday, October 3, 12
    October, 2012
    
slide 2:
    whoami
    • G’Day, I’m Brendan
    • These days I do systems performance analysis of the cloud
    • Regarding ZFS:
    • Perf analysis of ZFS (mostly using DTrace) for 5+ years,
    both enterprise and cloud usage
    • Wrote many DTrace-based ZFS perf analysis tools
    including those in the DTrace book
    • Developed ZFS L2ARC while at Sun
    Wednesday, October 3, 12
    
slide 3:
    Who is Joyent
    • Cloud computing provider (public cloud + software)
    • Use ZFS as much as possible:
    • Host storage
    • Guest storage: OS virtualization (SmartOS), and KVM guests
    (Linux, Windows)
    • We use ZFS because
    • Reliability: checksums, COW
    • Features: snapshots, clones, compression, ...
    • Performance: large ARCs
    • It can boil oceans
    Wednesday, October 3, 12
    
slide 4:
    Joyent, cont.
    • We build tools for ZFS automation and observability.
    • Performance is a key company feature.
    • Need to solve FS/disk issues fast.
    Wednesday, October 3, 12
    
slide 5:
    Agenda
    • My top 12 tools for ZFS performance analysis (unsorted):
    • iostat
    • vfsstat
    • zfsslower.d
    • iosnoop
    • iostacks
    • metaslab_free.d
    • spasync.d
    • arcstat
    • arcaccess.d
    For cloud computing
    from within a Zone, add:
    • latency counters
    • scatter plots
    • mysqld_pid_fslatency.d
    • heat maps (CA)
    • syscall with fi_fs == zfs
    Wednesday, October 3, 12
    
slide 6:
    Functional diagram: full stack
    • Unix 101
    Process
    User-Land
    Kernel
    logical I/O
    VFS
    ZFS
    Block Device Interface
    physical I/O
    Disks
    Wednesday, October 3, 12
    Syscall
    Interface
    
slide 7:
    Functional diagram: full stack, cont.
    • Unix 102
    User-Land
    Kernel
    sync.
    Process
    Syscall
    Interface
    VFS
    ZFS
    Block Device Interface
    Disks
    Wednesday, October 3, 12
    iostat(1)
    often async:
    write buffering,
    read ahead
    
slide 8:
    Functional diagram: full stack, cont.
    • DTrace 301
    mysql_pid_fslatency.d
    syscall with
    fi_fs == zfs
    Process
    User-Land
    Kernel
    VFS
    zioslower.d
    spasync.d
    metaslab_free.d
    arcstat.pl
    arcaccess.d
    ZFS
    iostacks.d
    Block Device Interface
    kernel drivers as needed
    see DTrace book chap 4
    Wednesday, October 3, 12
    Syscall
    Interface
    vfsstat
    Disks
    iostat
    iosnoop
    
slide 9:
    ZFS Internals
    • That’s just my top 12
    • Use more as needed
    Wednesday, October 3, 12
    
slide 10:
    ZFS Internals
    • That’s just my top 12
    • Use more as needed
    DTRACE	
    
      ALL
    THE	
    
      THINGS!
    http://hub.opensolaris.org/bin/view/Community+Group+zfs/source
    http://hyperboleandahalf.blogspot.com/2010/06/this-is-why-ill-never-be-adult.html
    Wednesday, October 3, 12
    
slide 11:
    iostat
    • Block-device level (almost disk-level) I/O statistics:
    $ iostat -xnz 1
    [...]
    r/s
    w/s
    r/s
    w/s
    r/s
    w/s
    extended device statistics
    kr/s
    kw/s wait actv wsvc_t asvc_t %w %b device
    52.0 0.0 0.0
    1 c0t0d0
    16.0 43325.5 0.0 4.0
    1 12 c0t1d0
    extended device statistics
    kr/s
    kw/s wait actv wsvc_t asvc_t %w %b device
    34.0 0.0 0.0
    0 c0t0d0
    16.0 1440.5 0.0 2.0
    0 10 c0t1d0
    extended device statistics
    kr/s
    kw/s wait actv wsvc_t asvc_t %w %b device
    36.0 0.0 0.0
    0 c0t0d0
    0.0 0.0 0.0
    4 c0t1d0
    ZFS->gt;Disk Workload
    Wednesday, October 3, 12
    Disk Resulting Performance
    
slide 12:
    iostat, cont.
    • Effective tool for a class of disk issues, especially:
    $ iostat -xnz 1
    [...]
    r/s
    w/s
    r/s
    w/s
    r/s
    w/s
    r/s
    w/s
    r/s
    w/s
    extended device statistics
    kr/s
    kw/s wait actv wsvc_t asvc_t
    0.0 0.0 4.0
    extended device statistics
    kr/s
    kw/s wait actv wsvc_t asvc_t
    0.0 0.0 4.0
    extended device statistics
    kr/s
    kw/s wait actv wsvc_t asvc_t
    0.0 0.0 4.0
    extended device statistics
    kr/s
    kw/s wait actv wsvc_t asvc_t
    0.0 0.0 4.0
    extended device statistics
    kr/s
    kw/s wait actv wsvc_t asvc_t
    0.0 0.0 4.0
    • Disks “out to lunch” (PERC ECC error)
    Wednesday, October 3, 12
    %w %b device
    0 100 c0t0d0
    %w %b device
    0 100 c0t0d0
    %w %b device
    0 100 c0t0d0
    %w %b device
    0 100 c0t0d0
    %w %b device
    0 100 c0t0d0
    
slide 13:
    iostat, cont.
    • Minor nits:
    • does not show read/write latency separately. ZFS TXG flushes
    drag up the latency, which looks alarming, but are
    asynchronous. Can use DTrace for the split.
    • no higher level context: PID, ZFS dataset, file pathname, ...
    (not its role)
    • Major problem (although, not iostat’s fault): commonly confused
    with application-level (logical) I/O.
    • The I/O rates, sizes, and latency, can dramatically differ
    between logical file system I/O and physical disk I/O.
    • Users commonly draw the wrong conclusions when only
    provided with iostat statistics to understand a system.
    Wednesday, October 3, 12
    
slide 14:
    iostat, cont.
    • iostat output (or disk kstats) graphed by various monitoring software
    So MANY graphs!
    Wednesday, October 3, 12
    
slide 15:
    iostat, cont.
    mysql_pid_fslatency.d
    For so LITTLE visibility
    syscall with
    fi_fs == zfs
    Process
    ThisUser-Land
    leaves many perf issues unsolved
    Kernel
    VFS
    zioslower.d
    spasync.d
    metaslab_free.d
    arcstat.pl
    arcaccess.d
    ZFS
    iostacks.d
    Block Device Interface
    kernel drivers as needed
    see DTrace book chap 4
    Wednesday, October 3, 12
    Syscall
    Interface
    vfsstat
    Disks
    iostat
    iosnoop
    
slide 16:
    vfsstat
    • VFS-level I/O statistics (VFS-iostat):
    # vfsstat -Z 1
    r/s
    w/s kr/s
    kw/s ractv wactv read_t writ_t
    App->gt;ZFS Workload
    Wednesday, October 3, 12
    ZFS Resulting Performance
    d/s
    del_t zone
    0.0 global (0)
    34.9 9cc2d0d3 (2)
    46.5 72188ca0 (3)
    16.5 4d2a62bb (4)
    27.6 8bbc4000 (5)
    11.3 d305ee44 (6)
    132.0 9897c8f5 (7)
    40.7 5f3c7d9e (9)
    31.9 22ef87fc (10)
    ZFS I/O
    Throttling
    
slide 17:
    vfsstat, cont.
    • Good high-level summary of logical I/O: application FS workload
    • Summarizes by zone
    • Impetus was observability for cloud “noisy neighbors”
    • Shows affect of ZFS I/O throttling (performance isolation)
    • Summarizes performance applications actually experience!
    • Usually a lot better than disk-level, due to ZFS caching (ARC,
    L2ARC) and buffering
    • Required kernel changes, new kstats (thanks Bill Pijewski)
    Wednesday, October 3, 12
    
slide 18:
    zfsslower.d
    • ZFS reads/writes slower than 10 ms:
    # ./zfsslower.d 10
    TIME
    PROCESS D
    2012 Sep 30 04:56:22 beam.smp R
    2012 Sep 30 04:56:23 beam.smp R
    2012 Sep 30 04:56:24 beam.smp R
    2012 Sep 30 04:56:24 beam.smp R
    2012 Sep 30 04:56:25 beam.smp R
    2012 Sep 30 04:56:25 beam.smp R
    2012 Sep 30 04:56:25 beam.smp R
    2012 Sep 30 04:56:26 cat
    2012 Sep 30 04:56:26 beam.smp R
    2012 Sep 30 04:56:27 beam.smp R
    [...]
    ms FILE
    12 /var/db/riak/leveldb/.../205788.sst
    15 /var/db/riak/leveldb/.../152831.sst
    11 /var/db/riak/leveldb/.../220432.sst
    12 /var/db/riak/leveldb/.../208619.sst
    21 /var/db/riak/leveldb/.../210710.sst
    18 /var/db/riak/leveldb/.../217757.sst
    13 /var/db/riak/leveldb/.../191942.sst
    10 /db/run/beam.smp.pid
    11 /var/db/riak/leveldb/.../220389.sst
    12 /var/db/riak/leveldb/.../186749.sst
    • Traces at VFS level to show the true application suffered I/O time
    • allows immediate confirm/deny of FS (incl. disk) based issue
    Wednesday, October 3, 12
    
slide 19:
    zfsslower.d
    • ZFS reads/writes slower than 100 ms:
    # ./zfsslower.d 100
    TIME
    PROCESS D
    2012 Sep 30 05:01:17 beam.smp R
    2012 Sep 30 05:01:54 beam.smp R
    2012 Sep 30 05:02:35 beam.smp R
    2012 Sep 30 05:02:35 beam.smp R
    2012 Sep 30 05:02:35 beam.smp R
    2012 Sep 30 05:02:40 beam.smp R
    2012 Sep 30 05:03:11 beam.smp R
    2012 Sep 30 05:03:38 beam.smp R
    [...]
    less frequent
    Wednesday, October 3, 12
    ms FILE
    144 /var/db/riak/leveldb/.../238108.sst
    149 /var/db/riak/leveldb/.../186222.sst
    188 /var/db/riak/leveldb/.../200051.sst
    159 /var/db/riak/leveldb/.../209376.sst
    178 /var/db/riak/leveldb/.../203436.sst
    172 /var/db/riak/leveldb/.../204688.sst
    200 /var/db/riak/leveldb/.../219837.sst
    142 /var/db/riak/leveldb/.../222443.sst
    
slide 20:
    zfsslower.d, cont.
    • All ZFS read/writes:
    # ./zfsslower.d
    TIME
    PROCESS D
    2012 Sep 30 04:46:09 beam.smp R
    2012 Sep 30 04:46:09 beam.smp R
    2012 Sep 30 04:46:09 beam.smp R
    2012 Sep 30 04:46:09 beam.smp R
    2012 Sep 30 04:46:09 beam.smp R
    2012 Sep 30 04:46:09 nginx
    2012 Sep 30 04:46:09 beam.smp R
    2012 Sep 30 04:46:09 beam.smp R
    2012 Sep 30 04:46:09 beam.smp R
    2012 Sep 30 04:46:09 beam.smp R
    2012 Sep 30 04:46:09 beam.smp R
    2012 Sep 30 04:46:09 beam.smp R
    2012 Sep 30 04:46:09 beam.smp R
    2012 Sep 30 04:46:09 beam.smp R
    2012 Sep 30 04:46:09 beam.smp R
    2012 Sep 30 04:46:09 beam.smp R
    2012 Sep 30 04:46:09 beam.smp R
    2012 Sep 30 04:46:09 beam.smp R
    [...]
    Wednesday, October 3, 12
    ms FILE
    0 /var/db/riak/leveldb/.../221844.sst
    0 /var/db/riak/leveldb/.../221155.sst
    0 /var/db/riak/leveldb/.../215917.sst
    0 /var/db/riak/leveldb/.../190134.sst
    0 /var/db/riak/leveldb/.../234539.sst
    0 /db/log/apps/prod17_nginx_access.log
    0 /var/db/riak/leveldb/.../205368.sst
    1 /var/db/riak/leveldb/.../199665.sst
    0 /var/db/riak/leveldb/.../177866.sst
    0 /var/db/riak/leveldb/.../177866.sst
    0 /var/db/riak/leveldb/.../177866.sst
    0 /var/db/riak/leveldb/.../177866.sst
    0 /var/db/riak/leveldb/.../218473.sst
    0 /var/db/riak/leveldb/.../210712.sst
    0 /var/db/riak/leveldb/.../234194.sst
    0 /var/db/riak/leveldb/.../233302.sst
    0 /var/db/riak/leveldb/.../233302.sst
    0 /var/db/riak/leveldb/.../233302.sst
    
slide 21:
    zfsslower.d, cont.
    • Written in DTrace
    [...]
    fbt::zfs_read:entry,
    fbt::zfs_write:entry
    self->gt;path = args[0]->gt;v_path;
    self->gt;kb = args[1]->gt;uio_resid / 1024;
    self->gt;start = timestamp;
    fbt::zfs_read:return,
    fbt::zfs_write:return
    /self->gt;start && (timestamp - self->gt;start) >gt;= min_ns/
    this->gt;iotime = (timestamp - self->gt;start) / 1000000;
    this->gt;dir = probefunc == "zfs_read" ? "R" : "W";
    printf("%-20Y %-16s %1s %4d %6d %s\n", walltimestamp,
    execname, this->gt;dir, self->gt;kb, this->gt;iotime,
    self->gt;path != NULL ? stringof(self->gt;path) : "gt;");
    [...]
    • zfsslower.d, also on github, originated from the DTrace book
    Wednesday, October 3, 12
    
slide 22:
    zfsslower.d, cont.
    • Traces VFS/ZFS interface (kernel)
    from usr/src/uts/common/fs/zfs/zfs_vnops.c:
    * Regular file vnode operations template
    vnodeops_t *zfs_fvnodeops;
    const fs_operation_def_t zfs_fvnodeops_template[] = {
    VOPNAME_OPEN,
    { .vop_open = zfs_open },
    VOPNAME_CLOSE,
    { .vop_close = zfs_close },
    VOPNAME_READ,
    { .vop_read = zfs_read },
    VOPNAME_WRITE,
    { .vop_write = zfs_write },
    VOPNAME_IOCTL,
    { .vop_ioctl = zfs_ioctl },
    VOPNAME_GETATTR,
    { .vop_getattr = zfs_getattr },
    [...]
    Wednesday, October 3, 12
    
slide 23:
    iosnoop
    • Traces block-device-level I/O:
    # ./iosnoop
    UID
    PID D
    BLOCK
    SIZE
    103 5891 R 238517878 131072
    103 5891 R 544800120 131072
    103 5891 R 286317766 131072
    103 5891 R 74515841 131072
    103 5891 R 98341339 131072
    [...]
    COMM PATHNAME
    beam.smp gt;
    beam.smp gt;
    beam.smp gt;
    beam.smp gt;
    beam.smp gt;
    • Has many options:
    # ./iosnoop -Dots
    STIME(us)
    TIME(us)
    [...]
    Wednesday, October 3, 12
    DELTA(us)
    DTIME(us)
    UID
    PID D
    BLOCK
    SIZE
    5891 R 128302372 131072
    5891 R 143783200 131072
    5891 R 84913822 131072
    5891 R 14964144 131072
    5891 R 283398320 131072
    5891 R 288468148 131072
    COMM PATHNAME
    beam.smp gt;
    beam.smp gt;
    beam.smp gt;
    beam.smp gt;
    beam.smp gt;
    beam.smp gt;
    
slide 24:
    iosnoop, cont.
    • Written in DTrace
    [...]
    io:genunix::done
    /start_time[args[0]->gt;b_edev, args[0]->gt;b_blkno]/
    [...]
    /* fetch entry values */
    this->gt;dev = args[0]->gt;b_edev;
    this->gt;blk = args[0]->gt;b_blkno;
    this->gt;suid = start_uid[this->gt;dev, this->gt;blk];
    this->gt;spid = start_pid[this->gt;dev, this->gt;blk];
    this->gt;sppid = start_ppid[this->gt;dev, this->gt;blk];
    self->gt;sargs = (int)start_args[this->gt;dev, this->gt;blk] == 0 ?
    "" : start_args[this->gt;dev, this->gt;blk];
    self->gt;scomm = start_comm[this->gt;dev, this->gt;blk];
    [...]
    printf("%5d %5d %1s %8d %6d ",
    this->gt;suid, this->gt;spid, args[0]->gt;b_flags & B_READ ? "R" : "W",
    args[0]->gt;b_blkno, args[0]->gt;b_bcount);
    [...]
    • From the DTraceToolkit
    Wednesday, October 3, 12
    
slide 25:
    iosnoop, cont.
    • 370 lines of code, mostly to process options:
    USAGE: iosnoop [-a|-A|-DeghiNostv] [-d device] [-f filename]
    [-m mount_point] [-n name] [-p PID]
    iosnoop
    # default output
    # print all data (mostly)
    # dump all data, space delimited
    # print time delta, us (elapsed)
    # print device name
    # print command arguments
    # print device instance
    # print major and minor numbers
    # print disk delta time, us
    # print start time, us
    # print completion time, us
    # print completion time, string
    -d device
    # instance name to snoop
    -f filename
    # snoop this file only
    -m mount_point # this FS only
    -n name
    # this process name only
    -p PID
    # this PID only
    eg,
    iosnoop -v
    # human readable timestamps
    iosnoop -N
    # print major and minor numbers
    iosnoop -m /
    # snoop events on filesystem / only
    Wednesday, October 3, 12
    
slide 26:
    iosnoop, cont.
    • I originally wrote it in 2004 to solve disk-I/O-by-process
    • Shipped by default on Mac OS X (/usr/bin), Oracle Solaris 11
    • I wrote a companion called iotop. A similar iotop tool later
    appeared for Linux (via the blk tracing framework)
    • Dosen’t do ZFS pathnames yet (known issue; hackathon?)
    • Pathnames from the block-device layer was always a party
    trick: relied on a cached vnode->gt;v_path
    • ZFS aggregates I/O. At the block-device layer, no one vnode
    responsible (or vnode pointer)
    Wednesday, October 3, 12
    
slide 27:
    iosnoop, cont.
    • Locate latency origin:
    Process
    Syscall
    Interface
    VFS
    zfsslower.d
    Correlate
    iosnoop
    ZFS
    Block Device Interface
    Disks
    Wednesday, October 3, 12
    
slide 28:
    iostacks.d
    • Just a one-liner really (could make an “iostacks.d”):
    # dtrace -n 'io:::start { @[stack()] = count(); }'
    dtrace: description 'io:::start ' matched 6 probes
    genunix`ldi_strategy+0x53
    zfs`vdev_disk_io_start+0xcc
    zfs`zio_vdev_io_start+0xab
    zfs`zio_execute+0x88
    zfs`zio_nowait+0x21
    zfs`vdev_mirror_io_start+0xcd
    zfs`zio_vdev_io_start+0x250
    zfs`zio_execute+0x88
    zfs`zio_nowait+0x21
    zfs`arc_read_nolock+0x4f9
    zfs`arc_read+0x96
    zfs`dsl_read+0x44
    zfs`dbuf_read_impl+0x166
    zfs`dbuf_read+0xab
    zfs`dmu_buf_hold_array_by_dnode+0x189
    zfs`dmu_buf_hold_array+0x78
    zfs`dmu_read_uio+0x5c
    zfs`zfs_read+0x1a3
    genunix`fop_read+0x8b
    genunix`read+0x2a7
    Wednesday, October 3, 12
    
slide 29:
    iostacks.d, cont.
    • Stack recognition chart:
    # dtrace -n 'io:::start { @[stack()] = count(); }'
    dtrace: description 'io:::start ' matched 6 probes
    genunix`ldi_strategy+0x53
    zfs`vdev_disk_io_start+0xcc
    zfs`zio_vdev_io_start+0xab
    zfs`zio_execute+0x88
    zfs`zio_nowait+0x21
    zfs`vdev_mirror_io_start+0xcd
    zfs`zio_vdev_io_start+0x250
    zfs`zio_execute+0x88
    zfs`zio_nowait+0x21
    zfs`arc_read_nolock+0x4f9
    zfs`arc_read+0x96
    zfs`dsl_read+0x44
    zfs`dbuf_read_impl+0x166
    zfs`dbuf_read+0xab
    zfs`dmu_buf_hold_array_by_dnode+0x189
    zfs`dmu_buf_hold_array+0x78
    zfs`dmu_read_uio+0x5c
    zfs`zfs_read+0x1a3
    genunix`fop_read+0x8b
    genunix`read+0x2a7
    Wednesday, October 3, 12
    syscall
    read()
    arc-miss
    
slide 30:
    iostacks.d, cont.
    • Stack recognition chart:
    # dtrace -n 'io:::start { @[stack()] = count(); }'
    dtrace: description 'io:::start ' matched 6 probes
    genunix`ldi_strategy+0x53
    zfs`vdev_disk_io_start+0xcc
    zfs`zio_vdev_io_start+0xab
    zfs`zio_execute+0x88
    zfs`vdev_queue_io_done+0x70
    zfs`zio_vdev_io_done+0x80
    zfs`zio_execute+0x88
    genunix`taskq_thread+0x2d0
    unix`thread_start+0x8
    Wednesday, October 3, 12
    ZIO
    pipeline
    
slide 31:
    iostacks.d, cont.
    • From zio, when parent == NULL (first zio):
    # dtrace -n 'zio_create:entry /arg0 == NULL/ { @[stack()] = count(); }'
    [...]
    zfs`zio_null+0x77
    zfs`zio_root+0x2d
    zfs`dmu_buf_hold_array_by_dnode+0x113
    zfs`dmu_buf_hold_array+0x78
    zfs`dmu_write+0x80
    zfs`space_map_sync+0x288
    zfs`metaslab_sync+0x135
    SPA
    zfs`vdev_sync+0x7f
    sync
    zfs`spa_sync+0x38b
    zfs`txg_sync_thread+0x204
    unix`thread_start+0x8
    Wednesday, October 3, 12
    
slide 32:
    iostacks.d, cont.
    • Can identify the reason for disk I/O
    • including unexplained additional I/O from ZFS
    • Times it doesn’t work (ZIO pipeline task pool), can try a similar
    one-liner from fbt::zio_create:entry, for the stack trace from the
    creation of all ZIO.
    Wednesday, October 3, 12
    
slide 33:
    metaslab_free.d
    • Traces ZFS metaslab details:
    # ./metaslab_free.d
    Tracing ZFS metaslab alloc.
    metaslab_df_free_pct = 4 %
    2012 Sep 30 05:32:47 free %pct by allocations:
    2012 Sep 30 05:32:49 free %pct by allocations:
    • ZFS pools are split into metaslabs (eg, 10 GB each)
    • Metaslabs have two allocation algorithms:
    metaslab %free >gt;= metaslab_df_free_pct == first fit (fast)
    metaslab %free 
slide 34:
    metaslab_free.d, cont.
    • Written in DTrace
    dtrace:::BEGIN
    printf("Tracing ZFS metaslab alloc.
    `metaslab_df_free_pct);
    metaslab_df_free_pct = %d %%\n",
    fbt::metaslab_df_alloc:entry
    this->gt;pct = args[0]->gt;sm_space * 100 / args[0]->gt;sm_size;
    @[this->gt;pct] = count();
    profile:::tick-1s
    printf("\n%Y free %%pct by allocations:", walltimestamp);
    printa(@);
    trunc(@);
    • metaslab_free.d is also on github
    Wednesday, October 3, 12
    
slide 35:
    metaslab_free.d, cont.
    • Shows if allocations are currently fast fit or best fit
    • Correlate to performance changes
    • Use while tuning
    • Can’t just try out a new setting and watch performance: perf
    can improve if ZFS switched to a new, emptier, metaslab at
    the same time. Script identifies if that occurred.
    Wednesday, October 3, 12
    
slide 36:
    spasync.d
    • Traces ZFS SPA syncs:
    # ./spasync.d
    Tracing ZFS spa_sync() slower than 1 ms...
    2012 Sep 30 05:56:07 zones
    30 ms, 25 MB 464 I/O
    2012 Sep 30 05:56:12 zones
    22 ms, 15 MB 366 I/O
    2012 Sep 30 05:56:17 zones
    39 ms, 19 MB 361 I/O
    2012 Sep 30 05:56:22 zones
    143 ms, 31 MB 536 I/O
    2012 Sep 30 05:56:27 zones
    32 ms, 27 MB 465 I/O
    2012 Sep 30 05:56:32 zones
    27 ms, 24 MB 468 I/O
    2012 Sep 30 05:56:37 zones
    31 ms, 23 MB 489 I/O
    2012 Sep 30 05:56:42 zones
    200 ms, 85 MB 956 I/O
    2012 Sep 30 05:56:47 zones
    122 ms, 46 MB 593 I/O
    2012 Sep 30 05:56:52 zones
    48 ms, 18 MB 394 I/O
    2012 Sep 30 05:56:57 zones
    43 ms, 16 MB 409 I/O
    2012 Sep 30 05:57:02 zones
    33 ms, 27 MB 526 I/O
    [...]
    • Check for correlations with I/O latency
    Wednesday, October 3, 12
    
slide 37:
    spasync.d, cont.
    • Written in DTrace
    [...]
    fbt::spa_sync:entry
    /!self->gt;start/
    in_spa_sync = 1;
    self->gt;start = timestamp;
    self->gt;spa = args[0];
    io:::start
    /in_spa_sync/
    @io = count();
    @bytes = sum(args[0]->gt;b_bcount);
    fbt::spa_sync:return
    /self->gt;start && (this->gt;ms = (timestamp - self->gt;start) / 1000000) >gt; MIN_MS/
    normalize(@bytes, 1048576);
    printf("%-20Y %-10s %6d ms, ", walltimestamp,
    stringof(self->gt;spa->gt;spa_name), this->gt;ms);
    printa("%@d MB %@d I/O\n", @bytes, @io);
    [...]
    • spasync.d is also on github
    Wednesday, October 3, 12
    
slide 38:
    spasync.d, cont.
    • Simple and effective: helps diagnose TXG sync issues, providing
    information to help guide tuning.
    • Long heritage: Matt Ahrens ->gt; Roch Bourbannias ->gt; Ben
    Rockwood ->gt; Brendan Gregg
    Wednesday, October 3, 12
    
slide 39:
    %iowait
    • I/O wait percent
    # mpstat 1
    CPU minf mjf xcal
    0 483
    0 427
    1 625
    0 153
    2 269
    0 702
    3 281
    0 161
    4 173
    0 577
    5 172
    6 139
    0 678
    7 140
    8 124
    0 527
    [...]
    Wednesday, October 3, 12
    intr ithr csw icsw migr smtx
    2216 913 4415
    4171 1748 2887
    1720 636 2943
    1433 529 2255
    3361 1465 2565
    1108 418 1640
    2406 802 2509
    1343 527 2196
    1145 419 1676
    srw syscl
    0 9275
    0 5909
    0 6422
    0 5269
    0 4645
    0 3699
    0 5135
    0 4879
    0 3335
    usr sys
    7 11
    wt idl
    0 83
    0 85
    0 87
    0 90
    0 90
    0 94
    0 90
    0 92
    0 92
    
slide 40:
    %iowait
    • I/O wait percent ... still zero in illumos!
    # mpstat 1
    CPU minf mjf xcal
    0 483
    0 427
    1 625
    0 153
    2 269
    0 702
    3 281
    0 161
    4 173
    0 577
    5 172
    6 139
    0 678
    7 140
    8 124
    0 527
    [...]
    Wednesday, October 3, 12
    intr ithr csw icsw migr smtx
    2216 913 4415
    4171 1748 2887
    1720 636 2943
    1433 529 2255
    3361 1465 2565
    1108 418 1640
    2406 802 2509
    1343 527 2196
    1145 419 1676
    srw syscl
    0 9275
    0 5909
    0 6422
    0 5269
    0 4645
    0 3699
    0 5135
    0 4879
    0 3335
    usr sys
    7 11
    wt idl
    0 83
    0 85
    0 87
    0 90
    0 90
    0 94
    0 90
    0 92
    0 92
    
slide 41:
    arcstat
    • ZFS ARC and L2ARC statistics:
    $ ./arcstat 1
    time read
    04:45:47
    04:45:49
    15K
    04:45:50
    23K
    04:45:51
    65K
    04:45:52
    30K
    [...]
    miss
    miss%
    dmis
    dm%
    pmis
    pm%
    mmis
    mm%
    arcsz
    14G
    14G
    14G
    14G
    14G
    • Statistics are per-interval:
    • read, miss: total ARC accesses, misses
    • miss%, dm%, pm%, mm%: ARC miss percentage total,
    demand, prefetch, metadata
    • dmis, pmis, mmis: misses for demand, prefetch, metadata
    • arcsz, c: ARC size, ARC target size
    Wednesday, October 3, 12
    14G
    14G
    14G
    14G
    14G
    
slide 42:
    arcstat, cont.
    • Written in Perl, uses kstats (zfs::arcstats:):
    [...]
    sub snap_stats {
    my %prev = %cur;
    if ($kstat->gt;update()) {
    printf("gt;\n");
    my $hashref_cur = $kstat->gt;{"zfs"}{0}{"arcstats"};
    %cur = %$hashref_cur;
    foreach my $key (keys %cur) {
    next if $key =~ /class/;
    if (defined $prev{$key}) {
    $d{$key} = $cur{$key} - $prev{$key};
    [...]
    sub calculate {
    %v = ();
    $v{"time"} = strftime("%H:%M:%S", localtime);
    $v{"hits"} = $d{"hits"}/$int;
    $v{"miss"} = $d{"misses"}/$int;
    $v{"read"} = $v{"hits"} + $v{"miss"};
    $v{"hit%"} = 100*$v{"hits"}/$v{"read"} if $v{"read"} >gt; 0;
    $v{"miss%"} = 100 - $v{"hit%"} if $v{"read"} >gt; 0;
    [...]
    • github.com/mharsch/arcstat
    Wednesday, October 3, 12
    
slide 43:
    arcstat, cont.
    • Options to configure output, including L2ARC stats
    • Crucial data when analyzing cache performance
    (and easier to read than the raw form: kstat -pn arcstats)
    • Originally by Neelakanth Nadgir, then Jason Hellenthal (FreeBSD
    port) and Mike Harsh
    Wednesday, October 3, 12
    
slide 44:
    arcaccess.d
    • ARC population age:
    # ./arcaccess.d -n 'tick-10s { exit(0); }'
    lbolt rate is 100 Hertz.
    Tracing lbolts between ARC accesses...
    value ------------- Distribution ------------- count
    -1 |
    0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 729988
    1 |
    2 |
    4 |
    8 |
    16 |
    32 |
    64 |
    128 |
    256 |
    512 |
    1024 |
    2048 |
    4096 |
    8192 |
    16384 |
    32768 |
    65536 |
    131072 |
    262144 |
    524288 |
    1048576 |
    2097152 |
    4194304 |
    8388608 |
    16777216 |
    33554432 |
    Wednesday, October 3, 12
    Age:
    10 ms
    1 second
    1 minute
    1 hour
    1 day
    
slide 45:
    arcaccess.d, cont.
    • Written in DTrace
    #!/usr/sbin/dtrace -s
    #pragma D option quiet
    dtrace:::BEGIN
    printf("lbolt rate is %d Hertz.\n", `hz);
    printf("Tracing lbolts between ARC accesses...");
    fbt::arc_access:entry
    self->gt;ab = args[0];
    self->gt;lbolt = args[0]->gt;b_arc_access;
    fbt::arc_access:return
    /self->gt;lbolt/
    @ = quantize(self->gt;ab->gt;b_arc_access - self->gt;lbolt);
    self->gt;ab = 0;
    self->gt;lbolt = 0;
    • http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/
    Wednesday, October 3, 12
    
slide 46:
    arcaccess.d, cont.
    • Shows population age of the ARC based on access time
    • Helps determine working set size and turn over rate
    • Previous example showed ARC was so large, it was able to keep
    buffers that had not been accessed in over a day.
    • Turn over rate == low, working set size == entirely fits, likely
    much smaller.
    Wednesday, October 3, 12
    
slide 47:
    Latency counters
    • kstat -p zone_vfs:::*_ops, show latency outlier counts since boot:
    $ kstat -p zone_vfs:::*_ops
    zone_vfs:0:global:100ms_ops
    zone_vfs:0:global:10ms_ops
    zone_vfs:0:global:10s_ops
    zone_vfs:0:global:1s_ops
    zone_vfs:1:5cedb79e:100ms_ops
    zone_vfs:1:5cedb79e:10ms_ops
    zone_vfs:1:5cedb79e:10s_ops
    zone_vfs:1:5cedb79e:1s_ops
    • reads/writes:
    $ kstat -p zone_vfs:::reads zone_vfs:::writes
    zone_vfs:0:global:reads 666148884
    zone_vfs:1:5cedb79e-51c6-41b3-b99d-358b33:reads 1791879457
    zone_vfs:0:global:writes
    zone_vfs:1:5cedb79e-51c6-41b3-b99d-358b33:writes
    Wednesday, October 3, 12
    
slide 48:
    Latency counters
    • kstat -p zone_vfs:::*_ops, show latency outlier counts since boot:
    $ kstat -p zone_vfs:::*_ops
    zone_vfs:0:global:100ms_ops
    zone_vfs:0:global:10ms_ops
    zone_vfs:0:global:10s_ops
    zone_vfs:0:global:1s_ops
    zone_vfs:1:5cedb79e:100ms_ops
    zone_vfs:1:5cedb79e:10ms_ops
    zone_vfs:1:5cedb79e:10s_ops
    zone_vfs:1:5cedb79e:1s_ops
    • reads/writes:
    yikes, 430 ops >gt; 1s!
    however, this is a small fraction
    $ kstat -p zone_vfs:::reads zone_vfs:::writes
    zone_vfs:0:global:reads 666148884
    zone_vfs:1:5cedb79e-51c6-41b3-b99d-358b33:reads 1791879457
    zone_vfs:0:global:writes
    zone_vfs:1:5cedb79e-51c6-41b3-b99d-358b33:writes
    >gt;1s = 0.00002%
    Wednesday, October 3, 12
    
slide 49:
    Latency counters, cont.
    • Added by Bill Pijewski of Joyent
    • Proven invaluable, usually to quickly eliminate ZFS (and disk) as a
    source of issues, after the fact (when counters are near-zero)
    • Has had identified real ZFS/disk-level issues as well
    Wednesday, October 3, 12
    
slide 50:
    Scatter plots
    iosnoop -Dots >gt; outfile
    awk/perl
    gnuplot/R
    60ms
    read = red, write = blue
    latency
    time (s)
    Wednesday, October 3, 12
    
slide 51:
    Scatter plots, cont.
    • Has been an effective last-resort to investigate nasty PERC
    issues: reads queueing behind writes
    • Has a big data problem: many points, x-y coordinates, takes time
    & space to capture and render.
    • Heat maps solve this, and can be real-time
    Wednesday, October 3, 12
    
slide 52:
    Heat maps
    • WHAT DOES IT MEAN?
    Wednesday, October 3, 12
    
slide 53:
    Heat maps
    • WHAT DOES IT MEAN?
    disk I/O
    DRAM I/O
    workload becomes cached
    Wednesday, October 3, 12
    
slide 54:
    Heat maps
    • WHAT DOES IT MEAN?
    bi-modal
    distribution
    Wednesday, October 3, 12
    
slide 55:
    Heat maps
    • File system latency is commonly bi-modal: cache hits vs misses
    • Average latency, stddev, 99th percentile, don’t make a lot of sense
    • Given line graphs of these, we would not be able to solve
    many problems
    • Need to see the full distribution: heat maps
    • Also good at identifying outliers
    Wednesday, October 3, 12
    
slide 56:
    Heat maps, cont.
    • Latency outliers, cloud wide (entire datacenter):
    Wednesday, October 3, 12
    
slide 57:
    Heat maps, cont.
    • Latency outliers, cloud wide (entire datacenter):
    Inconceivable
    Very Bad
    Bad
    Good
    Wednesday, October 3, 12
    
slide 58:
    Cloud computing: within the zone
    • Suspected “noisy neighbors” hurting ZFS performance
    • ZFS I/O throttling was added to address this
    • Need ZFS observability to confirm/deny issues from within a zone
    (eg, Joyent SmartMachine), however, kernel access (including
    DTrace io and fbt providers) is not available.
    • vfsstat (already covered) does work, as it is zone-aware
    • Cloud safe DTrace tools include:
    • mysql_pid_latency.d: tracing FS calls from within the app
    • syscall tracing when fds[].fi_fs == “zfs”
    • Scripts are in:
    http://dtrace.org/blogs/brendan/2011/05/18/file-system-latency-part-3/
    http://dtrace.org/blogs/brendan/2011/05/24/file-system-latency-part-4/
    Wednesday, October 3, 12
    
slide 59:
    Recap
    mysql_pid_fslatency.d
    syscall with
    fi_fs == zfs
    Process
    User-Land
    Kernel
    VFS
    zioslower.d
    spasync.d
    metaslab_free.d
    arcstat.pl
    arcaccess.d
    ZFS
    iostacks.d
    Block Device Interface
    kernel drivers as needed
    see DTrace book chap 4
    Wednesday, October 3, 12
    Syscall
    Interface
    vfsstat
    Disks
    iostat
    iosnoop
    
slide 60:
    There’s more
    • There are more tools
    • zpool iostat
    • zpool status
    • degraded? resilvering?
    • zfs get all
    • recsize match workload? compression? dedup?
    • ziostat
    Wednesday, October 3, 12
    
slide 61:
    There’s more, cont.
    • There are more tools
    • And much, much, more DTrace
    DTRACE	
    
      ALL
    THE	
    
      THINGS?
    Wednesday, October 3, 12
    
slide 62:
    DTrace book, Chapter 4 Disks
    Disk I/O subsystem scripts
    Chapter 5 is FS includes ZFS scripts
    Wednesday, October 3, 12
    
slide 63:
    Future
    • More tools
    • as needed, working on customer issues
    Wednesday, October 3, 12
    
slide 64:
    Future
    • More tools
    • as needed, working on customer issues
    • But not too many...
    Wenger giant
    Wednesday, October 3, 12
    
slide 65:
    Thank you!
    • email: brendan@joyent.com
    • twitter: @brendangregg
    • Resources:
    • http://dtrace.org/blogs/brendan
    • http://dtrace.org/blogs/brendan/2012/01/09/activity-of-the-zfs-arc/
    • http://dtrace.org/blogs/brendan/2011/06/03/file-system-latency-part-5/
    • https://github.com/brendangregg/dtrace-cloud-tools/tree/master/fs
    • http://www.dtracebook.com
    • More on latency heat maps:
    http://queue.acm.org/detail.cfm?id=1809426
    Wednesday, October 3, 12