Monday, February 28, 2011

net-snmp ip-forward table performance problems

Some performance problems are hard and complex, but others seem to be
due to just plain stupidity. This is the saga of SNMP daemon and a
full BGP route table. Way back in 2006, Vyatta discovered that if an
SNMP walk was done on a server with a full BGP route table, it would
peg the CPU and never complete. A full BGP route table is 500K entries
or more so it does a good job of exposing scalability nightmares. The
initial fix was to disable the caching of the route table in SNMP
which made it return no entries. Hardly a good fix, but returning
nothing is better than crashing.

I began investigating with the simple tools of packet capture with wireshark
and syscall capturing with strace. The first discovery was that each
request caused the TCP wrappers library to open and read
/etc/hosts.allow and /etc/hosts.deny. Bogus on two counts:

  1. Debian is shipping the 2 files with no real entries only comments.
    Each packet caused file to be read but there was really no data.
    It would have been better to have the file not exist and have the open fail.
  2. But for our distribution, there was no point in enabling
    TCP wrappers anyway.

The fix was simple to disable tcp-wrappers.

The net-snmp daemon retrieves the ipv4 and ipv6 routing table the old
school way through /proc. This isn't a total disaster but since the
route entries in /proc start with an interface name and net-snmp wants
an ifindex it looks up each entry. That is 300K extra ioctl
calls. Short term hack was to just cache last ifname -> ifindex
translation
; later I replaced it with a netlink route dump
which gives ifindex (surprisingly netlink route dump is already used
in another MIB).

Next observation was that it is stupid to use snmpwalk to walk
the whole system and instead use snmpbulk. This helps but still
the walk would not complete.

The real discovery was when looking at the net-snmp container
code. Internally, net-snmp uses an objectish abstraction to store
data, and the main ones are a flat table and a linked list. The table
is stored in sorted order for fast lookup and sequential access. New
entries are placed at the end of the table and a dirty bit is set for
next lookup. The problem is that each insert also does a lookup for
duplicates which causes a sort
. This makes inserts do quicksort for
each entry -- there is the scalability problem.

To make it more interesting net-snmp creates the route table
twice. First it reads table from /proc and puts entries in one table,
then walks that table to create the cache table used for lookup.

Loading the cache with non-scalable insertion takes several minutes on
a really fast machine, and the cache timeout is 30 seconds. This
ends up causing the CPU load because each request finds a dirty cache
and does a full reload.

Now for the good news, fixing the insert wasn't the hard. The first
step was realizing that the temporary table doesn't have to a table
container, instead it can be changed to a FIFO (linked list). The FIFO
container is O(1) on insert. The actual cache container requires a
different approach. The table container has an unused flag to allow
duplicates in the table. Turning the ALLOW_DUPLICATES flag makes
inserts much faster because the table is not sorted until the first
request. These get the table load down to less than a 5 seconds
on fast machine.

Lastly a couple of other improvements help as well. When the
binary_table is expanded, the code would calloc a new area, copy the
old data and then free the original. This is much worse than just
using realloc which can usually in place expansion when table is
getting large. The sort function can be optimized to avoid calling the
comparison function, and using a faster insertion sort for small sub
sections. These get the load down to less than a second.

Extra credit to the first developer who implements a new net-snmp
container using something better for big tables like AVL or B-tree.