Monday, September 24, 2012

VXLAN for Linux

VXLAN for Linux

Just published a Linux kernel implementation of VXLAN for possible inclusion in 3.7 kernel (patches).
For those unfamiliar with VXLAN, here are some common questions.

Q: What is VXLAN?

It is a standard protocol to transfer layer 2 Ethernet packets over UDP.

Q: What is the VXLAN protocol?

The standard is under development, the current draft RFC is at version 2.

Q: Why do we need yet another tunnel protocol? Why not just use GRE?

Existing tunnel protocols depend on properties of the backbone which may not be available. Generic Routing Encapsulation works by tunneling over IP and maybe blocked at routers by firewalls that only accept TCP and UDP.

Q: Does Openvswitch already do VXLAN?

The development version of Openvswitch does have VXLAN support, but OVS is fundamentally different than normal Linux networking. Many people may not want to take the jump into OVS. There are many cases where existing Linux networking technologies are easier to configure and use.

Q: What could VXLAN in Linux be used for?

It could be used to terminate VXLAN in Linux router, or link Linux bridges across hypervisors, or talk to legacy expensive virtualization products.

Q: Why is VXLAN cool?

Read the blogosphere, here are some good starting points

Q: That's too technical, what can I show my manager.

There is a short introductory video on the fundamentals of VXLAN

Monday, February 28, 2011

net-snmp ip-forward table performance problems

Some performance problems are hard and complex, but others seem to be
due to just plain stupidity. This is the saga of SNMP daemon and a
full BGP route table. Way back in 2006, Vyatta discovered that if an
SNMP walk was done on a server with a full BGP route table, it would
peg the CPU and never complete. A full BGP route table is 500K entries
or more so it does a good job of exposing scalability nightmares. The
initial fix was to disable the caching of the route table in SNMP
which made it return no entries. Hardly a good fix, but returning
nothing is better than crashing.

I began investigating with the simple tools of packet capture with wireshark
and syscall capturing with strace. The first discovery was that each
request caused the TCP wrappers library to open and read
/etc/hosts.allow and /etc/hosts.deny. Bogus on two counts:

  1. Debian is shipping the 2 files with no real entries only comments.
    Each packet caused file to be read but there was really no data.
    It would have been better to have the file not exist and have the open fail.
  2. But for our distribution, there was no point in enabling
    TCP wrappers anyway.

The fix was simple to disable tcp-wrappers.

The net-snmp daemon retrieves the ipv4 and ipv6 routing table the old
school way through /proc. This isn't a total disaster but since the
route entries in /proc start with an interface name and net-snmp wants
an ifindex it looks up each entry. That is 300K extra ioctl
calls. Short term hack was to just cache last ifname -> ifindex
; later I replaced it with a netlink route dump
which gives ifindex (surprisingly netlink route dump is already used
in another MIB).

Next observation was that it is stupid to use snmpwalk to walk
the whole system and instead use snmpbulk. This helps but still
the walk would not complete.

The real discovery was when looking at the net-snmp container
code. Internally, net-snmp uses an objectish abstraction to store
data, and the main ones are a flat table and a linked list. The table
is stored in sorted order for fast lookup and sequential access. New
entries are placed at the end of the table and a dirty bit is set for
next lookup. The problem is that each insert also does a lookup for
duplicates which causes a sort
. This makes inserts do quicksort for
each entry -- there is the scalability problem.

To make it more interesting net-snmp creates the route table
twice. First it reads table from /proc and puts entries in one table,
then walks that table to create the cache table used for lookup.

Loading the cache with non-scalable insertion takes several minutes on
a really fast machine, and the cache timeout is 30 seconds. This
ends up causing the CPU load because each request finds a dirty cache
and does a full reload.

Now for the good news, fixing the insert wasn't the hard. The first
step was realizing that the temporary table doesn't have to a table
container, instead it can be changed to a FIFO (linked list). The FIFO
container is O(1) on insert. The actual cache container requires a
different approach. The table container has an unused flag to allow
duplicates in the table. Turning the ALLOW_DUPLICATES flag makes
inserts much faster because the table is not sorted until the first
request. These get the table load down to less than a 5 seconds
on fast machine.

Lastly a couple of other improvements help as well. When the
binary_table is expanded, the code would calloc a new area, copy the
old data and then free the original. This is much worse than just
using realloc which can usually in place expansion when table is
getting large. The sort function can be optimized to avoid calling the
comparison function, and using a faster insertion sort for small sub
sections. These get the load down to less than a second.

Extra credit to the first developer who implements a new net-snmp
container using something better for big tables like AVL or B-tree.

Thursday, March 18, 2010


I like seeing LWN writers pick up small patches and explain what they are why they are important. As a developer, often the impact of a change is not obvious and without further explanation significant changes go unnoticed. The recent story about Generalized TTL Security Measures in is one such example.
But, when a story comes out, the writer should do research on the background. First, it is nice to give some credit to the author :-) and Vyatta, as well as also some history. I did this patch based on an enhancement request for the current Vyatta version. The starting point was a (unaccepted) patch to Quagga, and existing implementation for FreeBSD systems. It was one of those patches where the kernel change took less time than writing the test programs.

Also, the initial patch wasn't perfect since (nothing ever is), since it broke time wait sockets, and missed the case of ICMP messages. Both should be fixed by the time 2.6.34-rc2 comes out. Also, the necessary support has not been integrated into upstream Quagga (yet).

I appreciate the review and feedback from Eric, Andi, David, and Pekka for making this work.

Wednesday, November 11, 2009

Powerpoint® Karoke contest

Anyone in the Portland area interested in a fun and creative event is invited to the 1st Timbertalkers Powerpoint® Karoke contest on Tuesday 11/24 at noon.

Meeting location is: 9403-B SW Nimbus Ave., Beaverton, Oregon

If you have never done PPTK, here are the rules:

  • Topic is draw from set of 30 topics. Probably 10 to 15 slides
  • Speaker will have 2 to 3 minutes
  • Prizes awarded

In spirit of open source, it will really be a OpenOffice Impress contest, and the slides will be drawn from Creative Commons licensed decks.

Tuesday, October 27, 2009

Ubuntu 9.10 hates kernel developers?

Ubuntu has never been the easiest distribution to do kernel development, but it looks like with 9.10 it has made things too painful. I need to build and install kernels all the time, and usually just update grub menu manually. But now with grub 2 in Ubuntu 9.10 they have wrapped the grub menu in grub-mkconfig. Why?

It would be great if the system was setup so just doing 'make install' in the kernel source put in the kernel and updated the grub.cfg, but no that would make too much sense.

P.s: they managed to break the sky2 driver somehow, the connection won't come up and negotiates the wrong speed. It turned out not to be a kernel problem; wiring issue (speed), combined with some Network Manager changes

Tuesday, October 20, 2009

Japan Linux Symposium

I am giving three talks: 1) routing performance, 2) staging drivers, 3) Vyatta CLI.
So if you are attending JLS please stop by and give me support.

Thursday, September 17, 2009

Netconf / LinuxCon / Linux Plumber's Conference

It will be a busy week. The network developer's are getting together at Netconf over the weekend,
then LinuxCon followed by Linux Plumber's Conference. Hope the weather holds out, Portland has a tendency to rain when ever there is a big event.