tag:blogger.com,1999:blog-46863839737659118222024-02-08T08:51:09.988-08:00Network Plumber's JournalLinux networking and related topicsLinux Network Plumberhttp://www.blogger.com/profile/01514158449435119324noreply@blogger.comBlogger19125tag:blogger.com,1999:blog-4686383973765911822.post-26941977192187972322012-09-24T12:37:00.002-07:002012-09-24T21:28:17.622-07:00VXLAN for Linux<br />
<h2>
VXLAN for Linux</h2>
Just published a Linux kernel implementation of VXLAN for possible inclusion in 3.7 kernel (<a href="http://www.spinics.net/lists/netdev/msg211564.html">patches</a>).<br />
For those unfamiliar with VXLAN, here are some common questions.<br />
<h4>
Q: What is VXLAN?</h4>
It is a standard protocol to transfer layer 2 Ethernet packets over UDP.<br />
<h4>
Q: What is the VXLAN protocol?</h4>
The standard is under development, the <a href="http://tools.ietf.org/html/draft-mahalingam-dutt-dcops-vxlan-02">current draft RFC</a> is at version 2.<br />
<h4>
Q: Why do we need yet another tunnel protocol? Why not just use GRE?</h4>
Existing tunnel protocols depend on properties of the backbone which may not be available. Generic Routing Encapsulation works by tunneling over IP and maybe blocked at routers by firewalls that only accept TCP and UDP.<br />
<br />
<h4>
Q: Does Openvswitch already do VXLAN?</h4>
The development version of Openvswitch does have <a href="http://openvswitch.org/pipermail/dev/2011-October/012051.html">VXLAN support</a>, but OVS is fundamentally different than normal Linux networking. Many people may not want to take the jump into OVS. There are many cases where existing Linux networking technologies are easier to configure and use.<br />
<h4>
Q: What could VXLAN in Linux be used for?</h4>
It could be used to terminate VXLAN in Linux router, or link Linux bridges across hypervisors, or talk to legacy expensive virtualization products.<br />
<h4>
Q: Why is VXLAN cool?</h4>
Read the blogosphere, here are some good starting points<br />
<br />
<ul>
<li><a href="http://blog.ioshints.info/2012/08/vxlan-and-otv-ive-been-suckered.html">Ivan Pepelnjak: VXLAN and OTV I've been suckered</a></li>
<li></li>
<li><a href="http://blogs.cisco.com/datacenter/digging-deeper-into-vxlan/">Digging Deeper into VXLAN, Part 1</a></li>
<li><a href="http://blog.scottlowe.org/2011/12/02/examining-vxlan/">Examining VXLAN</a></li>
<li><a href="http://packetpushers.net/show-66-vxlan-nvgre-ken-duda/">NVGRE vs VXLAN on Packetpushers podcast</a></li>
</ul>
<div>
<br />
<h4>
Q: That's too technical, what can I show my manager.</h4>
There is a short introductory video on the fundamentals of VXLAN<br />
<br />
<iframe allowfullscreen="allowfullscreen" frameborder="0" height="315" src="http://www.youtube.com/embed/j7on2iLk5ls" width="560"></iframe><br />
<br />
<br /></div>
Linux Network Plumberhttp://www.blogger.com/profile/01514158449435119324noreply@blogger.com0tag:blogger.com,1999:blog-4686383973765911822.post-78382196253332157682011-02-28T12:45:00.000-08:002011-02-28T12:47:19.294-08:00net-snmp ip-forward table performance problemsSome performance problems are hard and complex, but others seem to be<br />due to just plain stupidity. This is the saga of SNMP daemon and a<br />full BGP route table. Way back in 2006, Vyatta discovered that if an<br />SNMP walk was done on a server with a full BGP route table, it would<br />peg the CPU and never complete. A full BGP route table is 500K entries<br />or more so it does a good job of exposing scalability nightmares. The<br />initial fix was to disable the caching of the route table in SNMP<br />which made it return no entries. Hardly a good fix, but returning<br />nothing is better than crashing.<br /><br />I began investigating with the simple tools of packet capture with wireshark<br />and syscall capturing with strace. The first discovery was that each<br />request caused the TCP wrappers library to open and read<br />/etc/hosts.allow and /etc/hosts.deny. Bogus on two counts:<br /><ol><br /><li> Debian is shipping the 2 files with no real entries only comments.<br />Each packet caused file to be read but there was really no data.<br />It would have been better to have the file not exist and have the open fail.<br /><li> But for our distribution, there was no point in enabling<br />TCP wrappers anyway.<br /></ol><br />The fix was simple to <b>disable tcp-wrappers</b>.<br /><br />The net-snmp daemon retrieves the ipv4 and ipv6 routing table the old<br />school way through /proc. This isn't a total disaster but since the<br />route entries in /proc start with an interface name and net-snmp wants<br />an ifindex it looks up each entry. That is 300K extra ioctl<br />calls. Short term hack was to just <b>cache last ifname -> ifindex<br />translation</b>; later I replaced it with a netlink route dump<br />which gives ifindex (surprisingly netlink route dump is already used<br />in another MIB).<br /><br />Next observation was that it is stupid to use <i>snmpwalk</i> to walk<br />the whole system and instead use <i>snmpbulk</i>. This helps but still<br />the walk would not complete.<br /><br />The real discovery was when looking at the net-snmp container<br />code. Internally, net-snmp uses an objectish abstraction to store<br />data, and the main ones are a flat table and a linked list. The table<br />is stored in sorted order for fast lookup and sequential access. New<br />entries are placed at the end of the table and a dirty bit is set for<br />next lookup. The problem is that <b>each insert also does a lookup for<br />duplicates which causes a sort</b>. This makes inserts do quicksort for<br />each entry -- there is the scalability problem.<br /><br />To make it more interesting net-snmp creates the route table<br />twice. First it reads table from /proc and puts entries in one table,<br />then walks that table to create the cache table used for lookup.<br /><br />Loading the cache with non-scalable insertion takes several minutes on<br />a really fast machine, and the cache timeout is 30 seconds. This<br />ends up causing the CPU load because each request finds a dirty cache<br />and does a full reload.<br /><br />Now for the good news, fixing the insert wasn't the hard. The first<br />step was realizing that the temporary table doesn't have to a table<br />container, instead it can be changed to a FIFO (linked list). The FIFO<br />container is O(1) on insert. The actual cache container requires a<br />different approach. The table container has an unused flag to allow<br />duplicates in the table. Turning the ALLOW_DUPLICATES flag makes<br />inserts much faster because the table is not sorted until the first<br />request. These get the table load down to less than a 5 seconds<br />on fast machine.<br /><br />Lastly a couple of other improvements help as well. When the<br />binary_table is expanded, the code would calloc a new area, copy the<br />old data and then free the original. This is much worse than just<br />using realloc which can usually in place expansion when table is<br />getting large. The sort function can be optimized to avoid calling the<br />comparison function, and using a faster insertion sort for small sub<br />sections. These get the load down to less than a second.<br /><br />Extra credit to the first developer who implements a new net-snmp<br />container using something better for big tables like AVL or B-tree.Linux Network Plumberhttp://www.blogger.com/profile/01514158449435119324noreply@blogger.com1tag:blogger.com,1999:blog-4686383973765911822.post-69477348499030730502010-03-18T22:53:00.000-07:002010-03-18T23:15:38.050-07:00GTSMI like seeing LWN writers pick up small patches and explain what they are why they are important. As a developer, often the impact of a change is not obvious and without further explanation significant changes go unnoticed. <a href="http://lwn.net/SubscriberLink/379070/ab6ef67695c8a859/">The recent story about Generalized TTL Security Measures</a> in <a href="http://lwn.net">lwn.net</a> is one such example.<span style="display: block;" id="formatbar_Buttons"><span class=" on down" style="display: block;" id="formatbar_CreateLink" title="Link" onmouseover="ButtonHoverOn(this);" onmouseout="ButtonHoverOff(this);" onmouseup="" onmousedown="CheckFormatting(event);FormatbarButton('richeditorframe', this, 8);ButtonMouseDown(this);"></span></span><br />But, when a story comes out, the writer should do research on the background. First, it is nice to give some credit to the <a href="mailto:shemminger@vyatta.com">author</a> :-) and <a href="http://vyatta.com/">Vyatta</a>, as well as also some history. I did this patch based on <a href="https://bugzilla.vyatta.com/show_bug.cgi?id=4937">an enhancement request for the current Vyatta version</a>. The starting point was <a href="http://www.gossamer-threads.com/lists/quagga/dev/17389">a (unaccepted) patch to Quagga</a>, and <a href="http://www.freebsd.org/releases/7.0R/relnotes.html">existing implementation for FreeBSD systems</a>. It was one of those patches where the kernel change took less time than writing the test programs.<br /><br />Also, the initial patch wasn't perfect since (nothing ever is), since <a href="http://article.gmane.org/gmane.linux.network/154355/match=d218d111">it broke time wait sockets,</a> and missed the case of ICMP messages. Both should be fixed by the time 2.6.34-rc2 comes out. Also, the necessary support has not been integrated into upstream Quagga (yet).<br /><br />I appreciate the review and feedback from Eric, Andi, David, and Pekka for making this work.Linux Network Plumberhttp://www.blogger.com/profile/01514158449435119324noreply@blogger.com1tag:blogger.com,1999:blog-4686383973765911822.post-62072086589764491052009-11-11T16:43:00.000-08:002009-11-11T16:58:39.404-08:00Powerpoint® Karoke contestAnyone in the Portland area interested in a fun and creative event is invited to the 1st <a href="http://timbertalkers.com/">Timbertalkers</a> <a href="http://www.powerpointkaraoke2009.com/">Powerpoint® Karoke</a> contest on Tuesday 11/24 at noon.<br /><br />Meeting location is:<a href="http://www.timbertalkers.com/Meeting-Location.html"> 9403-B SW Nimbus Ave., Beaverton, Oregon</a><br /><br />If you have never done PPTK, here are the rules:<br /><ul><br /><li>Topic is draw from set of 30 topics. Probably 10 to 15 slides<br /></li><li>Speaker will have 2 to 3 minutes<br /></li><li>Prizes awarded<br /></li></ul><br /><br />In spirit of open source, it will really be a <a href="http://www.openoffice.org/">OpenOffice Impress</a> contest, and the slides will be drawn from <a href="http://creativecommons.org/">Creative Commons</a> licensed decks.Linux Network Plumberhttp://www.blogger.com/profile/01514158449435119324noreply@blogger.com0tag:blogger.com,1999:blog-4686383973765911822.post-21906544458691573532009-10-27T20:42:00.000-07:002009-10-27T21:02:01.669-07:00Ubuntu 9.10 hates kernel developers?Ubuntu has never been the easiest distribution to do kernel development, but it looks like with 9.10 it has made things too painful. I need to build and install kernels all the time, and usually just update grub menu manually. But now with grub 2 in Ubuntu 9.10 they have wrapped the grub menu in grub-mkconfig. Why?<br /><br />It would be great if the system was setup so just doing 'make install' in the kernel source put in the kernel and updated the grub.cfg, but no that would make too much sense.<br /><br />P.s: they managed to break the sky2 driver somehow, the connection won't come up and negotiates the wrong speed. <i>It turned out not to be a kernel problem; wiring issue (speed), combined with some Network Manager changes</i>Linux Network Plumberhttp://www.blogger.com/profile/01514158449435119324noreply@blogger.com5tag:blogger.com,1999:blog-4686383973765911822.post-23735411578015666482009-10-20T01:25:00.000-07:002009-10-20T01:27:22.656-07:00Japan Linux SymposiumI am giving three talks: 1) routing performance, 2) staging drivers, 3) Vyatta CLI.<br />So if you are attending JLS please stop by and give me support.Linux Network Plumberhttp://www.blogger.com/profile/01514158449435119324noreply@blogger.com0tag:blogger.com,1999:blog-4686383973765911822.post-36257957364546480572009-09-17T09:20:00.001-07:002009-09-17T09:26:15.602-07:00Netconf / LinuxCon / Linux Plumber's ConferenceIt will be a busy week. The network developer's are getting together at <a href="http://vger.kernel.org/netconf2009.html">Netconf</a> over the weekend,<br />then <a href="http://events.linuxfoundation.org/events/linuxcon">LinuxCon</a> followed by <a href="http://linuxplumbersconf.org/2009/">Linux Plumber's Conference</a>. Hope the<a href="http://forecast.weather.gov/MapClick.php?site=pqr&smap=1&textField1=45.52361&textField2=-122.675"> weather</a> holds out, Portland has a tendency to rain when ever there is a big event.Linux Network Plumberhttp://www.blogger.com/profile/01514158449435119324noreply@blogger.com0tag:blogger.com,1999:blog-4686383973765911822.post-61226519580321291942009-07-20T09:56:00.000-07:002009-07-20T11:49:56.709-07:00Congratulations MicrosoftNice. <a href="http://lkml.org/lkml/2009/7/20/167">Microsoft has released the Hyper-V drivers as GPLv2</a>. I know was a hard step for Microsoft to take, since it means acknowledging GPL and respecting the Linux community. The releasing of the drivers is good news for users, developers, and in the end Microsoft as well. Like most GPL related actions, a lot of work was done behind the scenes to get the offending company into compliance.<br /><br />This saga started when one of the user's on the Vyatta forum inquired about supporting Hyper-V network driver in the Vyatta kernel. A little<a href="http://www.google.com/search?q=hyper-v+linux+driver&ie=utf-8&oe=utf-8&aq=t&rls=com.ubuntu:en-US:unofficial&client=firefox-a"> googling</a> found the necessary drivers, but on closer examination there was a problem. The driver had both open-source components which were under GPL, and statically linked to several binary parts. The GPL does not permit mixing of closed and open source parts, so this was an obvious violation of the license. Rather than creating noise, my goal was to resolve the problem, so I turned to <a href="http://www.kroah.com/log/">Greg Kroah-Hartman</a>. Since Novell has a (too) close association with Microsoft, my expectation was that Greg could prod the right people to get the issue resolved.<br /><br />It took longer than expected, but finally Microsoft decided to do the right thing and release the drivers.Linux Network Plumberhttp://www.blogger.com/profile/01514158449435119324noreply@blogger.com16tag:blogger.com,1999:blog-4686383973765911822.post-47294764902534893762009-06-05T14:52:00.000-07:002009-06-05T15:00:09.732-07:00Networking at Linux Plumbers ConferenceHey kernel developers, more <a href="http://linuxplumbersconf.org/ocw/events/2009/proposals/">proposals </a>related to networking submitted for the <a href="http://linuxplumbersconf.org/2009/">Linux Plumbers Conference</a>. This is the chance to have in-person discussions about future proposals like<a href="http://lwn.net/Articles/331582/"> receive packet steering</a>, <a href="http://lkml.org/lkml/2009/4/26/186">RCU netfilter optimization</a>, <a href="http://lwn.net/Articles/194443/">unified flow cache</a>, and all those other topics that need need more brainstorming and discussion.<br /><br />The Netconf 2009 is also being planned to occur before LPC.Linux Network Plumberhttp://www.blogger.com/profile/01514158449435119324noreply@blogger.com1tag:blogger.com,1999:blog-4686383973765911822.post-59948029959330896952009-02-17T21:44:00.000-08:002009-02-17T21:51:40.471-08:00Parallelizing netfilterThe Linux networking receive performance has been mostly single threaded until the advent of MSI-X and multiqueue receive hardware. Now with many cards, it is possible to be processing packets on multiple CPU's and cores at once. All this is great, and improves performance for the simple case.<br /><br />But most users don't just use simple networking. They use useful features like netfilter to do firewalling, NAT, connection tracking and all other forms of wierd and wonderful things. The netfilter code has been tuned over the years, but there are still several hot locks in the receive path. Most of these are reader-writer locks which are actually the worst kind, much worse than a simple spin lock. The problem with locks on modern CPU's is that even for the uncontested case, a lock operation means a full-stop cache miss.<br /><br />With the help of Eric Duzmet, Rick Jones, Martin Josefsson and others, it looks like there is a solution to most of these. I am excited to see how it all pans out but it could mean a big performance increase for any kind of netfilter packet intensive processing. Stay tuned.Linux Network Plumberhttp://www.blogger.com/profile/01514158449435119324noreply@blogger.com0tag:blogger.com,1999:blog-4686383973765911822.post-78742254984441539002008-12-18T20:47:00.000-08:002008-12-18T21:33:44.293-08:00GPL violations close to homeMany times I hear about GPL violations in vendors software, especially it seems in embedded routers. There are two cases which hit me in my home.<br /><br />The first is our <a href="http://verizon.com/fios">FIOS</a> router which is an <a href="http://www.fibercrap.com/article/actiontec-mi424wr-router-fios-weapon-of-choice-1150-1.html">Actionec MI424-WR</a> which runs Linux inside. You can even get to a telnet prompt. The problem is that it has a crappy DHCP server and always seems to assign different IP addresses even to the same MAC address. This breaks ssh and other services which do strong man-in-the-middle prevention. It seem the vendor hasn't fixed the problem, but as a <a href="http://www.softwarefreedom.org/news/2007/dec/07/busybox/">result of a GPL violations sui</a>t the <a href="http://opensource.actiontec.com/">some source is available</a> but the DHCP code is not included probably because it is BSD licensed so they don't have to. Given this I'll just punt and do the lazy solution and just turn it into an dumb Ethernet bridge and use something better like <a href="http://www.vyatta.com/products/hardware_appliances.php">Vyatta V514</a> test box or<a href="http://en.wikipedia.org/wiki/Linksys_WRT54G_series"> Linksys WR54TG</a>, both of which are repairable.<br /><br />The second is the <a href="http://www.asus.com/products.aspx?l1=3&l2=179&l3=815&l4=0&model=2593&modelmenu=1">Asus P6T motherboard</a> which has a <a href="http://hardware.slashdot.org/article.pl?no_d2=1&sid=08/05/14/173220">SplashVM</a> feature. This allows booting to a lightweight desktop in less than a minute (the BIOS is still slow to get its hardware setup). The desktop is based on Linux with standard kernel and browser. It is kind of a toy, but good for checking gmail etc. Since SplashVM is using GPL, if the vendor was following the GPL license I should be able to find the source on their website. It is possible to <a href="http://www.splashtop.com/open_source.php">find some pieces on the Splashtop vendor website</a>, but it is the responsibility of the system vendor not the subcontractor to make available the source for the <span style="font-weight: bold;">actual</span> firmware they are shipping. In this case, it matters to me for a couple of reasons. I wrote the driver for the Marvell Yukon-2 EC Ultra NIC's on this motherboard and would like to know if 1) the vendor fixed some bugs 2) the vendor still has some bugs that other users will pester me about. As copyright holder for this driver, I may have to go nasty to find out; stay tuned.Linux Network Plumberhttp://www.blogger.com/profile/01514158449435119324noreply@blogger.com1tag:blogger.com,1999:blog-4686383973765911822.post-68344324842002351312008-10-01T01:57:00.000-07:002008-10-01T02:07:25.185-07:00Netfilter workshop day 1At<a href="http://workshop.netfilter.org/2008/"> netfilter workshop</a>, Patrick McHardy described an exciting new feature implementation of netfilter firewalling called <a href="http://nfws.inl.fr/en/?p=92">nftables</a>. This has the promise of reducing 100's of netfilter modules down to a smaller kernel footprint, and allow for optimization of rulesets. <a href="http://nfws.inl.fr/en/">Eric Leblond's blog</a> has more information.Linux Network Plumberhttp://www.blogger.com/profile/01514158449435119324noreply@blogger.com0tag:blogger.com,1999:blog-4686383973765911822.post-76625395286236141322008-09-12T09:16:00.000-07:002008-09-12T09:28:54.840-07:00Open Source is alive and well in PDX thank youI really should stop reading the <a href="http://www.oregonlive.com/">Oregonian</a>, they do such a poor job of covering high tech and the business section is especially weak. The<a href="http://www.oregonlive.com/business/oregonian/index.ssf?/base/business/1221092714119880.xml&coll=7"> recent piece</a> about OSCON moving to Silly Valley overlooked so many obvious things like the<a href="http://linuxplumbersconf.org/"> Linux Plumber's Conference</a> next week, the <a href="https://www.linuxfoundation.org/events/kernel">Kernel Summit</a> not to mention the <a href="http://www.beavertonoregon.gov/business/businessincubator.aspx">Open Source technology center</a>, Oracle office in Portland, Portland State, and<a href="http://freegeek.org/"> Free Geek</a>. So the loss of one conference which is mostly attended by out of town people is really no impact on the local open source infrastructure.Linux Network Plumberhttp://www.blogger.com/profile/01514158449435119324noreply@blogger.com1tag:blogger.com,1999:blog-4686383973765911822.post-54418221280658816052008-08-31T12:54:00.000-07:002008-08-31T13:02:07.757-07:00Only aliens can configure selinux?<a href="http://ars.userfriendly.org/cartoons/?id=20080831">Sunday 8/31 user friendly cartoon</a> is great.<br /><a href="http://flickr.com/photos/x_jamesmorris/2693101534/">Do these people look like aliens?</a><br />Guess I'll have to give up on trying to setup selinux.Linux Network Plumberhttp://www.blogger.com/profile/01514158449435119324noreply@blogger.com0tag:blogger.com,1999:blog-4686383973765911822.post-88533420120954086822008-08-27T15:10:00.001-07:002008-08-27T15:20:44.804-07:00Exploring transactional filesystemsIn order to implement router style semantics, Vyatta allows setting many different configuration variables and then applying them all at once with a <span style="font-weight: bold;">commit</span> command. Currently, this is implemented by a combination of shell magic and <a href="http://www.filesystems.org/project-unionfs.html"><span style="font-style: italic;">unionfs</span></a>. The problem is that keeping <span style="font-style: italic;">unionfs</span> up to date and fixing the resulting crashes is major pain.<br /><br />There must be better alternatives, current options include:<br /><ul><li>Replace unionfs with <a href="http://aufs.sourceforge.net/">aufs</a> which has less users yelling at it and more developers.</li><li>Use a filesystem like <a href="http://btrfs.wiki.kernel.org/index.php/Main_Page">btrfs</a> which has snapshots. This changes the model and makes api's like "what changed?" hard to implement.</li><li>Move to a pure userspace model using <a href="http://git.or.cz/">git</a>. The problem here is that git as currently written is meant for users not transactions.<br /></li><li>Use combination of copy, bind mount, and <a href="http://samba.anu.edu.au/rsync/">rsync</a>.</li><li>Use a database for configuration. This is easier for general queries but is the most work. Conversion from existing format would be a pain.<br /></li></ul>Looks like a fun/hard problem. Don't expect any resolution soon.Linux Network Plumberhttp://www.blogger.com/profile/01514158449435119324noreply@blogger.com1tag:blogger.com,1999:blog-4686383973765911822.post-76841018611626159972008-06-26T22:42:00.000-07:002008-06-26T22:48:10.845-07:00TCP MD5 debuggingAdded <span class="blsp-spelling-error" id="SPELLING_ERROR_0">CLI</span> support for <span class="blsp-spelling-error" id="SPELLING_ERROR_1">TCP</span> MD5 (via <span class="blsp-spelling-error" id="SPELLING_ERROR_2">Quagga</span>) to the upcoming <span class="blsp-spelling-error" id="SPELLING_ERROR_3">Vyatta</span> release. It worked fine under testing (<span class="blsp-spelling-error" id="SPELLING_ERROR_4">VM</span>) but wouldn't operate with <span class="blsp-spelling-error" id="SPELLING_ERROR_5">IOS</span>. Reduced the problem down by making some useful <span class="blsp-spelling-corrected" id="SPELLING_ERROR_6">utilities</span>:<br /><ul><li>Patch for <a href="http://netcat.sourceforge.net/">Netcat</a> to support MD5</li><li>Standalone using libpcap to validate MD5 option in capture file</li></ul>It turned out that the sender was generating wrong MD5 option after the initial SYN handshake. When data is finally sent, the problem is that the data in the kernel is fragmented because the underlying device supports scatter/gather but the md5_calc doesn't do scatter gather.Linux Network Plumberhttp://www.blogger.com/profile/01514158449435119324noreply@blogger.com0tag:blogger.com,1999:blog-4686383973765911822.post-30010930378765601392008-06-19T16:51:00.000-07:002008-06-19T16:56:12.707-07:00Linux Plumbers ConferenceI have high hopes for the first <a href="http://linuxplumbersconf.org">Linux Plumbers Conference</a>. Unlike an academic conference with papers, or an un-conference with no agenda; the plumbers conference is using a mini-conference format to break down by topic. There is even a<a href="http://linuxplumbersconf.org/cfp/"> Call For Speakers</a> to get speakers in topic areas.<br /><br />First time conferences have a different feel, more rough edges, but more passion and fun. So I hope it works out. There is no particular networking track, mostly because the other areas seemed to need more work.Linux Network Plumberhttp://www.blogger.com/profile/01514158449435119324noreply@blogger.com0tag:blogger.com,1999:blog-4686383973765911822.post-49479061466829363962008-01-23T20:03:00.000-08:002008-01-23T20:15:40.379-08:00FIB Trie sagaFor the next release of <a href="http://vyatta.com">Vyatta,</a> I wanted to enable the <a href="http://en.wikipedia.org/wiki/Trie">Trie</a> algorithm for routing in our kernel. Since FIB Trie is compatible with the previous hash, I expected no change. Well the day after enabling it caused an immediate failure in the regression test. The regression test plays back a full BGP input stream into the router and polls for the result. The problem was that the Trie to a long time, ... a really long time, to dump the routes. For the full 163395 routes, the dump was taking 20 seconds vs 1/2 sec for FIB_HASH.<br /><br />As expected the problem turned out to be an N^2 lookup. The code was basically:<br />walk tree to find a route, and put it in buffer; if buffer is full, then give up, then go<br />back and walk to the last location. Since the trie has nice fast lookup function the change to just record the last route dumped (rather than offset), then use the lookup to find the location. This dropped the dump down to under 3/10 sec.<br /><br />Finding this took several days, mostly because of looking at the profile; see "nextleaf" is the hotspot, so let's look at that. The real breakthrough came when I realized there were other operations that were walking the tree, like collecting stats, but they were fast. The next diversion was figuring out all the other suboptimal behaviour (ip route flush calls fflush for each route), which although slow weren't the real issue.Linux Network Plumberhttp://www.blogger.com/profile/01514158449435119324noreply@blogger.com0tag:blogger.com,1999:blog-4686383973765911822.post-46078328169644968082007-12-17T19:40:00.000-08:002007-12-17T19:45:32.837-08:00A new dayStarted out for first day at <a href="http://www.blogger.com/vyatta.com">Vyatta</a>. There is a lot of overlap between what I know from doing iproute2 utilities so hopefully the learning curve won't be too steep.Linux Network Plumberhttp://www.blogger.com/profile/01514158449435119324noreply@blogger.com0