Category Archives: networking

Configuring OpenBGPD to announce VM’s virtual networks

We use BGP quite heavily at work, and even though I'm not interacting with that directly, it feels like it's something very useful to learn at least on some basic level. The most effective and fun way of learning technology is finding some practical application, so I decided to see if it could help to improve networking management for my Virtual Machines.

My setup is fairly simple: I have a host that runs bhyve VMs and I have a desktop system from where I ssh to VMs, both hosts run FreeBSD. All VMs are connected to each other through a bridge and have a common network 10.0.1/24. The point of this exercise is to be able to ssh to these VMs from desktop without adding static routes and without adding vmhost's external interfaces to the VMs bridge.

I've installed openbgpd on both hosts and configured it like this:

vmhost: /usr/local/etc/bgpd.conf

AS 65002
router-id 192.168.87.48
fib-update no

network 10.0.1.1/24

neighbor 192.168.87.41 {
descr "desktop"
remote-as 65001
}

Here, router-id is set vmhost's IP address in my home network (192.168.87/24), fib-update no is set to forbid routing table update, which I initially set for testing, but keeping it as vmhost is not supposed to learn new routes from desktop anyway. network announces my VMs network and neighbor describes my desktop box.

Now the desktop box:

desktop: /usr/local/etc/bgpd.conf

AS 65001
router-id 192.168.87.41
fib-update yes

neighbor 192.168.87.48 {
descr "vmhost"
remote-as 65002
}

It's pretty similar to vmhost's bgpd.conf, but no networks are announced here, and fib-update is set to yes because the whole point is to get VM routes added.

Both hosts have to have the openbgpd service enabled:

/etc/rc.conf.local

openbgpd_enable="YES"

Now start the service (or wait until next reboot) using service openbgpd start and check if neighbors are there:

vmhost: bgpctl show summary

$ bgpctl show summary                                                                                                                                                                    
Neighbor AS MsgRcvd MsgSent OutQ Up/Down State/PrfRcvd
desktop 65001 1089 1090 0 09:03:17 0
$

desktop: bgpctl show summary

$ bgpctl show summary
Neighbor AS MsgRcvd MsgSent OutQ Up/Down State/PrfRcvd
vmhost 65002 1507 1502 0 09:04:58 1
$

Get some detailed information about the neighbor:

desktop: bgpctl sh nei vmhost

$ bgpctl sh nei vmhost                                                                                                                                                                    
BGP neighbor is 192.168.87.48, remote AS 65002
Description: vmhost
BGP version 4, remote router-id 192.168.87.48
BGP state = Established, up for 09:06:25
Last read 00:00:21, holdtime 90s, keepalive interval 30s
Neighbor capabilities:
Multiprotocol extensions: IPv4 unicast
Route Refresh
Graceful Restart: Timeout: 90, restarted, IPv4 unicast
4-byte AS numbers

Message statistics:
Sent Received
Opens 3 3
Notifications 0 2
Updates 3 6
Keepalives 1499 1499
Route Refresh 0 0
Total 1505 1510

Update statistics:
Sent Received
Updates 0 1
Withdraws 0 0
End-of-Rib 1 1

Local host: 192.168.87.41, Local port: 179
Remote host: 192.168.87.48, Remote port: 13528

$

By the way, as you can see, bgpctl supports shortened commands, e.g. sh nei instead of show neighbor.

Now look for that VMs route:

desktop: bgpctl show rib

$ sudo bgpctl show rib
flags: * = Valid, > = Selected, I = via IBGP, A = Announced, S = Stale
origin: i = IGP, e = EGP, ? = Incomplete

flags destination gateway lpref med aspath origin
*> 10.0.1.0/24 192.168.87.48 100 0 65002 i
$

So that VMs network, 10.0.1/24, it's there! Now check if the system routing table was updated and has this route:

desktop

$ route -n get 10.0.1.45   
route to: 10.0.1.45
destination: 10.0.1.0
mask: 255.255.255.0
gateway: 192.168.87.48
fib: 0
interface: re0
flags:
recvpipe sendpipe ssthresh rtt,msec mtu weight expire
0 0 0 0 1500 1 0
$ ping -c 1 10.0.1.45
PING 10.0.1.45 (10.0.1.45): 56 data bytes
64 bytes from 10.0.1.45: icmp_seq=0 ttl=63 time=0.192 ms

--- 10.0.1.45 ping statistics ---
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.192/0.192/0.192/0.000 ms
$

Whoa, things work as expected!

Conclusion

As mentioned already, similar result could be achieved without using BGP by using either static routes or bridging interfaces differently, but the purpose of this exercise is to get some basic hands-on experience with BGP. Right now I'm looking into extending my setup in order to try more complex BGP schema. I'm thinking about adding some software switches in front of my VMs or maybe adding a second VM host (if budget allows). You're welcome to comment if you have some ideas how to extend this setup for educational purposes in the context of BGP and networking.

As a side note, I really like openbgpd so far. Its configuration file format is clean and simple, documentation is good, error and information messages are clear, and CLI has intuitive syntax.

Configuring OpenBGPD to announce VM’s virtual networks

We use BGP quite heavily at work, and even though I'm not interacting with that directly, it feels like it's something very useful to learn at least on some basic level. The most effective and fun way of learning technology is finding some practical application, so I decided to see if it could help to improve networking management for my Virtual Machines.

My setup is fairly simple: I have a host that runs bhyve VMs and I have a desktop system from where I ssh to VMs, both hosts run FreeBSD. All VMs are connected to each other through a bridge and have a common network 10.0.1/24. The point of this exercise is to be able to ssh to these VMs from desktop without adding static routes and without adding vmhost's external interfaces to the VMs bridge.

I've installed openbgpd on both hosts and configured it like this:

vmhost: /usr/local/etc/bgpd.conf

AS 65002
router-id 192.168.87.48
fib-update no

network 10.0.1.1/24

neighbor 192.168.87.41 {
descr "desktop"
remote-as 65001
}

Here, router-id is set vmhost's IP address in my home network (192.168.87/24), fib-update no is set to forbid routing table update, which I initially set for testing, but keeping it as vmhost is not supposed to learn new routes from desktop anyway. network announces my VMs network and neighbor describes my desktop box.

Now the desktop box:

desktop: /usr/local/etc/bgpd.conf

AS 65001
router-id 192.168.87.41
fib-update yes

neighbor 192.168.87.48 {
descr "vmhost"
remote-as 65002
}

It's pretty similar to vmhost's bgpd.conf, but no networks are announced here, and fib-update is set to yes because the whole point is to get VM routes added.

Both hosts have to have the openbgpd service enabled:

/etc/rc.conf.local

openbgpd_enable="YES"

Now start the service (or wait until next reboot) using service openbgpd start and check if neighbors are there:

vmhost: bgpctl show summary

$ bgpctl show summary                                                                                                                                                                    
Neighbor AS MsgRcvd MsgSent OutQ Up/Down State/PrfRcvd
desktop 65001 1089 1090 0 09:03:17 0
$

desktop: bgpctl show summary

$ bgpctl show summary
Neighbor AS MsgRcvd MsgSent OutQ Up/Down State/PrfRcvd
vmhost 65002 1507 1502 0 09:04:58 1
$

Get some detailed information about the neighbor:

desktop: bgpctl sh nei vmhost

$ bgpctl sh nei vmhost                                                                                                                                                                    
BGP neighbor is 192.168.87.48, remote AS 65002
Description: vmhost
BGP version 4, remote router-id 192.168.87.48
BGP state = Established, up for 09:06:25
Last read 00:00:21, holdtime 90s, keepalive interval 30s
Neighbor capabilities:
Multiprotocol extensions: IPv4 unicast
Route Refresh
Graceful Restart: Timeout: 90, restarted, IPv4 unicast
4-byte AS numbers

Message statistics:
Sent Received
Opens 3 3
Notifications 0 2
Updates 3 6
Keepalives 1499 1499
Route Refresh 0 0
Total 1505 1510

Update statistics:
Sent Received
Updates 0 1
Withdraws 0 0
End-of-Rib 1 1

Local host: 192.168.87.41, Local port: 179
Remote host: 192.168.87.48, Remote port: 13528

$

By the way, as you can see, bgpctl supports shortened commands, e.g. sh nei instead of show neighbor.

Now look for that VMs route:

desktop: bgpctl show rib

$ sudo bgpctl show rib
flags: * = Valid, > = Selected, I = via IBGP, A = Announced, S = Stale
origin: i = IGP, e = EGP, ? = Incomplete

flags destination gateway lpref med aspath origin
*> 10.0.1.0/24 192.168.87.48 100 0 65002 i
$

So that VMs network, 10.0.1/24, it's there! Now check if the system routing table was updated and has this route:

desktop

$ route -n get 10.0.1.45   
route to: 10.0.1.45
destination: 10.0.1.0
mask: 255.255.255.0
gateway: 192.168.87.48
fib: 0
interface: re0
flags:
recvpipe sendpipe ssthresh rtt,msec mtu weight expire
0 0 0 0 1500 1 0
$ ping -c 1 10.0.1.45
PING 10.0.1.45 (10.0.1.45): 56 data bytes
64 bytes from 10.0.1.45: icmp_seq=0 ttl=63 time=0.192 ms

--- 10.0.1.45 ping statistics ---
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.192/0.192/0.192/0.000 ms
$

Whoa, things work as expected!

Conclusion

As mentioned already, similar result could be achieved without using BGP by using either static routes or bridging interfaces differently, but the purpose of this exercise is to get some basic hands-on experience with BGP. Right now I'm looking into extending my setup in order to try more complex BGP schema. I'm thinking about adding some software switches in front of my VMs or maybe adding a second VM host (if budget allows). You're welcome to comment if you have some ideas how to extend this setup for educational purposes in the context of BGP and networking.

As a side note, I really like openbgpd so far. Its configuration file format is clean and simple, documentation is good, error and information messages are clear, and CLI has intuitive syntax.

Linux kernel TCP smoothed-RTT estimation

Recently I decided to look under the hood to see how exactly srtt is calculated in Linux. Actual (Exponentially Weighted Moving Average) srtt calculation is a rather straight-forward part but what goes in as input to that calculation under various scenarios is interesting and very important in getting correct rtt estimate.

Also useful to note the difference between Linux and FreeBSD in this regard. Linux doesn’t trust tcp packet Timestamps option provided value whenever possible as middle-boxes can meddle with it.

Basic algorithm is:
For non-retransmitted packets, use saved packet send timestamp and ack arrival time.
For retransmitted packets, use timestamp option and if that’s not enabled, rtt is not calculated for such packets.

Let’s look at the code. I am using net-next.
When a TCP sender sends packets, it has to wait for acks for those packets before throwing them away. It stores them in a queue called ‘retransmission queue’.
When sent packets get acked, tcp_clean_rtx_queue() gets called to clear those packets from the retransmission queue.

A few useful variables in that function are:
seq_rtt_us – uses first packet from ackd range
ca_rtt_us – uses last packet from ackd range (mainly used for congestion control)
sack_rtt_us – uses sacked ack
tcp_mstamp is a tcp_sock member which represents timestamp of most recent packet received/sent. It gets updated by tcp_mstamp_refresh().

For a clean ack (not sack), seq_rtt_us = ca_rtt_us (as there is no range)

If such a clean is also for a non-retransmitted packet,

seq_rtt_us = tcp_stamp_us_delta(tp->tcp_mstamp, first_ackt);

and for a sack which is again for a non-retransmitted packet,

sack_rtt_us = tcp_stamp_us_delta(tp->tcp_mstamp, sack->first_sackt);

Code that updates sack→first_sackt is in tcp_sacktag_one() where it gets populated when the sack is for a non-retransmitted packet.

tcp_stamp_us_delta() gets the difference with timestamp that the stack maintains.

Now tcp_ack_update_rtt() gets called which starts out with:

/* Prefer RTT measured from ACK's timing to TS-ECR. This is because
 * broken middle-boxes or peers may corrupt TS-ECR fields. But
 * Karn's algorithm forbids taking RTT if some retransmitted data
 * is acked (RFC6298).
 */
if (seq_rtt_us < 0)
        seq_rtt_us = sack_rtt_us;

For acks acking retransmitted packets, seq_rtt_us would be -ve.
But if there is a SACK timestamp from a non-retransmitted packet, it would use that as it carries valid and useful timestamps.

Then it takes TS-opt provided timestamps only if seq_rtt_us is -ve.

if (seq_rtt_us < 0 && tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr &&
    flag & FLAG_ACKED) {
        u32 delta = tcp_time_stamp(tp) - tp->rx_opt.rcv_tsecr;
        u32 delta_us = delta * (USEC_PER_SEC / TCP_TS_HZ);
 
        seq_rtt_us = ca_rtt_us = delta_us;
}

By this point, there is seq_rtt_us that can be fed into tcp_rtt_estimator() that’d generate smoothed-RTT (which is more or less based on SIGCOMM 88 paper by Van Jacobson).

Drop a packet

A few months back when I started looking into improving FreeBSD TCP’s response to packet loss, I looked around for traffic simulators which can do deterministic packet drop for me.

I had used dummynet(4) before so I thought of using it but the problem is that it only provided probabilistic drops. You can specify dropping 10% of the total packets, for example. I came across dpd work from CAIA, Swinburne University but it was written for FreeBSD7 and I couldn’t port it forward to FreeBSD11 with reasonable time/efforts as ipfw/dummynet has changed quite a bit.

So I decided to hack dummynet to provide me deterministic drops. Here is the patch: drop.patch
(Yes, it’s a hack and it needs polishing.)

Here is how I use it:
Setup:

client              dummynet          server
10.10.10.10    10.10.10.11
                    10.10.11.11  10.10.11.12

Both client and server need their routing tables setup correctly so that they can reach each other.

Dummynet node is the traffic shaping node here. We need to enable forwarding between interfaces:

sysctl net.inet.ip.forwarding=1

We need to setup links (called ‘pipes’) and their parameters on dummynet node like this:

# ipfw add pipe 100 ip from 10.10.11.12 to 10.10.10.10 out 
# ipfw add pipe 101 ip from 10.10.10.10 to 10.10.11.12 out
# ipfw pipe 100 config mask proto TCP src-ip 10.10.11.12 dst-ip 10.10.10.10 pls 3,4,5 plsr 7
# ipfw pipe 101 config mask proto TCP src-ip 10.10.10.10 dst-ip 10.10.11.12

‘pls 3,4,5 plsr 7’ – is the new configuration that the patch provides here.
pls : packet loss sequence
plsr : repeat frequency for the loss pattern

In the example above, it configures the pipe 100 to drop 3rd, 4th and 5th packet and repeat this pattern at every 7 packets going from server to client. So it’d also drop 10th, 11th and 12th packets and so on and so forth.

Side note: delay, bw and queue depth are other very useful parameters that you can set for the link to simulate however you want the link to behave. For example: ‘delay 5ms bw 40Mbps queue 50Kbytes’ would create a link with 10ms RTT, 40Mbps bandwidth with 50Kbytes worth of queue depth/capacity. Queue depth is usually decided based on BDP (bandwidth delay product) of the link. Dummynet drops packets once the limit is reached.

For simulations, I run a lighttpd web-server on the server which serves different sized objects and I request them via curl or wget from the client. I have tcpdump running on any/all of four interfaces involved to observe traffic and I can see specified packets getting dropped by dummynet.
sysctl net.inet.ip.dummynet.io_pkt_drop is incremented with each packet that dummynet drops.

Future work:
* Work on getting this patch committed into FreeBSD-head.
* sysctl net.inet.ip.dummynet.io_pkt_drop increments on any type of loss (which includes queue overflow and any other random error) so I am planning to add a more specific counter to show explicitly dropped packets only.
* I’ve (unsuccessfully) tried adding deterministic delay to dummynet so that we can delay specific packet(s) which can be useful in simulating link delays and also in debugging any delay-based congestion control algorithms. Turns out it’s trickier that I thought. I’d like to resume working on it as time permits.

Improving FreeBSD’s transport layer

FreeBSD network stack is quite stable but lacks some of the improvements/features available in other OSes.

A bunch of us have started an effort to try and identify current problems to improve transport layer (TCP, UDP, SCTP and others) for FreeBSD:

Transport Protocols wiki

Traditionally, freebsd-net has been the mailing list where networking problems get discussed but some have complained it to be too spammy and too focused on NIC drivers related issues. So a new mailing list has been created to specifically talk about transport level protocols: transport@

We’ve also started creating a list of TCP related RFCs and their support for FreeBSD to have a single point of reference.

Plan is to have a coordinated effort to improve TCP, UDP, etc.. so if you are interested in any of those protocols, please join the mailing list and help FreeBSD. :-)

Updated TCP Proposals and FreeBSD

There are a number of proposals for improving TCP performance coming out of Google that have some implications for FreeBSD. These proposals have taken the form of a group of IETF proposals, RFCs, patches to the Linux kernel, and research publications. A nice summary of the different initiatives is available from Lets Make TCP Faster on the Google Code Blog.

TCP Fast Open by Radhakrishnan, Cheng, Chu, Jain, and Raghavan is based on the observation that modern web services are dominated by TCP flows so short that they terminate a few round trips after handshaking. This means that the 3-way TCP handshake is a significant source of latency for such flows, and they describe a new mechanism for secure data exchange during the initial handshake to reduce some of the round-trip network transmission and associated latency for such short TCP transfers. This work shares many goals and challenges with T/TCP, which was previously in FreeBSD but suffered from some security vulnerabilities.

David Malone posted some thoughts on my Google+ post about how FreeBSD could implement the various changes. Maybe we could have some Summer of Code students work in this area this summer?

Updated TCP Proposals and FreeBSD

There are a number of proposals for improving TCP performance coming out of Google that have some implications for FreeBSD. These proposals have taken the form of a group of IETF proposals, RFCs, patches to the Linux kernel, and research publications. A nice summary of the different initiatives is available from Lets Make TCP Faster on the Google Code Blog.

TCP Fast Open by Radhakrishnan, Cheng, Chu, Jain, and Raghavan is based on the observation that modern web services are dominated by TCP flows so short that they terminate a few round trips after handshaking. This means that the 3-way TCP handshake is a significant source of latency for such flows, and they describe a new mechanism for secure data exchange during the initial handshake to reduce some of the round-trip network transmission and associated latency for such short TCP transfers. This work shares many goals and challenges with T/TCP, which was previously in FreeBSD but suffered from some security vulnerabilities.

David Malone posted some thoughts on my Google+ post about how FreeBSD could implement the various changes. Maybe we could have some Summer of Code students work in this area this summer?

FreeBSD as a WiFi Access Point

At a recent Linux users' gathering I temporarily saved the day when a WRT router was practically bricked, by setting up my netbook (Acer Aspire One) running 8-CURRENT as a wireless access point. It had wired connectivity to the Internet from one side and offered WiFi via its Atheros card on the other side. In between it did NAT and protected the LAN side from the Linux hackers, both with ipfw. Here is how I configured it.

Read more...