Saturday, December 31, 2016

Discovering (tagged only?) vlans using tcpdump

Why to use VLANs? well, there is the part of having a bunch of different devices -- computers, printers, toasters -- networks that are connected to the same switches but in different networks so you can control how/if they can see and talk to each other. And those switches are connected to each other and to routers who need to route traffic between the different networks. And that probably means we are trunking and tagging the vlans between those switches and routers. But, if we now add virtual machines to the mix, the physical machines they are running on -- the virtual servers -- need to be able to connect to all the networks their virtual clients belong to. And that once again is done using trunking and vlan tagging.

What if it is not working properly?

Let me elaborate it by using an example: I recently had a router running OpenWRT which I assumed I had configured to handle multiple vlans and was connected to the rest of the network through a 802.1q trunk; think of it as a router-on-a-stick setup. Now, the router was configured to have one IP -- 192.168.2.1 and 192.168.3.1 to make it easier -- in each of the two vlans, which I shall call vlan2 and vlan3. Problem was I could only reach it through vlan2; nothing on vlan3.

First thing I did was to recheck the switch to see if that port was configured for 802.1q trunk. Not only it was but both vlans were tagged; I did not have an untagged/default vlan in this trunk. So for all practical purposes both vlans should look the same for the router. So why weren't they?

At this point I decided to do some packet sniffing. And, this would be the cue to unleash Wireshark on it. But, we are talking about a little router running a OS designed to fit a resource limited device. Why not see if we can use a command line alternative, like the humble tcpdump? It can show you not only any vlan traffic on the wire but also only traffic on a specific vlan just as it can show traffic related to a specific IP or protocol. For instance, since we really only care about traffic on vlan 3 on interface eth0 we can

root@router:~# tcpdump -i eth0.3
tcpdump: WARNING: eth0.3: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0.3, link-type EN10MB (Ethernet), capture size 65535 bytes

And then after letting it run for ten minutes, all I got back were tumbleweeds. In the same time frame, if had asked for traffic on vlan 2 it would have filled the screen before I finished this sentence (and, yes, I am a slow typist but still). The logical thing to do now is to see if there is any traffic on the wire that has a 802.1Q tag (vlan):

tcpdump -i eth0 -e -n vlan
tcpdump: WARNING: eth0: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
01:32:27.440782 e2:46:9a:5e:a7:45 > c0:ff:ee:3e:06:eb, ethertype 802.1Q (0x8100)
, length 122: vlan 2, p 0, ethertype IPv4, 192.168.2.1.22 > 192.168.2.249.56762: Flags
 [P.], seq 2026124025:2026124077, ack 2216938037, win 21298, options [nop,nop,TS
 val 346274301 ecr 697366229], length 52
01:32:27.441234 e2:46:9a:5e:a7:45 > c0:ff:ee:3e:06:eb, ethertype 802.1Q (0x8100)
, length 122: vlan 2, p 0, ethertype IPv4, 192.168.2.1.22 > 192.168.2.249.56762: Flags
 [P.], seq 52:104, ack 1, win 21298, options [nop,nop,TS val 346274301 ecr 697366232], length 52
01:32:27.441946 e2:46:9a:5e:a7:45 > c0:ff:ee:3e:06:eb, ethertype 802.1Q (0x8100)
, length 138: vlan 2, p 0, ethertype IPv4, 192.168.2.1.22 > 192.168.2.249.56762: Flags
 [P.], seq 104:172, ack 1, win 21298, options [nop,nop,TS val 346274301 ecr 6973
66232], length 68

Do you see the vlan info on those lines? Let's trim it a bit:

tcpdump -i eth0 -e -n vlan | cut -f 6,10,11 -d ' '
tcpdump: WARNING: eth0: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
802.1Q vlan 2,
802.1Q vlan 2,
802.1Q vlan 2,
802.1Q vlan 2,
802.1Q vlan 2,
802.1Q vlan 2,
802.1Q vlan 2,
802.1Q vlan 2,
802.1Q vlan 2,
802.1Q vlan 2,
802.1Q vlan 2,
802.1Q vlan 2,
802.1Q vlan 2,
802.1Q vlan 2,
802.1Q vlan 2,
802.1Q vlan 2,
802.1Q vlan 2,
^C21 packets captured
29 packets received by filter
0 packets dropped by kernel

So all the traffic we have seen above is for vlan 2. In fact the 192.168.2.1.22 > 192.168.2.249.56762 means someone in 192.168.2.249 is connecting from its port 56762 to port 22 (ssh) on 192.168.2.1, which is the router. And that is indeed the case because that is how I connected to the router!

NOTE: A long time ago there was a bug in tcpdump, which would require you to run it twice,

tcpdump -Uw - | tcpdump -en -r - vlan 

But that is no longer the case.

But, where's vlan 3? That is a great question! At first I thought I had it setup wrong n the router even though both vlans have similar confiugrations, save the network info. And that kept me awake for a few nights.

It turns out that there is a ongoing bug which would not allow me to run more than one tagged vlan. The workaround is to "kick" it by adding a

option enable_vlan4k 1

to /etc/config/network

We feel confident we solved the problem, so instead of listening to all vlans this time we will only listen for vlan 3 traffic:

root@router:~# tcpdump -i eth0.3
tcpdump: WARNING: eth0.3: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0.3, link-type EN10MB (Ethernet), capture size 65535 bytes
19:15:25.822964 IP 0.0.0.0 > all-systems.mcast.net: igmp query v2
19:15:25.823010 IP6 fe80::e046:9aff:fe5e:a745 > ff02::1: HBH ICMP6, multicast li
stener querymax resp delay: 10000 addr: ::, length 24
19:15:26.568728 ARP, Request who-has 192.168.3.10 tell 192.168.3.1, length 28
19:15:26.569022 ARP, Reply 192.168.3.10 is-at c0:ff:ee:14:25:ad (oui Unknown), l
ength 42
19:15:26.569092 IP 192.168.3.1.32101 > 192.168.3.10.domain: 29094+ PTR? 1.0.0.0.
0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.2.0.f.f.ip6.arpa. (90)
19:15:26.606872 IP 192.168.3.10.domain > 192.168.3.1.32101: 29094 NXDomain 0/1/0
 (160)
19:15:27.797110 IP6 fe80::c2ff:eeff:fe14:25ad > ff02::1:ff14:25ad: HBH ICMP6, mu
lticast listener reportmax resp delay: 0 addr: ff02::1:ff14:25ad, length 24
19:15:28.014715 IP6 fe80::be5f:f4ff:fe54:d78d > ff02::1:ff54:d78d: HBH ICMP6, mu
lticast listener reportmax resp delay: 0 addr: ff02::1:ff54:d78d, length 24
19:15:28.615536 IP 192.168.3.1.16915 > 192.168.3.10.domain: 15895+ PTR? d.a.5.2.
4.1.e.f.f.f.e.e.f.f.2.c.0.0.0.0.0.0.0.0.0.0.0.0.0.8.e.f.ip6.arpa. (90)
19:15:28.616059 IP 192.168.3.10.domain > 192.168.3.1.16915: 15895 NXDomain* 0/1/
0 (125)
19:15:28.618233 IP 192.168.3.1.36929 > 192.168.3.10.domain: 54136+ PTR? d.a.5.2.
4.1.f.f.1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.2.0.f.f.ip6.arpa. (90)
19:15:28.646935 IP 192.168.3.10.domain > 192.168.3.1.36929: 54136 NXDomain 0/1/0
 (160)

And that is the kind of output I wanted to see!

Moral of the story: when vlan and their trunks get funky it might be time to unleash some packet analysis tool to figure out where in the network is the problem. And, even though Wireshark is a pretty nice tool, sometimes you can get the work done with something a bit more lightweight.