Categories
Network Opensource

OVN Distributed East/West and L3HA routing on VLAN

HA and Distributed are beautiful words, but complex ones. Within few seconds MAC addresses flap in the switch. It is for a good cause, but this anoys the admin who runs to disable port flapping detection in the switch, then he breaths.

I guess you picture the situation, which I have seen a few times happen. He already need to disable port flapping detection for L3HA to work, so it’s not a big deal.

In the next few pages I’m explaining how OVN/L3 works over VLAN once Anil’s patches are in place, and I review the life of a couple of ICMP packets through the network.

The network

As the diagram shows, the example network is composed of:

  • 4 Chassis:
    • Gateway Node 1
    • Gateway Node 2
    • Compute Node A
    • Compute Node B
  • 3 Physical networks:
    • Interface 1: VLANs provider network: The network for logical switches traffic, each logical switch will have it’s own VLAN ID.
    • Interface 2: The overlay network: although it’s not used for carrying traffic we still rely on BFD monitoring over the overlay network for L3HA purposes (deciding on the master/backup state)
    • Interface 3: Internet/Provider Network: Our external network.
  • 2 Logical Switches (or virtual networks):
    • A with CIDR 20.1.0.0/24 and a localnet port to vlan provider network, tag: 2011
    • B with CIDR 20.0.0.0/24 and a localnet port to vlan provider network, tag: 2010
    • C irrelevant
  • 1 Logical Router:
    • R1 which has three logical router ports:
      • On LS A
      • On LS B
      • On external network

The journey of the packet

In the next lines I’m tracking the small ICMP echo packet on its journey through the different network elements. You can see a detailed route plan in See Packet 1 ovn-trace

The inception

The packet is created inside VM1, where it has a virtual interface with the 20.1.0.11/24 address (fa:16:3e:16:07:92 MAC), and a default route to 20.1.0.1 (fa:16:3e:7e:d6:7e) for anything outside it’s subnet.

On it’s way out VM1

As the packet is handled by br-int OpenFlow rules for the logical router pipeline the source MAC address is replaced with the router logical port on logical switch B, and destination MAC is replaced with the destination port MAC on VM4. Afterwards the destination network VLAN tag for logical switch B attached to the packet.

The physical switch

The packet leaves the virtual switch br-phy, through interface 1, reaching the Top of Rack switch.

The ToR switch CAM table is updated for 2010 fa:16:3e:65:f2:ae which is R1’s leg into virtual network B (logical switch B).

vid + MACportage
2010 fa:16:3e:65:f2:ae10
2011 fa:16:3e:16:07:92112
2010 fa:16:3e:75:ca:89910

Going into VM4

As the packet arrives to the hypervisor, it’s decapsulated from the VLAN tag, and directed to the VM4 tap.

The return

VM4 receives the ICMP request and responds to it with an ICMP echo reply. The new packet is directed to R1’s MAC and VM1’s IP address.

On it’s way out of VM4

As the packet is handled by br-int OpenFlow rules for the logical router pipeline the source MAC address is replaced with the router logical port on logical switch B, and destination MAC is replaced with the destination port MAC on VM4.

Afterwards the destination network VLAN tag for logical switch B attached to the packet.

The physical switch (on it’s way back)

The packet leaves the virtual switch br-phy, through interface 9, reaching the Top of Rack Switch.

The ToR switch CAM table is updated for 2011 fa:16:3e:7e:d6:7e on port 9 which is R1’s leg into virtual network A (logical switch A).

vid + MACportage
2010 fa:16:3e:65:f2:ae11
2011 fa:16:3e:16:07:92112
2011 fa:16:3e:7e:d6:7e90
2010 fa:16:3e:75:ca:89910

The end

By the end of it’s journey, the ICMP packet crosses br-phy, where the OpenFlow rules will decapsulate from localnet port into LS A, and direct the packet to VM1, as the eth.dst patches VM1’s MAC address.

VM1 receives the packet normally, coming from VM4 (20.0.0.10) through our virtual R1 (fa:16:3e:7e:d6:7e).

The end?, oh no

We need to explore the case where we have ongoing communications from VM6 to VM3, and VM1 to VM4. Both cases are East/West traffic communication, that will make the R1 MAC addresses flip in ToR switch CAM table.

Annex

Packet 1 ovn-trace

$ ovn-trace --detailed neutron-0901bce9-c812-4fab-9844-f8ac1cdee066 'inport == "port-net2" && eth.src == fa:16:3e:16:07:92 && eth.dst==fa:16:3e:7e:d6:7e && ip4.src==20.1.0.11 && ip4.dst==20.0.0.10 && ip.ttl==32'
# ip,reg14=0x4,vlan_tci=0x0000,dl_src=fa:16:3e:16:07:92,dl_dst=fa:16:3e:7e:d6:7e,nw_src=20.1.0.11,nw_dst=20.0.0.10,nw_proto=0,nw_tos=0,nw_ecn=0,nw_ttl=32

ingress(dp="net2", inport="port-net2")
--------------------------------------
 0. ls_in_port_sec_l2 (ovn-northd.c:3847): inport == "port-net2" && eth.src == {fa:16:3e:16:07:92}, priority 50, uuid 72657159
    next;
 1. ls_in_port_sec_ip (ovn-northd.c:2627): inport == "port-net2" && eth.src == fa:16:3e:16:07:92 && ip4.src == {20.1.0.11}, priority 90, uuid 2bde621e
    next;
 3. ls_in_pre_acl (ovn-northd.c:2982): ip, priority 100, uuid 6a0c272e
    reg0[0] = 1;
    next;
 5. ls_in_pre_stateful (ovn-northd.c:3109): reg0[0] == 1, priority 100, uuid 00eac4fb
    ct_next;

ct_next(ct_state=est|trk /* default (use --ct to customize) */)
---------------------------------------------------------------
 6. ls_in_acl (ovn-northd.c:3292): !ct.new && ct.est && !ct.rpl && ct_label.blocked == 0 && (inport == "port-net2" && ip4), priority 2002, uuid 25b34866
    next;
16. ls_in_l2_lkup (ovn-northd.c:4220): eth.dst == fa:16:3e:7e:d6:7e, priority 50, uuid 21005439
    outport = "575fb1";
    output;

egress(dp="net2", inport="port-net2", outport="575fb1")
-------------------------------------------------------
 1. ls_out_pre_acl (ovn-northd.c:2938): ip && outport == "575fb1", priority 110, uuid 6d74b82c
    next;
 9. ls_out_port_sec_l2 (ovn-northd.c:4303): outport == "575fb1", priority 50, uuid d022b28d
    output;
    /* output to "575fb1", type "patch" */

ingress(dp="R1", inport="lrp-575fb1")
-------------------------------------
 0. lr_in_admission (ovn-northd.c:4871): eth.dst == fa:16:3e:7e:d6:7e && inport == "lrp-575fb1", priority 50, uuid 010fb48c
    next;
 7. lr_in_ip_routing (ovn-northd.c:4413): ip4.dst == 20.0.0.0/24, priority 49, uuid 4da9c83a
    ip.ttl--;
    reg0 = ip4.dst;
    reg1 = 20.0.0.1;
    eth.src = fa:16:3e:65:f2:ae;
    outport = "lrp-db51e2";
    flags.loopback = 1;
    next;
 8. lr_in_arp_resolve (ovn-northd.c:6010): outport == "lrp-db51e2" && reg0 == 20.0.0.10, priority 100, uuid 89c23f94
    eth.dst = fa:16:3e:76:ca:89;
    next;
10. lr_in_arp_request (ovn-northd.c:6188): 1, priority 0, uuid 94e042b9
    output;

egress(dp="R1", inport="lrp-575fb1", outport="lrp-db51e2")
----------------------------------------------------------
 3. lr_out_delivery (ovn-northd.c:6216): outport == "lrp-db51e2", priority 100, uuid a127ea78
    output;
    /* output to "lrp-db51e2", type "patch" */

ingress(dp="net1", inport="db51e2")
-----------------------------------
 0. ls_in_port_sec_l2 (ovn-northd.c:3829): inport == "db51e2", priority 50, uuid 04b4900d
    next;
 3. ls_in_pre_acl (ovn-northd.c:2885): ip && inport == "db51e2", priority 110, uuid fe072d82
    next;
16. ls_in_l2_lkup (ovn-northd.c:4160): eth.dst == fa:16:3e:76:ca:89, priority 50, uuid 3a1af0d6
    outport = "a0d121";
    output;

egress(dp="net1", inport="db51e2", outport="a0d121")
----------------------------------------------------
 1. ls_out_pre_acl (ovn-northd.c:2933): ip, priority 100, uuid ffea7ed3
    reg0[0] = 1;
    next;
 2. ls_out_pre_stateful (ovn-northd.c:3054): reg0[0] == 1, priority 100, uuid 11c5e570
    ct_next;

ct_next(ct_state=est|trk /* default (use --ct to customize) */)
---------------------------------------------------------------
 4. ls_out_acl (ovn-northd.c:3289): ct.est && ct_label.blocked == 0 && (outport == "a0d121" && ip), priority 2001, uuid f9826b44
    ct_commit(ct_label=0x1/0x1);
[vagrant@hv1 devstack]$

Glossary

  • E/W or East/West : This is the kind of traffic that traverses a router from one subnet to another subnet, going through two legs of a router.
  • N/S or North/South : This kind of traffic flow is very similar to E/W, but it’s a difference that we make, at least in the world of virtual networks when we’re talking of a router that has connectivity to an external network. Traffic that traverses the router into or from an external network. In the case of OVN or OpenStack, it implies the use of DNAT and or SNAT in the router, to translate internal addresses into external addresses and back.
  • L3HA : Highly available L3 service, which eliminates any single point of failure on the routing service of the virtual network.
  • ToR switch : Top of Rack switch, is the switch generally connected on the top of a rack to all the servers in such rack. It provides L2 connectivity.
  • CAM table : CAM means Content Addressable Memory, it’s an specific type of memory that instead of being accessed by address is accessed by “key”, in the case of switches, for the MAC table, it’s accessed by MAC+VLAN ID.
Categories
Network Opensource Openstack

Neutron QoS service plugin

Finally, I’ve been able to record a video showing how the QoS service plugin works.If you want to deploy this follow the instructions under the video. (open in vimeo for better quality)

Deployment instructions

Add to your devstack/local.conf

enable_plugin neutron git://git.openstack.org/openstack/neutron
enable_service q-qos

Let stack!

~/devstack/stack.sh

now create rules to allow traffic to the VM port 22 & ICMP

source ~/devstack/accrc/demo/demo

neutron security-group-rule-create  --direction ingress \
                                --port-range-min 22     \
                                --port-range-max 22     \
                                default

neutron security-group-rule-create --protocol icmp     \
                                   --direction ingress \
                                   default

nova net-list
nova boot --image cirros-0.3.4-x86_64-uec \
          --flavor m1.tiny \
          --nic net-id=*your-net-id* qos-cirros
#wait....

nova show qos-cirros  # look for the IP
neutron port-list # look for the IP and find your *port id*

In another console, run the packet pusher

ssh cirros@$THE_IP_ADDRESS \
     'dd if=/dev/zero  bs=1M count=1000000000'

In yet another console, look for the port and monitor it

# given a port id 49d4a680-4236-4d0c-9feb-8b4990ac35b9
# look for the ovs port:
$ sudo ovs-vsctl show | grep qvo49d4a680-42
       Port "qvo49d4a680-42"
           Interface "qvo49d4a680-42"

finally, try the QoS rules

source ~/devstack/accrc/admin/admin

neutron qos-policy-create bw-limiter
neutron qos-bandwidth-limit-rule-create *rule-id* \
        bw-limiter \
        --max-kbps 3000 --max-burst-kbps 300

# after next command, the port will quickly
# go down to 3Mbps
neutron port-update *your-port-id* --qos-policy bw-limiter

You can change rules in runtime, and ports will be updated

neutron qos-bandwidth-limit-rule-update *rule-id* \
        bw-limiter \
        --max-kbps 5000 --max-burst-kbps 500

Or you can remove the policy from the port, and traffic will spike up fast to the original maximum.

neutron port-update *your-port-id* --no-qos-policy
Categories
Network Opensource Openstack

Neutron Quality of Service coding sprint

Last week we had the Openstack Neutron Quality of Service coding sprint in Ra‘anana, Israel to work on [1].

It’s been an amazing experience, we’ve accomplished a lot, but we still have a lot ahead.We gathered together at Red Hat office for three days [2], delivering almost (sigh!) the full stack for the QoS service with bandwidth limiting.The first day we had a short meeting where we went over the whole picture of blocks and dependencies that we had to complete.

The people from Huawei India (hi Vikram Choudhary & Ramanjaneya Reddy) helped us remotely by bootstraping the DB models and the neutron client.

Eran Gampel (Huawei), Irena Berezovsky (Midokura) and Mike Kolesnik (Red Hat) revised the API for REST consistency during the first day, provided an amendment to the original spec [12], the API extension and the service plugin [13] Concurrently John Schwarz (Red Hat) was working on the API tests which acted as validation of the work they were doing.

Ihar Hrachyshka (Red Hat) finished the DB models and submited the first neutron versioned objects ever on top of the DB models, I recomend reading those patches, they are like nirvana of coding ;).

Mike Kolesnik plugged the missing callbacks for extending networks and ports. Some of those, extending object reads will be moved to a new neutron.callbacks interface.I mostly worked on coordination and writing some code for the generic RPC callbacks [5] to be used with versioned objects, where I had lots of help from Eran and Moshe Levi (Mellanox), the current version is very basic, not supporting object updates but initial retrieval of the resources, hence not a real callback 😉 (yet!).

Eran wrote a pluggable driver backend interface for the service, [6] with a default rpc/messaging backend which fitted very nicely.

Gal Sagie (Huawei) and Moshe Levi worked at the agent level, Gal created the QoS OvS library with the ability to manipulate queues, configure the limits, and attach those queues to ports [7], Moshe leaded the agent design, providing an interface for dynamic agent extensions [8], a QoS agent extension interface [9], and the example for SRIOV [10], Gal then coded the OvS QoS extension driver [11].

During the last day, we tried to put all the pieces together, John was debugging API->SVC->vo->DB (you’d be amazed if you saw him going through vim or ipdb at high speed). Ihar was polishing the models and versioned objects, Mike was polishing the callbacks, and I was tying together the agent side. We were not able to fully assemble a POC in the end, but we were able to interact with neutron client to the server across all the layers. And the agent side was looking good but I managed to destroy the environment I was using, so I will be working on it next week.The plan aheadWe need to assemble the basic POC, make a checklist for missing tests and TODO(QoS), and start enforcing full testing for any other non-poc-essential patch.Doing it as I write: https://etherpad.openstack.org/p/neutron-qos-testing-gapsOnce that’s done we may be ready to merge back during the end of liberty-2, or the very start of next one: liberty-3. Since QoS is designed as a separate service, most of the pieces won’t be activated unless explicitly installed, which makes it very low risk of breaking anything for anyone not using QoS.

What can be done better

Better coordination (in general), I’m not awesome at that, but I guess I had the whole picture of the service, so that’s what I did.Better coordination with remotes: It’s hard when you have a lot of ongoing local discussions, and very limited time to sprint, I’m looking forward to find formulas to enhance that part.

Notes

In my opinion, the mid-cycle coding sprint was very positive, the ability to meet every day, do fast cross-reviews, and very quickly loop in specific people to specific topics was very productive.I guess remote coding sprints should be very productive too, as long as companies guarantee the ability of people to focus on the specific topic, said that, the face to face part is always very valuable.I was able to learn a lot from all the other participants on specific parts of neutron I wasn’t fully aware of, and by building a service plugin we all got the understanding of a fullstack development, from API request, to database, messaging (or not), agents and how all fits together.

Special thanks Gary Kotton for joining us the first day to understand our plan, and help us later with reviews towards merging patches on the branch.To Livnat Peer, for organizing the event within Red Hat, and making sure we prioritized everything correctly.To Doug Wiegley and Kyle Mestery for helping us with rebases from master to the feature branch to cleanup gate bugs on time.

References:

[1] http://specs.openstack.org/openstack/neutron-specs/specs/liberty/qos-api-extension.html#rest-api-impact

[2] https://www.dropbox.com/sh/0ixsqk4dz092ppv/AAAd2hVFP-vXErKacjAdc90La?dl=0

[3] Versioned objects 1/2: https://review.openstack.org/#/c/197047

[4] Versioned objects 2/2: https://review.openstack.org/#/c/197876/

[5] Generic RPC callbacks: https://review.openstack.org/#/c/190635/

[6] Pluggable driver backend: https://review.openstack.org/#/c/197631/

[7] OVS Low level (ovsdb): https://review.openstack.org/196373

[8] Agent extensions: https://review.openstack.org/#/c/195439/

[9] QoS agent extension : https://review.openstack.org/#/c/195440/

[10] SRIOV agent extension https://review.openstack.org/#/c/195441/

[11] OvS QoS extension: https://review.openstack.org/#/c/197557/

[12] API amendment: https://review.openstack.org/#/c/197004/

[13] SVC and extension amendment: https://review.openstack.org/#/c/197078/

Categories
Opensource

How to split a git commit

Sometimes you write a piece of code within a context, and such context grows wider and wider, or you simply need all the pieces in one place to make sure it works.

Then during reviews, or to work in parallel it makes sense to split your patch in more logical patchlets. I always needed to ask google. So I decided to write it down here.

Let’s assume $COMMIT is the commit you want to split (set the commit for edit with the edit action):

git rebase -i $COMMIT^

And this will leave your commit changes in the working tree, but you will be back in the previous commit.

git reset HEAD^

loop:

git add -p # the pieces of code you want to
git commit
git rebase --continue

If you were working with gerrit, make sure that only one of your patches (probably the biggest one) keeps the original change ID, so the change can still be tracked, and old comments will be available.

banana split