You can't get there from here: solving kubernetes flannel configuration issues

After following the Kubernetes (K8) Documentation for installing K8 on bare-metal Ubuntu nodes, I encountered some difficulties getting Services to work properly, and getting containers on different nodes to see each other.

With the help of Justin Santa Barbara <https://twitter.com/justinsantab> in the K8 #kubernetes-users Slack channel, I was able to determine my flannel overlay network was misconfigured.  (Previously I had assumed it was working OK, because I could create and manage replication controllers, pods, etc. with no problems.)

Basically, containers could see the Internet, but could not see each other unless two pods happened to land on the same node.  Services created via the K8 API were not network-visible outside the node either, whether inside or outside the cluster.

Solving the immediate problem

I'm not sure how that configuration got messed up, but I suspect it was when I installed the latest version of docker from the Docker-managed PPA .  In order to fix it, here's what I did:

  • I stopped all current replication controllers and my gluster volume. Just in case I screwed something up, I didn't want to accidentally create a split brain.
  • I stopped the flannel daemons on all nodes participating in the cluster.
  • I used /opt/bin/etcdctl to reconfigure the network range under the key "/coreos.com/network/config" -- specifying a range that did not conflict with existing private-addressing schemes on the LAN.
  • After taking note of how to restore the entries just in case, I used /opt/bin/etcdctl to delete all existing subnet leases.  After reading up on flannel a bit, the configuration is actually pretty simple. I would liken it to the way that DHCP assigns IPs from a range; flannel, communicating via etcd, assigns subnet ranges to participating nodes in the same way.  (I also tried to configure it for host-gw networking, since my flat layer 2 network should permit that, but I couldn't get it to work and switched back to vxlan, which seems to work fine.)
  • I restarted the flannel daemons on all hosts, making sure that /etc/default/flanneld contained correct etcd-endpoints and iface lines on all nodes.  (My bare-metal nodes have two ethernet adapters; I've configured the secondary adapter on each node to serve as an isolated network just for gluster nodes to talk to each other.)
  • I looked at /opt/bin/etcdctl ls /coreos.com/network/subnets to make sure the old subnet info hadn't re-created itself somehow, and to make sure that new leases were being assigned out of the newly configured range.
  • I noted the range assigned to each node, which you could tell by looking at the output of ifconfig and looking for the flannel.1 adapter.
  • Docker, running on each node, needs to use an IP from the node's assigned flannel subnet as its bridge address for docker0.  Flannel writes this info to /run/flannel/subnet.env but I couldn't figure out how to get the docker daemon (in the amount of time I had) to read those values. Apparently whatever startup process that reads /etc/default/docker does not evaluate that file as shell -- I couldn't get it to expand the environment variables.  But, since I knew the subnet range assigned to each node, I just configured the --bip parameter for each node by hand. (This is a very small cluster and this method obviously would not work for anything more than a few nodes.)
  • I restarted the docker daemon on each node so it would pick up the new configuration, and confirmed that the correct docker0 address had been assigned through ifconfig.  It was.
  • I spun up a container with an exposed port and confirmed that I could reach that port from another node in the cluster, and I could.  I created a type:NodePort k8 service entry to expose that port to the rest of the world, and that worked also.  Finally I confirmed that I could reach the exposed service through any node in the cluster when connecting to the correct port.  Previously none of this had worked, but now it does.

  Here are the rc and service YML files I used: http://pastebin.com/d1J3H5xL

Takeaways

During this process, looking at the logs in /var/log/upstart/docker.log and /var/log/upstart/flanneld.log were invaluable.

It seems like there should be some kind of config lint tool for flannel/kubernetes installs that warn of problems like this ... until I poked around looking for the problem, I never got any messages that anything was wrong! Automated installs are great but if you don't understand what they are doing under the hood, and they get messed up, they are tough to debug.

Solve the world's greatest mystery: Where Did All the Humans Go?

I started thinking about this as a joke.

But the more I think about it, the more I think that it must be done.

Thus I bring to you... Mountain Goat from Fort Pedro Informatics' game division.

"Mountain Goat Mount Massive" by Darklich14. Licensed under CC BY 3.0 via Wikimedia Commons

"Mountain Goat Mount Massive" by Darklich14. Licensed under CC BY 3.0 via Wikimedia Commons

Observer pattern / Event oriented programming in Python

I recently decided that in one portion of Ancho's code, it would make sense to use an event-oriented paradigm. I'm familiar with the Observer pattern from the Gang of Four (Design Patterns) book, and I've used event models in things like GUI programming and Javascript. The general idea is that instead of having your code constantly check for some condition, you can tell the event system "let me know when this happens," which can be much more efficient.

Event programming can be hard to wrap your head around at first, but this is in a part of Ancho that casual users probably won't see much. I think it's worth the added complexity.

In Python, there is often only one obvious way to do something. But I've never done event programming in Python before, and initial searches did not yield a single obvious preferred way. So, I'm documenting my search here in hopes that it saves others some effort in answering the same question.

Narrowing the Search

There are some sub-categories of event-based programming:

  1. Networked message passing, job queues, etc. This would include things like Celery on top of an underlying task queue. (Actually Queues.io seems like a good summary of all the queue systems out there.)
  2. Responding to events from the network, dealing with concurrency and non-blocking I/O. This would include things like Twisted, Eventlet, GEvent, and the like.
  3. Libraries for handling in-process event propagation without regard for networking or I/O. Mostly these are about performance, concurrency issues prevention of memory leaks. These can be synchronous (calls to event handlers must return before the next handler is called) or asynchronous (calls to event handlers execute in their own threads and may be executing simultaneously.) For my purposes, I had planned to use synchronous calls.

For my purposes, I am exclusively interested in the last of the above categories.

Found, but Rejected

  • Python library's signal module. This is really more for dealing with BSD-style signals, which are interruptions from outside the current process, like what happens when you hit Ctrl-C.
  • PyDispatcher. This software was last updated in 2011, which is either because it's really stable and perfect (not likely) or because it's no longer actively used or maintained. Since I'm looking for something that is actively under development, this means that PyDispatcher is out. Besides, Django stopped using PyDispatcher several years ago, in favor of a modified version that was roughly 10 times faster.
  • Michael Foord's Event Hook. This is more of a recipe than an actual piece of code. I'm looking for something that's being maintained.
  • Circuits. This appears to be more heavyweight than what I'm looking for. It is described as a "component architecture" and includes a web server and networking i/o components, which would put it more in my second category.
  • Py-Notify. Last release was in 2008. The author's benchmarking, however, is of interest: compared with PyGTK (a C-based implementation) PyNotify is considerably slower. That's no surprise, but I might use a similar benchmarking method to compare the libraries that seem like viable candidates.
  • Observed. Interesting, but it is more language-level in that it sends events for all method calls.  You don't seem have the level of control to create your own events. Not quite what I was looking for.
  • PySignals. Based on the django.dispatch code that replaced PyDispatcher; but apparently not touched since 2010.

Candidates

These are all close enough to what I'm looking for for a fuller evaluation.

  • zope.event. Updated in 2014, and seems to be reguarly developed. I'm not sure if it has any dependencies on the rest of Zope; if so, that's a big negative. I have also read that it does not support arguments to be passed through to handlers.  That seems like a potential downside, depending on how my usage patterns shake out.
  • Django's signals implementation. Like Zope.Event, it's probably integrated somewhat with other code I don't need, but it's well-tested in production environments.
  • PyPubSub. Last updated in February 2014, so that's fairly recent. Support activity seems weak.  I don't see many references to it online.  I'll try it anyway, because its implementation is a little different (publish/subscribe with a central dispatcher) than the others.
  • Events. Bills itself as "Python event handling the C# style." The fact that it's C# doesn't do anything for me, but I do like that it's well-documented, has frequent commit activity, and has explicit support for python 2.7, 3.3 and 3.4. 
  • Smokesignal. This seems to be an indirect descendant of Django's signals, as PySignals was, except that this is still actively maintained. As of this writing the last commit was 11 days ago.

So, what I plan to do is to read up on these candidates; develop a test workload similar to my use case that can be implemented on each of them; benchmark them for performance; and report back on their overall fitness and ease of use. My initial guess is that Zope.Event and Django.Dispatch will not be easy fits, because of their connections to those other frameworks... but we'll see.

The heat will be on!