Multithreaded Rust on Threadripper

I recently ran some benchmarks on a Threadripper 3960X system and the results were surprising me quite a bit. Simplified, the throughput the benchmark recorded went down, from 341 MB/s on a MBP to 136 MB/s on a Threadripper desktop. Prior I had read Daniel Lemire’s notes on the sub optimal performance for simdjson on Zen 2, which is heavily used in the benchmark, but the suggested drop were a few percent not half.

Long story short, this made me curious what caused this. First stop: perf.

perf crossbeam

Notice the first item? It is crossbeam_channel::flavors::array::Channel<T>::recv. Oh my, I never saw that one hogging so much cpu time, in fact we spend more time in receiving from the channel then we spend in parsing or serializing JSON!

Lets add a bit of Threadripper trivia, the design AMD went with was splitting the CPU from a single silicon to multiple small dies, they call CCDs with in turn are consists of two CCXs that then contain the cores and level 1-3 cache. So lets look at another thing, htop (trusty little tool to show our load):

htop load on cores

In this screenshot we can spot that one thread seems to be running on the 5th core, one on the 16th and one on the 19th and 20th. Thinking back to the design of the Threadripper this is a bit of a hint, those cores are on different CCXs and even further on different CCDs so what happens if they were on the same?

Boom 400+ MB/s! taskset -c 0,1,2 does the trick, that’s a really nice improvement and looking at the perf output we can see recv to move from nearly 11% of CPU time to 7.28%, now that’s a neat improvement. Not only is it nearly 3x faster then the first benchmark but also is it 20% faster then on the laptop. So far so good.

perf crossbeam

But it’s still leaves the question, why and if we can do something about this. Enter a little benchmark and look at what it puts out for the first core (it’s a lot of output otherwise).

B 0 -  0: -
B 0 -  1: 818us/send
B 0 -  2: 673us/send
B 0 -  3: 2839us/send
B 0 -  4: 2421us/send
B 0 -  5: 2816us/send
B 0 -  6: 3466us/send
B 0 -  7: 3634us/send
B 0 -  8: 3267us/send
B 0 -  9: 3042us/send
B 0 - 10: 3633us/send
B 0 - 11: 3535us/send
B 0 - 12: 3334us/send
B 0 - 13: 3443us/send
B 0 - 14: 3348us/send
B 0 - 15: 3398us/send
B 0 - 16: 3459us/send
B 0 - 17: 3108us/send
B 0 - 18: 3287us/send
B 0 - 19: 3393us/send
B 0 - 20: 3369us/send
B 0 - 21: 3248us/send
B 0 - 22: 3290us/send
B 0 - 23: 3323us/send

B 0 - 24: 487us/send
B 0 - 25: 812us/send
B 0 - 26: 676us/send
B 0 - 27: 2859us/send
B 0 - 28: 2853us/send
B 0 - 29: 2864us/send
B 0 - 30: 3475us/send
B 0 - 31: 3620us/send
B 0 - 32: 3582us/send
B 0 - 33: 3497us/send
B 0 - 34: 3524us/send
B 0 - 35: 3488us/send
B 0 - 36: 3331us/send
B 0 - 37: 3303us/send
B 0 - 38: 3365us/send
B 0 - 39: 3333us/send
B 0 - 40: 3324us/send
B 0 - 41: 3363us/send
B 0 - 42: 3554us/send
B 0 - 43: 3351us/send
B 0 - 44: 3207us/send
B 0 - 45: 3240us/send
B 0 - 46: 3377us/send
B 0 - 47: 3275us/send

First things first, the numbers here are 0 indexed, unlike in htop where they’re 1 indexed. So core 0 here means core 1 in htop. The test runs only for a second per core combination (as it goes through all cores and otherwise takes a really long time), some variation is to be expected. That gets really slow really fast. We can see that core 24-47 are the SMTs cores on the physical cores 0-23, so 24 being the second thread on core 0. The second observation is that core 0-2 are in the same CCX, performance is reasonable fast here. 3-5 seem to be on the same CCD and so on.

Lets look at the code for the crossbeam channel. The part that’s interesting is that both head and tail are wrapped in CachePadded. Fortunately I have a friend who keeps going on about false sharing whenever performance becomes a topic so that was a really good hint here. Looking through the struct aligning head and tail to the cache line makes a lot of sense they’re frequently accessed from both sides of the queue but there is another part that’s frequently used on both sides. The buffer, and that is just an array of T so it might not align well to the cache. In other words, if we access buffer[x] we might invalidate buffer[x-1] or buffer[x+1] (or more). So what happens if we wrap the the elements in a CachePadded?. The result looks quite nice, it cut down by 50% when going over CCX boundaries:

B 0 -  0: -
B 0 -  1: 630us/send
B 0 -  2: 678us/send
B 0 -  3: 1319us/send
B 0 -  4: 1256us/send
B 0 -  5: 1291us/send
B 0 -  6: 1438us/send
B 0 -  7: 1504us/send
B 0 -  8: 1525us/send
B 0 -  9: 1660us/send
B 0 - 10: 1772us/send
B 0 - 11: 1807us/send
B 0 - 12: 1382us/send
B 0 - 13: 1380us/send
B 0 - 14: 1387us/send
B 0 - 15: 1375us/send
B 0 - 16: 1382us/send
B 0 - 17: 1383us/send
B 0 - 18: 1471us/send
B 0 - 19: 1471us/send
B 0 - 20: 1463us/send
B 0 - 21: 1462us/send
B 0 - 22: 1468us/send
B 0 - 23: 1457us/send

B 0 - 24: 466us/send
B 0 - 25: 619us/send
B 0 - 26: 671us/send
B 0 - 27: 1438us/send
B 0 - 28: 1422us/send
B 0 - 29: 1514us/send
B 0 - 30: 1789us/send
B 0 - 31: 1688us/send
B 0 - 32: 1812us/send
B 0 - 33: 1820us/send
B 0 - 34: 1719us/send
B 0 - 35: 1797us/send
B 0 - 36: 1383us/send
B 0 - 37: 1364us/send
B 0 - 38: 1373us/send
B 0 - 39: 1383us/send
B 0 - 40: 1370us/send
B 0 - 41: 1390us/send
B 0 - 42: 1468us/send
B 0 - 43: 1467us/send
B 0 - 44: 1464us/send
B 0 - 45: 1463us/send
B 0 - 46: 1475us/send
B 0 - 47: 1467us/send

With all of this, the code went from 136 MB/s to over 150 MB/s when not pinned to cores, while this isn’t close to where I’d like to to be, it is a 10% improvement in throughput. And looking at perf again recv is completely gone from the list, which is nice!

perf crossbeam

This is the conclusion for now, if I have more interesting finds I’ll add a continuation - so I’ll keep digging.

Dell XPS/Windows as a Dev Env

I’ve recently gotten a Dell XPS 15" 2-in-1 and started using it as a development environment for the last week. To be honest as a long term MacBook user I expected a rather disapointing experience but to my big surprise I do really like it so far. But enough of a preamble. Why I’m writing this? Because I figured that the mistakes I made, the hints I got all over the place would have really helped me if someone had collected them - so I do that now. Mind you a lot of it is not specific to the Dell and will work for every system.

The goal

Let’s set expecations. This is not about making the perfect system, after all a week isn’t enough to decide what perfect is and find out all the little things that need to get tweeked. The goal here is to get a ‘good enough’ system that works for most everything without regretting it or missing my MacBook in as little time as possible.

The OS

I suspect many people will be surprised but I decided to stay with windows as a operating system, partially because I’ve not the best opinion on Linux (which I’m not going to discuss here) and partially because I wanted to see how good the out of the box experience is. Last but not least the whole 2-in-1 experience seems to be excelently integreated with the OS and I’m a bit scared to have to fight a different OS to get it all working as it’d be two unknowns to fight instead of just one.

WSL is brilliant, I’ve had very little problems (read none but one super esoteric one we’ll get to later) with it.

My setp

To my great fortune I do use Eemacs which means that my environment of choice works nearly everywhere. OSX, Linux, Windows, WSL, there is an Emacs. Other then that I obviously need my Erlang, I really can’t go without it ;). In addition I tossed in some rust and since I’m writing this post on the XPS ruby. The usual suspects, gcc, make, git and friends come with the terretory.

My initial attempt was to run things on windows. I tried VS code and windos emacs. Erlang, rust, git all exist as native windows binaries but the whole experience wasn’t great especially since in the end it’ll run on a unix or unix like system anyhow. So in the end I used WSL for help and just set everything up there.

Ubuntu 18.04

The WSL installation I picked. I don’t want to go into a distribution war and I’m sure all other distributions just work as well - pick your favourite I simply know ubuntu a bit better then I know SuSe (the alternative from the app store) and can’t be bothered to find out how to add additional distributions.

Most of everything comes in apt, with the exception of rust which uses rustup and erlang wich I fetched the ESL package for to get around ubuntus horrible decision to split up erlang in multiple packages (seriously WTF?!?).


I use space emacs which installes without problems on WSL, just clone the reposit and call it a day. It’ll ask you for additions for languages as you open the files, a lot exist but if you do something that’s more esoteric then erlang you might have to do some additional fiddling.

Here comes the first issue, WSL does not have an X server (I really hope at some point Microsoft will add one) but in the meantime VcXsrv works quite well.

You can add it to auto-start too which helps! After running it the first time it’ll ask you to save the config, do so and then run windows-r and enter shell:startup to open the folder for auto-start, drop the file in there and call it a day.

Once X is started you need to tell the WSL shell where your display is you can do that by adding the line export DISPLAY=localhost:0 at the end of the ~/.profile file.

With that all set you should have a running emacs, feel free to skip the the next section or keep reading here if you want to know a few more fancies.

I really like the Fira Code font, fortunately it comes with ubuntu so you can install it by running sudo install fonts-firacode. To enable it in emacs you can follow their tutorial. Here is the gotcha I talked about while the font works I’ve not yet managed to get the code points to work, but it’s not that big of a deal to me.

A second fancy but not required thing is a Emacs desktop icon. I really wanted the option to just click it and make it start without having to go through bash and running a command or having a console window open. There is a simple trick for that. Create a VB script somewhere with the follow content:

Set oShell = CreateObject ("Wscript.Shell") 
Dim strArgs
strArgs = "bash -c ~/start-emacs"
oShell.Run strArgs, 0, false

and a shell script in your home directory called start-emacs:


export DISPLAY=localhost:0

That’s it you can then link that to the laptop and give it the nice spacemacs icon.

Windows / The hardware

There are a few tweaks to the system I hade to make it more frinedly to my use (read MY use, so YMMV).


The touchpad just isn’t at par with Apples touchpad, honestly none is and I suspect if you ever used a MacBook’s touchpad you easiely agree. But XPS is one of the better ones, probably the best on a PC I’ve ever used and there are a few tweaks that make it more user friendly.

  1. Disable tap to click, it’s just super annoying if you’re used the Mac touchpad.
  2. disable the right side as a right click, two fingers like on the Mac just work fine and feel more natura (to a mac user).
  3. Set 3 finger gestures to “Switching Desktops and showing desktops”.


The XPS uses what dell calls a maglev keyboard, which sounds really cool, it also types quite well. It’s a bit clicky, that’s not for everyone but pressing the key is distinct and you don’t get this “did I press it or not” feeling. Not good or bad but I noticed that compared to the Mac the alt and ctrl keys are placed different. While on the Mac FN and Command/apple/windows are on the outside and alt and ctrl are on the inside the Dell does it the oposite way around. It takes some time to get used to but I think after a few days the way it’s on the Dell feels more natural and better reachable (I still press Windows-D instead of Alt-D way too often :P).

Now the downer. Dell has made the horrible deicsion to put the page up/down keys right about the arrow keys. I hate it. It’s unnatural and trying to count the time I went a page up instead of left resulted in an integer overflow. Fortunately Kevin suggested a toll called AutoHotkey. A simple script like this:


sovles the problem. AHK translates it to a standalone executable that you can put into the auto-start folder mentioned above. Effectively this re-binds the page up/down keys to left/right so that no matter which you end up hitting you get the arrow keys.

Migrating to Rebar3

A long journey from rebar2 to rebar3

Rebar 3 has recently started to surface out of alpha state and entered beta1, about time for the crazy people like me to abandon tried and tested tools to venture into the great vastness of the unknown!

So with a backpack, hiking shoes, food for about a week and a direct line to the rebar3 IRC channel I set off to migrate sniffle from rebar2 to rebar3. Now, after it looks like everything is working, I want to write up what exactly went down.

The complete delta can be seen here please be ware that the upgrade kicked of a bit of a chain reaction with updating libraries too.

3 =/= 2 + 1

The most important thing I found, or rather the biggest misconception I had, is that rebar3 is the next iteration of rebar2. This lead to a lot of misery on my part. rebar3 is an entirely different application, the workflow is different, the logic is different and the behavior is different. Just dropping it and expecting everything to keep working the same will not end well. Treat it like migrating to a different build tool and things are a lot easier.

The simple stuff

Folders and files

Some of the directories change, deps no longer exists and has to be deleted. It also can be removed form the .gitignore file. Instead _build now takes its place, somewhat, it’s different but it can go into .gitignore.

The same way ebin doesn’t exist any longer and should be deleted. The former ebin now lives also in _build so we don’t need to add anything new to the .gitignore file here. The old ebin will take priority over the .beam files generated in _build. That said I was pointed to the fact that there are valid reasons to have it around, for example to prevent rebar3 to generate the .app file from .app.src or if there are additional files compiled by another tool other then rebar3.

Now I said and even said it in bold that those folders HAVE TO BE DELETED that is because they do, if not an axe murderer will come by your house and kill your cat, seriously, I was just lucky that I didn’t have a cat so he left disappointed. Aside of the axe murderer, rebar3 will also rather unexpectedly load things from there, which, unlike then the axe murderer, did affect me and cause quite some headache.

There is a new file, rebar.lock which you want to add to your repository, not ignore it, it will keep track of the versions of libraries that are into the _build directory and that should go there if they don’t already exist.


The rebar get-deps is deprecated so is rebar update-deps, you don’t need them any more, rebar3 figures out itself when dependencies need to be installed or updated (from the rebar.lock file). There is a new rebar3 deps command, which has nothing to do with the old commands, instead it is used to give a list of the dependencies of your project (but not the sub-dependencies).

rebar doc is now called rebar3 edoc that should be noted.

rebar3 dialyzer is new, it replaces the old workflow of running dialyzer on its own and does all the building and checking .ptl files for you. The old trick of grep-ing away known errors to mitigate them is not working due to the mixed output however I was told that erlang 18 comes with a -dializer pre-compiler directive can be used to handle this. I am not really sure about this especially with third party libraries.

The handling of dependencies changed too, skip_deps=true is no longer needed. Nor is -r if you are using a apps/*/... structure for your project. Along with those the -D flag is gone now, it can however the same can be achieved with profiles - later more to that.

generate was replaced by release and it now uses relx and not reltool. If you are one of the poor sods (like me) that was using reltool you are into a lot more fun here but that is mostly beyond the scope. If you used relx before this should be straight forward, just that the config now lives in the rebar.config. Existing relx.config files will still be used as long as no relx section exists. It should be noted that this also takes care of linking instead of copying files when used with {dev_mode, true} which can be very nice for developing.


Now there is probably a lot to say, it is a way to handle differences in behavior, and can for example replace the -D flag like this: {profiles, [{long, {erl_opts, [{d, LONGTESTS}]}. I haven’t fully grasped the power and best practice of this and there is a good article in the docs about this so I won’t dive further into that.

Something worth pointing out before moving on is that everything that can be in a rebar.config can be in a profile, including plugins, dependencies, erlang options and so on. This makes it incredibly powerful.


Plugins have changed a bit and become a lot more important. Some common tasks in rebar2 now live in a plugin instead of being part of the core system. The most notable here is probably the Port Compiler (or pc as the plugin is called) which is used for building NIFs (like eleveldb).

Hex comes as a plugin, which is really nice, however the plugin is needed to publish not to fetch dependencies. This plugin could happily go into the global config, yes there is a global config in ~/.config/rebar3/rebar.config. However it is best to keep other plugins out there.

The EQC (QuickCheck) plugin is very nice if you have quick check, either the free or the commercial version. It should be pointed out here not to put this in the global config, no matter how tempting it is or the axe murderer will come back. Other then that you can now put the properties into a eqc folder and separate them from tests and it is no longer needed to wrap them in -ifdef(EQC) and -ifdef(TEST). What is especially nice here is that it picks up on the same naming as quickcheck-ci so that will make things easier.


This is a quite big topic, but it can be summed up in: forget everything you know about rebar’s handling of dependencies it’s invalid now.

Perhaps the most obvious change is that in addition to source dependencies you can now include hex packages. The packages can take the form: dflow as ‘the latest version’ (or the version fitting to other packages), or as {dflow, "0.1.6"} to pick a specific version (more details here).

Using packages has a huge advantage, they are cached locally which makes fetching them, especially for big projects, a lot nicer.

Now my experience with rebar2 was that dependencies were handled by just cloning all of the dependencies in the deps folder and then adding them to the library path. This also had the effect that order didn’t really matter. For example you could happily include header files from projects you were not depending on in the application including it.

Now rebar3 is actually caring about what you do. For example I ran into the following situation. I have an application sniffle this application has file include/sniffle_version.hrl. Now sniffle was depending on sniffle_watchdog, however sniffle_watchdig was including include/sniffle_version.hrl

     +-------------+             +-------------+
     |   sniffle   |<------------|  watchdog   |
     +-------------+             +-------------+
            |                           ^
            |                           |
            |                           |
            |                           |
            |                           |
+-----------------------+               |
|  sniffle_version.hrl  |---------------+

This setup is no problem with rebar2, those files where in apps/sniffle/include and works great the file exist and that is all that’s needed. However, with rebar3 this approach is problematic, since sniffle_watchdog does not depend on sniffle it will not exist when sniffle_watchdog is compiled. This means that I needed to include sniffle in sniffle_watchdog which is not possible since it would create a circlular dependency. The solution for this was simply to put the version header in na own application that gets included into both.

+-------------+             +-------------+
|   sniffle   |<------------|  watchdog   |
+-------------+             +-------------+
       ^                           ^
       |                           |
       |                           |
       |                           |
       | +-----------------------+ |
       +-|    sniffle_version    |-+

Another slightly related topic is that when building releases now the content of the app file matters more, that is probably my own shortcoming that I ran into the problem but I did not include many library applications into the application section of the .app.src file. That lead to them missing in the release and the application dying a horribly painful death. I found the following code snipped rather helpful to track what applications were missing, and then a lot of manual labour to find where they should be included.

ls -1 _build/default/rel/sniffle/lib/  | sed 's/-.*//g' | sort > rlibs
ls -1 _build/default/lib | sort > libs
vimdiff libs rlibs

The bottom line

After working with it a bit I think rebar3, when treated as it’s own tool and not a iteration of rebar, is going to be huge improvement over existing erlang build tools, both, rebar2, and most likely any of the others lurking in the shadows.

The devs are very friendly and responsive and have helped me a great deal during this rather interesting exercise and deserve a lot of credit for the work and for putting up with the involved hatred and anger they receive.

Yes rebar3 is a learning curve and in the beginning it can be quite steep, but so does any other tool to be fair. It still is in beta (for a good reason), but bugs are fixed very fast and the help debugging them is outstanding.

If you require a rock solid tool today it is probably best to wait a bit longer until the final release but that said I have come a full circle, from utter hatred and frustration (on day one) to loving it after a week and will be using it from now on.

Postmortem of a Interesting Bug


After a full network outage in a larger system (7 FiFo instances and, a few dozen of hypervisors, VM’s in the 3 digit number) a small percentage of the VM’s stored in FiFo lost information which package was assigned to them and which organization they belong to.

Technical background

As part of planned maintenance on the switching layer the entire network was taken down. Effectively cutting the communication between any two systems in the network. During the maintenance the FiFo system was kept “hot”, no services disabled or paused. At the end of the maintenance window the network was as planned enabled again, reinstating communications.

FiFo background

We will focus in the sniffle and chunter components since those are the relevant parts of sniffle that handle information related to the symptoms. ¯ Generally all fifo data is stored in CRDT’s which provides conflict resolution and nearly loss free merging even of entirely divergent data.


Sniffle is a distributed system that runs on top of riak_core, all (7) nodes are connected as a mash network via distributed erlang. While the nodes are connected data and functionality is split out in dynamo fashion, at a default N value of 3 that means every node handles (total data)*3/7 of the data.

In the case of a node failure the adjacent node takes over the function of the failed node. Missing data is resolved by a method called “read repair”, meaning that when data is requested and 1 out of the 3 nodes responds with no data that “missing” data is repaired.

Once the node failure is repaired the system that took over the work for the failed node performs what is called a handoff, sending it’s (supposedly) updated data to the returning node to bring it up to speed.


Under normal conditions chunter will send incremental updates about changing vm state to sniffle. If a chunter system for the first time connects to sniffle it will register it’s VM’s, meaning they are created if not existent, and assigned to the hypervisor. The registration contains nearly all relevant VM information with some exceptions that do not concern the hypervisor. The missing information was amongst the not updated information.

Sniffle nodes are discovered via mDNS broadcasts, the first broadcast received will trigger the initial registration sending the request to this node.

What happened?

It is a combination of conditions that lead to the problem and explains why only a small percentage of VM’s were affected.

During the outage

1) The network outage cut the communication between all sniffle nodes, so each node was on its own handling the full load of the requests. 2) The network outage cut communication of nearly all chunter nodes, only those who resided on a hypervisor that also contained a fifo zone did not loose connectivity.

After the outage

Now there happened a race condition, on one side the distributed erlang started to re-discover the other nodes, reestablishing communication in sniffle and initializing handoffs (handoffs of nearly all parts).

On the other side the periodic mDNS broadcasts from the sniffle nodes triggered re-registration of the chunter nodes.

The bug

The bug was triggered by the fact that during receiving a handoff a vnode took the wrong assumption that the received data must be newer then its own and overwrite the currently stored data instead of merging it.

Due to the usual repair during reads this turned out to be quite an edge case that was only triggered when:

1) The mDNS was received from a node that neither holds the right partition, neither is connected to a node holding the right partition. 2) the registration happened to three vnodes that did not hold that data since as long as one node held the data it would have been merged with this instead of re-created 3) The handoff on all three systems was performed before any read happened, since the read-repair would haver otherwise merged the data. 4) The handoff off all three systems was performed before AAE triggered, since AAE triggers a read-repair on inconsistent data.

The fix

This was fixed rather simple, instead of overwriting existing data on a handoff, the handoff is now treated as a read-repair and old and existing data is merged.

Post-mortem of a Failed Support Case.

Every now and then I check the link reports for Project FiFo to see what people think and write about it. Recently I stumbled about an article that oddly enough made me both proud and sad. It actually was a rather negative one, which is a shame, but on the other hand a project isn’t mature until people care enough to complain.

Yet even so it would be very easy to cast this aside as a ‘success’ in a strange manner, it still bothers me that someone is upset enough with FiFo to spend his time writing a longish blog article and write their own management software. I do take great pride in the fact that we do our best to have an outstanding documentation and a good support for users of FiFo, and I dare to say so does anyone else who is involved in the project.

So armed with the full history of events plus the additional background from the article mentioned above I want to try a post mortal analysis of this support case, what went wrong and how it could be avoided. I obviously will try to be as impartial as I can be but in the end I can only look at what I see. Hopefully there is a lesson to be learned from this so next time things went smoother.

As a disclaimer: the problem is still a mystery and will probably remain so forever. Also all documentation linked is from the time the bug was filed (thanks for versioning) to give a fair picture of what was available back then.

Act 3 - the mailing list

The history of this case starts a bit into the history of the whole story, but for now lets skip the first part and look at the first time we (as in the FiFo team) came aware of the problem: A mail to the mailing list and the conversation following it.

Sadly the initial mail gives every little to no information on the problem aside that ‘not firing on all cylinders/Suddenly FiFo stopped working all by it self’. I must admit that for me such a thing is a big red flag that the person on the other end does not care much or is not willing to invest energy into finding the problem, looking from today and after reading the articles on Gordon’s blog it might as well just have been the kind of humor used, we will never know.

But fortunately I am not the only one on the mailing list and others are slower to judge and have more patience and Mark jumped in to help and tried to help, starting to ask for additional information that were missing from the original mail. After the information not showing any additional evidence that could lead to an easy conclusion Marks ask for him to take a look at FiFo’s Problem Checklist and come to the IRC channel to help further.

Act 3 conclusion

Mark replied to the initial mail within less then an hour which is a truly amazing response time that can put most commercial support to shame request to go through the checklist and join the channel for a more direct help is a very good approach.

Since I can actually look into my head I shed a bit more light on my take at the point, I decided to opt out of that request at that moment since Mark seemed to have it covered and the tone of the mail already made me sort it in a bucket of ‘going to be annoying’. So I probably should not judge as fast.

To receive the best possible support there are some things Gordon could have done, the initial mail could have contained more information, the tone could have been different — jokes often communicate badly over written text, and following Marks request to go through the checklist had probably helped too if only to get some additional facts for later.

Act 4 - IRC

Marks suggested to swap to IRC for a quicker communication, we all hate mail ping pong I guess. Fortunately we keep logs of the FiFo channel to make it easy to google for known problems that were discussed before — in this case it allows us to look at the conversation between Mark (aka trentster) and Gordon (g-flemming).

This is pretty uneventful mark asks a few more questions to rule out common errors none of which seem to be the cause. At this point Mark asks to escalate the issue and file a ticket with the logs and offers the suggestion to move to a newer version of FiFo (the development build at that time) along with the manual containing steps necessary to do this. The reasoning being that the development build is quite well tested with multiple people (including Mark and myself) running it.

Act 4 conclusions

Things are still pretty well at this point, our escalation process seems to work wonderfully and the resources of the documentation hold a lot of valuable information.

This is where the story ends for Mark and to all honestly he did an astonishing job to support here and escalate when he ran out of ways to help.

On the channel Gordon mentions the first time that he is running FiFo with customer on it this indicates some urgency on the matter and at least I missed that line. Sadly so he did not seem to have went through the troubleshooting steps or read the update documentation.

The lesson to learn here for us is to either ask people to include an IRC log in a new ticket or do it ourselves after they create it, this might have added some additional info to the ticket that was not present when it was created.

Act 5 - The ticketing system

Now this is where my part in the story starts, just as before I’ll try to give some additional insight on what I was thinking at this time to shed a bit of light on my end - I obviously can’t do the same for Gordon.

As asked by Mark Gordon created a JIRA Ticket with the logs. When seeing the ticket I had already read the mailing list but not the IRC backlog (I don’t always do that since I don’t want to spend too much time reading old things that might not even affect me). From the ML I have already an unhappy feeling towards the issue as it looked to me that Gordon was not willing to put much effort into resolving the issue.

Non the less I take a look at the logs and trie to pice together what exactly happened, up to this date I do not know what it was. The suggestions were pretty close to what Mark suggested earlier, a problem with too little memory or the filesystem the error in the logs einval hinted to some issue with a POSIX filesystem call.

And at this point things are going wrong, lacking the information that there are production users on the system I make the mistake to judge the issue as non urgent especially after the full reply from Gordon takes a day. At that point I pretty much stop caring about the issue since I feel neither does Gordon (which is a grave misjudgment as it turns out). I revisit the issue only one week later at which point I now can only assume Gordon had given up and in a last attempt to provide a direction suggest looking at FS errors, missing the question asked in Gordon’s reply.

Gordon makes the mistake of not reading the documentation Mark gave him earlier or the question he posted next day would have already been answered and probably waiting a day with replying to the question.

Act 5 conclusion

I need to stop drawing conclusions so quickly and read bug requests more carefully or I would not have missed the migration question in the bug report. It would also be worth a try not to treat less urgent tickets with less care, it sucks to have tickets open the goal should be to close (as in resolve) then as soon as possible even if they seem not urgent — this applies especially to bugs.

Gordon could have made his own life a lot easier by including more information in the ticket, noting that it is an urgent issue, including the history of what he already did to debug would have both helped a lot. In addition to that actually reading the documentation Mark provided would have answered the migration question beforehand.

Mark while out of the picture already could have included the chat history in the ticket when seeing that Gordon did not — this admittedly is asking a lot.

Act 0 - how everything started

I know this is the wrong order, but Star Wars got away with it too. Now after the fact I know more then I did before. Reading Gordon’s blog posts shed some light on the history and I think this is where actually things started to go wrong.

I am glad for every person choosing to try out FiFo, it means people put trust in what we’ve build and that is a really cool thing. But please if you want to use something in production inform yourself ahead of time. Don’t put yourself and your customers at risk by blindly running into things.

Talk to us, even before you start deploying! We know FiFo inside out, everyone in the FiFo team is running it themselves either for fun, for profit, for testing or for all three things. The channel is helpful too, there are more people outside the core team on the channel too who will gladly share their stories.

To put this straight, you’ll not only get the software for free you will even get some “consulting” tossed in the mix for not anything more then just asking! That is of cause in a sensible limit and given we have time, but there is always half an hour to spare here or there.

Act 0 conclusions

Deploying in a single node for a production system was not the best move, FiFo clusters for a reason and a distributed setup has many advantages over a single node. Probably some of the config settings could have been tweaked for a better user experience. It might have even made sense to run on dev instead of release or at least to switch early.

Had we known the surrounding circumstances helping might have been a lot easier.

Act 1 - be active in the community

There is a huge advantage to this. And I don’t even mean the fact that the community grows and everyone benefits. Being present in the channel and occasionally talking to people helps you to stay in the loop, know what is going on and what other people face for problems or find for solutions.

It also gives the benefit to influence the course of the project, a lot of the features were thought up and discussed within the community. And last but not least being active might actually end up in helping others, which will make the community as a while stronger.

Act 1 conclusion

I admit I am totally biased here. Everyones time is limited. When I have to chose to help a stranger I have no idea who it is or someone I know and has contributed in one way or another to the community I will pick the community member every time.

I can only talk for me here but I know for a fact that if Gordon or one of his colleges had been around in the channel and were a known face I would have taken the problem more serious. That said I have no idea if that is good or bad that I put community members before strangers.

Act 2 - read the fantastic manual

As Gordon points out FiFo not a trivial application. And he is entirely right, it is not, and there are may reasons for that which I am not going to argue here if that is a good or bad thing I will simply state that I have thought long about every choice I did in FiFo and claim them to be sound.

But we are well aware that it is not a simple pice of software like a editor or something, that is why we put a lot of time and effort in providing a manual and an extensive set of informations surrounding it.

We have a fully fledged manual, guides for Installation, Migration, and update. Best practice articles for Scaling, Networking and clustering. Checklists for problems and known issues. A list of terminology, information about our versioning system, recorded trainings (admittedly not much) and videos on usage.

For people interested in developing we have API documentations, starter guides, a documented build process. We have a list of internal libraries, and specifications on data structures. A guide to plugins, the messaging system and how to write plugins.

And last but not least we even have a page in which we explain how to best submit a bug

Act 2 conclusion

Reading those documents this would probably have saved a lot of time and pain and the fifo team some work. This is a reoccurring problem that we sadly see way too often - if anyone has suggestions how to encourage users to read manuals please share the holy grail.

The documents would have given good advice on how to set up a redundant fifo, how to check for problems and perhaps most importantly how to properly report a bug.

The Rant

Given I spent the last two years of my life working on Project FiFo I feel I am entitled to this. Bad bug reports are a pet peeve! And this was a prime example of one.

All the right signs were there, a entirely nonsensical title the catch phrases ‘it stopped working’ and ‘nothing had changed’. I can’t say if this is the one in a million where that was actually true but in all my time in IT I have never seen those words to be correct, nor have I ever heard from someone else that saw the mystical situation of something just stopping to work.

There was no usable information in there, no sign of interest to actually help the process of finding the root cause. Just because FiFo is free and there is no charge in using it does not mean my time is worth any less then yours, gladly help you with a problem but I’ll expect some engagement in return. I do not like to have my time wasted by having to pull every but of information out of someones nose.

Bottom line is: If you don’t care enough about your problem to put some effort into getting help I will not care enough to help.

Backups With Project FiFo

With 0.4.3 FiFo introduces support for LeoFS and this allows for some quite nice new features. Most importantly it decouples FiFo’s operations from storing big amounts of data which makes maintaining either of this much more sensible and scaling storage much more easy.

Then again while nice that is not the important part, just storing datasets somewhere else does not make much of a difference for most users but what LoeFS allows FiFo to store much more data then would be good in the old setup. ‘A lot more’ here means pretty much as much as you can store.

So with this options 0.4.3 FiFo introduce backups! Backups complement the snapshots already in the system for quiet a while but while snapshots were made to stay on the hypervisor backups are supposed to be shipped off to LeoFS. This not only helps to keep the number of snapshots limited, does not count against the local quota but also widens the failure domain.

And to make it better there is a sensible concept about incremental and full backups, that said there are a few limitations to be aware of:

  • Backups that stay on the hypervisor will count against the local quota.
  • Once a backup is moved away from the hypervisor it can’t be restored without overwriting the current state.
  • Restoring a backup might mean first deleting the local zfs volume for the vm.

But that aside that there are some very interesting things:

  • While it’s not possible to keep multiple branches with snapshots this is very well possible with backups.
  • It’s possible to make a difference between incremental and full backups choosing between recovery speed or space efficiency.

All that sums up to something quite awesome, it allows for proper grandfather backups concepts for VM’s and they can even be scripted using the fifo python client. So here is an example how this could be done.

Lets quickly describe what we want to achieve:

  • Every month we want a full backup.
  • Every week we want an incremental backup
    • for the first week in a month towards the monthly full backup
    • for other weeks to the previous week
  • Every day we want a incremental backup
    • for the first day of a week from the week backup
    • for other days from the previous day

To allow for the incremental backups we need to keep some of the backups around:

  • monthly until the first weekly was done.
  • weekly until the next weekly was done but not longer the the next monthly.
  • daily until the next daily but not longer then the next weekly or monthly.

The FiFo backup code fortunately helps a lot with this, a important part of the logic is ‘create a incremental snapshot and delete its parent from the hypervisor’ and that is exactly the behavior of the backup code when passing both a parent and requesting a delete.

Here some code (with some comments added):

#!/usr/bin/env bash
case $1 in
        $fifo vms backups $vm create monthly
        # After createing a new monthly snapshot we first delete the last weekly and daily backup
        last_daily=$($fifo vms backups $vm list -pH --fmt uuid,local,comment | grep 'daily' | grep 'YES' | tail -1)
        if [ ! -z "$last_daily" ]
            daily_uuid=$(echo $last_daily | cut -d: -f1)
            # the -l flag tells FiFo to only remove the backup from the hypervisor.
            $fifo vms backups $vm delete -l $daily_uuid
        last_weekly=$($fifo vms backups $vm list -pH --fmt uuid,local,comment | grep 'weekly' | grep 'YES' | tail -1)
        if [ ! -z "$last_weekly" ]
            weekly_uuid=$(echo $last_weekly | cut -d: -f1)
            $fifo vms backups $vm delete -l $weekly_uuid
        last_backup=$($fifo vms backups $vm list -pH --fmt uuid,local,comment | grep 'monthly\|weekly' | grep 'YES' | tail -1)
        uuid=$(echo $last_backup | cut -d: -f1)
        $fifo vms backups $vm create --parent $uuid -d weekly
        # After creating a new weekly we need to make sure to delete the last daily one
        last_daily=$($fifo vms backups $vm list -pH --fmt uuid,local,comment | grep 'daily' | grep 'YES' | tail -1)
        if [ ! -z "$last_daily" ]
            daily_uuid=$(echo $last_daily | cut -d: -f1)
            $fifo vms backups $vm delete -l $daily_uuid
        last_backup=$($fifo vms backups $vm list -pH --fmt uuid,local,comment | grep 'daily\|weekly' | grep 'YES' | tail -1)
        uuid=$(echo $last_backup | cut -d: -f1)
        type=$(echo $last_backup | cut -d: -f2)
        case $type in
                $fifo vms backups $vm create --parent $uuid daily
                $fifo vms backups $vm create --parent $uuid -d daily

A Asynchronously GCed or Set

Following the article about Asynchronous garbage collection with CRDTs I experimented with implementing the concept. The OR Set is a very nice data structure for this since it’s rather simple and so is it’s garbage!

To garbage collect the OR Set we do the following, we take some of the elements of the remove set, and delete them from both the add and the remove set - this way we save the space for them and generate a new baseline.

First step was to implement the data structure described to hold the collectable items, I call it a ROT (Roughly Ordered Tree) it’s a nice name for garbage related stuff ;) and it is treeish and mostly ordered.

The interface of the ROT is rather simple, Elements must be time tagged, in the form {Time, Element}. Where time must not be a clock, as long as the Erlang comparison operations work on it to give an order. Then it allows asking for full buckets, and removing buckets based on their hash value and newest message timestamp.

While the elements in a the OR set area already tagged with a timestamp, this timestamp records addition, not deletion so it would be misleading to use them since the ROT would think the remove happened when actually the addition happened and this would violate the rule that no event can travel back behind T100. As a result we’ll have to double timestamp the removes - as in add a second when when it was removed.

So since the ROT has a very similar interface to a G Set (which implemented the remove set before) the change is trivial. Remove, GC and the merge function are more interesting.


remove(Id, Element, ORSet = #vorsetg{removes = Removes}) ->
    CurrentExisting = [Elem || Elem = {_, E1} <- raw_value(ORSet),
                               E1 =:= Element],
    Removes1 = lists:foldl(fun(R, Rs) ->
                                   rot:add({Id, R}, Rs)
                           end, Removes, CurrentExisting),
    ORSet#vorsetg{removes = Removes1}.

Id defaults to a the current time in nanoseconds since it’s precise enough for most cases, but can be given any value that provides timed order. Line 2 and 3 collect all observed and not yet removed instances of the element to delete, we then fold over those instances and add each of them to the ROT.


      adds = Adds,
      removes = Removes,
      gced = GCed}) ->
    {Values, Removes1} = rot:remove(HashID, Removes),
    Values1 = [V || {_, V} <- Values],
    Values2 = ordsets:from_list(Values1),
    #vorsetg{adds = ordsets:subtract(Adds, Values2),
             gced = ordsets:add_element(HashID, GCed),
             removes = Removes1}.

To GC the set we take the HashID, this is what the rot returns when it reports full buckets, and in line 6 remove it from the ROT. Thankfully the ROT will return the content of the deleted bucket, this comes in very handy, since in the process of garbage collecting the bucket we also need to remove the items once and for all from the add list as seen in line 9. We then record the GC action in line 10 to make sure it will applied during a merge.

Please note that currently this set, even so it is garbage collected still grows without bounds since the GC actions themselves are not (yet) garbage collected, this will be added in a later iteration.


merge(ROTA = #vorsetg{gced = GCedA},
      ROTB = #vorsetg{gced = GCedB}) ->
       adds = AddsA,
       gced = GCed,
       removes = RemovesA}
        = lists:foldl(fun gc/2, ROTA, GCedB),
       adds = AddsB,
       removes = RemovesB}
        = lists:foldl(fun gc/2, ROTB, GCedA),
    ROT1 = rot:merge(RemovesA, RemovesB),
    #vorsetg{adds = ordsets:union(AddsA, AddsB),
             gced = GCed,
             removes = ROT1}.

Merging gets a bit more complicated due to the fact that we now have to take into account that values might be garbage collected in one set but not in the other. While merging them would do no harm it would recreate the garbage which isn’t too nice. So what we do is applying the recorded GC actions to both sets first as seen in line 3 to 11 and then merge the remove values (line 12) finally the add values (line 13).


I set up some proper tests for the implementation, comparing the GCed OR Set (bucket size 10) with a normal OR Set, running 1000 iterations with a set of 1000 instructions composed of 70% adds and removes, 20% merges and 10% GC events. T100 is a sliding time from the allowed collection of events older then the last merge.

Each stored element had the size of between 500 and 600 bytes (so there were 100 possible elements). A remove will always remove the stalest element, since they are added in random order this equals a random remove.

The operations are carried out of replicas copies of the set where add, and remove have a equal chance to be either happening just on copy A, or just on copy B, or on both replicas at the some time. GC operations are always carried out on both replicas but it should be noted that the GC operation does not include a merge operation so can be considered asynchronous.

All operations but the GC operation are executed exactly the same on the GCed OR set and the not GCed or Set in the same order and same spread.

At the end a final merge was performed and the resulting values compared for each iteration, no additional GC action takes place at the end.

Measured were both the space reduction per GC run and the final difference of size. Per GC run about 15% space was reclaimed and at the end the GCed set had a total space consumption of around 26% of the normal OR Set in average, 6% in the best and 143% in the worst case.

src/vorsetg.erl:389:<0.135.0>: [Size] Cnt: 1000,   Avg: 0.261,  Min: 0.062, Max: 1.507
src/vorsetg.erl:389:<0.135.0>: [ RS ] Cnt: 49221,  Avg: 0.866,  Min: 0.064, Max: 1.0
src/vorsetg.erl:389:<0.135.0>: [ GC ] Cnt: 49221,  Avg: 55.870, Min: 0,     Max: 6483
src/vorsetg.erl:389:<0.135.0>: [ MG ] Cnt: 98357,  Avg: 58.110, Min: 0,     Max: 6596
src/vorsetg.erl:389:<0.135.0>: [ OP ] Cnt: 344708, Avg: 38.539, Min: 0,     Max: 6916```

The numbers are from a test run, for readability truncated manually after 3 digest and aligned to be nicer readable. Size is total size at the end of the iteration, RS is the space reduction per GC run. GC, MG and OP are the time used for garbage collection, merging and other operations respectively, the numbers are per execution and measured microseconds. Time measurements also include noise that from additional operations required for the test and should not be seen as a useful benchmark!


The GC method described seems to work, and not even too badly, in the course of experimenting with values it showed that the conserved space is heavily dependant on the environment like the bucket size chosen, the size of the elements, the add/remove ratio and the ratio on which merges happen.

The OR Set it was compared with was not optimised at all, but thanks to it’s simplicity a rather good candidate, the gains on already optimised sets will likely be lower. (run with a optimised OR Set gave only 1 54% reduction in space instead of a 74% one with a normal OR Set).

The downside is that garbage collection takes time, so does merging, so a structure like this is over all slower then a not garbage collected version

Asynchronous Garbage Collection With CRDTs

So CRDTs are very very nice data structures awesome for eventual consistent applications like riak, or the components of Project-FiFo. So they have one big drawback, most of them collect garbage, and over time that can sum up to a lot making them pretty unpractical in many cases. Collecting this garbage is a bit tricky, since usually it means synchronising the data - which going back to the eventual consistent stuff prevents either A or P.

I want to outline some thoughts here how one could deal with this issue. As usual the idea here isn’t without tradeoffs, it does impose certain constrains on the systems behaviour and does not fit every behaviour in exchange of allowing garbage to be disposed of without the need of synchronisation. Now then, lets dive right in.

What’s that trash?

We start with understanding what the garbage is that sums up. To allow CRDTs to work the way we do, they need to store some kind of history or legend of how the current state (version/value) of the CRDT came to existence.

If we look at a OR Set for example the history of this set is stored by recording all elements ever added along with all elements ever deleted - elements are tagged to be unique too so adding 5, removing 5 and adding 5 again and removing that again, leaves not a data structure with 0 elements but one with 4. That said there are ways to optimise the OR Set bot lets ignore this for the sake of the example. We can’t just store an empty list since we need to make sure that when another copy of the same set can recreate the steps even if it just missed one of the events.

Actually we could, if we would synchronise all copies, say hey ¯from now on you all agree that the new baseline (this is bold since it will come up a few more times) is an empty set from now on. And doing that we would have garbage collected the OR Set, disposed of data that isn’t directly relevant to the current state any more.

If we don’t guarantee that all objects are garbage collected to the same state, we face a real issue, since the new baseline will cause quite some trouble since the partially applied effects will just be applied again and possibly cause them to be doubly applied. Or in short, partially applied GCing will cause the CRDT to stop functioning.

Things get old.

Looking at the data that gathers and how it is distributed there is one observation to be made: the older a change in state is the more likely it is to be present in all replicas. It makes sense, with eventual consistency we say ‘eventually’ our data will be the same everywhere, and the chances of ‘eventual’ are growing the older the change is since it will get more chance to replicate. (mechanisms similar to riak’s AAE greatly help here).

state distribution

So generally there is a T100 from which point on older data is shared between all instances and by that no longer relevant if we could just garbage collect it. But we don’t want synchronous operations, nor do we want partial garbage collection (since that rally would suck).

Back to state, we know which ones we want to garbage collect, lets say we record not only the state change but a timestamp, a simple non monotonic system timestamp, it’s cheap to get. Keep in mind T100 is well in the past, so if the precision of the times taps is good enough to guarantee that a event at T0 can not travel back behind T100, it’s OK if order between T0 and T99 changes all the time, we don’t really care about that so lets store the state data in a way that helps us with this:

T0 [S0,S1, …, Sn] T100 [Sn+1, …, Sn+m]

A trash bin

But since it would really suck (I know I’m repeating myself) if we partially GC the data we want to be sure that we agree, so would could go and ask all the replicas for their old data (older then T100). Yet this approach has a problem, for once T100 will shift in the time we check, then this might be more data to move then we care for.

So lets use a trash bin, or multiple once order our data in them so you’ve some groups of old messages, bunched together which can be agreed on, no matter on the time moving and they are smaller portions. Something like this

… T100 [Sn+1, …, Sn+100] [Sn+101, …, Sn+200]…

So we just have to agree on some bucket to garbage collect, since so if there is another half full bucket now since T100 has moved since the agreement we don’t really care about that. Thanks to the fact that operations are commutative we also can garbage collect in a non direct order, so it’s not a biggie if we take just one bucket and not the oldest one.

We’re still left with transmitting (in this example) 100 elements to delete and haven’t solve the problem of partial garbage collection, but at least we’re a good step closer, we’ve put the garbage in bins now that are much easier to handle then just on a huge pile.

A garbage compactor

Lets tackle the last two issues we do a little trick, instead of sending out the entire bucket we compress it, create a hash of it and send this back and forth, so instead of:

[Sn+1, …, Sn+100]

We tag this bucket with a hash (over it’s content) and the newest timestamp of the first element. Since it’s older then T100 we do not need to worry of it changing and recreating the hash, and we get something like this:

(hash, TSn+1)[Sn+1, …, Sn+100]

To agree on buckets to collect and to give the collect order we just need to send the hash and timestamp and an identifier, this is pretty little data to send forth and back. This solves the send much data problem, curiously it also helps a lot with the partial garbage collection status.

A schedule for garbage collection

With only the buckets tag identifying it we can solve the partial collection issue, we just treat garbage collection as just another event, storing it and replaying it if it wasn’t present in a old replica. So we gradually progress the baseline of a replica towards the common baseline somewhat like this:

gc graph

Ideally we store the GC operations in a own list and since we can easier apply it then and guarantee that the GC events are synchronised and applied before other events.

That’s it, and should be a somewhat working implementation of asynchronous garbage collection for CRTDs. But it’s not perfect so lets take a look at the downsides before we end this.

Lets be honest, it still has a downside

This concept of GCing does not come for free, the data structure required isn’t entirely trivial so it will add overhead, even so the current implementation is pretty cheap when adding the events in right order, wrong order will cause additional overhead because it might cause elements to shift around in the structure.

It requires events to be timestamped, even so there is no requirement for absolute order, this adds a constraint to messages and events that wasn’t there before. Also this is additional work and space that is consumed.

We need to define a T100 for the system and guarantee it, and find a balance of choosing a big enough T100 to ensure it’s correctness while keeping it small enough to not keep a huge tail of non garbage collected events. That said this can be mitigated slightly by using a dynamic T100 for example put record when a object was last written to all primary nodes.

If T100 isn’t chooses correctly it might end up getting really messy! if a elements slips by T100 that wasn’t there it could mean that the garbage collection is broken for quite some while or worst state gets inconsistent.

Bucket size is another matter, it needs to be chosen carefully to be big enough to not spam the system but small enough to not take ages to fill, a event passing T100 but not filling the bucket isn’t doing much good.

This is just a crazy idea. I haven’t tried this, implemented it or have a formal prove, it is based on common sense and my understanding on matters so it might just explode ;)

Happy Birthday Project FiFo

Some might know it, some might not and some might not care but for what it’s worth I’m the author of Project-FiFo (or most of it) and today is Project-FiFo’s first birthday (since the domain registration) and I want to take this chance to look back to the past year and reflect, say thank you to all of you and take a look in the future.

When I started Project FiFo a year ago it was more of a tiny hobby project and I could have sworn it would stand in row with all the other little open source projects no one would ever give a damn about. I really could not have been more wrong, what started as a few lines of clojurescript has grown to a beat of project with thousands of lines of code, a ever growing and incredible community (the project page gets between 2.5 and 3 thousand visitors a month by now and constantly more then 20 people in the irc channel) and a totally enthusiastic team!

Thank you

With a year gone it is about time to call out a few people and say ‘thanks’ because without their time, effort and work FiFo would not exist and in the day to day business of killing bugs, adding features and contemplating world domination it’s easy to forget this.

I want to start with Mark Slatem aka trentster author of the SmartCore blog and FiFo’s number one. He was pretty much the first person looking into FiFo and has sticked around till now going from first observer to tester, writer, helper and most of all a good friend.

Deirdré the Joyent community manager for SmartOS. Solaris and with that SmartOS is a underdog, and I’m sure without the incredible brilliant community it would have been doomed from the start. But it is not, and in a good part thanks to Deirdrés effort to shape the community and make it part of the ecosystem instead of a second class citizen.

Killphil author of jingles, FiFo’s web UI, it is amazing he popped up one day, out of the blue saying ‘hey I’ve played a bit with improving your UI’ and put jingles down with became the official UI within matters of days. Sadly I could not get rid of him again since then so you all have to live with him adding crazy new features.

Joyent as a whole for open sourcing SmartOS and making it better and better. Without SmartOs there sure would be no FiFo and I’d be stuck with Linux/KVM, which would be not too much fun.

basho who open sourced riak_core and riak_test which make fifo so much more incredible and provide me with free T-Shirts (I got four by now but please don’t tell them or I might not get any more).

Every single person using FiFo, it’s amazing to see how well the project is received, hear the feedback. Thanks a lot for all the help with the little things, for putting up with the occasional bugs and bearing with the time it might take to fix them or add new features.

Looking back

I’ve been working on FiFo for a year now, well first attempts counted a bit longer even, and without exaggerating this was the most amazing year in my life. It has been a blast, I’ve learned tons of things, both technical and socially, meet some of the most amazing people I can think of and honestly never have been this happy before.

The whole thing started when I wanted to share a co-located server with some friends who are kind of consoleophobe and I wasn’t happy with the approach to give everyone root access. So a solution had to be found but vanilla SmartOS provided none, SDC was making no sense with a single node and too expensive for a hobby system. Everything else was simply not Solaris, period. Adding to it that Deirdré showed me the community, randomly answering a question on twitter with a hint to visit the channel - which was was incredible surprising after experiencing some of the Linux community… really there was no chance in hell ending up with anything but SmartOS.

But all in all there was no virtualisation solution that suited what I wanted, not even if I had taken cost out of the equation. And since I refuse to swing the white flag and surrender to something I don’t like the only wan was: build one! (also I’m crazy about challenges and seeing how far I can push things ;) And after

That sums up how the whole thing started, with a little nodejs/clojurescript application that could be used with sdc-* commands over http. But that did only work with a single host, not that I had more to serve but it looked kind of clumsy and unprofessional for a cloud operating system like SmartOS so wiggle was born as kind of a broker in front of multiple vmwebadm (man the name was horribly boring). And from there on it kept growing and growing.

Now over the last year of work, and lots of lots of input from the community the one badly named program has become 5 services, three of them distributed via riak_core, and a HTML/JS UI on top of that, that most importantly, all have very cool names.

All the technical things aside, running an open source project where people get engaged is a fascinating experience, there is so much to learn I would have never dreamed of, so much to take away from the situation that helped me understand the problems and inner working of teams, people, projects better. That alone was worth every second of time invested.

Dogs and funky names

Now before we go on I share something that was asked a few times now: why the crazy names and obsession with dogs?

So the story starts with naming the first component, which back then was wiggle (after being very disappointed with my name choice for vmwebadm). Wiggle was the component that gave a unified interface to multiple hypervisors and in the SmartOS channel I had heard that Joyent called it’s thing ‘headnode’, but for FiFo the goal was never to clone something that already existed and I wanted to make a point of it so here is how the thought chain went: head -> tail, node -> nod -> wiggle, tail&wiggle -> dog.

Now I had a promise to keep, back when I was younger and my brother was even younger (since he is my little brother) he got a pet (as in not real) dog, and I had just learned with a FiFo (First in First out) queue is, I found that Fifo is a amazing name for a dog, so I talked my brother into naming his pet dog Fifo telling hime that if I ever had a dog I’d name it Fifo too, that said, I’m a man of my word even if it takes over something like 15 years to make good on it.

All other names just followed the same naming scheme, expect jingles but then again I did not name it myself. That said Fifo, the dog, is still with us, he is sitting on my window board drying from taking a shower earlier today to get cleaned up for it’s birthday!

As a nice side effect it is great for people to remember things that are named so silly as FiFo’s components!

Looking ahead

To be honest I feel that even after a year FiFo is still in its infancy, don’t get me wrong it’s quite stable and the features it provides build a very good foundation but there is so much more it can and will be!

PXE booting, integrated in FiFo, allowing to spin up new hypervisors by a click in the UI (or a call from he console) adding ipmi to the mix makes it even more exciting! Think about automatically booting a new hypervisor when the capacity reaches a certain percentage, or shutting an empty one down when it’s below.

Support for Clouds spread over WAN, location awareness of VM’s and Hypervisors with a notion of distance with deployment rules that take this into account (please don’t deploy all my database cluster VM’s on the same physical host, but don’t spread them over multiple datacenters!).

Cold migration of VM’s from one host to another, and putting them into cold storage / backing them up as a whole.

Well, I could go on and on rambling about crazy ideas for another thousand or so words or so but lets save this for another time. All in all I wanted to say it was an amazing year, amazing to see the community develop, seeing how FiFo gets used and I am hugely excited to see how things continue from here on! I can’t wait to see the first 10+ node FiFo setup, hear what people make out of it. See a first adopted UI, people starting to build things around FiFo - there already is a ruby implementation of the API along with a chef knife thing.

So to close: Happy birthday FiFo, thanks to all of you for joining this journey and lets brace for another year of dog-named-components!

FiFo + 80LOC of Bash = 5 Node Riak Cluster

The reason

The question ‘why would I want at least 5 nodes’ comes up very often in the #riak IRC channel, there is a good explanation. But that’s boring, no one likes reading manuals, we, as engineers, like to try things out (aka. break stuff).

Only downside with that is that we need to set things up before we can break them, or even worst need to un-break it later to try out different things (aka. break it in different ways). Admittedly setting up a riak instance is easy but setting up 5 and connecting them then break them and do all again to break them again, erm… I mean try things out of cause, can get really tedious and I for once am too lazy to bother with that.

The goal

Make setting our breakage, erm test, bed setup as simple as possible, and whipping up things and tearing them down trivial, ideally have one simple command like ./ setup to do that for us and ./ delete undo it all for us to get back to a clean state.

The tools

To build anything we’ll need some tools, hammer and nails will not do us much good here so we are going to pick:

  • Project FiFo - my favourite virtualisation tool (I am biassed I wrote it), but it’s very easy to set up and very powerful.
  • The FiFo Console Client - we want to script things, a UI isn’t helpful.
  • bash - the simplest possible scripting tool.
  • curl - since riak offers a http fronted it’s a wonderful way to check if the system is up.
  • jsontool - a nifty utility to traverse JSON documents.

With that we should be set and good to go.

The steps

We’ll have to perform multiple steps to build our wracking ground for riak lets look at them one by one:

Preparing the environment

Before we can begin we’ve to set up a few things, I’ll not go into detail how to set up FiFo, there is a [good manual] for that with only like 5 steps required. So lets start at some of the script’s variables:

#/usr/bin/env bash
  • smal is the name of the package created in FiFo, I picked something with 512MB of memory since that should be enough for now.
  • base64-1.9.1 is the dataset it means things are running in a solaris zone this also can be installed from the FiFo UI.
  • 7df94bc3-6a9f-4c88-8f80-7a8f4086b79d is the UUID of the network you can find that with fifo networks list
schroedinger:fifopy heinz [master] $ fifo packages list
                                UUID Name       RAM        CPU cap    Quota
------------------------------------ ---------- ---------- ---------- ----------
5f9f6c41-d700-4b4f-80f1-7350a71ed2e6 small      512 MB     100%       10 GB
schroedinger:fifopy heinz [master] $ fifo networks list
                                UUID Name       Tag                  First            Last
------------------------------------ ---------- ---------- --------------- ---------------
7df94bc3-6a9f-4c88-8f80-7a8f4086b79d test       admin
schroedinger:fifopy heinz [master] $ fifo datasets list
                                UUID Name       Version Type  Description
------------------------------------ ---------- ------- ----- ----------
60ed3a3e-92c7-11e2-ba4a-9b6d5feaa0c4 base       1.9.1   zone  A SmartOS ...

Creating a VM with riak installed

Creating a VM is rather simple we need a little JSON and pipe it to fifo with a cat. Please note the section reading user-script here we make the setup. Here is how it looks.

cat <<EOF | fifo vms create -p $PACKAGE -d $DATASET
  "alias": "riak1",
  "networks": {"net0": "$NET"},
  "metadata": {"user-script": "/opt/local/bin/sed -i.bak \\"s/pkgsrc/pkgsrc-eu-ams/\\" /opt/local/etc/pkgin/repositories.conf; /opt/local/bin/pkgin update; /opt/local/bin/pkgin -y install riak; export IP=\`ifconfig net0 | head -n 2 | tail -n 1 | awk '{print \$2}'\`; /opt/local/bin/sed -i.bak \\"s/\$IP/\\" /opt/local/etc/riak/app.config; /opt/local/bin/sed -i.bak \\"s/\$IP/\\" /opt/local/etc/riak/vm.args; svcadm enable epmd riak"}

To get a bit better look user script section and remove the escape things:

# We configure pkgin to use the european mirror you might not need to do that.
/opt/local/bin/sed -i.bak "s/pkgsrc/pkgsrc-eu-ams/" /opt/local/etc/pkgin/repositories.conf;
# We update the pkgin database and install riak
/opt/local/bin/pkgin update;
/opt/local/bin/pkgin -y install riak;
# We find out what IP our VM has from within the VM.
export IP=`ifconfig net0 | head -n 2 | tail -n 1 | awk '{print $2}'`;
# We update the app.config and vm.args to use the 'public' ip instead of the
/opt/local/bin/sed -i.bak "s/$IP/" /opt/local/etc/riak/app.config;
/opt/local/bin/sed -i.bak "s/$IP/" /opt/local/etc/riak/vm.args;
# Start epmd and riak
svcadm enable epmd riak

Waiting for riak

Now that is the first zone set up next we’ll want to wait for riak to properly start up. This is needed since the commands are asynchronous and installing the packages can be a tad slow. But we can just to curl the http interface to check for this, so it’s rather simple:

# We'll ask fifo for the IP of our first zone.
IP1=`fifo vms get riak1 | json networks[0].ip`
# Print some info so waiting is not so boring
echo -n 'Waiting until riak is up and running on the primary node.'
# now we curl the http interface every second to see if things are good.
until curl http://${IP1}:8098 2>/dev/null >/dev/null
    sleep 1
    echo -n '.'
# and we're done!
echo " done."

Setting up the remaining zones

We’re not going to get into too much details with this since it is pretty much working the same as the first VM with the only difference that the user-script holds two more lines:

for i in 2 3 4 5
    cat <<EOF | fifo vms create -p $PACKAGE -d $DATASET
    "alias": "riak${i}",
    "networks": {"net0": "$NET"},
    "metadata": {"user-script": "/opt/local/bin/sed -i.bak \\"s/pkgsrc/pkgsrc-eu-ams/\\" /opt/local/etc/pkgin/repositories.conf; /opt/local/bin/pkgin update; /opt/local/bin/pkgin -y install riak; export IP=\`ifconfig net0 | head -n 2 | tail -n 1 | awk '{print \$2}'\`; /opt/local/bin/sed -i.bak \\"s/\$IP/\\" /opt/local/etc/riak/app.config; /opt/local/bin/sed -i.bak \\"s/\$IP/\\" /opt/local/etc/riak/vm.args; svcadm enable epmd riak; sleep 10; /opt/local/bin/sudo -uriak /opt/local/sbin/riak-admin cluster join riak@${IP1}; /opt/local/bin/sudo -uriak /opt/local/sbin/riak-admin cluster plan; /opt/local/bin/sudo -uriak /opt/local/sbin/riak-admin cluster commit"}
    IP=`fifo vms get riak$i | json networks[0].ip`
    echo -n "Waiting untill riak is up and running on the node $i."
    until curl http://${IP}:8098 2>/dev/null >/dev/null
        sleep 1
        echo -n '.'
    echo " done."


The two new lines are joining the node to the existing riak node which is quite easy, we can use $IP1 we generated in the first step too: bash /opt/local/bin/sudo -uriak /opt/local/sbin/riak-admin cluster join riak@${IP1} /opt/local/bin/sudo -uriak /opt/local/sbin/riak-admin cluster plan /opt/local/bin/sudo -uriak /opt/local/sbin/riak-admin cluster commit

This is run that up and you’ve a 5 node riak cluster, and it’s quick at last if you’re in the US and have a good connection to the package repository.

Here is this all slapped together.