Lice!

...

Postmortem of a Interesting Bug

Symptoms

After a full network outage in a larger system (7 FiFo instances and, a few dozen of hypervisors, VM’s in the 3 digit number) a small percentage of the VM’s stored in FiFo lost information which package was assigned to them and which organization they belong to.

Technical background

As part of planned maintenance on the switching layer the entire network was taken down. Effectively cutting the communication between any two systems in the network. During the maintenance the FiFo system was kept “hot”, no services disabled or paused. At the end of the maintenance window the network was as planned enabled again, reinstating communications.

FiFo background

We will focus in the sniffle and chunter components since those are the relevant parts of sniffle that handle information related to the symptoms. ¯ Generally all fifo data is stored in CRDT’s which provides conflict resolution and nearly loss free merging even of entirely divergent data.

Sniffle

Sniffle is a distributed system that runs on top of riak_core, all (7) nodes are connected as a mash network via distributed erlang. While the nodes are connected data and functionality is split out in dynamo fashion, at a default N value of 3 that means every node handles (total data)*3/7 of the data.

In the case of a node failure the adjacent node takes over the function of the failed node. Missing data is resolved by a method called “read repair”, meaning that when data is requested and 1 out of the 3 nodes responds with no data that “missing” data is repaired.

Once the node failure is repaired the system that took over the work for the failed node performs what is called a handoff, sending it’s (supposedly) updated data to the returning node to bring it up to speed.

Chunter

Under normal conditions chunter will send incremental updates about changing vm state to sniffle. If a chunter system for the first time connects to sniffle it will register it’s VM’s, meaning they are created if not existent, and assigned to the hypervisor. The registration contains nearly all relevant VM information with some exceptions that do not concern the hypervisor. The missing information was amongst the not updated information.

Sniffle nodes are discovered via mDNS broadcasts, the first broadcast received will trigger the initial registration sending the request to this node.

What happened?

It is a combination of conditions that lead to the problem and explains why only a small percentage of VM’s were affected.

During the outage

1) The network outage cut the communication between all sniffle nodes, so each node was on its own handling the full load of the requests. 2) The network outage cut communication of nearly all chunter nodes, only those who resided on a hypervisor that also contained a fifo zone did not loose connectivity.

After the outage

Now there happened a race condition, on one side the distributed erlang started to re-discover the other nodes, reestablishing communication in sniffle and initializing handoffs (handoffs of nearly all parts).

On the other side the periodic mDNS broadcasts from the sniffle nodes triggered re-registration of the chunter nodes.

The bug

The bug was triggered by the fact that during receiving a handoff a vnode took the wrong assumption that the received data must be newer then its own and overwrite the currently stored data instead of merging it.

Due to the usual repair during reads this turned out to be quite an edge case that was only triggered when:

1) The mDNS was received from a node that neither holds the right partition, neither is connected to a node holding the right partition. 2) the registration happened to three vnodes that did not hold that data since as long as one node held the data it would have been merged with this instead of re-created 3) The handoff on all three systems was performed before any read happened, since the read-repair would haver otherwise merged the data. 4) The handoff off all three systems was performed before AAE triggered, since AAE triggers a read-repair on inconsistent data.

The fix

This was fixed rather simple, instead of overwriting existing data on a handoff, the handoff is now treated as a read-repair and old and existing data is merged.

Post-mortem of a Failed Support Case.

Every now and then I check the link reports for Project FiFo to see what people think and write about it. Recently I stumbled about an article that oddly enough made me both proud and sad. It actually was a rather negative one, which is a shame, but on the other hand a project isn’t mature until people care enough to complain.

Yet even so it would be very easy to cast this aside as a ‘success’ in a strange manner, it still bothers me that someone is upset enough with FiFo to spend his time writing a longish blog article and write their own management software. I do take great pride in the fact that we do our best to have an outstanding documentation and a good support for users of FiFo, and I dare to say so does anyone else who is involved in the project.

So armed with the full history of events plus the additional background from the article mentioned above I want to try a post mortal analysis of this support case, what went wrong and how it could be avoided. I obviously will try to be as impartial as I can be but in the end I can only look at what I see. Hopefully there is a lesson to be learned from this so next time things went smoother.

As a disclaimer: the problem is still a mystery and will probably remain so forever. Also all documentation linked is from the time the bug was filed (thanks for versioning) to give a fair picture of what was available back then.

Act 3 – the mailing list

The history of this case starts a bit into the history of the whole story, but for now lets skip the first part and look at the first time we (as in the FiFo team) came aware of the problem: A mail to the mailing list and the conversation following it.

Sadly the initial mail gives every little to no information on the problem aside that ‘not firing on all cylinders/Suddenly FiFo stopped working all by it self’. I must admit that for me such a thing is a big red flag that the person on the other end does not care much or is not willing to invest energy into finding the problem, looking from today and after reading the articles on Gordon’s blog it might as well just have been the kind of humor used, we will never know.

But fortunately I am not the only one on the mailing list and others are slower to judge and have more patience and Mark jumped in to help and tried to help, starting to ask for additional information that were missing from the original mail. After the information not showing any additional evidence that could lead to an easy conclusion Marks ask for him to take a look at FiFo’s Problem Checklist and come to the IRC channel to help further.

Act 3 conclusion

Mark replied to the initial mail within less then an hour which is a truly amazing response time that can put most commercial support to shame request to go through the checklist and join the channel for a more direct help is a very good approach.

Since I can actually look into my head I shed a bit more light on my take at the point, I decided to opt out of that request at that moment since Mark seemed to have it covered and the tone of the mail already made me sort it in a bucket of ‘going to be annoying’. So I probably should not judge as fast.

To receive the best possible support there are some things Gordon could have done, the initial mail could have contained more information, the tone could have been different — jokes often communicate badly over written text, and following Marks request to go through the checklist had probably helped too if only to get some additional facts for later.

Act 4 – IRC

Marks suggested to swap to IRC for a quicker communication, we all hate mail ping pong I guess. Fortunately we keep logs of the FiFo channel to make it easy to google for known problems that were discussed before — in this case it allows us to look at the conversation between Mark (aka trentster) and Gordon (g-flemming).

This is pretty uneventful mark asks a few more questions to rule out common errors none of which seem to be the cause. At this point Mark asks to escalate the issue and file a ticket with the logs and offers the suggestion to move to a newer version of FiFo (the development build at that time) along with the manual containing steps necessary to do this. The reasoning being that the development build is quite well tested with multiple people (including Mark and myself) running it.

Act 4 conclusions

Things are still pretty well at this point, our escalation process seems to work wonderfully and the resources of the documentation hold a lot of valuable information.

This is where the story ends for Mark and to all honestly he did an astonishing job to support here and escalate when he ran out of ways to help.

On the channel Gordon mentions the first time that he is running FiFo with customer on it this indicates some urgency on the matter and at least I missed that line. Sadly so he did not seem to have went through the troubleshooting steps or read the update documentation.

The lesson to learn here for us is to either ask people to include an IRC log in a new ticket or do it ourselves after they create it, this might have added some additional info to the ticket that was not present when it was created.

Act 5 – The ticketing system

Now this is where my part in the story starts, just as before I’ll try to give some additional insight on what I was thinking at this time to shed a bit of light on my end – I obviously can’t do the same for Gordon.

As asked by Mark Gordon created a JIRA Ticket with the logs. When seeing the ticket I had already read the mailing list but not the IRC backlog (I don’t always do that since I don’t want to spend too much time reading old things that might not even affect me). From the ML I have already an unhappy feeling towards the issue as it looked to me that Gordon was not willing to put much effort into resolving the issue.

Non the less I take a look at the logs and trie to pice together what exactly happened, up to this date I do not know what it was. The suggestions were pretty close to what Mark suggested earlier, a problem with too little memory or the filesystem the error in the logs einval hinted to some issue with a POSIX filesystem call.

And at this point things are going wrong, lacking the information that there are production users on the system I make the mistake to judge the issue as non urgent especially after the full reply from Gordon takes a day. At that point I pretty much stop caring about the issue since I feel neither does Gordon (which is a grave misjudgment as it turns out). I revisit the issue only one week later at which point I now can only assume Gordon had given up and in a last attempt to provide a direction suggest looking at FS errors, missing the question asked in Gordon’s reply.

Gordon makes the mistake of not reading the documentation Mark gave him earlier or the question he posted next day would have already been answered and probably waiting a day with replying to the question.

Act 5 conclusion

I need to stop drawing conclusions so quickly and read bug requests more carefully or I would not have missed the migration question in the bug report. It would also be worth a try not to treat less urgent tickets with less care, it sucks to have tickets open the goal should be to close (as in resolve) then as soon as possible even if they seem not urgent — this applies especially to bugs.

Gordon could have made his own life a lot easier by including more information in the ticket, noting that it is an urgent issue, including the history of what he already did to debug would have both helped a lot. In addition to that actually reading the documentation Mark provided would have answered the migration question beforehand.

Mark while out of the picture already could have included the chat history in the ticket when seeing that Gordon did not — this admittedly is asking a lot.

Act 0 – how everything started

I know this is the wrong order, but Star Wars got away with it too. Now after the fact I know more then I did before. Reading Gordon’s blog posts shed some light on the history and I think this is where actually things started to go wrong.

I am glad for every person choosing to try out FiFo, it means people put trust in what we’ve build and that is a really cool thing. But please if you want to use something in production inform yourself ahead of time. Don’t put yourself and your customers at risk by blindly running into things.

Talk to us, even before you start deploying! We know FiFo inside out, everyone in the FiFo team is running it themselves either for fun, for profit, for testing or for all three things. The channel is helpful too, there are more people outside the core team on the channel too who will gladly share their stories.

To put this straight, you’ll not only get the software for free you will even get some “consulting” tossed in the mix for not anything more then just asking! That is of cause in a sensible limit and given we have time, but there is always half an hour to spare here or there.

Act 0 conclusions

Deploying in a single node for a production system was not the best move, FiFo clusters for a reason and a distributed setup has many advantages over a single node. Probably some of the config settings could have been tweaked for a better user experience. It might have even made sense to run on dev instead of release or at least to switch early.

Had we known the surrounding circumstances helping might have been a lot easier.

Act 1 – be active in the community

There is a huge advantage to this. And I don’t even mean the fact that the community grows and everyone benefits. Being present in the channel and occasionally talking to people helps you to stay in the loop, know what is going on and what other people face for problems or find for solutions.

It also gives the benefit to influence the course of the project, a lot of the features were thought up and discussed within the community. And last but not least being active might actually end up in helping others, which will make the community as a while stronger.

Act 1 conclusion

I admit I am totally biased here. Everyones time is limited. When I have to chose to help a stranger I have no idea who it is or someone I know and has contributed in one way or another to the community I will pick the community member every time.

I can only talk for me here but I know for a fact that if Gordon or one of his colleges had been around in the channel and were a known face I would have taken the problem more serious. That said I have no idea if that is good or bad that I put community members before strangers.

Act 2 – read the fantastic manual

As Gordon points out FiFo not a trivial application. And he is entirely right, it is not, and there are may reasons for that which I am not going to argue here if that is a good or bad thing I will simply state that I have thought long about every choice I did in FiFo and claim them to be sound.

But we are well aware that it is not a simple pice of software like a editor or something, that is why we put a lot of time and effort in providing a manual and an extensive set of informations surrounding it.

We have a fully fledged manual, guides for Installation, Migration, and update. Best practice articles for Scaling, Networking and clustering. Checklists for problems and known issues. A list of terminology, information about our versioning system, recorded trainings (admittedly not much) and videos on usage.

For people interested in developing we have API documentations, starter guides, a documented build process. We have a list of internal libraries, and specifications on data structures. A guide to plugins, the messaging system and how to write plugins.

And last but not least we even have a page in which we explain how to best submit a bug

Act 2 conclusion

Reading those documents this would probably have saved a lot of time and pain and the fifo team some work. This is a reoccurring problem that we sadly see way too often – if anyone has suggestions how to encourage users to read manuals please share the holy grail.

The documents would have given good advice on how to set up a redundant fifo, how to check for problems and perhaps most importantly how to properly report a bug.

The Rant

Given I spent the last two years of my life working on Project FiFo I feel I am entitled to this. Bad bug reports are a pet peeve! And this was a prime example of one.

All the right signs were there, a entirely nonsensical title the catch phrases ‘it stopped working’ and ‘nothing had changed’. I can’t say if this is the one in a million where that was actually true but in all my time in IT I have never seen those words to be correct, nor have I ever heard from someone else that saw the mystical situation of something just stopping to work.

There was no usable information in there, no sign of interest to actually help the process of finding the root cause. Just because FiFo is free and there is no charge in using it does not mean my time is worth any less then yours, gladly help you with a problem but I’ll expect some engagement in return. I do not like to have my time wasted by having to pull every but of information out of someones nose.

Bottom line is: If you don’t care enough about your problem to put some effort into getting help I will not care enough to help.

Backups With Project FiFo

With 0.4.3 FiFo introduces support for LeoFS and this allows for some quite nice new features. Most importantly it decouples FiFo’s operations from storing big amounts of data which makes maintaining either of this much more sensible and scaling storage much more easy.

Then again while nice that is not the important part, just storing datasets somewhere else does not make much of a difference for most users but what LoeFS allows FiFo to store much more data then would be good in the old setup. ‘A lot more’ here means pretty much as much as you can store.

So with this options 0.4.3 FiFo introduce backups! Backups complement the snapshots already in the system for quiet a while but while snapshots were made to stay on the hypervisor backups are supposed to be shipped off to LeoFS. This not only helps to keep the number of snapshots limited, does not count against the local quota but also widens the failure domain.

And to make it better there is a sensible concept about incremental and full backups, that said there are a few limitations to be aware of:

  • Backups that stay on the hypervisor will count against the local quota.
  • Once a backup is moved away from the hypervisor it can’t be restored without overwriting the current state.
  • Restoring a backup might mean first deleting the local zfs volume for the vm.

But that aside that there are some very interesting things:

  • While it’s not possible to keep multiple branches with snapshots this is very well possible with backups.
  • It’s possible to make a difference between incremental and full backups choosing between recovery speed or space efficiency.

All that sums up to something quite awesome, it allows for proper grandfather backups concepts for VM’s and they can even be scripted using the fifo python client. So here is an example how this could be done.

Lets quickly describe what we want to achieve:

  • Every month we want a full backup.
  • Every week we want an incremental backup
    • for the first week in a month towards the monthly full backup
    • for other weeks to the previous week
  • Every day we want a incremental backup
    • for the first day of a week from the week backup
    • for other days from the previous day

To allow for the incremental backups we need to keep some of the backups around:

  • monthly until the first weekly was done.
  • weekly until the next weekly was done but not longer the the next monthly.
  • daily until the next daily but not longer then the next weekly or monthly.

The FiFo backup code fortunately helps a lot with this, a important part of the logic is ‘create a incremental snapshot and delete its parent from the hypervisor’ and that is exactly the behavior of the backup code when passing both a parent and requesting a delete.

Here some code (with some comments added):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
#!/usr/bin/env bash
fifo=fifo
vm="$2"
case $1 in
    monthly)
        $fifo vms backups $vm create monthly
        # After createing a new monthly snapshot we first delete the last weekly and daily backup
        last_daily=$($fifo vms backups $vm list -pH --fmt uuid,local,comment | grep 'daily' | grep 'YES' | tail -1)
        if [ ! -z "$last_daily" ]
        then
            daily_uuid=$(echo $last_daily | cut -d: -f1)
            # the -l flag tells FiFo to only remove the backup from the hypervisor.
            $fifo vms backups $vm delete -l $daily_uuid
        fi
        last_weekly=$($fifo vms backups $vm list -pH --fmt uuid,local,comment | grep 'weekly' | grep 'YES' | tail -1)
        if [ ! -z "$last_weekly" ]
        then
            weekly_uuid=$(echo $last_weekly | cut -d: -f1)
            $fifo vms backups $vm delete -l $weekly_uuid
        fi
        ;;
    weekly)
        last_backup=$($fifo vms backups $vm list -pH --fmt uuid,local,comment | grep 'monthly\|weekly' | grep 'YES' | tail -1)
        uuid=$(echo $last_backup | cut -d: -f1)
        $fifo vms backups $vm create --parent $uuid -d weekly
        # After creating a new weekly we need to make sure to delete the last daily one
        last_daily=$($fifo vms backups $vm list -pH --fmt uuid,local,comment | grep 'daily' | grep 'YES' | tail -1)
        if [ ! -z "$last_daily" ]
        then
            daily_uuid=$(echo $last_daily | cut -d: -f1)
            $fifo vms backups $vm delete -l $daily_uuid
        fi
        ;;
    daily)
        last_backup=$($fifo vms backups $vm list -pH --fmt uuid,local,comment | grep 'daily\|weekly' | grep 'YES' | tail -1)
        uuid=$(echo $last_backup | cut -d: -f1)
        type=$(echo $last_backup | cut -d: -f2)
        case $type in
            weekly)
                $fifo vms backups $vm create --parent $uuid daily
                ;;
            daily)
                $fifo vms backups $vm create --parent $uuid -d daily
                ;;
        esac
        ;;
esac

A Asynchronously GCed or Set

Following the article about Asynchronous garbage collection with CRDTs I experimented with implementing the concept. The OR Set is a very nice data structure for this since it’s rather simple and so is it’s garbage!

To garbage collect the OR Set we do the following, we take some of the elements of the remove set, and delete them from both the add and the remove set – this way we save the space for them and generate a new baseline.

First step was to implement the data structure described to hold the collectable items, I call it a ROT (Roughly Ordered Tree) it’s a nice name for garbage related stuff ;) and it is treeish and mostly ordered.

The interface of the ROT is rather simple, Elements must be time tagged, in the form {Time, Element}. Where time must not be a clock, as long as the Erlang comparison operations work on it to give an order. Then it allows asking for full buckets, and removing buckets based on their hash value and newest message timestamp.

While the elements in a the OR set area already tagged with a timestamp, this timestamp records addition, not deletion so it would be misleading to use them since the ROT would think the remove happened when actually the addition happened and this would violate the rule that no event can travel back behind T100. As a result we’ll have to double timestamp the removes – as in add a second when when it was removed.

So since the ROT has a very similar interface to a G Set (which implemented the remove set before) the change is trivial. Remove, GC and the merge function are more interesting.

remove

1
2
3
4
5
6
7
remove(Id, Element, ORSet = #vorsetg{removes = Removes}) ->
    CurrentExisting = [Elem || Elem = {_, E1} <- raw_value(ORSet),
                               E1 =:= Element],
    Removes1 = lists:foldl(fun(R, Rs) ->
                                   rot:add({Id, R}, Rs)
                           end, Removes, CurrentExisting),
    ORSet#vorsetg{removes = Removes1}.

Id defaults to a the current time in nanoseconds since it’s precise enough for most cases, but can be given any value that provides timed order. Line 2 and 3 collect all observed and not yet removed instances of the element to delete, we then fold over those instances and add each of them to the ROT.

GC

1
2
3
4
5
6
7
8
9
10
11
gc(HashID,
   #vorsetg{
      adds = Adds,
      removes = Removes,
      gced = GCed}) ->
    {Values, Removes1} = rot:remove(HashID, Removes),
    Values1 = [V || {_, V} <- Values],
    Values2 = ordsets:from_list(Values1),
    #vorsetg{adds = ordsets:subtract(Adds, Values2),
             gced = ordsets:add_element(HashID, GCed),
             removes = Removes1}.

To GC the set we take the HashID, this is what the rot returns when it reports full buckets, and in line 6 remove it from the ROT. Thankfully the ROT will return the content of the deleted bucket, this comes in very handy, since in the process of garbage collecting the bucket we also need to remove the items once and for all from the add list as seen in line 9. We then record the GC action in line 10 to make sure it will applied during a merge.

Please note that currently this set, even so it is garbage collected still grows without bounds since the GC actions themselves are not (yet) garbage collected, this will be added in a later iteration.

merge

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
merge(ROTA = #vorsetg{gced = GCedA},
      ROTB = #vorsetg{gced = GCedB}) ->
    #vorsetg{
       adds = AddsA,
       gced = GCed,
       removes = RemovesA}
        = lists:foldl(fun gc/2, ROTA, GCedB),
    #vorsetg{
       adds = AddsB,
       removes = RemovesB}
        = lists:foldl(fun gc/2, ROTB, GCedA),
    ROT1 = rot:merge(RemovesA, RemovesB),
    #vorsetg{adds = ordsets:union(AddsA, AddsB),
             gced = GCed,
             removes = ROT1}.

Merging gets a bit more complicated due to the fact that we now have to take into account that values might be garbage collected in one set but not in the other. While merging them would do no harm it would recreate the garbage which isn’t too nice. So what we do is applying the recorded GC actions to both sets first as seen in line 3 to 11 and then merge the remove values (line 12) finally the add values (line 13).

Results

I set up some proper tests for the implementation, comparing the GCed OR Set (bucket size 10) with a normal OR Set, running 1000 iterations with a set of 1000 instructions composed of 70% adds and removes, 20% merges and 10% GC events. T100 is a sliding time from the allowed collection of events older then the last merge.

Each stored element had the size of between 500 and 600 bytes (so there were 100 possible elements). A remove will always remove the stalest element, since they are added in random order this equals a random remove.

The operations are carried out of replicas copies of the set where add, and remove have a equal chance to be either happening just on copy A, or just on copy B, or on both replicas at the some time. GC operations are always carried out on both replicas but it should be noted that the GC operation does not include a merge operation so can be considered asynchronous.

All operations but the GC operation are executed exactly the same on the GCed OR set and the not GCed or Set in the same order and same spread.

At the end a final merge was performed and the resulting values compared for each iteration, no additional GC action takes place at the end.

Measured were both the space reduction per GC run and the final difference of size. Per GC run about 15% space was reclaimed and at the end the GCed set had a total space consumption of around 26% of the normal OR Set in average, 6% in the best and 143% in the worst case.

1
2
3
4
5
src/vorsetg.erl:389:<0.135.0>: [Size] Cnt: 1000,   Avg: 0.261,  Min: 0.062, Max: 1.507
src/vorsetg.erl:389:<0.135.0>: [ RS ] Cnt: 49221,  Avg: 0.866,  Min: 0.064, Max: 1.0
src/vorsetg.erl:389:<0.135.0>: [ GC ] Cnt: 49221,  Avg: 55.870, Min: 0,     Max: 6483
src/vorsetg.erl:389:<0.135.0>: [ MG ] Cnt: 98357,  Avg: 58.110, Min: 0,     Max: 6596
src/vorsetg.erl:389:<0.135.0>: [ OP ] Cnt: 344708, Avg: 38.539, Min: 0,     Max: 6916```

The numbers are from a test run, for readability truncated manually after 3 digest and aligned to be nicer readable. Size is total size at the end of the iteration, RS is the space reduction per GC run. GC, MG and OP are the time used for garbage collection, merging and other operations respectively, the numbers are per execution and measured microseconds. Time measurements also include noise that from additional operations required for the test and should not be seen as a useful benchmark!

Conclusion

The GC method described seems to work, and not even too badly, in the course of experimenting with values it showed that the conserved space is heavily dependant on the environment like the bucket size chosen, the size of the elements, the add/remove ratio and the ratio on which merges happen.

The OR Set it was compared with was not optimised at all, but thanks to it’s simplicity a rather good candidate, the gains on already optimised sets will likely be lower. (run with a optimised OR Set gave only 1 54% reduction in space instead of a 74% one with a normal OR Set).

The downside is that garbage collection takes time, so does merging, so a structure like this is over all slower then a not garbage collected version

Asynchronous Garbage Collection With CRDTs

So CRDTs are very very nice data structures awesome for eventual consistent applications like riak, or the components of Project-FiFo. So they have one big drawback, most of them collect garbage, and over time that can sum up to a lot making them pretty unpractical in many cases. Collecting this garbage is a bit tricky, since usually it means synchronising the data – which going back to the eventual consistent stuff prevents either A or P.

I want to outline some thoughts here how one could deal with this issue. As usual the idea here isn’t without tradeoffs, it does impose certain constrains on the systems behaviour and does not fit every behaviour in exchange of allowing garbage to be disposed of without the need of synchronisation. Now then, lets dive right in.

What’s that trash?

We start with understanding what the garbage is that sums up. To allow CRDTs to work the way we do, they need to store some kind of history or legend of how the current state (version/value) of the CRDT came to existence.

If we look at a OR Set for example the history of this set is stored by recording all elements ever added along with all elements ever deleted – elements are tagged to be unique too so adding 5, removing 5 and adding 5 again and removing that again, leaves not a data structure with 0 elements but one with 4. That said there are ways to optimise the OR Set bot lets ignore this for the sake of the example. We can’t just store an empty list since we need to make sure that when another copy of the same set can recreate the steps even if it just missed one of the events.

Actually we could, if we would synchronise all copies, say hey ¯from now on you all agree that the new baseline (this is bold since it will come up a few more times) is an empty set from now on. And doing that we would have garbage collected the OR Set, disposed of data that isn’t directly relevant to the current state any more.

If we don’t guarantee that all objects are garbage collected to the same state, we face a real issue, since the new baseline will cause quite some trouble since the partially applied effects will just be applied again and possibly cause them to be doubly applied. Or in short, partially applied GCing will cause the CRDT to stop functioning.

Things get old.

Looking at the data that gathers and how it is distributed there is one observation to be made: the older a change in state is the more likely it is to be present in all replicas. It makes sense, with eventual consistency we say ‘eventually’ our data will be the same everywhere, and the chances of ‘eventual’ are growing the older the change is since it will get more chance to replicate. (mechanisms similar to riak’s AAE greatly help here).

state distribution

So generally there is a T100 from which point on older data is shared between all instances and by that no longer relevant if we could just garbage collect it. But we don’t want synchronous operations, nor do we want partial garbage collection (since that rally would suck).

Back to state, we know which ones we want to garbage collect, lets say we record not only the state change but a timestamp, a simple non monotonic system timestamp, it’s cheap to get. Keep in mind T100 is well in the past, so if the precision of the times taps is good enough to guarantee that a event at T0 can not travel back behind T100, it’s OK if order between T0 and T99 changes all the time, we don’t really care about that so lets store the state data in a way that helps us with this:

T0 [S0,S1, …, Sn] T100 [Sn+1, …, Sn+m]

A trash bin

But since it would really suck (I know I’m repeating myself) if we partially GC the data we want to be sure that we agree, so would could go and ask all the replicas for their old data (older then T100). Yet this approach has a problem, for once T100 will shift in the time we check, then this might be more data to move then we care for.

So lets use a trash bin, or multiple once order our data in them so you’ve some groups of old messages, bunched together which can be agreed on, no matter on the time moving and they are smaller portions. Something like this

… T100 [Sn+1, …, Sn+100] [Sn+101, …, Sn+200]…

So we just have to agree on some bucket to garbage collect, since so if there is another half full bucket now since T100 has moved since the agreement we don’t really care about that. Thanks to the fact that operations are commutative we also can garbage collect in a non direct order, so it’s not a biggie if we take just one bucket and not the oldest one.

We’re still left with transmitting (in this example) 100 elements to delete and haven’t solve the problem of partial garbage collection, but at least we’re a good step closer, we’ve put the garbage in bins now that are much easier to handle then just on a huge pile.

A garbage compactor

Lets tackle the last two issues we do a little trick, instead of sending out the entire bucket we compress it, create a hash of it and send this back and forth, so instead of:

[Sn+1, …, Sn+100]

We tag this bucket with a hash (over it’s content) and the newest timestamp of the first element. Since it’s older then T100 we do not need to worry of it changing and recreating the hash, and we get something like this:

(hash, TSn+1)[Sn+1, …, Sn+100]

To agree on buckets to collect and to give the collect order we just need to send the hash and timestamp and an identifier, this is pretty little data to send forth and back. This solves the send much data problem, curiously it also helps a lot with the partial garbage collection status.

A schedule for garbage collection

With only the buckets tag identifying it we can solve the partial collection issue, we just treat garbage collection as just another event, storing it and replaying it if it wasn’t present in a old replica. So we gradually progress the baseline of a replica towards the common baseline somewhat like this:

gc graph

Ideally we store the GC operations in a own list and since we can easier apply it then and guarantee that the GC events are synchronised and applied before other events.

That’s it, and should be a somewhat working implementation of asynchronous garbage collection for CRTDs. But it’s not perfect so lets take a look at the downsides before we end this.

Lets be honest, it still has a downside

This concept of GCing does not come for free, the data structure required isn’t entirely trivial so it will add overhead, even so the current implementation is pretty cheap when adding the events in right order, wrong order will cause additional overhead because it might cause elements to shift around in the structure.

It requires events to be timestamped, even so there is no requirement for absolute order, this adds a constraint to messages and events that wasn’t there before. Also this is additional work and space that is consumed.

We need to define a T100 for the system and guarantee it, and find a balance of choosing a big enough T100 to ensure it’s correctness while keeping it small enough to not keep a huge tail of non garbage collected events. That said this can be mitigated slightly by using a dynamic T100 for example put record when a object was last written to all primary nodes.

If T100 isn’t chooses correctly it might end up getting really messy! if a elements slips by T100 that wasn’t there it could mean that the garbage collection is broken for quite some while or worst state gets inconsistent.

Bucket size is another matter, it needs to be chosen carefully to be big enough to not spam the system but small enough to not take ages to fill, a event passing T100 but not filling the bucket isn’t doing much good.

This is just a crazy idea. I haven’t tried this, implemented it or have a formal prove, it is based on common sense and my understanding on matters so it might just explode ;)

Happy Birthday Project FiFo

Some might know it, some might not and some might not care but for what it’s worth I’m the author of Project-FiFo (or most of it) and today is Project-FiFo’s first birthday (since the domain registration) and I want to take this chance to look back to the past year and reflect, say thank you to all of you and take a look in the future.

When I started Project FiFo a year ago it was more of a tiny hobby project and I could have sworn it would stand in row with all the other little open source projects no one would ever give a damn about. I really could not have been more wrong, what started as a few lines of clojurescript has grown to a beat of project with thousands of lines of code, a ever growing and incredible community (the project page gets between 2.5 and 3 thousand visitors a month by now and constantly more then 20 people in the irc channel) and a totally enthusiastic team!

Thank you

With a year gone it is about time to call out a few people and say ‘thanks’ because without their time, effort and work FiFo would not exist and in the day to day business of killing bugs, adding features and contemplating world domination it’s easy to forget this.

I want to start with Mark Slatem aka trentster author of the SmartCore blog and FiFo’s number one. He was pretty much the first person looking into FiFo and has sticked around till now going from first observer to tester, writer, helper and most of all a good friend.

Deirdré the Joyent community manager for SmartOS. Solaris and with that SmartOS is a underdog, and I’m sure without the incredible brilliant community it would have been doomed from the start. But it is not, and in a good part thanks to Deirdrés effort to shape the community and make it part of the ecosystem instead of a second class citizen.

Killphil author of jingles, FiFo’s web UI, it is amazing he popped up one day, out of the blue saying ‘hey I’ve played a bit with improving your UI’ and put jingles down with became the official UI within matters of days. Sadly I could not get rid of him again since then so you all have to live with him adding crazy new features.

Joyent as a whole for open sourcing SmartOS and making it better and better. Without SmartOs there sure would be no FiFo and I’d be stuck with Linux/KVM, which would be not too much fun.

basho who open sourced riak_core and riak_test which make fifo so much more incredible and provide me with free T-Shirts (I got four by now but please don’t tell them or I might not get any more).

Every single person using FiFo, it’s amazing to see how well the project is received, hear the feedback. Thanks a lot for all the help with the little things, for putting up with the occasional bugs and bearing with the time it might take to fix them or add new features.

Looking back

I’ve been working on FiFo for a year now, well first attempts counted a bit longer even, and without exaggerating this was the most amazing year in my life. It has been a blast, I’ve learned tons of things, both technical and socially, meet some of the most amazing people I can think of and honestly never have been this happy before.

The whole thing started when I wanted to share a co-located server with some friends who are kind of consoleophobe and I wasn’t happy with the approach to give everyone root access. So a solution had to be found but vanilla SmartOS provided none, SDC was making no sense with a single node and too expensive for a hobby system. Everything else was simply not Solaris, period. Adding to it that Deirdré showed me the community, randomly answering a question on twitter with a hint to visit the channel – which was was incredible surprising after experiencing some of the Linux community… really there was no chance in hell ending up with anything but SmartOS.

But all in all there was no virtualisation solution that suited what I wanted, not even if I had taken cost out of the equation. And since I refuse to swing the white flag and surrender to something I don’t like the only wan was: build one! (also I’m crazy about challenges and seeing how far I can push things ;) And after

That sums up how the whole thing started, with a little nodejs/clojurescript application that could be used with sdc-* commands over http. But that did only work with a single host, not that I had more to serve but it looked kind of clumsy and unprofessional for a cloud operating system like SmartOS so wiggle was born as kind of a broker in front of multiple vmwebadm (man the name was horribly boring). And from there on it kept growing and growing.

Now over the last year of work, and lots of lots of input from the community the one badly named program has become 5 services, three of them distributed via riak_core, and a HTML/JS UI on top of that, that most importantly, all have very cool names.

All the technical things aside, running an open source project where people get engaged is a fascinating experience, there is so much to learn I would have never dreamed of, so much to take away from the situation that helped me understand the problems and inner working of teams, people, projects better. That alone was worth every second of time invested.

Dogs and funky names

Now before we go on I share something that was asked a few times now: why the crazy names and obsession with dogs?

So the story starts with naming the first component, which back then was wiggle (after being very disappointed with my name choice for vmwebadm). Wiggle was the component that gave a unified interface to multiple hypervisors and in the SmartOS channel I had heard that Joyent called it’s thing ‘headnode’, but for FiFo the goal was never to clone something that already existed and I wanted to make a point of it so here is how the thought chain went: head –> tail, node –> nod –> wiggle, tail&wiggle –> dog.

Now I had a promise to keep, back when I was younger and my brother was even younger (since he is my little brother) he got a pet (as in not real) dog, and I had just learned with a FiFo (First in First out) queue is, I found that Fifo is a amazing name for a dog, so I talked my brother into naming his pet dog Fifo telling hime that if I ever had a dog I’d name it Fifo too, that said, I’m a man of my word even if it takes over something like 15 years to make good on it.

All other names just followed the same naming scheme, expect jingles but then again I did not name it myself. That said Fifo, the dog, is still with us, he is sitting on my window board drying from taking a shower earlier today to get cleaned up for it’s birthday!

As a nice side effect it is great for people to remember things that are named so silly as FiFo’s components!

Looking ahead

To be honest I feel that even after a year FiFo is still in its infancy, don’t get me wrong it’s quite stable and the features it provides build a very good foundation but there is so much more it can and will be!

PXE booting, integrated in FiFo, allowing to spin up new hypervisors by a click in the UI (or a call from he console) adding ipmi to the mix makes it even more exciting! Think about automatically booting a new hypervisor when the capacity reaches a certain percentage, or shutting an empty one down when it’s below.

Support for Clouds spread over WAN, location awareness of VM’s and Hypervisors with a notion of distance with deployment rules that take this into account (please don’t deploy all my database cluster VM’s on the same physical host, but don’t spread them over multiple datacenters!).

Cold migration of VM’s from one host to another, and putting them into cold storage / backing them up as a whole.

Well, I could go on and on rambling about crazy ideas for another thousand or so words or so but lets save this for another time. All in all I wanted to say it was an amazing year, amazing to see the community develop, seeing how FiFo gets used and I am hugely excited to see how things continue from here on! I can’t wait to see the first 10+ node FiFo setup, hear what people make out of it. See a first adopted UI, people starting to build things around FiFo – there already is a ruby implementation of the API along with a chef knife thing.

So to close: Happy birthday FiFo, thanks to all of you for joining this journey and lets brace for another year of dog-named-components!

FiFo + 80LOC of Bash = 5 Node Riak Cluster

The reason

The question ‘why would I want at least 5 nodes’ comes up very often in the #riak IRC channel, there is a good explanation. But that’s boring, no one likes reading manuals, we, as engineers, like to try things out (aka. break stuff).

Only downside with that is that we need to set things up before we can break them, or even worst need to un-break it later to try out different things (aka. break it in different ways). Admittedly setting up a riak instance is easy but setting up 5 and connecting them then break them and do all again to break them again, erm… I mean try things out of cause, can get really tedious and I for once am too lazy to bother with that.

The goal

Make setting our breakage, erm test, bed setup as simple as possible, and whipping up things and tearing them down trivial, ideally have one simple command like ./riak.sh setup to do that for us and ./riak.sh delete undo it all for us to get back to a clean state.

The tools

To build anything we’ll need some tools, hammer and nails will not do us much good here so we are going to pick:

  • Project FiFo – my favourite virtualisation tool (I am biassed I wrote it), but it’s very easy to set up and very powerful.
  • The FiFo Console Client – we want to script things, a UI isn’t helpful.
  • bash – the simplest possible scripting tool.
  • curl – since riak offers a http fronted it’s a wonderful way to check if the system is up.
  • jsontool – a nifty utility to traverse JSON documents.

With that we should be set and good to go.

The steps

We’ll have to perform multiple steps to build our wracking ground for riak lets look at them one by one:

Preparing the environment

Before we can begin we’ve to set up a few things, I’ll not go into detail how to set up FiFo, there is a [good manual] for that with only like 5 steps required. So lets start at some of the script’s variables:

1
2
3
4
#/usr/bin/env bash
PACKAGE="small"
DATASET="base64-1.9.1"
NET="7df94bc3-6a9f-4c88-8f80-7a8f4086b79d"
  • smal is the name of the package created in FiFo, I picked something with 512MB of memory since that should be enough for now.
  • base64-1.9.1 is the dataset it means things are running in a solaris zone this also can be installed from the FiFo UI.
  • 7df94bc3-6a9f-4c88-8f80-7a8f4086b79d is the UUID of the network you can find that with fifo networks list
1
2
3
4
5
6
7
8
9
10
11
12
schroedinger:fifopy heinz [master] $ fifo packages list
                                UUID Name       RAM        CPU cap    Quota
------------------------------------ ---------- ---------- ---------- ----------
5f9f6c41-d700-4b4f-80f1-7350a71ed2e6 small      512 MB     100%       10 GB
schroedinger:fifopy heinz [master] $ fifo networks list
                                UUID Name       Tag                  First            Last
------------------------------------ ---------- ---------- --------------- ---------------
7df94bc3-6a9f-4c88-8f80-7a8f4086b79d test       admin        192.168.0.210   192.168.0.220
schroedinger:fifopy heinz [master] $ fifo datasets list
                                UUID Name       Version Type  Description
------------------------------------ ---------- ------- ----- ----------
60ed3a3e-92c7-11e2-ba4a-9b6d5feaa0c4 base       1.9.1   zone  A SmartOS ...

Creating a VM with riak installed

Creating a VM is rather simple we need a little JSON and pipe it to fifo with a cat. Please note the section reading user-script here we make the setup. Here is how it looks.

1
2
3
4
5
6
7
cat <<EOF | fifo vms create -p $PACKAGE -d $DATASET
{
  "alias": "riak1",
  "networks": {"net0": "$NET"},
  "metadata": {"user-script": "/opt/local/bin/sed -i.bak \\"s/pkgsrc/pkgsrc-eu-ams/\\" /opt/local/etc/pkgin/repositories.conf; /opt/local/bin/pkgin update; /opt/local/bin/pkgin -y install riak; export IP=\`ifconfig net0 | head -n 2 | tail -n 1 | awk '{print \$2}'\`; /opt/local/bin/sed -i.bak \\"s/127.0.0.1/\$IP/\\" /opt/local/etc/riak/app.config; /opt/local/bin/sed -i.bak \\"s/127.0.0.1/\$IP/\\" /opt/local/etc/riak/vm.args; svcadm enable epmd riak"}
}
EOF

To get a bit better look user script section and remove the escape things:

1
2
3
4
5
6
7
8
9
10
11
12
# We configure pkgin to use the european mirror you might not need to do that.
/opt/local/bin/sed -i.bak "s/pkgsrc/pkgsrc-eu-ams/" /opt/local/etc/pkgin/repositories.conf;
# We update the pkgin database and install riak
/opt/local/bin/pkgin update;
/opt/local/bin/pkgin -y install riak;
# We find out what IP our VM has from within the VM.
export IP=`ifconfig net0 | head -n 2 | tail -n 1 | awk '{print $2}'`;
# We update the app.config and vm.args to use the 'public' ip instead of the 127.0.0.1
/opt/local/bin/sed -i.bak "s/127.0.0.1/$IP/" /opt/local/etc/riak/app.config;
/opt/local/bin/sed -i.bak "s/127.0.0.1/$IP/" /opt/local/etc/riak/vm.args;
# Start epmd and riak
svcadm enable epmd riak

Waiting for riak

Now that is the first zone set up next we’ll want to wait for riak to properly start up. This is needed since the commands are asynchronous and installing the packages can be a tad slow. But we can just to curl the http interface to check for this, so it’s rather simple:

1
2
3
4
5
6
7
8
9
10
11
12
# We'll ask fifo for the IP of our first zone.
IP1=`fifo vms get riak1 | json networks[0].ip`
# Print some info so waiting is not so boring
echo -n 'Waiting until riak is up and running on the primary node.'
# now we curl the http interface every second to see if things are good.
until curl http://${IP1}:8098 2>/dev/null >/dev/null
do
  sleep 1
  echo -n '.'
done
# and we're done!
echo " done."

Setting up the remaining zones

We’re not going to get into too much details with this since it is pretty much working the same as the first VM with the only difference that the user-script holds two more lines:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
for i in 2 3 4 5
do
  cat <<EOF | fifo vms create -p $PACKAGE -d $DATASET
  {
    "alias": "riak${i}",
    "networks": {"net0": "$NET"},
    "metadata": {"user-script": "/opt/local/bin/sed -i.bak \\"s/pkgsrc/pkgsrc-eu-ams/\\" /opt/local/etc/pkgin/repositories.conf; /opt/local/bin/pkgin update; /opt/local/bin/pkgin -y install riak; export IP=\`ifconfig net0 | head -n 2 | tail -n 1 | awk '{print \$2}'\`; /opt/local/bin/sed -i.bak \\"s/127.0.0.1/\$IP/\\" /opt/local/etc/riak/app.config; /opt/local/bin/sed -i.bak \\"s/127.0.0.1/\$IP/\\" /opt/local/etc/riak/vm.args; svcadm enable epmd riak; sleep 10; /opt/local/bin/sudo -uriak /opt/local/sbin/riak-admin cluster join riak@${IP1}; /opt/local/bin/sudo -uriak /opt/local/sbin/riak-admin cluster plan; /opt/local/bin/sudo -uriak /opt/local/sbin/riak-admin cluster commit"}
  }
EOF
  IP=`fifo vms get riak$i | json networks[0].ip`
  echo -n "Waiting untill riak is up and running on the node $i."
  until curl http://${IP}:8098 2>/dev/null >/dev/null
  do
      sleep 1
      echo -n '.'
  done
  echo " done."

done

The two new lines are joining the node to the existing riak node which is quite easy, we can use $IP1 we generated in the first step too:

1
2
3
/opt/local/bin/sudo -uriak /opt/local/sbin/riak-admin cluster join riak@${IP1}
/opt/local/bin/sudo -uriak /opt/local/sbin/riak-admin cluster plan
/opt/local/bin/sudo -uriak /opt/local/sbin/riak-admin cluster commit

This is run that up and you’ve a 5 node riak cluster, and it’s quick at last if you’re in the US and have a good connection to the package repository.

Here is this all slapped together.

Writing Your First Riak Test Test (Yes I Know There Are Two Tests There)

As promised in a previous post I’ll talk a bit about writing tests for riak_test. To start with the obvious it’s pretty simple and pretty awesome. riak_test gives you the tools you’ve dreamed of when testing distributed riak_core applications:

  • a backchannel to communicate and execute commands on the nodes.
  • a nice and way to bring up and tear down the test environment.
  • helper functions to deal with the riak_core cluster.
  • something called intercepts that allow you to mock certain behaviours in the cluster.
  • all the power of Erlang.

How a test works

Tests have a very simple structre they pretty much contain a single function: confirm/0

This function gets called when the test is executed and should return pass when everything works well or throw an exception when not. The actual test are simple unit asserts you use.

That in itself is not really overly exciting and those of you with a short attention span might start to thing ‘boooooooring’ so lets look at the exciting part.

Starting your applicatio

riak_test offers a way to start instances of your application and communicate with them, the common pattern is to start one (or more) nodes as first part of the script and check of they are up and running. That could look something like this:

1
2
3
4
confirm() ->
    [Node] = rt:deploy_nodes(1),
    ?assertEqual(ok, rt:wait_until_nodes_ready([Node])),
    pass.

This is a minimal test it sets up one instance of our application and waits for it to be ready.

rt:deploy_nodes(1) will deploy one node, the id of the node (a atom with that identifies it to erlang) will be sotred in Node, you can deploy more nodes by increasing the number in rt:deploy_nodes/1.

?assertEqual(ok, rt:wait_until_nodes_ready([Node])) will make the test wait for the nodes to be ready, ready here means that all ring services we defined in the config are provided.

The node will be running in its own Erlang VM and have the test suite connected as a hidden node. This is the first thing that is truly interesting, since the connection will allow us to run rpc calls on the node this is the first thing that is truly fun.

An official channel

Now we’ve a basic test running have our nodes to be started up and the test waiting until all is started and happy.

So chances are that aside of this back channel communication the node provides some kind of API, and we want to be able to connect to this API to run our tests. Un my case it’s a simple TCP port that announces itself over mdns, we could simply listen to the broadcast and use the information it provides to talk to the node. This would work, as long as we’ve a single node, the moment we’ve to we would never know which node we’re talking to and that would make testing hard.

So backchannel to the rescue! We’ll just get the nodes configuration from the host, for my application I store this kind of information in the application configuration so I’ve made a function that given a Node returns IP and Port for it to talk to plus one to send data:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
node_endpoing(Node) ->
    {ok, IP} = rpc:call(Node, application, get_env, [mdns_server_lib, ip]),
    {ok, Port} = rpc:call(Node, application, get_env, [mdns_server_lib, port]),
    {IP, Port}.

call(Node, Msg) ->
    {IP, Port} = node_endpoing(Node),
    lager:debug("~s:~p <- ~p", [IP, Port, Msg]),
    {ok, Socket} = gen_tcp:connect(IP, Port, [binary, {active,false}, {packet,4}], 100),
    ok = gen_tcp:send(Socket, term_to_binary(Msg)),
    {ok, Repl} = gen_tcp:recv(Socket, 0),
    {reply, Res} = binary_to_term(Repl),
    lager:debug("~s:~p -> ~p", [IP, Port, Res]),
    gen_tcp:close(Socket),
    Res.

There is some gen_tcp stuff in there but lets just ignore it for now it’s detail not important, the first function is the more interesting one it uses the rpc module to execute the commands on the node we just started which is quite awesome.

As a note I’ve put those functions into rt_<applicatin name> so for example rt_sniffle.

Testing the API

With call/2 we’ve now a way to send data directly to the node over the official API channel so lets add to our test. So lets get back to our confirm/0 and add some real tests:

1
2
3
4
5
6
7
8
9
10
confirm() ->
    [Node] = rt:deploy_nodes(1),
    ?assertEqual(ok, rt:wait_until_nodes_ready([Node])),
    ?assertEqual({ok,[]},
                 rt_sniffle:call(Node, {vm, list})),
    ?assertEqual(ok,
                 rt_sniffle:call(Node, {vm, register, <<"vmid">>, <<"hypervisor">>})),
    ?assertEqual({ok,[<<"vmid">>]},
                 rt_sniffle:call(Node, {vm, list})),
    pass.

It’s as easy as this, this test will

  • List the registered VM’s and expect none to be there.
  • Create a new VM and expect a ok result.
  • List the registered vm’s and expect the one we just registered to be present.

And that’s it for basic testing with riak_test I’ll follow up on this with an article over intercepts since they add another cool feature to riak_test.

Getting Started With Riak_test and Riak_core

If you don’t know what riak_core is, or don’t have a riak_core based application you’ll probably not take too much practical use of this posts you might want to start with Ryan Zezeski’s “working” blog try try try and the rebar plugin.

That said if you have a riak_core app this posts should get you started on how to test it with riak_test. We’ll not go through the topic of how to write the tests itself, this might come in a later post also the riak_kv tests are a good point to start.

Please note that the approach described here is what I choose to do, it might not be best practice of the best way for you to do things. I also will link to my fork of riak_test instead of the official one since it includes some modifications required for testing apps other then riak_kv I hope those modifications will be merged back at some point but for now I want to get it ironed out a bit more before making a pull request.

What is riak_test?

So before we start a few words to riak_test. riak_test is a pretty nice framework for testing distributed applications, it is, just as about all other riak_ stuff created by Basho and it is pretty darn awesome.

At it’s current state it is very focused on testing riak_kv (or riak as in the database) but from a first glance a lot of functionality is very universal and after all riak_core is also build on top of riak_core, so modifying it to run with other riak_core based apps is pretty easy.

The setup

Since I will be testing multiple riak_core apps and not just one I decided to go the following path: Have the entire setup in a git repository, then have one branch for general fixes/changed to riak_core then have one branch for each application I want to test that is based on the general branch so common changes can easily be merged so it will look like this:

1
2
3
4
5
---riak_test--------- (bashos master tree)
   `---riak_core----- (modifications to make riak_test work with core apps)
    ` `  `---sniffle- (tests for sniffle)
     ` `---snarl----- (tests for snarl)
      `---howl------- (tests for howl)

We’ll go over this and setting up tests for the howl application it’s rather small and simple and it’s easier to follow along with something real instead of a made up situation.

Getting started

Step one of getting started is to get a clone from the riak_test repository, that’s pretty simple (alter the path if you decided to fork):

1
2
3
cd ~/Projects
git clone https://github.com/Licenser/riak_test.git
cd riak_test

Now we branch of to have a place to get our howl application but first we need to checkout the riak_core branch to make sure we get the changes included in it:

1
2
3
git checkout riak_core
git branch howl
git checkout howl

Okay that’s it for the basic setup not that bad so far is it?

Configuration

Next thing we need to do is creating a configuration, at this point we assume you don’t have any yet so we’ll start from scratch, if you add more then one application later on you can just add them to an existing configuration.

riak_test looks for the configuration file ~/.riak_test.config and reads all the data from there so we’ll first need to copy the sample config there:

1
cp riak_test.config.sample ~/.riak_test.config

Next step is to open it in your favourite editor, you’ll recognise it’s a good old Erlang config file with tuples to group sections. We’ll be ignoring the default section for now, if you’re interested in it the documentation is quite good!

So lets go down to where it reads:

1
2
3
%% ===============================================================
%%  Project-specific configurations
%% ===============================================================

Here is where the fun starts, you’ll see a tuple starting with {rtdev, – note this rtdev has nothing whatsoever to do with the rtdev that is in the default section as {rt_harness, rtdev}. The rtdev in the project part is just the name of the project, since your project is named howl not rtdev we’ll go and change that first.

1
{rtdev, [

Now we can go to set up some variables first up the project name and executables, the name itself is just for information (or if you use giddyup) the executables are how your application is started, since our application is named howl it’s started with the command howl and the admin command for it is howl-admin.

1
2
3
4
5
6
%% The name of the project/product, used when fetching the test
%% suite and reporting.
{rt_project, "howl"},

{rc_executable, "howl"},
{rc_admin, "howl-admin"},

With that done come the services, those are the buggers you register in your _app.erl file, lets have a look at the howl_app.erl:

1
2
3
%...
            ok = riak_core_node_watcher:service_up(howl, self()),
%...

So we only have one service here we need to watch out for, named, you might guess … right howl that makes the list rather short:

1
{rc_services, [howl]},

Now the cookie, it’s a bit hidden in the code that you need to set it but you do, you will need this later so remember it! Since I am bad at remembering things I named it … howl … again.

1
{rt_cookie, howl},

Now comes the setup of paths, for this we’ve to decide where we want to put our data later on, I’ve put all my riak_test things in /Users/heinz/rt/... so we’ll follow with this. Also note that my development process works on three branches:

  • test – the most unstable branch.
  • dev – here things go that should work.
  • master – only full releases go in here.

This setup might not work for you at all, but since it are only path names it should be easy enough to adept the.

Note that by default riak_test will run tests on the current environment

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
%% Paths to the locations of various versions of the project. This
%% is only valid for the `rtdev' harness.
{rtdev_path, [
              %% This is the root of the built `rtdev' repository,
              %% used for manipulating the repo with git. All
              %% versions should be inside this directory.
              {root, "/Users/heinz/rt/howl"},

              %% The path to the `current' version, which is used
              %% exclusively except during upgrade tests.
              {current, "/Users/heinz/rt/howl/howl-test"},

              %% The path to the most immediately previous version
              %% of the project, which is used when doing upgrade
              %% tests.
              {previous, "/Users/heinz/rt/howl/howl-dev"},

              %% The path to the version before `previous', which
              %% is used when doing upgrade tests.
              {legacy, "/Users/heinz/rt/howl/howl-stable"}
             ]}
]}

And that’s it now the config is set up and should look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
{rtdev, [
    %% The name of the project/product, used when fetching the test
    %% suite and reporting.
    {rt_project, "howl"},

    {rc_executable, "rhowl"},
    {rc_admin, "howl-admin"},
    {rc_services, [howl]},
    {rt_cookie, howl},
    %% Paths to the locations of various versions of the project. This
    %% is only valid for the `rtdev' harness.
    {rtdev_path, [
                  %% This is the root of the built `rtdev' repository,
                  %% used for manipulating the repo with git. All
                  %% versions should be inside this directory.
                  {root, "/Users/heinz/rt/howl"},

                  %% The path to the `current' version, which is used
                  %% exclusively except during upgrade tests.
                  {current, "/Users/heinz/rt/howl/howl-test"},

                  %% The path to the most immediately previous version
                  %% of the project, which is used when doing upgrade
                  %% tests.
                  {previous, "/Users/heinz/rt/howl/howl-dev"},

                  %% The path to the version before `previous', which
                  %% is used when doing upgrade tests.
                  {legacy, "/Users/heinz/rt/howl/howl-stable"}
                 ]}
]}

Setting up the application

We’ve riak_test ready to test now next we need to prepare howl to be tested, we’ll only look at the current (aka test) setup since the steps for others are pretty much the same.

The first step is that we need the folder, so lets create it

1
2
mkdir -p /Users/heinz/rt/raw/howl
cd /Users/heinz/rt/raw/howl

Since howl lives with the octocat on github it’s easy to fetch our application and checkout the test branch (remember current is on the test branch for me):

1
2
3
git clone https://github.com/project-fifo/howl.git howl-test
cd howl-test
git checkout test

And done, now since it’s a riak_core app we should have a task called stagedevrel in our makefile which will basically generate three copies of howl for us in the folders dev/dev{1,2,3} and in the process take care of compiling and getting the dependencies. I prefer stagedevrel over the normal devrel since later on it will it easier to recompile code files (make is enough) because it links them to the right place instead of copying the.

1
make stagedevrel

Now we’ve to do a bit of a cheating, riak_text expects the root dir to be a git repository, that is why we can’t just put the data in there directly, so we’ve to manually build the tree for riak core and set up as git repositor.

1
2
3
4
5
6
7
8
9
10
mkdir -p /Users/heinz/rt/howl
cd /Users/heinz/rt/howl
git init

cat <<EOF > /Users/heinz/rt/howl/.gitignore
*/dev/*/bin
*/dev/*/erts-*
*/dev/*/lib
*/dev/*/releases
EOF

Now we need to link our devrel files and for my setup I’ve to copy the *.example files of the app.config and vm.args into the right place they might be named differently for you.

1
2
3
4
5
6
7
8
9
10
11
12
13
export RT_BASE=/Users/heinz/rt/howl/howl-test
export RC_BASE=/Users/heinz/rt/raw/howl/howl-test
for i in 1 2 3 4
do
  mkdir -p ${RT_BASE}/dev/dev${i}/
  cd ${RT_BASE}/dev/dev${i}/
  mkdir data etc
  touch data/.gitignore
  ln -s ${RC_BASE}/dev/dev${i}/{bin,erts-*,lib,releases} .
  cp ${RC_BASE}/dev/dev${i}/etc/vm.args.example etc/vm.args
  cp ${RC_BASE}/dev/dev${i}/etc/app.config.example etc/app.config
done

We still need to edit the vm.args in dev/dev{1,2,3,4}/etc/ since we need to set the correct cookie – I hope you still remember yours, I told you you’d need it (if not you can just look in the ~/.riak_test.config)!

That’s it.

Running a first test

In the riak_core branch of riak_test I’ve moved all the riak_kv specific tests from tests to tests_riakkv so you still can look at them but I left one of them in tests, namely the basic command test – it will check if your applications command (howl in our case) is well behaved.

We’ll want to run it to see if howl is a good boy and does well to do so we’ll need to get back into the riak_test folder and run the riak_test command:

1
2
cd ~/Projects/riak_test
./riak_test -t tests/* -c howl -v -b none

I’d like to explain this a bit, the arguments have the following meaning:

  • -t tests/* – we’ll be running all tests in the folder tests/.
  • -c howl – our application we want to test is named howl, this is the first element of the tuple we put in our config file when you remember.
  • -v – This just turns on verbose output.
  • -b none – This is still a relict from the riak_kv roots, it means which backend to test with, since we don’t have backends at all we’ll just pass none which means riak_test will happily ignore it.

That’s it! Now go and test all the things!

This is the first part of a series that goes on here.

Plugins With Erlang

Preamble

Lets start with this, Erlang releases are super useful they are one of the features I like most about Erlang – you get out an entirely self contained package you can deploy and forget, no library trouble, no wrong version of the VM no trouble at all.

BUT (this had to come didn’t it) sometimes they are limiting and kind of inflexible, adding a tiny little feature means rolling out a new release, with automated builds that is not so bad but ‘not so bad’ isn’t good either. And things get worst when there are different wishes.

A little glimpse into reality: I’m currently working a lot on Project FiFo and one of the issues I faced is that – surprisingly – not everyone wants things to work exactly as I do. Which was a real shock, how could anyone ever disagree with me? Well … I got over it, really I did, still solving this issue by adding one code path for every preference and making it configureable didn’t looked like a good solution.

Also recently we are thinking a lot about performance metrics, and there are like a gazillion of them and if you pick two random people I think they want three different sets of metrics. Ask again after 5 minutes and their opinions changed to 7 new metrics and certainly not the old ones!

Plugins

The problem is a very old one, extending the software after it was shipped, possibly letting the community extend it beyond what was dreamed of in the beginning. The solution is pretty much one day younger then the problem: plugins. Meaning a way to load code into the system.

Sounds easy but it is a bit more complex, just having something load into the VM doesn’t do much good when it does not get executed in the proper place – so sadly this comes with extra work for the developer to sprinkle their code with hooks and callbacks for the plugins.

With this I’d like to introduce eplugin it’s a very simplistic library for exactly that task – introduce plugins in an Erlang release or application. It takes care of discovering and loading plugins, letting them register into certain calls, a little dependency management on startup and provides functions to call registered plugins. Erlang comes with great tools already to the whole thing sums up to under 400 LOC.

Types of plugins

I feel it’s kind of interesting to look at the different kind of plugins that exist and how to handle the cases with eplugin also a post entirely without code would look boring.

informative plugins

Sometimes a plugin just want to know that something happened but don’t care about the result. eplugin provides the call (and apply) functions for that. A logger is a good example for this so lets have a look:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
%%% plugin.conf
{syslog_plugin,
  [{syslog_plugin, [{'some:event', log}]}],
  []}.

%%% syslog_plugin.erl
-module(syslog_plugin).
-export([log/1]).

log(String) ->
  os:cmd("logger '" ++ String ++ "'").

%%% in your code
  %%... 
  eplugin:call('some:event', "logging this!"),
  %%... 

Thats pretty much it, provided you’ve started the eplugin application in your code and put the plugins in the right place this will just work. You could also use this to trigger side effects, like delete all files when an error occurs to remove traces of your failure.

messing around plugins

This kind of plugins process some data and return a new version of this data, we have fold for this case. Fold since it internally uses fold to pass the data from one plugin to another. there are many applications for that one would be to replace all occurrences of ‘not js’ with ‘node.js’ to prevent freudian typos in your texts.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
%%% plugin.conf
{not_js,
  [{not_js, [{'text:check', replace}]}],
  []}.

%%% not_js.erl
-module(not_js).
-export([replace/1]).

replace(String) ->
  re:replace(String, "not js", "node.js", [global]).
%%% in your code
  %%... 
  String1 = eplugin:fold('text:check', "I'm writing a not js application!"),
  %%... 

fold and call are the most interesting and important kind of plugins, they cover most if not all of the possible use cases of plugins so there is a special case left which I found useful to have.

checking plugins

Checking plugins are plugins which are supposed to decide if something is Ok or not, they are pretty much a case of fold that returns true or false (or actually whatever is not true). But eplugin solves this too, with the test function! An example here is authentication

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
%%% plugin.conf
{get_out,
  [{get_out, [{'login:allowed', no_really_not}]}],
  []}.

%%% get_out.erl
-module(get_out).
-export([no_really_not/1]).

no_really_not(Login) ->
  {forbidden, ["Dear ", Login, " we don't want you here go away!"]}.

%%% in your code
  %%... 
  case eplugin:test('login:allowed', "Licenser") of
     true ->
      %%% huzza!
     Error ->
      io:format("~p~n", [Error])
  end
  %%...