Drawing the Curtain Back on the Magic Elixir – Part 2

Introduction
In the last post, we looked at pattern matching and data structures in Elixir. In this post, we’re going to look particularly at Elixir processes and how they affect handling errors and state.

Supervisors & Error Handling
Supervisors, or rather supervision trees as controlled by supervisors, are another differentiating feature of Elixir. Processes in Elixir are supposed to be much lighter weight than threads in other languages,^[1] and you’re encouraged to create separate processes for many different purposes. These processes can be organized into supervision trees that control what happens when a process within the tree terminates unexpectedly. There are a lot of options for what happens in different cases, so I’ll refer you to the documentation to learn about them and just focus on my experiences with them. Elixir has a philosophy of “failing fast” and letting a supervision tree handle restarting the failing process automatically. I tried to follow this philosophy to start with, but I’ve been finding that I’m less enamoured with this feature than I thought I would be. My initial impression was that this philosophy would free me of the burden of thinking about error handling, but that turned out not to be the case. If nothing else, you have to understand the supervision tree structure and how that will affect the state you’re storing (yes, Elixir does actually have state, more below) because restarting a process resets its state. I had one case where I accidentally brought a server down because I just let a supervisor restart a failed process. The process dutifully started up again, hit the same error and failed again, over and over again. In surprisingly short order, this caused the log file to take up all the available disk space and took down the server entirely. Granted this could happen if I had explicit error handling, too, but at least there I’d have to think about it consciously and I’m more likely to be aware of the possibilities. You also have to be aware of how any child processes you created, either knowingly or unknowingly through libraries, are going to behave and add some extra code to your processes if you want to prevent the children bringing your process down with them. So, I’m finding supervisors a mixed blessing at best. They do offer some convenience, but they don’t really allow you to do anything that couldn’t be done in other languages with careful error handling. By hiding a lot of the necessity for that error handling, though, I feel like it discourages me from thinking about what might happen and how I would want to react to those events as much as I should. As a final note tangentially related to supervisors, I also find that the heavy use of processes and Elixir’s inlining of code can make stack traces less useful than one might like.

All that being said, supervisors are very valuable in Elixir because, at least to my thinking, the error handling systems are a mess. You can have “errors,” which are “raised” and may be handled in a “rescue” clause. You have “throws,” which appear not not have any other name and are “thrown” and may be handled in a “catch” clause and you have “exits,” which can also be handled in a “catch” clause or trapped and converted to a message and sent to the process outside of the current flow of control. I feel like the separation of errors and throws is a distinction without a difference. In theory, throws should be used if you intend to control the flow of execution and errors should be used for “unexpected and/or exceptional situations,” but the idea of what’s “unexpected and/or exceptional” can vary a lot between people, so there’s been a few times when third-party libraries I’ve been using raise an error in a situation that I think should have been perfectly expectable and should have been handled by returning a result code instead of raising an error. (To be fair, I haven’t come across any code using throws. The habit seems to be handle things with with statements, instead.) There is a recommendation that one should have two versions of a method, one that returns a result code and the other, with the same name suffixed with a bang (File.read vs. File.read!, for example) and raises an error, but library authors don’t follow this convention religiously for themselves, and sometimes you’ll find a function that’s supposed to return no matter what calling a function in yet another library that does raise an error, so you can’t really rely on it.

I haven’t seen any code explicitly using an exit statement or any flow control trying to catch exits, but I have found that converting the unintentional errors that sometime happen into messages allows me to avoid other processes taking down my processes, thus allowing me to maintain the state of my process rather than having it start from its initial state again. That’s something I appreciate, but I still find it less communicative when reading code as the conversion is handled by setting a flag on the process, and that flag persists until explicitly turned off, so I find it harder to pinpoint where in my code I was calling the thing that actually failed.

All-in-all, I think that three forms of error handling are too many and that explicit error handling is easier to understand and forces one into thinking about it more. (I happen to believe that the extra thinking is worthwhile, but others may disagree.) That being said, adding supervision trees to other languages might be a nice touch.

State
Given that the hallmark of functional programming languages is being stateless, it may seem odd to conclude with a discussion of state, but the world is actually a stateful place and Elixir provides ways to maintain state that are worth discussing.

At one level, Elixir is truly stateless in that it’s not possible to modify the value of a variable. Instead, you create a new value that’s a function of the old value. For example, to remove an element from a list, you’d have to write something like:

my_list = List.delete_at(my_list, 3)

It’s not just that the List.delete_at function happens to create a new list with one fewer element, there’s no facility in the language for modifying the original value. Elixir, however, does let you associate a variable with a process, and this effectively becomes the state of the process. There are several different forms that these processes can take, but the one that I’ve used (and seen most used) are GenServers. The example in the linked document is quite extensive, and I refer you to that for details, but essentially there are two things to understand for the purposes of my discussion. On the client side, just like in an OOP, you don’t need to know anything about the state associated with a process, so, continuing with Stack example from the documentation, a client would add an element to a stack by calling something like:

GenServer.call(process_identifier, {:push, “value”})

(Note that process_identifier can be either a system generated identifier or a name that you give it when creating the process. I’ll explain why that’s important in a little while.) On the server side, you need code something like:

def handle_call({:push, value}, from, state) do
…
{:reply, :ok, new_state}
end

Here, the tuple {:push, value} matches the second parameter of the client’s invocation. The from parameter is the process identifier for the client, and state is the state associated with the server process. The return from the server function must be a tuple, the first value of which is a command to the infrastructure code. In the case of the :reply command, the second value of the tuple is what will be returned to the client, and the final value in the tuple is the state that should be associated with the process going forward. Within a process, only one GenServer call back (handle_call, handle_cast, handle_info, see the documentation for details) can be running at a time. Strictly in terms of the deliberate state associated with the process, this structure makes concurrent programming quite safe because you don’t need to worry about concurrent modification of the state. Honestly, you’d achieve the same effect in Java, for example, by marking every public method of a class as synchronized. It works, and it does simplify matters, but it might be overkill, too. On the other hand, when I’m working with concurrency in OOP languages, I tend towards patterns like Producer/Consumer relationships, which I feel are just as safe and offer me more flexibility.

There is a dark side to Elixir’s use of processes to maintain state, too: Beyond the explicit state discussed above, I’ve run into two forms of implicit state that have tripped me up from time to time. Not because they’re unreasonable in-and-of themselves, but because they are hidden and as I become more aware of them, I realize that they do need thinking about. The first form of implicit state is the message queue that’s used to communicate with each process. When a client makes a call like GenServer.call(pid, {:push, “something”}), the system actually puts the {:push, “something”} message onto the message queue for the given pid and that message waits in line until it gets to the front of the queue and then gets executed by a corresponding handle_call function. The first time this tripped me up was when I was using a third-party library to talk to a device via a TCP socket. I used this library for a while without any issues, but then there was one case where it would start returning the data from one device when I was asking for data from another device. As I looked at the code, both the library code and mine, I realized that my call was timing out (anytime you use GenServer.call, there’s a timeout involved; I was using the default of 5 seconds and it was that timeout being triggered instead of a TCP timeout), but the device was still returning its data somewhat after that. The timing was such that the code doing the low-level TCP communications put the response data into the higher-level library’s message queue before my code put its next message requesting data into the library’s queue, so the library sent my query to the device but gave my code the response to the previous query and went back to waiting for another message from either side, thus making everything one query off until another query call timed out. This was easy enough to solve by making the library use a different process for each device, but it illustrates that you’re not entirely relieved of thinking about state when writing concurrent programs in Elixir. Another issue I’ve had with the message queue is that anything that has your process identifier can send you any message. I ran into this when I was using a different third-party library to send text messages to Slack. The documentation for that library said that I should use the asynchronous send method if I didn’t care about the result. In fact, it described the method as “fire-and-forget.” These were useful but non-essential status messages, so that seemed the right answer for me. Great, until the library decided to put a failure message into the message queue of my process and my code wasn’t written to accept it. Because there was no function that matched the parameters sent by the library, the process crashed, the supervisor restarted it, and it just crashed again, and again, and again until the original condition that caused the problem went away. Granted this was mostly a problem with library’s documentation and was easy enough to solve by adding a callback to my process to handle the library’s messages, but it does illustrate another area where the state of the message queue can surprise you.

The second place that I’ve been surprised to find implicit state is the state of the processes themselves. To be fair, this has only been an issue in writing unit tests, but it does still make life somewhat more difficult. In the OOP world, I’m used to having a new instance of the class I’m testing created for each and every test. That turns out not to be the case when testing GenServers. It’s fine if you’re using the system generated process identifiers, but I often have processes that have a fixed identifier so that any other process can send a message to them. I ended up with messages from one test getting into the message queue of the process while another test was running. It turns out that tests run asynchronously by default, but even when I went through and turned that off, I still had trouble controlling the state of these specially named processes. I tried restarting them in the test setup, but that turned out to be more work than I expected because the shutdown isn’t instantaneous and trying to start a new process with the same name causes an error. Eventually I settled on just adding a special function to reset the state without having to restart the process. I’m not happy about this because that function should never be used in production, but it’s the only reliable way I’ve found to deal with the problem. Another interesting thing I’ve noticed about tests and state is that the mocking framework is oddly stateful, too. I’m not yet sure why this happens, but I’ve found that when I have some module, like the library that talks over TCP, a failing test can cause other tests to fail because the mocking framework still has hold of the module. Remove the failing test and the rest pass without issue, same thing if you fix the failing test. But having a failing test that uses the mocking framework can cause other tests to fail. It really clutters up the test results to have all the irrelevant failures.

Conclusion
I am reminded of a story wherein Jascha Heifetz’s young nephew was learning the violin and asked his uncle to demonstrate something for him. Heifetz apparently took up his nephew’s “el cheapo student violin” and made it sound like the finest Stradivarius. I’m honestly not sure if this is a true story or not, but I like the point that we shouldn’t let our tools dictate what we’re capable of. We may enjoy playing a Stradivarius/using our favorite programming language more than an “el cheapo” violin/other programming language, but we shouldn’t let that stop us achieving our ends. Elixir is probably never going to be my personal Strad, but I’d be happy enough to use it commercially once it matures some more. Most of the issues I’ve described in these postings are likely solvable with more experience on my part, but the lack of maturity would hold me back from recommending it for commercial purposes at this point. When I started using it last June, Dave Thomas’ Programming Elixir was available for version 1.2 and the code was written in version 1.3, although later version were already available. Since then, the official release version has gotten up to 1.6.4 with some significant seeming changes in each minor version. Other things that make me worry about the maturity of Elixir include Ecto, the equivalent of an ORM for Elixir, which doesn’t list Oracle as an officially supported database and the logging system, which only writes to the console out of the box. (Although you can implement custom backends for it. I personally would prefer to spend my time solving business problems rather than writing code to make more useful logs, and I haven’t seen a third-party library yet that would do what I want for logging.) For now, my general view is that Elixir should be kept for hobby projects and person interest, but it could be a viable tool for commercial projects once the maturity issues are dealt with.

About the Author
Narti is a former CC Pace employee now working at ConnectDER. He’s been interested in the design of programming languages since first reading The Dragon Book and the proceedings of HOPL I some thirty years ago.

^[1] I’ve no reason to doubt this. I just haven’t tested it myself.