Drawing the Curtain Back on the Magic Elixir – Part 2
Introduction
In the last post, we looked at pattern matching and data structures in Elixir. In this post, we’re going to look particularly at Elixir processes and how they affect handling errors and state.
Supervisors & Error Handling
Supervisors, or rather supervision trees as controlled by supervisors, are another differentiating feature of Elixir. Processes in Elixir are supposed to be much lighter weight than threads in other languages,[1] and you’re encouraged to create separate processes for many different purposes. These processes can be organized into supervision trees that control what happens when a process within the tree terminates unexpectedly. There are a lot of options for what happens in different cases, so I’ll refer you to the documentation to learn about them and just focus on my experiences with them. Elixir has a philosophy of “failing fast” and letting a supervision tree handle restarting the failing process automatically. I tried to follow this philosophy to start with, but I’ve been finding that I’m less enamoured with this feature than I thought I would be. My initial impression was that this philosophy would free me of the burden of thinking about error handling, but that turned out not to be the case. If nothing else, you have to understand the supervision tree structure and how that will affect the state you’re storing (yes, Elixir does actually have state, more below) because restarting a process resets its state. I had one case where I accidentally brought a server down because I just let a supervisor restart a failed process. The process dutifully started up again, hit the same error and failed again, over and over again. In surprisingly short order, this caused the log file to take up all the available disk space and took down the server entirely. Granted this could happen if I had explicit error handling, too, but at least there I’d have to think about it consciously and I’m more likely to be aware of the possibilities. You also have to be aware of how any child processes you created, either knowingly or unknowingly through libraries, are going to behave and add some extra code to your processes if you want to prevent the children bringing your process down with them. So, I’m finding supervisors a mixed blessing at best. They do offer some convenience, but they don’t really allow you to do anything that couldn’t be done in other languages with careful error handling. By hiding a lot of the necessity for that error handling, though, I feel like it discourages me from thinking about what might happen and how I would want to react to those events as much as I should. As a final note tangentially related to supervisors, I also find that the heavy use of processes and Elixir’s inlining of code can make stack traces less useful than one might like.
All that being said, supervisors are very valuable in Elixir because, at least to my thinking, the error handling systems are a mess. You can have “errors,” which are “raised” and may be handled in a “rescue” clause. You have “throws,” which appear not not have any other name and are “thrown” and may be handled in a “catch” clause and you have “exits,” which can also be handled in a “catch” clause or trapped and converted to a message and sent to the process outside of the current flow of control. I feel like the separation of errors and throws is a distinction without a difference. In theory, throws should be used if you intend to control the flow of execution and errors should be used for “unexpected and/or exceptional situations,” but the idea of what’s “unexpected and/or exceptional” can vary a lot between people, so there’s been a few times when third-party libraries I’ve been using raise an error in a situation that I think should have been perfectly expectable and should have been handled by returning a result code instead of raising an error. (To be fair, I haven’t come across any code using throws. The habit seems to be handle things with with statements, instead.) There is a recommendation that one should have two versions of a method, one that returns a result code and the other, with the same name suffixed with a bang (File.read vs. File.read!, for example) and raises an error, but library authors don’t follow this convention religiously for themselves, and sometimes you’ll find a function that’s supposed to return no matter what calling a function in yet another library that does raise an error, so you can’t really rely on it.
I haven’t seen any code explicitly using an exit statement or any flow control trying to catch exits, but I have found that converting the unintentional errors that sometime happen into messages allows me to avoid other processes taking down my processes, thus allowing me to maintain the state of my process rather than having it start from its initial state again. That’s something I appreciate, but I still find it less communicative when reading code as the conversion is handled by setting a flag on the process, and that flag persists until explicitly turned off, so I find it harder to pinpoint where in my code I was calling the thing that actually failed.
All-in-all, I think that three forms of error handling are too many and that explicit error handling is easier to understand and forces one into thinking about it more. (I happen to believe that the extra thinking is worthwhile, but others may disagree.) That being said, adding supervision trees to other languages might be a nice touch.
State
Given that the hallmark of functional programming languages is being stateless, it may seem odd to conclude with a discussion of state, but the world is actually a stateful place and Elixir provides ways to maintain state that are worth discussing.
At one level, Elixir is truly stateless in that it’s not possible to modify the value of a variable. Instead, you create a new value that’s a function of the old value. For example, to remove an element from a list, you’d have to write something like:
my_list = List.delete_at(my_list, 3)
It’s not just that the List.delete_at function happens to create a new list with one fewer element, there’s no facility in the language for modifying the original value. Elixir, however, does let you associate a variable with a process, and this effectively becomes the state of the process. There are several different forms that these processes can take, but the one that I’ve used (and seen most used) are GenServers. The example in the linked document is quite extensive, and I refer you to that for details, but essentially there are two things to understand for the purposes of my discussion. On the client side, just like in an OOP, you don’t need to know anything about the state associated with a process, so, continuing with Stack example from the documentation, a client would add an element to a stack by calling something like:
GenServer.call(process_identifier, {:push, “value”})
(Note that process_identifier can be either a system generated identifier or a name that you give it when creating the process. I’ll explain why that’s important in a little while.) On the server side, you need code something like:
def handle_call({:push, value}, from, state) do
…
{:reply, :ok, new_state}
end
Here, the tuple {:push, value} matches the second parameter of the client’s invocation. The from parameter is the process identifier for the client, and state is the state associated with the server process. The return from the server function must be a tuple, the first value of which is a command to the infrastructure code. In the case of the :reply command, the second value of the tuple is what will be returned to the client, and the final value in the tuple is the state that should be associated with the process going forward. Within a process, only one GenServer call back (handle_call, handle_cast, handle_info, see the documentation for details) can be running at a time. Strictly in terms of the deliberate state associated with the process, this structure makes concurrent programming quite safe because you don’t need to worry about concurrent modification of the state. Honestly, you’d achieve the same effect in Java, for example, by marking every public method of a class as synchronized. It works, and it does simplify matters, but it might be overkill, too. On the other hand, when I’m working with concurrency in OOP languages, I tend towards patterns like Producer/Consumer relationships, which I feel are just as safe and offer me more flexibility.
There is a dark side to Elixir’s use of processes to maintain state, too: Beyond the explicit state discussed above, I’ve run into two forms of implicit state that have tripped me up from time to time. Not because they’re unreasonable in-and-of themselves, but because they are hidden and as I become more aware of them, I realize that they do need thinking about. The first form of implicit state is the message queue that’s used to communicate with each process. When a client makes a call like GenServer.call(pid, {:push, “something”}), the system actually puts the {:push, “something”} message onto the message queue for the given pid and that message waits in line until it gets to the front of the queue and then gets executed by a corresponding handle_call function. The first time this tripped me up was when I was using a third-party library to talk to a device via a TCP socket. I used this library for a while without any issues, but then there was one case where it would start returning the data from one device when I was asking for data from another device. As I looked at the code, both the library code and mine, I realized that my call was timing out (anytime you use GenServer.call, there’s a timeout involved; I was using the default of 5 seconds and it was that timeout being triggered instead of a TCP timeout), but the device was still returning its data somewhat after that. The timing was such that the code doing the low-level TCP communications put the response data into the higher-level library’s message queue before my code put its next message requesting data into the library’s queue, so the library sent my query to the device but gave my code the response to the previous query and went back to waiting for another message from either side, thus making everything one query off until another query call timed out. This was easy enough to solve by making the library use a different process for each device, but it illustrates that you’re not entirely relieved of thinking about state when writing concurrent programs in Elixir. Another issue I’ve had with the message queue is that anything that has your process identifier can send you any message. I ran into this when I was using a different third-party library to send text messages to Slack. The documentation for that library said that I should use the asynchronous send method if I didn’t care about the result. In fact, it described the method as “fire-and-forget.” These were useful but non-essential status messages, so that seemed the right answer for me. Great, until the library decided to put a failure message into the message queue of my process and my code wasn’t written to accept it. Because there was no function that matched the parameters sent by the library, the process crashed, the supervisor restarted it, and it just crashed again, and again, and again until the original condition that caused the problem went away. Granted this was mostly a problem with library’s documentation and was easy enough to solve by adding a callback to my process to handle the library’s messages, but it does illustrate another area where the state of the message queue can surprise you.
The second place that I’ve been surprised to find implicit state is the state of the processes themselves. To be fair, this has only been an issue in writing unit tests, but it does still make life somewhat more difficult. In the OOP world, I’m used to having a new instance of the class I’m testing created for each and every test. That turns out not to be the case when testing GenServers. It’s fine if you’re using the system generated process identifiers, but I often have processes that have a fixed identifier so that any other process can send a message to them. I ended up with messages from one test getting into the message queue of the process while another test was running. It turns out that tests run asynchronously by default, but even when I went through and turned that off, I still had trouble controlling the state of these specially named processes. I tried restarting them in the test setup, but that turned out to be more work than I expected because the shutdown isn’t instantaneous and trying to start a new process with the same name causes an error. Eventually I settled on just adding a special function to reset the state without having to restart the process. I’m not happy about this because that function should never be used in production, but it’s the only reliable way I’ve found to deal with the problem. Another interesting thing I’ve noticed about tests and state is that the mocking framework is oddly stateful, too. I’m not yet sure why this happens, but I’ve found that when I have some module, like the library that talks over TCP, a failing test can cause other tests to fail because the mocking framework still has hold of the module. Remove the failing test and the rest pass without issue, same thing if you fix the failing test. But having a failing test that uses the mocking framework can cause other tests to fail. It really clutters up the test results to have all the irrelevant failures.
Conclusion
I am reminded of a story wherein Jascha Heifetz’s young nephew was learning the violin and asked his uncle to demonstrate something for him. Heifetz apparently took up his nephew’s “el cheapo student violin” and made it sound like the finest Stradivarius. I’m honestly not sure if this is a true story or not, but I like the point that we shouldn’t let our tools dictate what we’re capable of. We may enjoy playing a Stradivarius/using our favorite programming language more than an “el cheapo” violin/other programming language, but we shouldn’t let that stop us achieving our ends. Elixir is probably never going to be my personal Strad, but I’d be happy enough to use it commercially once it matures some more. Most of the issues I’ve described in these postings are likely solvable with more experience on my part, but the lack of maturity would hold me back from recommending it for commercial purposes at this point. When I started using it last June, Dave Thomas’ Programming Elixir was available for version 1.2 and the code was written in version 1.3, although later version were already available. Since then, the official release version has gotten up to 1.6.4 with some significant seeming changes in each minor version. Other things that make me worry about the maturity of Elixir include Ecto, the equivalent of an ORM for Elixir, which doesn’t list Oracle as an officially supported database and the logging system, which only writes to the console out of the box. (Although you can implement custom backends for it. I personally would prefer to spend my time solving business problems rather than writing code to make more useful logs, and I haven’t seen a third-party library yet that would do what I want for logging.) For now, my general view is that Elixir should be kept for hobby projects and person interest, but it could be a viable tool for commercial projects once the maturity issues are dealt with.
About the Author
Narti is a former CC Pace employee now working at ConnectDER. He’s been interested in the design of programming languages since first reading The Dragon Book and the proceedings of HOPL I some thirty years ago.
[1] I’ve no reason to doubt this. I just haven’t tested it myself.
Introduction
Last June I took over an Elixir based project that involves a central server communicating with various devices over a TCP/IP network. This was my first real exposure to Elixir, so I quickly got a copy of Dave Thomas’ Programming Elixir and went through Code School’s Elixir tutorials so that I could at least start understanding the code I was inheriting. Since then, I’ve spent a fair amount of time changing and adding to this project and feel that I’m starting to come to terms with Elixir and have something to say about it.
My purpose in these posts is not to offer another introductory course in the language but rather to look at some of the features that are said to distinguish it from other languages and think about how different these features really are and how they affect the readability and expressivity of the code that one writes. In this first entry, we’ll look at pattern matching, which constitute the biggest difference between Elixir’s biggest differentiator, and the data structures available in Elixir. A forthcoming post will look at supervision trees, error handling and the handling of state in Elixir.
Pattern Matching
Pattern matching is perhaps the feature that most distinguishes Elixir from traditional imperative languages and, for me, is the most interesting part of the language. Take a statement like:
x = 1
In an imperative language, this assigns the value of 1 to the variable x. In Elixir, it effectively does the same thing in this case. But in Elixir, the equals symbol is called a match operator and actually performs a match between the left-hand side and the right-hand side of the operator. So, we can do things like
{:ok, x} = Enum.fetch([2, 4, 6, 8], 2)
do_something(x)
report_success()
In this case, the fetch function of the Enum module would return {:ok, 6}, which would match the left-hand side of the expression and the variable x would take on the value of 6. We then call the do_something method, passing it the value of x. What would happen, though, were we to call fetch with a position that didn’t exist in the collection we passed it? In this case, fetch returns the atom :error, which would not match the left-hand side of the expression. If we don’t do anything to handle that non-matching expression, the VM will throw an error and stop the process that tried to execute the failed match. (This is not as drastic as it sounds, see the section on Supervisors, below.) All is not lost, though. There are several ways we could avoid terminating the process when the match fails, but I’ll just describe two: the with statement and pattern matching as part of the formal parameters to a function declaration. Strictly speaking, the former is a macro, which I’m not going to cover, but it acts like a flow control statement and quacks like a flow control statement, so I’m not going to worry about the distinctions. Rewriting the above to avoid terminating the running process using a with statement would look like
with {:ok, x} <- Enum.fetch([2, 4, 6, 8], 2) do
do_something(x)
report_success()
else
:error -> report_fetch_error()
end
In this case, we’re saying that we’ll call do_something if and only if the result of calling fetch matches {:ok, x}; if the result of calling fetch is :error, we’re going to call report_error, and if the result is something else, which isn’t actually possible with this function, we’d still throw an error. We can actually have multiple matches in both the with and the else clauses, so we could also do something like:
with {:ok, x} <- Enum.fetch([2, 4, 6, 8], 2),
:ok <- do_something(x) do
report_success()
else
:do_something_error -> report_do_something_error()
:error -> report_fetch_error()
end
Here, do_something will still not be executed unless the result of fetch matches the first clause, but the else clauses are executed with the first non-matching value in the with clauses. If do_something also just returned :error, though, we’d have to separate this out into two with statements.
The other way we could handle the original problem of the non-matching expression would be to use pattern matching in the formal parameters of a set of functions we define. For example, we could express the code above as:
do_something(Enum.fetch([2, 4, 6, 8], 2))
def do_something({:ok, x}) do
result = … # :ok or :error
report_do_something_result(result)
end
def do_something(:error) do
report_fetch_error()
end
def report_do_something_result(:ok) do
report_success()
end
def report_do_something_result(:error) do
# whatever it takes to report the error
end
In this case, we’re using pattern matching to overload a function definition and do different things based on what parameters are passed instead of using flow control statements. The Elixir literature tends to say that this is actually preferable to using flow control statements as it makes each method simpler and easier to understand, but I have my doubts about it. I find that I tend to duplicate a lot of code when use pattern matching as a flow control mechanisms. To be fair, I’m learning more and more techniques to avoid this, but in the end I suspect that the discouragement of flow control structures and, perhaps more importantly, the lack of subclassing makes the code harder to understand and makes it harder to avoid duplicating code than I’d like.
Sometimes, too, I feel like the cure for avoiding this kind of duplication can be worse than the disease. One technique I’ve seen used is essentially to use reflection, and I found that code very hard to follow and some of the code just got duplicated between modules instead of being duplicated in a single module. There are other techniques for dealing with this kind of duplication, too, but I can’t shake the feeling that I’m being pushed into making the code less expressive when trying to remove duplication when other languages would have made removing the duplication easier and more obvious.
Another concern I have around this form of overloading methods through pattern matching is that the meaning of the patterns to match isn’t always obvious. For example looking at two functions like:
def do_something(item, false) do
…
end
def do_something(item, true) do
…
end
How do you know what the true and false mean? Sadly, you have to go back and find the code that’s calling these functions to see what the distinguishing parameter represents. You can actually write the declarations to be more intention revealing:
def do_something(item, false = _active) do
…
end
def do_something(item, true = _active) do
…
end
But the majority of the code I’ve looked at is written in the former style that doesn’t reveal its intentions as well as the latter.
Pattern matching is at once both the most intriguing and a sometimes troubling element of the language. I find it an interesting concept, but I’m still not sure whether it’s something I’m really happy about.
Data Structures
Elixir displays a curious dichotomy in regards to data structures. On the one hand, we’re told that we shouldn’t use data structures because they represent the bad-old object oriented way of doing things. On the other hand, Elixir does allow one to create structs, which define a set of fields, but no behavior, and set default values for them. Structs are really just syntactic sugar on top of Elixir’s maps, though. An Elixir map is a set of key/value mappings and a struct enforces the set of keys that one can use for that particular map. Otherwise they’re interchangeable, so I’m going to ignore structs for the rest of this discussion. Elixir offers two other data structure that aggregate data: lists and tuples. Conceptually Elixir lists are pretty much like a list in other programming languages. (Remembering that all of these data structures are immutable in Elixir.) The documentation says that there are some low-level implementation differences that make Elixir lists more efficient to work with than in other languages, but I haven’t concerned myself terribly about this. Finally, in terms of aggregating data, we have tuples. One can use a tuple much like a list, except that again there are implementation details that make they more efficient to use in some situations than lists, while in other situations the lists are more efficient. In practice, I’ve mostly seen tuples to be used when one wants to match a pattern and lists used when one is just storing the data. One somewhat contrived example of all of this and then some discussion:
{:ok, %{grades: [g1, g2, g3], names: [n1, “student2”, n3}} = get_grades()
So, what’s going on here? We’re again using pattern matching to expect the function get_grades to return a tuple consisting of the atom :ok and a map containing two lists of three elements each, where the name of the second student in the list must be “student2,” just to show that we can do a match to that level.
While more specific than is typical, the pattern of returning status and response is fairly common in Elixir. This supposedly obviates a lot of exception handling, but to my way of thinking, it really just replaces it with with statements, as described previously. I’m not convinced that it makes that much difference.
I think that the prevalent use of atoms, on the other hand, makes a large difference. Sadly, it’s a negative difference in that atoms turn into another form of magic numbers (in the code smell sense). Elixir has no way to define a truly global constant, so there’s actually no way around polluting one’s code with these magic numbers and there’s no compiler support for when one mistypes one of them. For example, consider some hypothetical code dealing with a TCP socket:
def send_stuff(socket, stuff) do
with :ok <- :gen_tcp.send(socket, stuff) do
:ok
else
{:error, :econnrefused} -> …
{:error, :timeout} -> …
end
end
def open_socket(address, port, options) do
with {:ok, socket} <- :gen_tcp.connect(address, port_options) do
{:ok, socket}
else
{:error, :econnnrefused} -> …
{:error, :timeout} -> …
end
end
There’s nothing to tell me that I’ve accidentally added an extra “n” to :econnrefused in the open_socket function until it it turns into a runtime error. (This is also the kind of code that will often not get unit tested because it’s hard to deal with. I haven’t had to write any code at this level yet, so I’m unsure if I could mock gen_tcp or not, but that’s the only way I could think of to ensure that all the error handling worked correctly.)
Conclusion
I find Elixir’s pattern matching a very interesting feature of the language. Sometimes I think that it’s a nice step towards adding assertions as first-class elements of production code. At other times, though, I wonder if it really adds to the readability of the code. For example, is the send_stuff example really more readable than:
public send_stuff(socket, stuff) {
try {
socket.send(stuff)
} catch (ConnectionRefusedException cre) {
…
} catch (TimeoutException te) {
…
}
}
That being said, I often vacillate between Java’s checked exceptions and .NET’s unchecked exceptions. Elixir has pushed me much more towards making exceptions checked by default since that makes sure that I think about them when .NET and Elixir let me forget about them, often to my detriment.
Unless and until Elixir offers truly global constants, however, I find myself distrusted the use of atoms in pattern matching to distinguish between functions. I worry that I have too much code that’s like:
def do_something(:type1 = _device_type, other_params) …
def do_something(:type2 = _device_type, other params) …
def do_something(:type3 = _device_type, other_params) …
Granted that I could probably do more to avoid this, but I also feel like this would make the code less expressive.
Next time we’ll look at supervision trees, error handling and handling state in Elixir.
About the Author
Narti is a former CC Pace employee now working at ConnectDER. He’s been interested in the design of programming languages since first reading The Dragon Book and the proceedings of HOPL I some thirty years ago.