Optimising for Concurrency: Comparing and contrasting the BEAM and JVM virtual machines.
by Francesco Cesarini & Gabor Olah
The success of any programming language in the Erlang ecosystem can be apportioned into three tightly coupled components. They are:
1) the semantics of the Erlang programming language, on top of which other languages are implemented
2) the OTP libraries and middleware used to architect scalable and resilient concurrent systems and
3) the BEAM Virtual Machine tightly coupled to the language semantics and OTP.
Take any one of these components on their own, and you have a runner up. But, put the three together, and you have the uncontested winner for scalable, resilient soft-real time systems. To quote Joe Armstrong, “You can copy the Erlang libraries, but if it does not run on BEAM, you can’t emulate the semantics”. This gets enforced by Robert Virding’s First Rule of Programming, which states that “Any sufficiently complicated concurrent program in another language contains an ad hoc informally-specified bug-ridden slow implementation of half of Erlang.”
In this post, we want to explore the BEAM VM internals. We will compare and contrast them with the JVM where applicable, highlighting why you should pay attention to them and care. For too long, this component has been treated as a black box and taken for granted, without understanding the reasons or implications. It is time to change that!
Highlights of the BEAM
Erlang and the BEAM VM were invented to be the right tool to solve a specific problem. They were developed by Ericsson to help implement telecom infrastructure handling both mobile and fixed networks. This infrastructure is highly concurrent and scalable in nature. It has to display soft real-time properties and may never fail. We don’t want our Hangouts calls on our mobile with our grandmothers dropped or our online gaming experience of Fortnite to be affected by system upgrades, high user load or software, hardware and network outages. The BEAM VM is optimised to solve many of these challenges by providing fine-tuned features which work on top of a predictable concurrent programming model.
Its secret sauce are light-weight processes which don’t share memory, managed by schedulers which can manage millions of them across multiple cores. It uses a garbage collector which runs on a per-process basis, highly optimized to reduce any impact on other processes. As a result, the garbage collectors do not impact the overall soft real time properties of the system. The BEAM is also the only widely used VM used at scale with a built-in distribution model which allows a program to run on multiple machines transparently.
Highlights of the JVM
The Java Virtual Machine (JVM) was developed by Sun Microsystem with the intent to provide a platform for ‘write once’ code that runs everywhere. They created an object oriented language similar to C++, but memory-safe because its runtime error detection checks array bounds and pointer dereferences. The JVM ecosystem became extremely popular with the Internet-era, making it the de-facto standard for enterprise server applications. The wide range of applicability was enabled by a virtual machine that caters for a wide range use cases, and an impressive set of libraries catering for enterprise development.
The JVM was designed with efficiency in mind. Most of its concepts are an abstraction of features found in popular operating systems such as the threading model which maps to operating system threads. The JVM is highly customisable, including the garbage collector (GC) and class loaders. Some state-of-the-art GC implementations provide highly tunable features catering for a programming model based on shared memory. The JVM allows you to change the code while the program is running. And, a JIT compiler allows byte code to be compiled to the native machine code with an intent to speed up parts of the application.
Concurrency in the Java world is mostly concerned with running applications in parallel threads, ensuring they are fast. Programming with concurrency primitives is a difficult task because of the challenges created by its shared memory model. To overcome these difficulties, there are attempts to simplify and unify the concurrent programming models, with the most successful attempt being the Akka framework.
Concurrency and Parallelism
We talk about parallel code execution if parts of the code are run at the same time on multiple cores, processors or computers, while concurrent programming refers to handling events arriving to the system independently. Concurrent execution can be simulated on single threaded hardware, while parallel execution cannot. Although this distinction may seem pedantic, the difference results in some very different problems to solve. Think of many cooks making a plate of Carbonara pasta. In the parallel approach, the tasks are split across the number of cooks available, and a single portion would be completed as quickly as it took these cooks to complete their specific tasks. In a concurrent world, you would get a portion for every cook, where each cook does all of the tasks. You use parallelism for speed and concurrency for scale.
Parallel execution tries to solve an optimal decomposition of the problem to parts that are independent of each other. Boil the water, get the pasta, mix the egg, fry the guanciale ham, grate the pecorino cheese. The shared data (or in our example, the serving dish) is handled by locks, mutexes and various other techniques to guarantee correctness. Another way to look at this is that the data (or ingredients) are present, and we want to utilise as many parallel CPU resources as possible to finish the job as quickly as possible.
Concurrent programming, on the other hand, deals with many events that arrive at the system at different times and tries to process all of them within a reasonable time. On multi-core or distributed architectures, some of the execution is run parallel, but this is not a requirement. Another way to look at it is that the same cook boils the water, gets the pasta, mixes the eggs and so on, following a sequential algorithm which is always the same. What changes across processes (or cooks) is the data (or ingredients) to work on, which exist in multiple instances.
The JVM is built for parallelism, the BEAM for concurrency. They are two intrinsically different problems, requiring different solutions.
The BEAM and Concurrency
The BEAM provides light-weight processes to give context to the running code. These processes, also called actors, don’t share memory, but communicate through message passing, copying data from one process to another. Message passing is a feature that the virtual machine implements through mailboxes owned by individual processes. The message passing is a non-blocking operation, which means that sending a message to another process is almost instantaneous and the execution of the sender is not blocked. The messages sent are in the form of immutable data, copied from the stack of the sending process to the mailbox of the receiving one. This is achieved without the need for locks and mutexes among the processes, only a lock on the mailbox in case multiple processes send a message to the same recipient in parallel.
The immutable data and the message passing enable the programmer to write processes which work independently of each other and focus on the functionality instead of the low-level management of the memory and scheduling of tasks. It turns out that this simple design not only works on a single thread, but also on multiple threads on a local machine running in the same VM and, using the built in distribution, across the network with a cluster of VMs and machines. If the messages are immutable between processes, they can be sent to another thread (or machine) without locks, scaling almost linearly on distributed, multi-core architectures. The processes are addressed in the same way on a local VM as in a cluster of VMs, message sending works transparently regardless of the location of the receiving process.
Processes do not share memory, allowing you to replicate your data for resilience and distribute it for scale. This means having two instances of the same process on two separate machines, sharing state updates among each other. If a machine fails, the other has a copy of the data and can continue handling the request, making the system fault tolerant. If both machines are operational, both processes can handle requests, giving you scalability. The BEAM provides the highly optimised primitives for all of this to work seamlessly, while OTP (the “standard library”) provides the higher level constructs to make the life of the programmers easy.
Akka does a great job at replicating the higher level constructs, but is somewhat limited by the lack of primitives provided by the JVM, allowing it to be highly optimised for concurrency. While the primitives of the JVM enable a wider range of use cases, they make programming distributed systems harder as they have no built in primitives for communication and are often based on a shared memory model. For example, where in a distributed system do you place your shared memory? And what is the cost of accessing it?
We mentioned that one of the strongest features of the BEAM is the ability to break a program into small, light-weight processes. Managing these processes is the task of the scheduler. Unlike the JVM, which maps its threads to OS threads and lets the operating system schedule them, the BEAM comes with its own scheduler.
The scheduler starts, by default, an OS thread for every core and optimises the workload between them. Each process consists of code to be executed and a state which changes over time. The scheduler picks the first process in the run queue that is ready to run, gives it a certain amount of reductions to execute, where each reduction is the rough equivalent of a command. Once the process has either run out of reductions, is blocked by I/O, is waiting for a message or completes executing its code, the scheduler picks the next process in the run queue and dispatches it. This scheduling technique is called pre-emptive.
We mentioned the Akka framework many times, as its biggest drawback is the need to annotate the code with scheduling points, as the scheduling is not done at the JVM level. By removing the control from the hands of the programmer, soft real time properties are preserved and guaranteed, as there is no risk of them accidentally causing process starvation.
The processes can be spread around the available scheduler threads and maximise CPU utilisation. There are many ways to tweak the scheduler but it is rare and needed only for edge and borderline cases, as the default options cover most usage patterns.
There is a sensitive topic that frequently pops up regarding schedulers: how to handle Natively Implemented Functions (NIFs). A NIF is a code snippet written in C, compiled as a library and run in the same memory space as the BEAM for speed. The problem with NIFs is that they are not pre-emptive, and can affect the schedulers. In recent BEAM versions, a new feature, dirty schedulers, was added to give better control for NIFs. Dirty schedulers are separate schedulers that run in different threads to minimise the interruption a NIF can cause in a system. The word dirty refers to the nature of the code that is run by these schedulers.
Modern, high level programming languages today mostly use a garbage collector for memory management. The BEAM languages are no exception. Trusting the virtual machine to handle the resources and manage the memory is very handy when you want to write high level concurrent code, as it simplifies the task. The underlying implementation of the garbage collector is fairly straightforward and efficient, thanks to the memory model based on immutable state. Data is copied, not mutated and the fact that processes do not share memory removes any process interdependencies, which, as a result, do not need to be managed.
Another feature of the BEAM is that garbage collection is run only when needed, on a per process basis, without affecting other processes waiting in the run queue. As a result, the garbage collection in Erlang does not ‘stop-the-world’. It prevents processing latency spikes, because the VM is never stopped as a whole — only specific processes are, and never all of them at the same time. In practice, it is just part of what a process does and treated as another reduction. The garbage collector collecting process suspends the process for a very short interval, often microseconds. As a result, there will be many small bursts, triggered only when the process needs more memory. A single process usually doesn’t allocate large amounts of memory, and is often short lived, further reducing the impact by immediately freeing up all its allocated memory on termination. A feature of the JVM is the ability to swap garbage collectors, so by using a commercial GC, it is also possible to achieve non-stopping GC in the JVM.
The features of the garbage collector are discussed in an excellent blog post by Lukas Larsson. There are many intricate details, but it is optimised to handle immutable data in an efficient way, dividing the data between the stack and the heap for each process. The best approach is to do the majority of the work in short lived processes.
A question that often comes up on this topic is how much memory the BEAM uses. Under the hood the VM allocates big chunks of memory and uses custom allocators to store the data efficiently and minimise the overhead of system calls. This has two visible effects:
1) The used memory decreases gradually after the space is not needed
2) Reallocating huge amounts of data might mean doubling the current working memory.
The first effect can, if really necessary, be mitigated by tweaking the allocator strategies. The second one is easy to monitor and plan for if you have visibility of the different types of memory usage. (One such monitoring tool that provides system metrics out of the box is WombatOAM).
Hot code loading is probably the most cited unique feature of the BEAM. Hot code loading means that the application logic can be updated by changing the runnable code in the system whilst retaining the internal process state. This is achieved by replacing the loaded BEAM files and instructing the VM to replace the references of the code in the running processes.
It is a crucial feature for no downtime code upgrades for telecom infrastructure, where redundant hardware was put to use to handle spikes. Nowadays, in the era of containerization, other techniques are also used for production updates. Those who have never used it dismiss it as a less important feature, but it is none-the-less useful in the development workflow. Developers can iterate faster by replacing part of their code without having to restart the system to test it. Even if the application is not designed to be upgradable in production, this can reduce the time needed for recompilation and redeployments.
When not to use the BEAM
It is very much about the right tool for the job. You need a system to be extremely fast, but are not concerned about concurrency? Handling a few events in parallel, and having to handle them fast? Need to crunch numbers for Graphics, AI or analytics? Go down the C++, Python or Java route. Telecom infrastructure does not need fast operations on floats, so speed was never a priority. Aided with dynamic typing, which has to do all type checks at runtime means compiler time optimizations are not as trivial. So number crunching is best left to the JVM, Go or other languages which compile to native. It is no surprise that floating point operations on Erjang, the version of Erlang running on the JVM, was 5000% faster than the BEAM. But where we’ve seen the BEAM shine is using its concurrency to orchestrate number crunching, outsourcing the analytics to C, Julia, Python or Rust. You do the map outside the BEAM and the reduce within the BEAM.
The mantra has always been fast enough. It takes a few hundred milliseconds for humans to perceive a stimulus (an event) and process it in their brain, meaning that micro or nano second response time is not necessary for many applications. Nor would you use the BEAM for microcontrollers, it is too resource hungry. But for embedded systems with a bit more processing power, where multi-core is becoming the norm, you need concurrency, and the BEAM shines. Back in the 90s, we were implementing telephony switches handling tens of thousands of subscribers running in embedded boards with 16mb of memory. How much memory does a RaspberryPi have these days? And finally, hard real time systems. You would probably not want the BEAM to manage your airbag control system. You need hard guarantees, something only a hard real time OS and a language with no garbage collection or exceptions. An implementation of an Erlang VM running on the bare metal such as GRiSP will give you similar guarantees.
Use the right tool for the job. If you are writing a soft real time system which has to scale out of the box and never fail, and do so without the hassle of having to reinvent the wheel, the BEAM is the battle proven technology you are looking for. For many, it works as a black box. Not knowing how it works would be analogous to driving a Ferrari and not being capable of achieving optimal performance or not understanding what part of the motor that strange sound is coming from. This is why you should learn more about the BEAM, understand its internals and be ready to finetune and fix it. For those who have used Erlang and Elixir in anger, we have launched a one day instructor-led course which will demystify and explain a lot of what you saw whilst preparing you to handle massive concurrency at scale. The course is available through our new instructor lead remote training, learn more here. We also recommend The BEAM book by Erik Stenman and the BEAM Wisdoms, a collection of articles by Dmytro Lytovchenko.
You may also like:
1.One of the authors also likes a similar looking dish not called carbonara but made with cream..↩
Originally published at https://www.erlang-solutions.com on May 11, 2020.