ROS 2 – a viable real-time solution?

Preface

It sometimes scares me how many commercially available products are prone to safety failures1. In robotics domain (including autonomous driving, domestic care robots, medical robots, etc) a low hanging fruit of safety failures is real-time criteria. For those not familiar with the term, very loosely speaking real-time requirements for a system imply that the system must guarantee response within specified time constraints. For instance, if your autonomous driving car detects a pedestrian and commands an evasive braking, the response time between issuing the brake command and actually executing it must remain withing the bounds of a deadline. This means the real-time requirements are about determinism and not swiftness (you could have a strictly real-time system with response time of 2 days and a system with 2 nano seconds response time which is not real-time). With that being said, how would you feel if you were interacting with a machine that does not adhere to its real-time requirements?

Many robotics applications and research relies on ROS, a set of software libraries and middleware suite for robot control. And it is not real-time 😐 At least it wasn’t until its long awaited successor ROS 2 was introduced recently and is being slowly adopted both in research and industry. In this post I look at ROS 2 and compares it to an actually real-time and overall great framework called OROCOS.

This post is a brief summary of a paper that I recently co-authored and was accepted in IEEE International Conference on Robotics and Automation (ICRA) 2021. ICRA is considered as #1 robotic conference and we were very happy to have this paper accepted. It is quite difficult to push software/middleware/framework type of publications in ICRA.

My co-authors were Sinan Barut a PhD candidate at my former institute, Marco Boneberger my student whom I also supervised his bartender-robot project, and Prof. Dr. Jochen Steil the chair of robotics at TU Braunschweig.

ABSTRACT: Numerous robotic and control applications have strict real-time requirements, which, when violated, result in reduced quality of service or in safety critical applications, might have catastrophic consequences. To ensure that these real-time constraints are satisfied, roboticists have relied on real-time safe frameworks, environments and middleware. With the introduction of ROS 2, alongside kernel patches such as PREEMPT_RT, there is an abundance of solutions to pick from. This paper compares OROCOS and ROS 2 over PREEMPT_RT and vanilla Linux kernels in a variety of benchmarks and draws conclusions on their performance in real-time critical applications. The outcomes of the benchmark shows comparable performances in normal condition, however, when the system is under stress both frameworks suffer in different fashions. The results, furthermore, show an accumulating error which over time violates the real-time requirements in both frameworks. These findings are paramount in conducting real world application with real-time constraints.

The scope

As I mentioned, OROCOS was/is granddaddy of real-time frameworks with many smart people behind it. It offers two abstractions:

  1. RTT or real-time-toolkit which is for all intents and purposes an abstraction of operating system’s scheduler
  2. A full fledged Component system, where components are basically an abstraction of processes.

Its main drawback is in distributed systems where it relies on Corba (I am yet to visit a professional who doesn’t make a face of disgust when I mention Corba!). ROS on the other hand, is great for distributed, but none real-time. ROS 2 is suppose to tackle the challenge by relying on Data Distribution Service (DDS) when is deployed distributed. When all components are in a single process, ROS 2 uses a intra-process communications. This was the main comparison criteria for our paper.

We also needed to consider the role of operating systems’ schedulers and their intrinsic non-real-time behavior. There are two main possibilities within the Linux universe to achieve some level of real-timeness, namely, XENOMAI and PREEMPT_RT. The way these two works are quite out of the scope of this post but here is a nice read. Due to the fact that ROS 2 has no support for XENOMAI we had to limit the benchmark scope to PREEMPT_RT and Vanilla Linux kernel (shame…).

What we compared were message latency and jitter under stress and normal condition.

Comparison of the real-time frameworks
  OROCOS ROS 2
Communication Pattern Ports (1–1, 1–n) Publish/Subscriber
Parameter Settings Per Component Parameter-Server
Logging rt-safe log4cpp rosbag2 & ROS 2 logger
Distributed With CORBA native DDS
System Introspection Fully Supported Fully Supported
Xenomai Support Yes No
PREEMPT_RT Support Yes Yes

Timing

It is not trivial to factor in the unspecified monotonic time of the kernel. Our best effort to mitigate the issue can be summarized in these two pictures:

Different times contributing to the benchmark are depicted. The instance labeled by t_ref is the unspecified monotonic time of the kernel. The fact that this value is not specified is related to the way that the Linux kernel works and not a shortcoming of the experiment design. Once the tau_min (green solid line) is determined, it is reflected on all cycles (green dotted lines). The dotted orange line
is similar to the orange line in the image below.
The sequence of the experiment. The dashed horizontal lines are the time stamps logged as measurement. The time stamp in orange is taken when the cycle gets executed. The other times are logged immediately before a message is send or immediately after a message is received. Thus tau_res and tau_req can be calculated by the difference between the two time stamps.

We conducted 4 benchmarks both in normal and under stress situations. The stress is created by the standard Linux stress tool. In this case loads on CPU, memory, disk and i/o interfaces with 10 threads for each were deployed:

stress --c 10 --i 10 --m 10 --d 10

Benchmark results

Feel free to jump down to the plots (open them in new tabs for better visibility). The results can be summarized as follows:

Vanilla kernel

  • Without stress, ROS 2 has bounded latencies in message delivery, however, there are large spikes as well. The cycle delays, furthermore, exhibits even stronger spikes. Under stress, ROS 2 completely goes off the chart. The cycle delays getting out of any bounds due to linear growth. The request delays share the same problem while the respond delays do not. Note that after 10 minutes, ROS 2 has an accumulated delay of nearly 3 seconds.
  • Without stress OROCOS behaves similar to ROS 2 to a certain degree, however, it has less jitter. The request and respond delays show a cyclic behaviour[…] Under stress OROCOS does not show a clear trend but it shows considerably higher spike count which could result in poor performance.

PREEMPT_RT

  • Without stress, the performance looks quite similar to vanilla, although it shows more spikes in the cycle delays. Additionally the peak delay in the message transmissions is lower. The performance under stress is better when compared to the vanilla kernel. The problematic trend of linear growth is no longer present and the delays are bounded despite the fact that they are quite noisy.
  • Surprisingly, OROCOS shows a poor performance both with and without stress(we concluded this probably is due to a bug and we submitted the issue). In OROCOS, the performance without stress is bad as under stress. It has a trend similar to stressed ROS 2 in vanilla kernel which results in delays. On the other hand the message delays seem to be well bounded.
Vanilla kernel under stress and under normal condition. The picture-in-picture (PiP) plots show the overall trend throughout the 10 minute benchmark. The larger plots are scaled up region of interest (yellow rectangle in PiP). Note that apart from ROS 2 under stress, all other plots share axes horizontally and vertically. ROS 2 has a strange trend in its cycles delays. Even without stress, both frameworks show occasional spikes as seen in the PiPs.
PREEMPT_RT under stress and under normal condition. The picture-in-picture plots show the overall trend throughout the 10 minute benchmark. The larger plots are scaled up region of interested (yellow rectangle in PiP). All plots share the same vertical axis except those marked with red labels. For the request and respond the delays are quite constant without stress. There are occasional spikes, but theses are also bounded. Under stress, the delays have more jitter but still are bounded. For ROS 2 the cycle delays seem promising but are still high whereas OROCOS has a linear growth.

Conclusion

ROS 2 is promising and it would be great to see if it also delivers with DDS in distributed situations. On a more subjective view, I personally still prefer OROCOS echo system and its tools an assets. It is truly model-based with great composition model and a very nice introspection mechanism (you can cd to your component, ls them and so on). But it just doesn’t scale for distributed applications and it also doesn’t have the same wide user base of ROS.

A study with Corba (sigh) and DDS for distributed scenarios, alongside a XENOMAI single process benchmark would make a great TODO list for future works.

To summarize the summary, ROS 2 probably delivers but it is still clearly a work in progress.

To add a surprising conclusion, I would like add the following controversial opinion which at least one of my co-authors shared:

Frameworks like OROCOS and ROS 2 provide a toolset to simplify creation and deployment of real-time components and applications. But the simplification comes at the cost of hiding important aspects. Hence, for highly critical cases (e.g., safety critical applications) we suggest to directly use system calls to create real-time applications.

It is not really that surprising when you think about it. ROS/ROS 2/OROCOS and all are great tools for creating and prototyping your controllers and algorithm. When you go to production, particularly in safety critical applications, you cannot rely on a generic operating system with many pieces including a scheduler which is designed to do your office work not preemption. After all, this is what Linus Torvalds had to say about PREEMPT_RT:

Controlling a laser with Linux is crazy, but everyone in this room is crazy in his own way. So if you want to use Linux to control an industrial welding laser, I have no problem with your using PREEMPT_RT.

Linus Torvalds

Thank you very much for reading. If you are interested to see the full paper feel free to drop me an email. You can also get the entire code and data from our git server at https://git.rob.cs.tu-bs.de/public_repos/irp_papers/rt-benchmarks


1. It is also shocking how autonomous driving cars are permitted in US or other countries. They clearly do not match the governments safety standards. In the same like I was quite surprised when I saw job advert for ROS experts for self-driving cars. ROS doesn’t properly functions even in the lab…[back to top]

Leave a Reply

Your email address will not be published. Required fields are marked *