Denys Rtveliashvili
ai

GC Pauses in High-Frequency Trading

Introduction

GC pauses is a phenomenon that is as old as GC itself, and garbage collection has been invented around 1959. These pauses affect software written in Java, C#, Python, JavaScript, just to name a few. The timing of GC pauses is largely unpredictable. Their length depends on the specific GC algorithm, heap size, the specific kind of GC operation, and other factors but there tends to be a natural limit to how long the pauses can be.

For the vast majority of software GC pauses are small. This is especially true in the case of modern, generational, implementations of garbage collectors. For user-oriented software, GC pauses are not an issue because an unexpected delay of ten milliseconds will be unnoticed. For batch processing, it is even less of a concern.

However, not every application can survive even the smallest unpredictable delays. For example, the software responsible for computer-aided manufacturing has to control the process of machining in real-time and even the smallest unpredictable delays can result in defective products and possibly even damage to the plant.

In the financial industry, there is a deeply rooted belief that GC pauses are extremely harmful and must be avoided at all costs. The result is significant effort spent on building “zero-GC software”, low code readability, many more bugs, and a drop in productivity. These negative aspects are swept under the rug and the perceived benefit of “low latency” is proudly touted. This can be a particularly big deal in HFT (High-Frequency Trading) firms.

But is there true merit in all these efforts at building “zero-GC software”? Why? And importantly, in which situations does it matter? Let’s try to find out.

Zero-GC software

So what is zero-GC software?

The answer is not what one might expect as opinions differ. Let’s consider three types of zero-GC (the terminology is mine).

True zero-GC

True zero-GC software is the one that has no GC whatsoever. This would be the case where the runtime does not have any implementation of GC, which filters out everything written in Python, Java, C#, various LISPs, JavaScript, Go, Ruby, and so on. It might also filter out software in languages like D where garbage collector is optional (but happens to be required for most things you might need).

This leaves you with languages like C, C++, Assembly. However, nothing is stopping you from adding a garbage collector to a program in, say, C++. Once you’ve done it, your software is no longer of a true zero-GC kind.

The nice feature of true zero-GC software is that you know for a fact that there would be no pauses caused by a garbage collector. Their absence is guaranteed, which helps to reason about the software’s behaviour. This may be valuable in some contexts. However, the absence of pauses caused by a GC does not mean the absence of other pauses. There is a great number of other source of latencies, some of which can be very large.

Partial zero-GC

This kind of software is almost non-existent but let’s mention it regardless. There are/were runtimes like Sun Java Real-Time System and others where a significant effort was put into improving the real-time behaviour of applications without dropping garbage collector entirely.

The examples of implementing a partial zero-GC approach would be:

  1. Segregating threads into real-time and background ones and ensuring that the real-time threads have some amount of memory reserved for them.
  2. Dedicated memory areas for different threads are used in a way where garbage can be collected for a specific thread only without pausing the rest.
  3. Ability to mark certain parts of code as zero-GC so that the code would be verified, compiled, and executed in a way which guarantees that its execution will not trigger a GC.
  4. Running software in a way where full-scale garbage collection is forced to happen at carefully selected times when it is known that at those moments the effects of the pauses are not going to be harmful.
  5. Using a no-GC implementation of a garbage collector and hoping that GC will not be necessary.

…and so on.

However, all of this is half-measures. They do not provide solid proof that the execution would not be suddenly paused by a garbage collection. The cost of these measures is a high engineering and maintenance cost.

Pretend” zero-GC

This is what people in finance call “zero-GC” and it is very popular.

Here, “zero-GC” means that the code is run on a generic runtime, but the code is designed and written in a manner which reduces the number of object allocations as much as possible.

The approach relies on using object pools instead of normal on-demand allocation, using setters rather than constructors, returning values from methods by modifying one of the arguments, and so on.

Here, the hope is that all these efforts will result in the program creating no new objects during its normal operation and so never triggering a garbage collection.

With this approach, there is no guarantee that there would be no GC pauses. However, the frequency of GC pauses is reduced, sometimes significantly. There is also a cost, mainly related to the quality of the code and the cost of writing it (more on that later).

Before we continue with the matter of zero-GC as it is understood in finance, let’s consider a few other things first. And the first of them would be the importance of latencies and pauses in HFT business.

Latencies in High-Frequency Trading

There is no strict definition of what HFT is, but it is generally understood to be a kind of financial trading where turnover is high, frequency of trading is also high (many trades per second), and importantly the position in a single financial instrument is opened and closed frequently (a position is open for seconds to minutes rather than days to months). Your mileage may vary, of course.

In the HFT business, profits are extracted from short-term market “inefficiencies” rather than long-term speculations or investments. Unlike trading at lower frequencies, HFT is particularly sensitive to latencies: there is only so much inefficiency in the markets and whoever notices and reacts to each case first ends up being the winner.

Non-HFT finance may also be somewhat sensitive to latencies. For example, a business may have a low-frequency strategy which makes decisions to buy and sell certain volumes throughout a day, but the strategy of executing these high-level decisions may attempt to do so in a clever way, where trading individual small units is done at the times when it appears that execution would be particularly efficient.

The latency game

The importance of latencies is well known and much is published on the latencies and how people are trying to win in this competition. The game is also very old. For an example of a latency-sensitive trade which was done two centuries ago, one may consider a story of how Nathan Mayer Rothschild traded on the news of the outcome of the Battle of Waterloo.

The modern latency game is tough. To be faster, companies have to invest in low latency network communications, specialised software, custom hardware, and a lot of research. These activities are expensive and require access to competent people and unusual resources.

Latencies in HFT

Which latency is good in HFT business?

There is no specific number, and if someone tells you one, you can be sure that person has no idea what he/she is talking about. Let’s consider this matter in more detail.

Firstly, a number on its own does not say much. Let’s say someone tells you their latency is 45 microseconds. What does it mean? It could mean lots of things. Here are a few examples:

  1. Their algorithm makes a decision in 45 μsec (from the point it gets a signal and to the point it makes a decision);
  2. Their algorithm makes a decision in 45 μse (from the point a network packet with a signal arrives at the network card and to the point when an order leaves a network card);
  3. It takes 45 μsec for their order management system to receive the order and route it to a system which would be placing the order on the market;
  4. It takes 45 μsec for their market adapter to receive market data from the market, read it, and publish it to algorithms;
  5. It takes 45 μsec to react to market data (from the moment the market data packet enters the firm’s network and to the moment when the resulting order leaves the firm’s network);
  6. It takes 45 μsec to react to market data (from the moment the market data packet is formed on the exchange’s side and to the moment when the resulting order enters the exchange’s matching engine).

We can continue like this for days, but I suppose it is clear that all of these cases are different and hence the meaning of the statement “our latency is 45 μsec” can be different too. Context is important.

Secondly, no real system has a latency of “x μsec” or “y msec”. Latencies have a distribution, and that distribution has a significant meaning.

You could talk about the minimal, average or median latency, but the value of those metrics is limited. Can you talk about the maximum latency? No, not really. You cannot measure it, because large latencies happen as a result of infrequent events, and you cannot measure them with precision.

Does the absolute value of latency matter?

No, it does not. What matters is whether your systems are first to react when it matters. Just like in a sprint, it may be fun to set a record in a 100-metre dash. However, when in a competition, one needs to be faster than others, even if by just a little bit. The exact time does not matter much.

The mathematics of the latency game

A trivial example and terminology

Let’s forget about software and hardware for a moment, and consider latencies and their role in the profitability of HFT trading in a toy example.

It is assumed here that there is an existing HFT trading strategy which can make money if it always manages to trade at the times when it needs to. For simplicity, it is assumed that the strategy is purely reactive: an event comes in, and the strategy opens/closes the position. Let’s also say there is only one asset being traded and the algorithm aims to open the position at an opportune moment and then close it when the conditions are optimal.

When running concurrently with other similar algorithms, the competition results in some of the attempts at opening the position failing. That alone means that the strategy cannot exploit the discovered opportunities as fully as desired. The measure of the ability to trade a fill-or-kill order at the desired moment is called “fill rate“. If you cannot open the position at that specific moment when the valuable signal has just arrived, you have already lost the race. Unfortunately, the fill rate is particularly bad exactly at the moment when HFT firms have spotted an inefficiency and are competing for it.

Now let’s say the algorithm has won the first battle and has opened the position. Next, it needs to close the position at the correct time. The easiest thing would be to close it at a time when there is no competition and there is a good chance of doing so profitably. This may be done on a time-out, for example. However, the algorithm may notice a signal that the market is going against it. In that case, it must close the position as soon as possible. If it fails, it will end up with a loss, and the loss can be dramatic. The difficulty is that the same signal will likely be interpreted by others in the same way, and they will aim to make the same trade, competing for the same liquidity on the market. Sometimes an algorithm would fail to win. Failing to trade at the desired price would likely result in having to trade at a worse one. This difference between the expected/desired price which was missed and the actual, realised price is called “slippage“.

Latency and P&L

The P&L of a single trade is — simplifying a bit — the volume traded times the change in price minus fees.

PnL=vΔpf

For simplicity, let’s say v>0 and the algorithm was expecting the price to rise.

The change of the price can be seen as a combination of intentional, expected change of price Δ˜p and undesirable slippage s. However, the slippage occurs only if the algorithm was not the fastest one in a specific case. So the P&L of a specific trade i is:

PnLi=Δ˜pisiδli>0fi

and the overall P&L of multiple trades is

PnL=iPnLi=i(Δ˜pisiδli>0f)=i(Δ˜pifi)i(siδli>0)

The first part of the P&L formula represents the expected profit of trades less the costs of trading. That term is hopefully positive, as the model was backtested before going live and the expected P&L of trading should at least cover the trading fees.

The second part, however, represents the financial loss caused by losing in the latency game. That loss is the sum of all slippages which have occurred due to the algorithm’s latency being higher than that of the fastest competitor(s).

Winning or losing is a binary outcome. Not only absolute latencies do not matter, but even relative ones matter only so much.

Not all latencies matter

To maximise the P&L of trading one needs to reduce the realised slippage. But how?

The first thing to note is the distribution of slippages. Almost all of them are going to be modest, although there would be a few large outliers. Instinctively, one may wish to go after them.

However, the second thing to note is that the magnitude of each slippage is almost independent of the nature of the corresponding relative latency and its magnitude. So trying to attack the problem of large slippages by identifying and eliminating the corresponding latencies is a fruitless endeavour.

The fruitful one is to eliminate sources of latencies which — even though their latency contribution may be small — tip the scales and frequently make the algorithm lose the latency game.

In other words, it simply does not matter that occasionally your algorithm gets stuck for, say, 100 milliseconds if 99.9% of the time it is faster than its competition, even by a mere nanosecond.

Latencies in software

There is a tremendous number of things which cause latencies in software: cache misses, PCI latencies, contention on various buses, the cost of coherence protocols, the behaviour of DRAM, various interrupts, context switches, updates to virtual-physical mappings, priority inversion, etc. I am not going to cover them all, but suffice it to say that their contribution is on a scale from nanoseconds to seconds, and while an algorithm is running, it is accumulating them all the time. The final latency experienced by an algorithm in each case is a sum of thousands of latencies: great and small.

The latencies caused by a GC are nothing special.

A “full GC” is expensive as it needs to stop all threads and then go through the whole memory of the process to sweep away the unused objects. This can take seconds if you are unlucky, especially on massive heaps. However, full GC happens very infrequently.

On the opposite side of the spectrum you have ultra-fast GC which cleans up the so-called “Eden”. It costs almost nothing as it has to scan through a small volume of memory which is most likely in L1 cache anyway. The cost of that is sub-microsecond.

There are also situations in between where a GC pause takes between a few microseconds to some milliseconds.

If your software is well-written, it won’t cause much strain on GC. Most of your objects will be short-lived and will not be promoted from Eden, so heavier GC cycles would be infrequent and mostly irrelevant. However, even when they do happen, it is likely the pauses would happen at some random time rather than when your algorithm is trying to close the position and the slippage is unusually high.

In other words, GC latencies mostly do not matter as far as P&L is concerned and there is nothing that makes them special compared to a host of others. It is irrational to prioritise the latencies of GC over everything else.

Why do people care?

Given all the above, the natural question is why people care about GC latencies that much.

It is clear that they do: when hiring, various financial institutions are deliberately looking for people who write “zero-GC” code and say their systems as “low latency” (often without real justification).

I believe there are several reasons.

For quants and traders it is clear that latencies must be reduced in order to minimise slippage. That is fair and true. They understand the mathematics of trading in detail, but the signal they give to engineers is somewhat simplified: “Latencies are bad, get rid of them”.

Engineers do not understand — or do not want to understand — the nuances. So they look at latencies and notice the “horrible” thing: an algorithm is usually “fast” but occasionally it is very slow due to a GC pause. The conclusion is that it must be a very bad thing, and an insane effort is spent on re-writing the software in a pretend zero-GC way. The GC pauses are almost gone, it’s time to drink champagne. Or is it? Has anyone cared to check the distribution of latencies? Could it be that all this effort has resulted in a situation where the infrequent and high latencies are gone but the vast majority of latencies are now much higher, resulting in lower fill rates overall?

GC latencies are also easily visible. Runtime would normally log them if required, and it is so much easier to work on removing a problem which is readily observable.

For some, removing GC latencies is a matter of pride. It doubtless requires much effort and some understanding of the platform. However, this somewhat difficult job is not necessarily productive.

It is also good for job security: once the software was optimised for “zero-GC” operation, it is harder to work with its code for anyone but its author.

Eventually, culture changes. The company has to hire new people with a view that there are a zero-GC code to support, a team which sees “zero-GC” as a value, and a proud tradition of attacking the windmills to uphold. The actual merit of the approach is rarely questioned.

One step forwards, two steps back

It is both funny and sad that in order to create zero-GC software, steps are taken that increase the latencies which matter in trading.

Often, the memory footprint of processes increases, which means software has to touch more memory and hence accumulate memory-related latencies more actively.

The avoidance of memory allocation demands that certain safety-oriented approaches are never used. Instead of an approach where each object is valid upon construction — and is possibly even immutable — each object has to be mutable in its entirety and often has an invalid state by design (for example, when it is in a memory pool).

This greatly increases code complexity and the frequency of bugs. More than that, the bugs tend to be of the nastier kind, where one may end up with a chimaera. For example, an order for ¥1,000,000 is returned to the pool, then re-used to create an object for $50, and suddenly the system ends up with an effective order for $1,000,000.

Rationally, a loss of $10 on a severe GC pause is less important than hitting the market with a $1M order instead of $50, but it is not always seen this way.

A case of intense data processing

But what if you have a system which needs to process huge volumes of data?

For example, let’s say there is a market data connection over a 10Gbps line and sometimes the data rate is approaching the capacity of the line. Surely, a single latency of, say, 1 millisecond would cause problems: during that latency around 1 MB of data would buffer up and that might overflow. You can raise the buffer size, but you cannot do it indefinitely. If the latency is substantially great, the buffer will not be enough, no matter how big it is. For UDP-based connections, this can be the end.

That is far. I agree, that on high volumes and with protocols like UDP you may end up in trouble if your application is suffering from GC pauses. Does it mean that “zero-GC” code is the way? Of course not. All it means is that you need to pick the right tool for the job.

If you need predictability so much, write in C. Or C++ if you really want to. It is not a whim, it is a practical choice. You would have a fairly predictable foundation to build upon and yes, C and C++ are quite good for soft-realtime systems. Also, it is not like there are no people around who can write decent code in them.

But let’s not forget that even code in C or C++ will occasionally experience random delays. I did see situations where the whole system was locking up for good several seconds once in a blue moon, and it took plenty of time to figure it out. But for some, less severe latencies may also be a problem. In that case, there are hard-realtime solutions. They are… hard. But they are totally possible. Most firms (I’d say everyone apart from five or so) won’t need that anyway.

Conclusion

In finance — especially in High-Frequency Trading — latencies do matter. The latencies caused by garbage collection can be quite high. This has resulted in a popular view that these GC pauses must be eliminated at all costs for P&L to improve.

However, the view is based on a misunderstanding. Not only these latencies are usually irrelevant, but trying to get rid of them in an aggressive way results in various additional costs a rational person would rather avoid.

If you do need to get rid of random latencies or have to deal with extremely high streams of data, you would be better off using technologies which are suited for soft- or hard-realtime operation (depending on your needs) rather than pretending that you can teach a frog fly.