Table of contents:

The Adventure Begins

I tried A.I.s based on large language models and was mostly unimpressed. For example, you can ask it to write an email for you, and it would do a better-than-average job, but shouldn’t we be literate enough to do it ourselves? After all, if you have typed the request to ChatGPT or similar, why not type the email yourself? Nevertheless, there is a valid use case for LLMs, and with some care, there may be value in it. Don’t hope for much, however!

Once upon a time, I was writing some code in Rust and using Polars to work with tabular data. The code was fairly straightforward: it was reading data from a custom format, turning it into a table, doing some trivial transformations such as sorting it by a column and writing it out as parquet. A triviality, really. I thought the only tricky bit was the volume of data. So I wrote the code to go through this data in independent iterations (data is not carried over from one iteration to another), so that the volume of data it was processing on each iteration was reasonable (less than 10G).

The code was written and launched. It had quickly devoured much more RAM than expected and crashed. The adventure had begun.

Due Diligence

So I am checking the logic and the logic is sound. It does not keep anything around for too long, no fancy pointers exist to prolong the life of various tables, no data is carried over between the iterations. Nothing explains why the code would consume quite so much, and why its consumption would grow from step to step, even though it should drop to almost zero after every group of data being processed.

I ran it under “perf” and the findings were too slim to be useful. Ridiculous time spent copying the memory, context switching, etc. But nothing specific to point the finger at.

The first discovery was that sorting tabular data in Polars is amazingly hungry. If I sort a DataFrame - even in-place - the memory usage suddenly explodes. It is also slow. Long story short, not having time or desire to mend someone else’s rather arcane software, I have implemented that sorting procedure myself. The result was a code that does its job in a fraction of a second and temporarily doubles the amount of used RAM. It was not ideal, but it was acceptable.

But with that elephant out of the room, I was left with a whale shark: the memory consumption still grows, and it does not shrink. Worse still, it does it only sometimes, and the behaviour seems to depend on both data and the computer where the code is running, exhibiting classic features of a Heisenbug. And most perplexing of all, the used-up memory was not freed once an interation was over, despite my code clearly releasing all the data relevant structures. This smelled like a memory leak in Polars.

Searching Online

When I typed “memory management in Polars” in Google hoping to find an explanation for the effective leaks and excessive memory use that I have observed, I got nothing of value. However, the second link led to a text saying “It is fascinating to see that Polars is so efficient and can process our dataset with only 512MB of RAM”, while the second link pointed to an article titled “Why Polars uses less memory than Pandas”.

That did not help, so let’s be more specific: “memory leak in Polars”. OK, that shows a few links to bug reports.

“Steady increase of unfreed memory“: reported on Aug 9, 2022, no assignees, OOM issues mentioned, but it was Windows-specific, while I was on Linux.
“Possibly memory leak on Linux“: reported on Apr 17, 2024, no assignees and still open. The issue got stuck as there was no reproducible example.

… and so on. Nothing truly relevant, however.

ChatGPT to the Rescue

After a few hours of rather vexing profiling, I decided that I needed help.

In the olden times people in distress were talking to Eliza the psychotherapist, who/which was easily accessible from Emacs by typing M-x doctor. I did not have Emacs at hand, but I did have ChatGPT…

“I have a code in Rust which uses Polars library and works with DataFrame objects. It seems that the code leaks memory. What could be the problem?” I typed. ChatGPT obligingly dumped a few pages about possible root causes of memory leaks. It mentioned reference cycles, Rc and Arc (irrelevant), unintentional cloning of DataFrame (also irrelevant), but also the possibility of bugs in Polars or Polars not releasing memory to the OS after the lifetime of DataFrames ends. OK, that’s a good start.

“Are there any known bugs in Polars causing the library to keep the memory and not release it to OS?” I tried. ChatGPT suggested I check bug reports on GitHub. Basically, it has subtly asked me whether I was banned from using Google. Fair enough.

“Does Polars make use of a memory allocator?” I persisted, as - obviously - if Polars is using a custom memory allocator, that memory allocator may decide to keep the previously used memory around for a while, potentially causing a leak. ChatGPT has produced a couple of pages more, essentially insisting that Polars is using standard memory allocation (“such as malloc on Unix-like systems and HeapAlloc on Windows“). OK…

Need to prod it a bit more… I suppose Polars may rely on Apache Arrow for internal data representation. So I asked: “Does Polars rely on Apache Arrow or on something else for memory allocation when it works with DataFrame’s data?“. ChatGPT replies: “Polars primarily relies on Apache Arrow for memory allocation when working with DataFrame … Each column in a Polars DataFrame is stored as a Series, which in turn is based on Arrow Arrays“.

Need a bit more pushing here. “How does Apache Arrow (when used by Polaris) work with memory?” I asked. The reply was “When Polars uses Apache Arrow for memory management, it leverages Arrow’s memory format, memory pool system, and columnar storage model to efficiently allocate, manage, and access large datasets“. Oh yes, here it comes. There is a memory pool there.

“Which memory pool system is used?” produced two pages more: “In Apache Arrow, the memory pool system used is called the MemoryPool … Once memory is no longer needed, it is returned to the memory pool. This allows Arrow to reuse the memory for future allocations, minimizing the need for the system to allocate new memory repeatedly.”

Let’s go for the jugular: “Is it true that when Polars in Rust is allocating memory for columns of a DataFrame, the memory is requested from MemoryPool of Apache Arrows, and that when DataFrame is no longer needed, the memory is returned to the same MemoryPool, but is not immediately returned to OS?”

And here we get the confession: “(1) When a DataFrame is no longer needed, and the Rust DataFrame object (or its associated Series) is dropped (either by going out of scope or being explicitly dropped via std::mem::drop), the memory used by the Arrow arrays (i.e., the buffers) is returned to the Arrow MemoryPool. (2) Arrow’s memory pool does not immediately return the memory to the OS. Instead, it retains the memory within the pool for potential reuse by future allocations. (3) However, Arrow (and consequently Polars) may eventually release memory back to the OS during specific memory pool clean-up operations or when the process terminates, but it is not done immediately upon deallocation.” In other words, there is a custom memory allocator in play, that is by design, and memory is not returned to the OS promptly, just as expected.

Somewhat hopeful, I have asked: “Is it possible to speed up the return of memory already returned to the MemoryPool to the OS?“, to which the reply was “Currently, Apache Arrow does not provide an explicit API to force the immediate return of memory to the OS from its MemoryPool“.

Did it tell me the truth, the whole truth, and nothing but the truth?

Of course not, not even close.

It contradicted itself: on the one hand, it said Polars relies on standard memory management, but on the other hand, it later admitted that Polars relies on Arrow, which in turn has a memory pool. It did not even mention Apache Arrow until I explicitly asked whether it is in use by Polars. Finally, it did mention MemoryPool of Apache Arrow, but I cannot see it in the source. I can see this merge request which was closed rather than merged, but that’s about it. It’s not that MemoryPool is a complete illusion, however. It is a part of C++ implementation of Arrow, but that’s not the same thing as the Rust implementation.

Did it help me to solve the problem with my code?

Definitely not. But it did calm me down. Chatting with it had a mild therapeutic effect.

Why is ChatGPT Better Then?

Despite not being able to find anything of practical use, I do believe that ChatGPT was a better alternative to traditional search engines.

While Google Search and ChatGPT were both fruitless at getting the job done, ChatGPT was definitely less aggravating than Google Search. It has calmed me down and shone the light of hope — even though it was likely a false one — which ultimately helped me to accept the reality and re-write the code one more time, to rely on Polars even less. Et voilà! The memory consumption is minimal now.

It is not a cheerful thought, but the volume of “information” online is so great, and the quality of it is so low, that finding gems there is next to impossible. You may complain about it, but it won’t get you anywhere. Instead, accept it. You won’t find anything that really matters. But you can find a kind digital soul to talk to you calmly and politely. That digital soul might lie to you, but you know it will, and you know it was not meant to cause injury. So it is OK, really.