GCC, glibc, stack unwinding and relocations – A war story
I’ve been meaning to write a post about this bug for a while, so here it is (before I forget the details!).
First, I’d like to thank a few people:
- My friend Gabriel F. T. Gomes, who helped with debugging and simply talking
about the issue. I love some doing some pair debugging, and I
noticed that he also had a great time diving into the internals of
glibc and
libgcc
. - My teammate Dann Frazier, who always provides invaluable insights and was there to motivate me to push a bit more in order to figure out what was going on.
- The upstream GCC and glibc developers who finally drove the investigation to completion and came up with an elegant fix.
I’ll probably forget some details because it’s been more than a week (and life at work moves fast), but we’ll see.
The background story
Wolfi OS takes security seriously, and one of the things we have is a package which sets the hardening compiler flags for C/C++ according to the best practices recommended by OpenSSF. At the time of this writing, these flags are (in GCC’s spec file parlance):
*self_spec:
+ %{!O:%{!O1:%{!O2:%{!O3:%{!O0:%{!Os:%{!0fast:%{!0g:%{!0z:-O2}}}}}}}}} -fhardened -Wno-error=hardened -Wno-hardened %{!fdelete-null-pointer-checks:-fno-delete-null-pointer-checks} -fno-strict-overflow -fno-strict-aliasing %{!fomit-frame-pointer:-fno-omit-frame-pointer} -mno-omit-leaf-frame-pointer
*link:
+ --as-needed -O1 --sort-common -z noexecstack -z relro -z now
The important part for our bug is the usage of -z now
and
-fno-strict-aliasing
.
As I was saying, these flags are set for almost every build, but sometimes things don’t work as they should and we need to disable them. Unfortunately, one of these problematic cases was glibc.
There was an attempt to enable hardening while building glibc, but that introduced a strange breakage to several of our packages and had to be reverted.
Things stayed pretty much the same until a few weeks ago, when I started working on one of my roadmap items: figure out why hardening glibc wasn’t working, and get it to work as much as possible.
Reproducing the bug
I started off by trying to reproduce the problem. It’s important to mention this because I often see young engineers forgetting to check if the problem is even valid anymore. I don’t blame them; the anxiety to get the bug fixed can be really blinding.
Fortunately, I already had one simple test to trigger the failure.
All I had to do was install the py3-matplotlib
package and then
invoke:
$ python3 -c 'import matplotlib'
This would result in an abortion with a coredump.
I tried following the steps above, and readily saw the problem manifesting again. OK, first step is done; I wasn’t getting out easily from this one.
Initial debug
The next step is to actually try to debug the failure. In an ideal world you get lucky and are able to spot what’s wrong after just a few minutes. Or even better: you also can devise a patch to fix the bug and contribute it to upstream.
I installed GDB, and then ran the command above inside it. When the
abortion happened, I issued a backtrace
command to see where exactly
things had gone wrong. I got a stack trace similar to the following:
#0 0x00007c43afe9972c in __pthread_kill_implementation () from /lib/libc.so.6
#1 0x00007c43afe3d8be in raise () from /lib/libc.so.6
#2 0x00007c43afe2531f in abort () from /lib/libc.so.6
#3 0x00007c43af84f79d in uw_init_context_1[cold] () from /usr/lib/libgcc_s.so.1
#4 0x00007c43af86d4d8 in _Unwind_RaiseException () from /usr/lib/libgcc_s.so.1
#5 0x00007c43acac9014 in __cxxabiv1::__cxa_throw (obj=0x5b7d7f52fab0, tinfo=0x7c429b6fd218 <typeinfo for pybind11::attribute_error>, dest=0x7c429b5f7f70 <pybind11::reference_cast_error::~reference_cast_error() [clone .lto_priv.0]>)
at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:93
#6 0x00007c429b5ec3a7 in ft2font__getattr__(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) [clone .lto_priv.0] [clone .cold] () from /usr/lib/python3.13/site-packages/matplotlib/ft2font.cpython-313-x86_64-linux-gnu.so
#7 0x00007c429b62f086 in pybind11::cpp_function::initialize<pybind11::object (*&)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), pybind11::object, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, pybind11::name, pybind11::scope, pybind11::sibling>(pybind11::object (*&)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), pybind11::object (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_call&)#1}::_FUN(pybind11::detail::function_call&) [clone .lto_priv.0] ()
from /usr/lib/python3.13/site-packages/matplotlib/ft2font.cpython-313-x86_64-linux-gnu.so
#8 0x00007c429b603886 in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) () from /usr/lib/python3.13/site-packages/matplotlib/ft2font.cpython-313-x86_64-linux-gnu.so
...
Huh. Initially this didn’t provide me with much information. There
was something strange seeing the abort
function being called right
after _Unwind_RaiseException
, but at the time I didn’t pay much
attention to it.
OK, time to expand our horizons a little. Remember when I said that several of our packages would crash with a hardened glibc? I decided to look for another problematic package so that I could make it crash and get its stack trace. My thinking here is that maybe if I can compare both traces, something will come up.
I happened to find an old discussion where Dann Frazier mentioned that Emacs was also crashing for him. He and I share the Emacs passion, and I totally agreed with him when he said that “Emacs crashing is priority -1!” (I’m paraphrasing).
I installed Emacs, ran it, and voilà: the crash happened again. OK, that was good. When I ran it inside GDB and asked for a backtrace, here’s what I got:
#0 0x00007eede329972c in __pthread_kill_implementation () from /lib/libc.so.6
#1 0x00007eede323d8be in raise () from /lib/libc.so.6
#2 0x00007eede322531f in abort () from /lib/libc.so.6
#3 0x00007eede262879d in uw_init_context_1[cold] () from /usr/lib/libgcc_s.so.1
#4 0x00007eede2646e7c in _Unwind_Backtrace () from /usr/lib/libgcc_s.so.1
#5 0x00007eede3327b11 in backtrace () from /lib/libc.so.6
#6 0x000059535963a8a1 in emacs_backtrace ()
#7 0x000059535956499a in main ()
Ah, this backtrace is much simpler to follow. Nice.
Hmmm. Now the crash is happening inside _Unwind_Backtrace
. A
pattern emerges! This must have something to do with stack unwinding
(or so I thought… keep reading to discover the whole truth). You
see, the backtrace
function (yes, it’s a function) and C++’s
exception handling mechanism use similar techniques to do their jobs,
and it pretty much boils down to unwinding frames from the stack.
I looked into Emacs’ source code, specifically the emacs_backtrace
function, but could not find anything strange over there. This bug
was probably not going to be an easy fix…
The quest for a minimal reproducer
Being able to easily reproduce the bug is awesome and really helps with debugging, but even better is being able to have a minimal reproducer for the problem.
You see, py3-matplotlib
is a huge package and pulls in a bunch of
extra dependencies, so it’s not easy to ask other people to “just
install this big package plus these other dependencies, and then run
this command…”, especially if we have to file an upstream bug and
talk to people who may not even run the distribution we’re using. So
I set up to try and come up with a smaller recipe to reproduce the
issue, ideally something that’s not tied to a specific package from
the distribution.
Having all the information gathered from the initial debug session,
especially the Emacs backtrace, I thought that I could write a very
simple program that just invoked the backtrace
function from glibc
in order to trigger the code path that leads to _Unwind_Backtrace
.
Here’s what I wrote:
#include <execinfo.h>
int
main(int argc, char *argv[])
{
void *a[4096];
backtrace (a, 100);
return 0;
}
After compiling it, I determined that yes, the problem did happen with this small program as well. There was only a small nuisance: the manifestation of the bug was not deterministic, so I had to execute the program a few times until it crashed. But that’s much better. Having a minimal reproducer pretty much allows us to switch our focus to what really matters. I wouldn’t need to dive into Emacs’ or Python’s source code anymore.
At the time, I was sure this was a glibc bug. But then something else happened.
GCC 15
I had to stop my investigation efforts because something more important came up: it was time to upload GCC 15 to Wolfi. I spent a couple of weeks working on this (it involved rebuilding the whole archive, filing hundreds of FTBFS bugs, patching some programs, etc.), and by then end of it the transition went smooth. When the GCC 15 upload was finally done, I switched my focus back to the glibc hardening problem.
The first thing I did was to… yes, reproduce the bug again. It had been a few weeks since I had touched the package, after all. So I built a hardened glibc with the latest GCC and… the bug did not happen anymore!
Fortunately, the very first thing I thought was “this must be GCC”, so I rebuilt the hardened glibc with GCC 14, and the bug was there again. Huh, unexpected but very interesting.
Diving into glibc and libgcc
At this point, I was ready to start some serious debugging. And then I got a message on Signal. It was one of those moments where two minds think alike: Gabriel decided to check how I was doing, and I was thinking about him because this involved glibc, and Gabriel contributed to the project for many years. I explained what I was doing, and he promptly offered to help. Yes, there are more people who love low level debugging!
We spent several hours going through disassembles of certain functions (because we didn’t have any debug information in the beginning), trying to make sense of what we were seeing. There was some heavy GDB involved; unfortunately I completely lost the session’s history because it was done inside a container running inside an ephemeral VM. But we learned a lot. For example:
-
It was hard to actually understand where the full stack trace leading to
uw_init_context_1[cold]
._Unwind_Backtrace
obviously didn’t call it (it calleduw_init_context_1
, but what was that[cold]
doing?). We had to investigate the disassemble ofuw_init_context_1
in order to determined whereuw_init_context_1[cold]
was being called. -
The
[cold]
suffix is a GCC function attribute that can be used to tell the compiler that the function is unlikely to be reached. When I read that, my mind immediately jumped to “this must be an assertion”, so I went to the source code and found the spot. -
We were able to determine that the return code of
uw_frame_state_for
was5
, which means_URC_END_OF_STACK
. That’s why the assertion was triggering.
After finding these facts without debug information, I decided to bite
the bullet and recompiled GCC 14 with -O0 -g3
, so that we could
debug what uw_frame_state_for
was doing. After banging our heads a
bit more, we found that fde
is NULL
at this excerpt:
// ...
fde = _Unwind_Find_FDE (context->ra + _Unwind_IsSignalFrame (context) - 1,
&context->bases);
if (fde == NULL)
{
#ifdef MD_FALLBACK_FRAME_STATE_FOR
/* Couldn't find frame unwind info for this function. Try a
target-specific fallback mechanism. This will necessarily
not provide a personality routine or LSDA. */
return MD_FALLBACK_FRAME_STATE_FOR (context, fs);
#else
return _URC_END_OF_STACK;
#endif
}
// ...
We’re debugging on amd64, which means that
MD_FALLBACK_FRAME_STATE_FOR
is defined and therefore is called. But
that’s not really important for our case here, because we had
established before that _Unwind_Find_FDE
would never return NULL
when using a non-hardened glibc (or a glibc compiled with GCC 15). So
we decided to look into what _Unwind_Find_FDE
did.
The function is complex because it deals with .eh_frame
, but we
were able to pinpoint the exact location where find_fde_tail
(one of
the functions called by _Unwind_Find_FDE
) is returning NULL
:
if (pc < table[0].initial_loc + data_base)
return NULL;
We looked at the addresses of pc
and table[0].initial_loc + data_base
, and found that the former fell within libgcc’s text
section, which the latter fell within /lib/ld-linux-x86-64.so.2
text.
At this point, we were already too tired to continue. I decided to keep looking at the problem later and see if I could get any further.
Bisecting GCC
The next day, I woke up determined to find what changed in GCC 15 that
caused the bug to disappear. Unless you know GCC’s internals like
they are your own home (which I definitely don’t), the best way to do
that is to git bisect
the commits between GCC 14 and 15.
I spent a few days running the bisect. It took me more time than I’d
have liked to find the right range of commits to pass git bisect
(because of how branches and tags are done in GCC’s repository), and I
also had to write some helper scripts that:
- Modified the
gcc.yaml
package definition to make it build with the commit being bisected. - Built glibc using the GCC that was just built.
- Ran tests inside a docker container (with the recently built glibc installed) to determine whether the bug was present.
At the end, I had a commit to point to:
commit 99b1daae18c095d6c94d32efb77442838e11cbfb
Author: Richard Biener <rguenther@suse.de>
Date: Fri May 3 14:04:41 2024 +0200
tree-optimization/114589 - remove profile based sink heuristics
Makes sense, right?! No? Well, it didn’t for me either. Even after reading what was changed in the code and the upstream bug fixed by the commit, I was still clueless as to why this change “fixed” the problem (I say “fixed” because it may very well be an unintended consequence of the change, and some other problem might have been introduced).
Upstream takes over
After obtaining the commit that possibly fixed the bug, while talking to Dann and explaining what I did, he suggested that I should file an upstream bug and check with them. Great idea, of course.
I filed the following upstream bug:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120653
It’s a bit long, very dense and complex, but ultimately upstream was able to find the real problem and have a patch accepted in just two days. Nothing like knowing the code base. The initial bug became:
https://sourceware.org/bugzilla/show_bug.cgi?id=33088
In the end, the problem was indeed in how the linker defines
__ehdr_start
, which, according to the code (from
elf/dl-support.c
):
if (_dl_phdr == NULL)
{
/* Starting from binutils-2.23, the linker will define the
magic symbol __ehdr_start to point to our own ELF header
if it is visible in a segment that also includes the phdrs.
So we can set up _dl_phdr and _dl_phnum even without any
information from auxv. */
extern const ElfW(Ehdr) __ehdr_start attribute_hidden;
assert (__ehdr_start.e_phentsize == sizeof *GL(dl_phdr));
_dl_phdr = (const void *) &__ehdr_start + __ehdr_start.e_phoff;
_dl_phnum = __ehdr_start.e_phnum;
}
But the following definition is the problematic one (from elf/rtld.c
):
extern const ElfW(Ehdr) __ehdr_start attribute_hidden;
This symbol (along with its counterpart, __ehdr_end
) was being
run-time relocated when it shouldn’t be. The fix that was pushed
added optimization barriers to prevent the compiler from doing the
relocations.
I don’t claim to fully understand what was done here, and Jakub’s analysis is a thing to behold, but in the end I was able to confirm that the patch fixed the bug. And in the end, it was indeed a glibc bug.
Conclusion
This was an awesome bug to investigate. It’s one of those that deserve a blog post, even though some of the final details of the fix flew over my head.
I’d like to start blogging more about these sort of bugs, because I’ve encountered my fair share of them throughout my career. And it was great being able to do some debugging with another person, exchange ideas, learn things together, and ultimately share that deep satisfaction when we find why a crash is happening.
I have at least one more bug in my TODO list to write about (another one with glibc, but this time I was able to get to the end of it and come up with a patch). Stay tunned.