This essay is about an investigation into the Mac OS X kernel to track down a strange bug in Chrome on Lion. It details the steps that I took during the investigation and discusses how to write a kernel extension to expose private data in the kernel to an analysis program.

The Problem

In Chrome 17, we started receiving reports of Chrome crashing on Lion with __THE_SYSTEM_HAS_NO_PORTS_AVAILABLE__ as the crashing stack frame. A colleague seeing this filed a bug to track the issue. Initially I thought there was just a problem with his machine, but as more reports of the issue started coming in, an investigation was needed. This crash boils down to the kernel running out of space to allocate Mach ports for the process.

Mach ports are a kernel-provided inter-process communication (IPC) mechanism used heavily throughout the operating system. A Mach port is a unidirectional, kernel-protected channel that can have multiple send endpoints and only one receive endpoint. Because they are unidirectional, a common way to use a Mach port is to send a message to the receiver, and to include another Mach port on which the receiver can reply to the sender. Because of this port-passing, tracking usage and leaks can be a bit difficult, as I’ll explain in this essay.

The obvious question is: what is leaking Mach ports?

Options in Userspace

I first tried to find the source of the leaks by writing a DTrace script. DTrace is a powerful debugging tool written by Sun that Apple included for and uses in their Instruments application. The script I wrote monitored all calls to functions matching mach* and tracking the process ID (PID) and execname to try and correlate the number of ports between processes. Unfortunately, the data was inconclusive because it wasn’t granular enough (remember N senders and only one receiver), and I had appeared to reach a dead end on this bug for the time being.

New Discoveries

Meanwhile, another obscure Lion crash bug was brought to my attention. Specifically, this bug was about crashes inside __CFRunLoopFindMode, which is a helper called by various runloop functions. One of the great things about Mac OS X is the amount of code that Apple open sources. As I wrote on the bug, it’s possible to disassemble that function (using otool -tV) to see what’s going on. Look for the function __CFRunLoopFindMode and then take the address for the first instruction in that function and add the frame offset (the number after the + in an Apple crash report) to get the instruction address at which the function is crashing. The last step is to translate the assembly into lines of code, which is possible through the open source code of the CF library. After reading the code, I was able to determine that the bug had the same root cause as the one I was already working on. This also meant that port exhaustion was not just an isolated issue, but it was one that was affecting a significantly larger number of users than previously thought.

Due to the nature of the crash – an exhaustion of Mach ports – the Chrome crash reporting infrastructure was not able to send reports of this. The reason? Breakpad (the crash reporter) communicates using Mach ports to send data about the crashing process. Therefore, the only way we noticed this was through user reports, which also meant that we didn’t have an accurate count of the frequency at which this crash was occurring.

The Mach API

The kernel of Mac OS X is called XNU. It’s a hybrid kernel composed of two distinct parts: the BSD interface, used to build the POSIX API; and the Mach interface, which handles processes, virtual memory, scheduling, and threading. Apple considers the BSD interface to be higher on the system stack and a more stable API, with Mach being the implementation details behind it (though technically both are exposed to userspace and are on roughly equal footing). In Mach parlance, a task is an address space with at least one thread of execution. This is roughy equivalent to a UNIX process, and in fact in the BSD proc_t structure, there’s a task_t pointer to represent this relationship between the two parts of the kernel.

In BSD, communication with the kernel happens using a syscall interface. Mach is designed as a microkernel, in which IPC plays a large role. So, to communicate with the Mach portion of the kernel, you use Mach IPC in the form of Mach messages. This IPC mechanism is used to implement a remote procedure call system, in which ports are used to represent objects in the kernel or other processes, to which a client can send messages to have services performed on its behalf.

In the context of the bug, the Mach API exposes a helpful function called mach_port_names. A call to this function returns two arrays containing the “names” of the ports and their “types.” A name is a task-relative integer address, that is to say, a sending end of a port will have a different name than that of the receiving end in a different (or the same) process. A port type is a bitmask of the rights that are held on the port, which essentially means whether the holder can send or receive. The type can contain some other information, which I won’t detail here, but for which the header files are a good reference. This function is how Activity Monitor counts the number of “Ports” a process has open, a discovery which confirmed that Chrome was in fact leaking Mach ports.

I used this API to create a program called port_tracer, which iterates over all the processes on the system and collects a list of all the port names. It then iterates over the list of ports and calls mach_port_kobject, which returns the address of the corresponding kernel object. I believed this kernel address would be able to connect the two task-relative addresses/port names into an address that links two ends of a port. It turns out, though, that this was incorrect. After reading xnu/osfmk/ipc/mach_port.c, it became clear as to why: this just returns the address in the kernel of the port object, not the object that connects ports.

Enter the Kernel

At this point in the investigation, it became clear to me that the kernel does not expose the right information that links two ends of a port together. This link would be used to find the other side of the ports Chrome leaks. As discussed earlier, there are two parts of kernel: the Mach part and the BSD part. To create a kernel extension to service userspace requests in Mach, you’d define an Mach Interface Generator (MIG) .defs file. This gets processed by the MIG command and creates C stubs that are called by the kernel IPC trampolines. In the BSD part, you’d create a new sysctl, which is an expansion on the standard syscall interface. Because I had experience writing an ioctl on Linux (a very similar interface) and had never used MIG before, I decided to go the latter route. Apple has a document called Boundary Crossings, which is a pretty helpful resource that explains these subsystems.

Writing a sysctl

The sysctl function (man 3 sysctl) traps into the kernel and executes some service request, returning the result to userspace. The function signature is:

int sysctl(int *name, u_int namelen,
           void *oldp, size_t *oldlenp,
           void *newp, size_t newlen);

The first component is the “name” of the sysctl, which is in the format of a Management Information Base (MIB). A MIB is an integer array that represents nodes in a tree structure. There are multiple top-level nodes including kernel, debug, vm, hw, etc. Nodes below this are either leaves or intermediate nodes to descend into a sub-category. The “old” and “new” parameters are used to either get or set values in the kernel. For example, you can change various configuration options like the hostname or maximum number of processes using this interface. A concrete example is to use it to get a list of all the processes running on the system, like so:

int* mib[] = { CTL_KERN, KERN_PROC, KERN_PROC_ALL };
size_t buf_size;
int err = sysctl(mib, 3 /* len of mib */, NULL, &buf_size, NULL, NULL);

std::vector<kinfo_proc> buf(buf_size / sizeof(kinfo_proc));
err = sysctl(mib, 3, buf.data(), &buf_size, NULL, NULL);

The MIB is targeting the kernel category, to get information about processes, and to get all processes. Instead of KERN_PROC_ALL, you could use KERN_PROC_UID and append a user ID to the MIB to get all processes owned by a user. We make the sysctl twice: the first time we only get the length of the buffer, indicating the number of returned items. After that, we can allocate a buffer in userspace, into which the kernel will copy the result buffer from kernelspace.

The sysctl I wrote to debug the problem at hand needs to take two parameters in the MIB: the process ID for which the list of ports will be returned, and a bitmask that indicates the port rights/types that the caller is interested in. In order to do that, you define the sysctl as a CTLTYPE_NODE, indicating to the kernel that subnodes (parameters) are expected in the MIB. The format of the MIB would be:

{ CTL_DEBUG, /* my parent node */, /* process id */, ~0 /* all rights */ }

Because kexts are dynamically loaded and unloaded, unlike the system-provided MIBs, the MIB defined in the kext will not have a constant number. To look up the number, you use sysctlnametomib, which takes the string name (as defined in the sysctl initializer) and it outs the MIB array.

To create a sysctl, start with the kernel extension Xcode project, and use the SYSCTL_PROC macro to define a new sysctl. The document linked above (“Boundary Crossings”) explains this in more detail. You can also refer to the code attached to this essay for the exact incantation. What this macro does is define a struct that holds a function pointer to the sysctl handler, and it generates the sysctl number for the given named string.

The sysctl handler’s function signature is:

// The actual signature is partly expanded by the SYSCTL_HANDLER_ARGS macro.
int SysctlHandlerFunc(struct sysctl_oid *oidp,
                      void *arg1, int arg2,
                      struct sysctl_req *req);

The oidp is a pointer to the structure that’s defined by the SYSCTL_PROC macro. arg1 is the MIB and arg2 is its length. The req parameter is the structure that holds data about this specific sysctl invocation, namely the oldp and newp parameters to sysctl in userspace.

MACH_KERNEL_PRIVATE

The hardest part of this investigation was dealing with various ifdefs of KERNEL, XNU_KERNEL_PRIVATE, MACH_KERNEL_PRIVATE, etc. The headers provided to your kext by Kernel.framework are preprocessed, so a lot of functionality is not exposed except to the core kernel, of which kexts are not a part. These become problematic when trying to access two things: contents of structs and functions defined as kernel-private.

The latter is in most cases impossible to work around. Your kext is linked against /System/Library/Frameworks/Kernel.framework, which only exposes symbols that Apple considers to be part of the stable KPI (Kernel Programming Interface). Thus, even if you forward-declare kernel private symbols, your kext will fail to link because those symbols are not exported to the libraries against which your kext is linking. (Technically it could be possible to call non-exported symbols in the kernel, but that’s an entirely different rabbit hole.)

Structs, on the other hand, are merely human-friendly offsets into a region of memory. Their definition and layout can be shamelessly copied from the XNU open source headers into your kext’s project so that you can access fields in kernel private structures. As it turns out, virtually ever structure within the kernel is designed to be opaque to a kext. Apple decided to do this so that they can freely change the kernel structures, but it also makes writing a debugging tool like this a little harder. To do so you need to edit the headers so they compile in your project through a process I call “munging.”

What does it mean to “munge” a header? First, copy the headers into your project and remove the #ifdefs to expose the structs you need. Then, you need to clean up #includes, removing any includes that are only part of the kernel core and aren’t needed to compile the file. You may come across certain kernel-private files that are dependencies for the header you need, which will mean performing this munging recursively. For this kext, this primarily applied to the kernel locking primitives in xnu/osfmk/i386/locks.h. I did this by importing the headers I needed into my project and attempting to compile the kext until I could clear all the errors. Doing this header munging essentially ties you to the binary interface of a specific kernel version that is liable to change between OS releases. The only reason I did it here is because I was debugging an issue on a specific OS version. You should never ship a kext that relies on kernel private interfaces.

For this kext, I needed to specifically expose:
proc_t to get the task_t
task_t to get the ipc_space_t
ipc_space_t to get the array of ipc_entry_t
ipc_entry_t to get the ipc_port_t

as well as a handful of supporting files that contained things like locking primitives and return codes.

The Magic Number

As I mentioned above, the goal of this entire exercise is to figure out which process owns the other side of the ports that Chrome is leaking. To do this, I needed to turn the task-relative address of the port objects into a single address that connects the two. Inside the kernel, each task_t holds an ipc_space_t. An IPC space is another type of address space (the primary one being a virtual address space in the VM subsystem) that holds all the ports in a task.

Inside a space, there are two structures for storing ports: the first is an array and free-list, the second is a splay tree. The splay tree I found to be seldom used, so I’m not going to cover in-depth here (nor did my debugging tool iterate over it). The array is the more interesting piece: the first element is always considered NULL and contains a pointer to the next free entry in the array. The array stores ipc_entry_t objects, which is a structure that contains the field ie_entry, which is a pointer to an ipc_object_t. If the ie_entry is NULL, the element is considered free. The free list pointers are fixed up when adding and removing entries in the table.

The IPC subsystem makes use of object-oriented programming in C, and the ipc_object_t is the “base class” for most objects in the system. In this case, the ie_object stored in the ipc_space_t is an ipc_port_t. A quick down-cast later and we have the kernel data structure responsible for one endpoint of communication (the userspace equivalent is a mach_port_t).

Now that we have that data structure in hand, it’s time for some critical reasoning: how are two ports connected? The kernel isn’t heavily commented and its internals aren’t documented, so you learn by tracing code by hand. The easiest way to find out how two ports are connected is to see what happens when a message is sent between them. For that, you trace mach_msg_send (the userspace call) down to ipc_kmsg_send, which is called after the Mach message is translated from userspace into kernelspace. It eventually places the message on the receiving port’s message queue, on which the receiver blocks via mach_msg_trap in userspace. That last sentence is key: a sending port places outgoing messages onto a message queue – the same queue from which the receiving port takes messages out. Eureka! The message queue is the object that links together all the send ports and the receive port (remember that there can be N senders but only one receiver).

The address of the ipc_mqueue_t is what we’ve been looking for. We can compare these addresses between processes to see who holds the end of the port we’re leaking. All that needs to be done is allocate an array to hold all these addresses and send it across the kernel boundary back to userspace, for use in the port_tracer program.

Interlude: Allocating Memory in the Kernel

In order to send an array of arbitrary length across the kernel boundary, a dynamic memory allocation needs to be made. In userspace, this would be done with a call to the malloc family of methods. In the kernel, though, there are a myriad of options by which memory can be allocated. However, most of these are protected with MACH_KERNEL_PRIVATE, rendering them unusable in the context of a kernel extension. I ultimately found the right allocator for a kext, but it took examining each to find it. Below is an explanation of various kernelspace allocators.

kalloc: This is the allocator for the BSD part of the kernel. It’s very similar to malloc in terms of usage and it allocates a contiguous block of memory. It is the allocator upon which most other allocators are built.
vm_allocate: The Mach part of the kernel allocates memory through its virtual memory subsystem in terms of pages. The VM system could be an entirely different essay, so I’m not going to dive into it here. Most userspace tasks allocate memory either directly or indirectly through this allocator.
_MALLOC: Another BSD-style allocator. This is just another wrapper around kalloc, but provides more options to the interface, such as nonblocking allocation and a “types” system that allocates memory for specific kernel subsystems.
OSMalloc: The first of the memory allocators available to a kext. This allocator has some unusual semantics. In order to allocate memory, you first must create an OSMallocTag, which is used to reference count and identify blocks of memory that are allocated to it. Once you have a tag, you can call OSMalloc with it and a size to get a block of memory. It doesn’t appear that deallocating a tag deallocates its corresponding memory, so I’m not really sure why the API was designed this way.
mac_kalloc: The final allocator worth looking at is mac_kalloc, which is part of the security module of XNU. These provide plain wrappers around the kalloc allocator family and essentially just expose kernel-private interface. At the top of xnu/security/mac_alloc.c, there’s a comment that reads: “We should probably make sure only registered policies can call these, otherwise we’re effectively changing Apple’s plan wrt exported allocators.” Because of that, I decided that this allocator should probably not be used.

Putting It All Together

With all the pieces in hand, it’s just a matter of putting them all together. To recap: the key piece of data that links two ends of a Mach port is the kernel message queue, whose address is now accessible in the kext. The return buffer has been allocated using the right kernelspace allocator (OSMalloc). The sysctl has been registered so that the userspace debugging program can get access to the array of message queue addresses. And the userspace program has been written to iterate over all processes, finding ports in common with some target process.

The last piece was to change the port_tracer program to not use mach_port_names() but instead the sysctl I wrote. After running the program and looking at the initial data, no other process stood out as having an inordinate number of ports shared with the Chrome browser process, which was clearly leaking ports. I ran port_tracer with Safari as the target and received similar output, so it started to look like that either my data was incorrect or that Chrome itself was leaking Mach ports.

In port_tracer I included an if statement to exclude from the list of output any ports held by the target process. After removing that if, it became clear that Chrome was in fact the process leaking ports. Earlier in the week I had audited all instances of Mach calls in the code base, including two to mach_thread_self(). This call is very similar to mach_task_self(), but instead of returning the task port, it returns the thread port (in Mac OS X, userspace threads are mapped directly to kernel threads). On first call, though, mach_thread_self() will also obtain a send right on the port, which must be deallocated manually. The mach_task_self() function does not have this requirement. In terms of API design, this is a little confusing because two similarly named methods have very different ownership semantics. The reason for this is an odd implementation detail:

In xnu/libsyscall/mach/headers/mach_init.h, Apple #defines mach_task_self() to be the extern’d global mach_task_self_, a cached copy of the task port. Because it’s cached, the send right is already held and does not get incremented when you invoke the macro. If you were to call task_self_trap(), though, a call into the kernel would occur and a new Mach port with a send right would be returned. Because mach_thread_self() is not cached in this manner, you need to release the send right with a call to mach_port_deallocate().

During my audit, I noticed that we were not releasing this send right from mach_thread_self(), so I made a change to do so. It hadn’t occurred to me that I may have already fixed the problem when I started this project. I sent a copy of the kext and port_tracer to a colleague who was able to reproduce the crash and port leak very easily. I had him try the Chrome Canary (nightly builds) from the day before my change and the day after my change. The results were conclusive after testing those two versions: Chrome’s port usage dropped significantly, specifically the number of ports with send rights held by the browser process. As the change got pushed to dev channel, users reported that they were no longer seeing the crash; and victory was declared.

Chrome leaked ports like this because mach_thread_self() is used to implement the PlatformThread::CurrentId() function in Chrome, returning a semi-unique identifier for the current thread. It turns out that this is called all over the place. And I had actually introduced an unrelated change in November that called CurrentId() in a core piece of Chrome code, which is why the uptick in port usage started in Chrome 17 and not before.

With lessons now learned about Mach and XNU, I decided it’d be wise to add a performance test to Chrome’s continuous integration system to count the number of open Mach ports. That way, if a change introduces a port leakage, it will hopefully be easy to spot.

This entire investigation took a little over a week working full-time. A good reference I used is Amit Singh’s Mac OS X Internals, which is a bit dated for some things but is still quite useful. I hope you’ve found this essay and attached code interesting and helpful, should you ever find yourself in a sticky situation needing to do deep digging inside the OS X kernel.

The Code

The code is available as a ZIP file. Inside is a README that explains the various components and how to use them.

Debugging Mach Ports

25 January 2012 by Robert Sesek