I have been think­ing about ways to op­tim­ize a work­load that util­izes a WebAssembly runtime with JIT code gen­er­a­tion cap­ab­il­it­ies a lot re­cently. This runtime im­ple­ment­a­tion util­izes a fairly typ­ical con­struc­tion for a JIT-­based en­gine, how­ever the work­load util­izes the runtime in some­what an un­usual man­ner. Every time some func­tion within a mod­ule is in­voked, it will be within a fresh in­stance. Fur­ther­more, the in­voc­a­tions are user­-­gen­er­ated and will in­volve reas­on­ably ran­dom­ized set of WebAssembly mod­ules, pre­clud­ing the ap­plic­a­tion of a cache which would oth­er­wise help sig­ni­fic­antly in this sort of situ­ation. This work­load def­in­itely poses a fair share of in­ter­est­ing chal­lenges for me to beat!

To give a brief over­view of the work­load and the runtime’s con­struc­tion:

  1. A ran­dom blob of WebAssembly code is trans­mit­ted to the ma­chine run­ning this work­load;
  2. The blob is parsed, val­id­ated and the ma­chine code is gen­er­ated for the mod­ule;
  3. The ma­chine code is loaded into a re­gion of memory;
  4. Re­lo­ca­tions are ap­plied and func­tions provided by the em­bed­der are linked in;
  5. The memory re­gion is made ex­ecut­able;
  6. An in­stance of this is cre­ated and the code within the mod­ule is ex­ecuted.

Ma­chine code gen­er­a­tion is by far the most ex­pens­ive step here. For­tu­nately, given a large enough disk a fair amount of the gen­er­ated ma­chine code can be cached, and this cache helps the through­put tre­mend­ously. Same can­not be said about the next most ex­pens­ive step in this work­load – load­ing, re­lo­ca­tion and link­ing. There is­n’t a ma­chine with enough memory for a cache of linked ma­chine code to be ef­fect­ive. That calls for mak­ing the op­er­a­tion it­self faster. Much faster.

Naïveté

In or­der to load some ex­ecut­able code into the memory a typ­ical runtime im­ple­ment­a­tion will re­quest some write­able memory pages us­ing the mmap (or VirtualAlloc on Win­dows) sys­tem call, copy the ex­ecut­able code over, ap­ply any fix-ups (re­lo­ca­tion, link­ing) and then ad­just the map­ping such that its con­tents be­come ex­ecut­able. Once the mod­ule is dis­carded, the map­ping is re­turned to the ker­nel for re­cyc­ling.

The runtime I am work­ing on is no dif­fer­ent. Here’s a rough sketch of what a code load­ing routine looks like. This code is writ­ten in a form of a bench­mark that could be eas­ily pro­filed with criterion’s --profile-time flag:

benchmark.iter(|| {
    let map_size = ceil_to_page(module_size);
    let map = io::mmap_anonymous(
        ptr::null_mut(),
        map_size,
        ProtFlags::READ | ProtFlags::WRITE,
        MapFlags::PRIVATE,
    )?;
    write_module(map, module_size);
    io::mprotect(
        map,
        size,
        MprotectFlags::READ | MprotectFlags::EXEC
    )?;
    execute_module(map, module_size);
    io::munmap(map, map_size)?;
})

I used the rustix crate to in­ter­face with the ker­nel here. The APIs ex­posed by this lib­rary are largely mir­rors of the C API for these func­tion­al­it­ies, with some minor con­veni­ences. The write_module func­tion mod­els the pro­cess of copy­ing the ma­chine code from disk to memory, ap­ply­ing re­lo­ca­tions, link­ing and such. For the pur­poses of these bench­mark this func­tion will only write a func­tion re­turn se­quence at the be­gin­ning of each page. execute_module then takes the pop­u­lated map­ping and calls the func­tion at the be­gin­ning of each page once. If you would like to fol­low along, the im­ple­ment­a­tion for these func­tions, as well as an eas­ily run­nable bench­mark suite can be found on Git­Hub1. Let’s run the bench­mark un­der perf and see what the hot spots look like:

Flamegraph for the naive bench­mark2

I had an­ti­cip­ated to see a lot of time be­ing spent in the ker­nel land – the model is in­vok­ing the mmap, mprotect and munmap sys­tem calls in a pretty tight loop here. What I did not ex­pect was the 30% of the time spent in the ker­nel as a res­ult of the write_module func­tion which does not call into the ker­nel in any way. Out of curi­os­ity, as a san­ity check, I also ran a sim­ilar bench­mark on other UNIX-­like sys­tems such as FreeBSD and OpenBSD. The gen­eral shape of the flamegraph was re­pro­du­cible there just as well.

Re­duce, [Reuse], Re­cycle

Let’s con­struct some hy­po­theses. First, these are all UNIX-­like sys­tems that rely on memory over­com­mit­ment to op­er­ate well. It is most likely that mmap will not ac­tu­ally cause any phys­ical memory pages to be al­loc­ated (faul­ted in) eagerly, at least not without ad­di­tional flags. The al­loc­a­tion will only hap­pen when these memory pages are ac­cessed for the first time. Second, the pages re­turned by mmap are guar­an­teed to be ini­tial­ized with every byte set to 0. This ini­tial­iz­a­tion needs to hap­pen some­where and it is quite plaus­ible that this could also be a part of an ex­plan­a­tion why page faults are so prom­in­ent3.

If these hy­po­theses turn out to be true, then it must be the case that re­using the old mmap al­loc­a­tion would res­ult in neither of these op­er­a­tions oc­cur­ring more than once dur­ing the en­tire life­time of the map­ping. The phys­ical memory pages will have been paged in by the module_write call ex­ecuted dur­ing the very first bench­mark it­er­a­tion, and the ker­nel no longer needs to zero out these pages again! Not to men­tion, the time spent hand­ling re­peated mmap and munmap sys­tem calls would also go away. Some ad­just­ments later:

let map_size = ceil_to_page(module_size);
let map = io::mmap_anonymous(
    ptr::null_mut(),
    map_size,
    ProtFlags::empty(),
    MapFlags::PRIVATE
)?;
benchmark.iter(|| {
    io::mprotect(
        map,
        map_size,
        MprotectFlags::READ | MprotectFlags::WRITE
    )?;
    write_module(map, module_size);
    io::mprotect(
        map,
        map_size,
        MprotectFlags::READ | MprotectFlags::EXEC
    )?;
    execute_module(map, module_size);
    Ok(())
})
io::munmap(map, map_size);

With the bench­mark it­er­a­tion time de­creas­ing to less than 50% (from 4.6µs/it­er­a­tion down to 2.2µs/it4) of the ori­ginal na­ive bench­mark, the im­prove­ment is def­in­itely very ap­pre­ciable. Not­with­stand­ing, a sig­ni­fic­ant chunk of the time is still be­ing spent in­side the ker­nel. mprotect is one thing – no work has gone into im­prov­ing it yet – but why is write_module still caus­ing such a sig­ni­fic­ant num­ber of page faults?

Flamegraph for the reuse bench­mark

CoW-cu­lator

A long look at the Linux source code seemed to sug­gest that every time this map­ping was made writ­able by the mprotect call, the ker­nel would walk through all the af­fected pages and mark them to be copied out into a new page the next time there’s a write to the page. This mech­an­ism is widely known as Copy-on-Write, ab­bre­vi­ated as CoW, and is cru­cial piece of what makes op­er­a­tions such as fork(2) su­per fast enough.

How­ever in our case we’re hold­ing the only ref­er­ence to the phys­ical memory re­gion, and the ker­nel knows that. Each ker­nel page has an as­so­ci­ated ref­er­ence count which is ne­ces­sary to en­sure that pages shared between vir­tual map­pings aren’t re­cycled for a dif­fer­ent map­ping while still in use. My in­tu­ition sug­gests that all of the CoW busi­ness over­head ob­served in the bench­mark above could be en­tirely avoided if only ker­nel had con­sul­ted that ref­er­ence count. There must be a good reason why it did­n’t, but it is bey­ond me to find out what that reason might be. Whatever the case may be, I need my stuff to go fast last week, not next ker­nel LTS re­lease.

Tak­ing stock of the learn­ings so far: the bench­mark is now re­using some re­sources, but one of the re­sources we were hop­ing to re­use are still be­ing re­cycled. This re­cyc­ling takes a form of pages be­ing copied around each it­er­a­tion severely im­pact­ing the through­put in a neg­at­ive way. Tak­ing some guesses, POSIX sys­tems provide a mech­an­ism to en­sure pages stay al­loc­ated in phys­ical memory and are not moved to swap or oth­er­wise paged out via mlock. mmap ex­poses this mech­an­ism dir­ectly via the LOCKED flag as well. Maybe this would help? The bench­mark code is largely the same as in the reuse case, with this small change to the mmap flags:

 let map = io::mmap_anonymous(
     ptr::null_mut(),
     map_size,
     ProtFlags::empty(),
-    MapFlags::PRIVATE
+    MapFlags::PRIVATE | MapFlags::LOCKED // ← here
 )?;

Run­ning the ad­ap­ted bench­mark re­veals a de­crease in it­er­a­tion time from 2.2µs/it down to 1.9µs/it. More im­port­antly, lo-and-be­hold, page faults are no more! Yay! Spoiler alert: the ker­nel is still re­cyc­ling the pages, even if it is do­ing so faster in this neatly tucked away call to populate_vma_page_range. Yikes.

Flamegraph for the reuse_lock bench­mark

Write, Read and Ex­ecute

At this point, es­tab­lish­ing the ab­so­lute fast­est im­ple­ment­a­tion that could pos­sibly be had would be nice. If only to learn just how much over­head is still wait­ing to be elim­in­ated. A map­ping that is read­able, write­able and ex­ecut­able all at once would likely fit the bill. No mprotect calls between write_module and execute_module, the map­ping is still re­used and there are no ap­par­ent cracks for the party-ru­in­ing ker­nel to squeeze through:

let map_size = ceil_to_page(module_size);
let result = io::mmap_anonymous(
    ptr::null_mut(),
    map_size,
    ProtFlags::READ | ProtFlags::WRITE | ProtFlags::EXEC,
    MapFlags::PRIVATE,
);
benchmark.iter(|| {
    write_module(map, module_size);
    execute_module(map, module_size);
});
io::munmap(map, map_size);

At 100 nano­seconds, more than 15 times faster than reuse_lock, this bench­mark spends ma­jor­ity of its time do­ing the Ac­tu­ally Use­ful Work. This is so fast, the over­heads from criterion be­come quite prom­in­ent in the flamegraph, with the ac­tual bench­mark rep­res­ent­ing just 80% of the samples.

Flamegraph for the rwx bench­mark

Time for trade-offs! Read-write-ex­ecute maps have a fatal flaw that makes then un­us­able as is. Main­tain­ing an in­vari­ant that all pages in the pro­cess are either writ­able or ex­ecut­able and never both is an im­port­ant and very po­tent de­fense-in-depth strategy. This mit­ig­a­tion single-han­dedly can pre­vent at­tack­er’s memory write gad­gets from very quickly es­cal­at­ing into a trivial ar­bit­rary code ex­e­cu­tion ex­ploits. With all pages either writ­able or ex­ecut­able, an at­tacker may be able to write all the ar­bit­rary code they want, but large ma­jor­ity of those writes will never get ex­ecuted no mat­ter what. This mit­ig­a­tion is so use­ful and pre­val­ent that some ma­jor com­puter hard­ware vendors figured it’d be the prime time to en­force W^X at the hard­ware level. Even if RWX pages wer­en’t a scary enough concept by them­selves, sac­ri­fi­cing sup­port for hard­ware with a sig­ni­fic­ant mar­ket share def­in­itely makes this an un­in­ter­est­ing op­tion.

HugeTLB

The re­view of the ker­nel code did re­veal an­other in­ter­est­ing tid­bit. Hand­ling of the huge pages (MAP_HUGETLB) branch out into an en­tirely dif­fer­ent mod­ule. The code along this other path seemed dif­fer­ent enough to sug­gest it might be worth try­ing out. Hav­ing fewer pages to slog through can­not hurt either way, so I felt com­pelled to find out what would it take for me to get some of these sweet and fancy pages. Quite a lot, it turns out.

In or­der to use huge pages in an user­space pro­cess, the huge page pool must be set up and filled up with some of these pages first. While there is a mech­an­ism to do so at any time – by writ­ing the de­sired num­ber of huge pages to the vm.nr_hugepages sy­sctl, this is not guar­an­teed to suc­ceed at all. If the ma­chine was run­ning for long enough to sig­ni­fic­antly frag­ment the memory, this al­loc­a­tion can and will fail, leav­ing vm.nr_hugepages at 0. If the stars align right, it might even suc­ceed in part with vm.nr_hugepages be­com­ing any­thing between 0 and the re­ques­ted num­ber. The only way to set up the pool re­li­ably turned out to in­volve ad­just­ing ker­nel’s boot para­met­ers and re­boot­ing.

It is im­port­ant to note that this pool is also a global re­source shared across all pro­cesses on the sys­tem. Heavy re­li­ance on huge pages can ex­pose an ap­plic­a­tion to weird, non-­de­termin­istic and dif­fi­cult to re­pro­duce per­form­ance prob­lems that only show up in scen­arios where this re­source is ex­hausted5. Well… try­ing it out won’t hurt:

let map_size = ceil_to_page(module_size, 2 * 1024 * 1024);
let map = io::mmap_anonymous(
    ptr::null_mut(),
    map_size,
    ProtFlags::empty(),
    MapFlags::PRIVATE | MapFlags::HUGETLB | MapFlags::HUGE_2MB,
)?;
benchmark.iter(|| {
    io::mprotect(
        map,
        map_size,
        MprotectFlags::READ | MprotectFlags::WRITE
    )?;
    write_module(map, module_size);
    io::mprotect(
        map,
        map_size,
        MprotectFlags::READ | MprotectFlags::EXEC
    )?;
    execute_module(map, module_size);
    Ok(())
});
io::munmap(map, map_size);

Run­ning this bench­mark yields a sig­ni­fic­ant im­prove­ment over the reuse_lock bench­mark with the it­er­a­tion time de­creas­ing to 1.2µs/it. Un­for­tu­nately, this bench­mark also marks the second com­ing of the write_module page faults. Any at­tempts to com­bine the MapFlags::LOCKED and MapFlags::HUGETLB flags would res­ult in a de­luge of un­doc­u­mented EWOULDBLOCK er­rors for reas­ons I still don’t un­der­stand. The com­plex setup to use HugeTLB make this ap­proach in­ap­plic­able for most use-cases where a JIT en­gine is em­ployed – in­clud­ing my own – so dig­ging any fur­ther did­n’t seem worth­while. I’ll spare you from yet an­other flamegraph for this one.

What was that about pri­vacy?

There is ac­tu­ally a sanc­tioned way to en­tirely cir­cum­vent the Copy-on-write se­mantics that I had missed des­pite read­ing through the mmap(2) page mul­tiple times over – the SHARED map­ping type. So far all of the bench­marks have util­ized the PRIVATE map­ping type. The pur­pose of this type is to en­sure any changes to the mapped memory or file do not es­cape the spe­cific map­ping, through which these changes were in­tro­duced. For JIT pur­poses this is an im­port­ant prop­erty – things would Crash and Burn and Ex­plode if per­chance some­thing else over­wrote some code that was be­ing ex­ecuted by the pro­cess run­ning The Work­load. The copy-on-write se­mantics en­sure that this can­not hap­pen even in the off chance the pro­cess de­cided to self-identify as a proka­ryote and forks it­self. If fork is of no con­cern, a SHARED map­ping is def­in­itely the type to con­sider. Re­pla­cing the mmap func­tion in one of the bench­marks above as such…

 let result = io::mmap_anonymous(
     ptr::null_mut(),
     size,
     ProtFlags::empty(),
-    MapFlags::PRIVATE,
+    MapFlags::SHARED, // ← here
 );

…and run­ning the ad­ap­ted bench­mark will re­veal a siz­able im­prove­ment to the through­put, with each it­er­a­tion tak­ing ~1.2µs. While this code is still spend­ing a sig­ni­fic­ant amount of time in the ker­nel, all of that work seems to be do­ing some­thing use­ful – flip­ping per­mis­sions back & forth, flush­ing the TLBs and, most im­port­antly, none of the stuff it should­n’t be do­ing.

Flamegraph for the reuse_shared bench­mark

Pseudo read-write-ex­ecute maps

We’ve already es­tab­lished it is pos­sible to go even faster! More than 10 times faster, in fact. Turns out, that by com­bin­ing the worst parts of the other ap­proaches, we can con­struct an im­ple­ment­a­tion that works on the shiny new Apple sys­tems, does not need to deal with heav­ily lim­ited global re­source pools and is as fast as a proper read-write-ex­ecute map­ping, all without ex­pos­ing an at­tack sur­face as large as the na­ive read-write-ex­ecute op­tion.

In ad­di­tion to dis­abling the CoW se­mantics, SHARED mmaps also al­low cre­ation of mul­tiple map­pings to the same un­der­ly­ing re­gion of memory. If we can­not have the blaz­ing-­fast RWX map­pings, maybe it would be pos­sible to set up two map­pings, where one is read­able and ex­ecut­able, while the other is writ­able? It that does­n’t sound prom­ising, I don’t know what does!

Here are some things that are ne­ces­sary for a shared map­ping: a file descriptor rep­res­ent­ing the re­gion of memory we want to cre­ate a map­ping to. And… that’s it! On Linux a memfd can serve as a straight­for­ward mech­an­ism to come up with such a file descriptor, but shm_open could be used as a reas­on­ably port­able al­tern­at­ive. Per­form­ance wise, it prob­ably does­n’t mat­ter which one is used and, as one of the main­tain­ers of the memfd-rs, I’m more fa­mil­iar with memfd. So memfd it is:

let map_size = ceil_to_page(module_size);
let memfd = fs::memfd_create("gotta_go_fast", MemfdFlags::CLOEXEC)?;
fs::ftruncate(&memfd, map_size as u64)?;
let executable_map = io::mmap(
    ptr::null_mut(),
    map_size,
    ProtFlags::READ | ProtFlags::EXEC,
    MapFlags::SHARED,
    &memfd,
    0,
)?;
let write_map = io::mmap(
    ptr::null_mut(),
    map_size,
    ProtFlags::WRITE,
    MapFlags::SHARED,
    &memfd,
    0,
)?;
benchmark.iter(|| {
    write_module(write_map, module_size);
    execute_module(executable_map, module_size);
    Ok(())
})
io::munmap(map, map_size);

As hoped, this im­ple­ment­a­tion hits the 100n­s/it­er­a­tion north star, match­ing the through­put of the fast­est bench­mark we had till now. No sig­ni­fic­ant time is spent in the ker­nel at all:

Flamegraph for the memfd bench­mark

Yes, this im­ple­ment­a­tion is still tech­nic­ally con­struct­ing a re­gion of memory that’s read­able, writ­able and ex­ecut­able at the same time, but it ac­tu­ally is not as ter­rible as it sounds. Think about it this way: the JIT-­com­piled ma­chine code will typ­ic­ally stay in a writ­able memory for a long time (ac­cord­ing to com­puter time scales, any­way) be­fore it is copied over into an ex­ecut­able re­gion of memory. Pla­cing a write-only view into some ex­ecut­able memory at some ran­dom off­set within 256T­iBs of 48-bit ad­dress space can’t make that status quo much worse. For the par­tic­u­larly para­noid, re-cre­at­ing the view at a dif­fer­ent ran­dom loc­a­tion every couple ex­e­cu­tions would likely gen­er­ate enough chaos in the sys­tem to thwart any­body look­ing to find this map­ping in the first place, let alone ex­ploit it.

Of course, I’m not sug­gest­ing that tak­ing such chances would be ap­pro­pri­ate in work­loads where lives are on the line. Such sys­tems should keep all the airbags they can. For my spe­cific work­load where things just need to go fast? Well, a seat­belt may have to do.


  1. Be­ware: no ef­fort went into mak­ing the bench­marks run­nable out­side x86_64-linux.↩︎

  2. Note that perf_event frames are pro­fil­ing over­head. I could­n’t find a nice way to get perf to ex­clude them. Those frames should largely be ig­nored.↩︎

  3. Linux at least might already have some mech­an­ism to zero these pages out more eagerly, in the back­ground, with PG_zero. This work does not ap­pear to have made its way to the up­stream ker­nel yet, though.↩︎

  4. Bench­marks were ex­ecuted on a sys­tem run­ning a Ryzen 1700 pro­cessor, run­ning a Linux 5.15, with plenty of avail­able pages.↩︎

  5. For what it is worth, there is an­other global re­source re­lated to map per­mis­sions – pkeys. These are sup­posedly much, much faster but do re­quire hard­ware I don’t have ac­cess to, so I won’t be cov­er­ing them here. You can read about them in the pkeys(7) manual entry.↩︎