Reading privileged memory with a side-channel

Google Project Zero - Wed, 01/03/2018 - 17:27
Posted by Jann Horn, Project Zero

We have discovered that CPU data cache timing can be abused to efficiently leak information out of mis-speculated execution, leading to (at worst) arbitrary virtual memory read vulnerabilities across local security boundaries in various contexts.
Variants of this issue are known to affect many modern processors, including certain processors by Intel, AMD and ARM. For a few Intel and AMD CPU models, we have exploits that work against real software. We reported this issue to Intel, AMD and ARM on 2017-06-01 [1].
So far, there are three known variants of the issue:
  • Variant 1: bounds check bypass (CVE-2017-5753)
  • Variant 2: branch target injection (CVE-2017-5715)
  • Variant 3: rogue data cache load (CVE-2017-5754)

Before the issues described here were publicly disclosed, Daniel Gruss, Moritz Lipp, Yuval Yarom, Paul Kocher, Daniel Genkin, Michael Schwarz, Mike Hamburg, Stefan Mangard, Thomas Prescher and Werner Haas also reported them; their [writeups/blogposts/paper drafts] are at:

During the course of our research, we developed the following proofs of concept (PoCs):
  1. A PoC that demonstrates the basic principles behind variant 1 in userspace on the tested Intel Haswell Xeon CPU, the AMD FX CPU, the AMD PRO CPU and an ARM Cortex A57 [2]. This PoC only tests for the ability to read data inside mis-speculated execution within the same process, without crossing any privilege boundaries.
  2. A PoC for variant 1 that, when running with normal user privileges under a modern Linux kernel with a distro-standard config, can perform arbitrary reads in a 4GiB range [3] in kernel virtual memory on the Intel Haswell Xeon CPU. If the kernel's BPF JIT is enabled (non-default configuration), it also works on the AMD PRO CPU. On the Intel Haswell Xeon CPU, kernel virtual memory can be read at a rate of around 2000 bytes per second after around 4 seconds of startup time. [4]
  3. A PoC for variant 2 that, when running with root privileges inside a KVM guest created using virt-manager on the Intel Haswell Xeon CPU, with a specific (now outdated) version of Debian's distro kernel [5] running on the host, can read host kernel memory at a rate of around 1500 bytes/second, with room for optimization. Before the attack can be performed, some initialization has to be performed that takes roughly between 10 and 30 minutes for a machine with 64GiB of RAM; the needed time should scale roughly linearly with the amount of host RAM. (If 2MB hugepages are available to the guest, the initialization should be much faster, but that hasn't been tested.)
  4. A PoC for variant 3 that, when running with normal user privileges, can read kernel memory on the Intel Haswell Xeon CPU under some precondition. We believe that this precondition is that the targeted kernel memory is present in the L1D cache.

For interesting resources around this topic, look down into the "Literature" section.
A warning regarding explanations about processor internals in this blogpost: This blogpost contains a lot of speculation about hardware internals based on observed behavior, which might not necessarily correspond to what processors are actually doing.
We have some ideas on possible mitigations and provided some of those ideas to the processor vendors; however, we believe that the processor vendors are in a much better position than we are to design and evaluate mitigations, and we expect them to be the source of authoritative guidance.
The PoC code and the writeups that we sent to the CPU vendors will be made available at a later date.Tested Processors
  • Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz (called "Intel Haswell Xeon CPU" in the rest of this document)
  • AMD FX(tm)-8320 Eight-Core Processor (called "AMD FX CPU" in the rest of this document)
  • AMD PRO A8-9600 R7, 10 COMPUTE CORES 4C+6G (called "AMD PRO CPU" in the rest of this document)
  • An ARM Cortex A57 core of a Google Nexus 5x phone [6] (called "ARM Cortex A57" in the rest of this document)
Glossaryretire: An instruction retires when its results, e.g. register writes and memory writes, are committed and made visible to the rest of the system. Instructions can be executed out of order, but must always retire in order.
logical processor core: A logical processor core is what the operating system sees as a processor core. With hyperthreading enabled, the number of logical cores is a multiple of the number of physical cores.
cached/uncached data: In this blogpost, "uncached" data is data that is only present in main memory, not in any of the cache levels of the CPU. Loading uncached data will typically take over 100 cycles of CPU time.
speculative execution: A processor can execute past a branch without knowing whether it will be taken or where its target is, therefore executing instructions before it is known whether they should be executed. If this speculation turns out to have been incorrect, the CPU can discard the resulting state without architectural effects and continue execution on the correct execution path. Instructions do not retire before it is known that they are on the correct execution path.
mis-speculation window: The time window during which the CPU speculatively executes the wrong code and has not yet detected that mis-speculation has occurred.Variant 1: Bounds check bypassThis section explains the common theory behind all three variants and the theory behind our PoC for variant 1 that, when running in userspace under a Debian distro kernel, can perform arbitrary reads in a 4GiB region of kernel memory in at least the following configurations:
  • Intel Haswell Xeon CPU, eBPF JIT is off (default state)
  • Intel Haswell Xeon CPU, eBPF JIT is on (non-default state)
  • AMD PRO CPU, eBPF JIT is on (non-default state)

The state of the eBPF JIT can be toggled using the net.core.bpf_jit_enable sysctl.Theoretical explanationThe Intel Optimization Reference Manual says the following regarding Sandy Bridge (and later microarchitectural revisions) in section ("Branch Prediction"):
Branch prediction predicts the branch target and enables theprocessor to begin executing instructions long before the branchtrue execution path is known.
In section ("L1 DCache"):
Loads can:[...]
  • Be carried out speculatively, before preceding branches are resolved.
  • Take cache misses out of order and in an overlapped manner.

Intel's Software Developer's Manual [7] states in Volume 3A, section 11.7 ("Implicit Caching (Pentium 4, Intel Xeon, and P6 family processors"):
Implicit caching occurs when a memory element is made potentially cacheable, although the element may never have been accessed in the normal von Neumann sequence. Implicit caching occurs on the P6 and more recent processor families due to aggressive prefetching, branch prediction, and TLB miss handling. Implicit caching is an extension of the behavior of existing Intel386, Intel486, and Pentium processor systems, since software running on these processor families also has not been able to deterministically predict the behavior of instruction prefetch.Consider the code sample below. If arr1->length is uncached, the processor can speculatively load data from arr1->data[untrusted_offset_from_caller]. This is an out-of-bounds read. That should not matter because the processor will effectively roll back the execution state when the branch has executed; none of the speculatively executed instructions will retire (e.g. cause registers etc. to be affected).
struct array {  unsigned long length;  unsigned char data[];};struct array *arr1 = ...;unsigned long untrusted_offset_from_caller = ...;if (untrusted_offset_from_caller < arr1->length) {  unsigned char value = arr1->data[untrusted_offset_from_caller];  ...}However, in the following code sample, there's an issue. If arr1->length, arr2->data[0x200] and arr2->data[0x300] are not cached, but all other accessed data is, and the branch conditions are predicted as true, the processor can do the following speculatively before arr1->length has been loaded and the execution is re-steered:
  • load value = arr1->data[untrusted_offset_from_caller]
  • start a load from a data-dependent offset in arr2->data, loading the corresponding cache line into the L1 cache

struct array {  unsigned long length;  unsigned char data[];};struct array *arr1 = ...; /* small array */struct array *arr2 = ...; /* array of size 0x400 *//* >0x400 (OUT OF BOUNDS!) */unsigned long untrusted_offset_from_caller = ...;if (untrusted_offset_from_caller < arr1->length) {  unsigned char value = arr1->data[untrusted_offset_from_caller];  unsigned long index2 = ((value&1)*0x100)+0x200;  if (index2 < arr2->length) {    unsigned char value2 = arr2->data[index2];  }}
After the execution has been returned to the non-speculative path because the processor has noticed that untrusted_offset_from_caller is bigger than arr1->length, the cache line containing arr2->data[index2] stays in the L1 cache. By measuring the time required to load arr2->data[0x200] and arr2->data[0x300], an attacker can then determine whether the value of index2 during speculative execution was 0x200 or 0x300 - which discloses whether arr1->data[untrusted_offset_from_caller]&1 is 0 or 1.
To be able to actually use this behavior for an attack, an attacker needs to be able to cause the execution of such a vulnerable code pattern in the targeted context with an out-of-bounds index. For this, the vulnerable code pattern must either be present in existing code, or there must be an interpreter or JIT engine that can be used to generate the vulnerable code pattern. So far, we have not actually identified any existing, exploitable instances of the vulnerable code pattern; the PoC for leaking kernel memory using variant 1 uses the eBPF interpreter or the eBPF JIT engine, which are built into the kernel and accessible to normal users.
A minor variant of this could be to instead use an out-of-bounds read to a function pointer to gain control of execution in the mis-speculated path. We did not investigate this variant further.Attacking the kernelThis section describes in more detail how variant 1 can be used to leak Linux kernel memory using the eBPF bytecode interpreter and JIT engine. While there are many interesting potential targets for variant 1 attacks, we chose to attack the Linux in-kernel eBPF JIT/interpreter because it provides more control to the attacker than most other JITs.
The Linux kernel supports eBPF since version 3.18. Unprivileged userspace code can supply bytecode to the kernel that is verified by the kernel and then:
  • either interpreted by an in-kernel bytecode interpreter
  • or translated to native machine code that also runs in kernel context using a JIT engine (which translates individual bytecode instructions without performing any further optimizations)

Execution of the bytecode can be triggered by attaching the eBPF bytecode to a socket as a filter and then sending data through the other end of the socket.
Whether the JIT engine is enabled depends on a run-time configuration setting - but at least on the tested Intel processor, the attack works independent of that setting.
Unlike classic BPF, eBPF has data types like data arrays and function pointer arrays into which eBPF bytecode can index. Therefore, it is possible to create the code pattern described above in the kernel using eBPF bytecode.
eBPF's data arrays are less efficient than its function pointer arrays, so the attack will use the latter where possible.
Both machines on which this was tested have no SMAP, and the PoC relies on that (but it shouldn't be a precondition in principle).
Additionally, at least on the Intel machine on which this was tested, bouncing modified cache lines between cores is slow, apparently because the MESI protocol is used for cache coherence [8]. Changing the reference counter of an eBPF array on one physical CPU core causes the cache line containing the reference counter to be bounced over to that CPU core, making reads of the reference counter on all other CPU cores slow until the changed reference counter has been written back to memory. Because the length and the reference counter of an eBPF array are stored in the same cache line, this also means that changing the reference counter on one physical CPU core causes reads of the eBPF array's length to be slow on other physical CPU cores (intentional false sharing).
The attack uses two eBPF programs. The first one tail-calls through a page-aligned eBPF function pointer array prog_map at a configurable index. In simplified terms, this program is used to determine the address of prog_map by guessing the offset from prog_map to a userspace address and tail-calling through prog_map at the guessed offsets. To cause the branch prediction to predict that the offset is below the length of prog_map, tail calls to an in-bounds index are performed in between. To increase the mis-speculation window, the cache line containing the length of prog_map is bounced to another core. To test whether an offset guess was successful, it can be tested whether the userspace address has been loaded into the cache.
Because such straightforward brute-force guessing of the address would be slow, the following optimization is used: 215 adjacent userspace memory mappings [9], each consisting of 24 pages, are created at the userspace address user_mapping_area, covering a total area of 231 bytes. Each mapping maps the same physical pages, and all mappings are present in the pagetables.

This permits the attack to be carried out in steps of 231 bytes. For each step, after causing an out-of-bounds access through prog_map, only one cache line each from the first 24 pages of user_mapping_area have to be tested for cached memory. Because the L3 cache is physically indexed, any access to a virtual address mapping a physical page will cause all other virtual addresses mapping the same physical page to become cached as well.
When this attack finds a hit—a cached memory location—the upper 33 bits of the kernel address are known (because they can be derived from the address guess at which the hit occurred), and the low 16 bits of the address are also known (from the offset inside user_mapping_area at which the hit was found). The remaining part of the address of user_mapping_area is the middle.

The remaining bits in the middle can be determined by bisecting the remaining address space: Map two physical pages to adjacent ranges of virtual addresses, each virtual address range the size of half of the remaining search space, then determine the remaining address bit-wise.
At this point, a second eBPF program can be used to actually leak data. In pseudocode, this program looks as follows:
uint64_t bitmask = <runtime-configurable>;uint64_t bitshift_selector = <runtime-configurable>;uint64_t prog_array_base_offset = <runtime-configurable>;uint64_t secret_data_offset = <runtime-configurable>;// index will be bounds-checked by the runtime,// but the bounds check will be bypassed speculativelyuint64_t secret_data = bpf_map_read(array=victim_array, index=secret_data_offset);// select a single bit, move it to a specific position, and add the base offsetuint64_t progmap_index = (((secret_data & bitmask) >> bitshift_selector) << 7) + prog_array_base_offset;bpf_tail_call(prog_map, progmap_index);
This program reads 8-byte-aligned 64-bit values from an eBPF data array "victim_map" at a runtime-configurable offset and bitmasks and bit-shifts the value so that one bit is mapped to one of two values that are 27 bytes apart (sufficient to not land in the same or adjacent cache lines when used as an array index). Finally it adds a 64-bit offset, then uses the resulting value as an offset into prog_map for a tail call.
This program can then be used to leak memory by repeatedly calling the eBPF program with an out-of-bounds offset into victim_map that specifies the data to leak and an out-of-bounds offset into prog_map that causes prog_map + offset to point to a userspace memory area. Misleading the branch prediction and bouncing the cache lines works the same way as for the first eBPF program, except that now, the cache line holding the length of victim_map must also be bounced to another core.Variant 2: Branch target injectionThis section describes the theory behind our PoC for variant 2 that, when running with root privileges inside a KVM guest created using virt-manager on the Intel Haswell Xeon CPU, with a specific version of Debian's distro kernel running on the host, can read host kernel memory at a rate of around 1500 bytes/second.BasicsPrior research (see the Literature section at the end) has shown that it is possible for code in separate security contexts to influence each other's branch prediction. So far, this has only been used to infer information about where code is located (in other words, to create interference from the victim to the attacker); however, the basic hypothesis of this attack variant is that it can also be used to redirect execution of code in the victim context (in other words, to create interference from the attacker to the victim; the other way around).

The basic idea for the attack is to target victim code that contains an indirect branch whose target address is loaded from memory and flush the cache line containing the target address out to main memory. Then, when the CPU reaches the indirect branch, it won't know the true destination of the jump, and it won't be able to calculate the true destination until it has finished loading the cache line back into the CPU, which takes a few hundred cycles. Therefore, there is a time window of typically over 100 cycles in which the CPU will speculatively execute instructions based on branch prediction.Haswell branch prediction internalsSome of the internals of the branch prediction implemented by Intel's processors have already been published; however, getting this attack to work properly required significant further experimentation to determine additional details.
This section focuses on the branch prediction internals that were experimentally derived from the Intel Haswell Xeon CPU.
Haswell seems to have multiple branch prediction mechanisms that work very differently:
  • A generic branch predictor that can only store one target per source address; used for all kinds of jumps, like absolute jumps, relative jumps and so on.
  • A specialized indirect call predictor that can store multiple targets per source address; used for indirect calls.
  • (There is also a specialized return predictor, according to Intel's optimization manual, but we haven't analyzed that in detail yet. If this predictor could be used to reliably dump out some of the call stack through which a VM was entered, that would be very interesting.)
Generic predictorThe generic branch predictor, as documented in prior research, only uses the lower 31 bits of the address of the last byte of the source instruction for its prediction. If, for example, a branch target buffer (BTB) entry exists for a jump from 0x4141.0004.1000 to 0x4141.0004.5123, the generic predictor will also use it to predict a jump from 0x4242.0004.1000. When the higher bits of the source address differ like this, the higher bits of the predicted destination change together with it—in this case, the predicted destination address will be 0x4242.0004.5123—so apparently this predictor doesn't store the full, absolute destination address.
Before the lower 31 bits of the source address are used to look up a BTB entry, they are folded together using XOR. Specifically, the following bits are folded together:
bit Abit B0x40.00000x20000x80.00000x40000x100.00000x80000x200.00000x1.00000x400.00000x2.00000x800.00000x4.00000x2000.00000x10.00000x4000.00000x20.0000
In other words, if a source address is XORed with both numbers in a row of this table, the branch predictor will not be able to distinguish the resulting address from the original source address when performing a lookup. For example, the branch predictor is able to distinguish source addresses 0x100.0000 and 0x180.0000, and it can also distinguish source addresses 0x100.0000 and 0x180.8000, but it can't distinguish source addresses 0x100.0000 and 0x140.2000 or source addresses 0x100.0000 and 0x180.4000. In the following, this will be referred to as aliased source addresses.
When an aliased source address is used, the branch predictor will still predict the same target as for the unaliased source address. This indicates that the branch predictor stores a truncated absolute destination address, but that hasn't been verified.
Based on observed maximum forward and backward jump distances for different source addresses, the low 32-bit half of the target address could be stored as an absolute 32-bit value with an additional bit that specifies whether the jump from source to target crosses a 232 boundary; if the jump crosses such a boundary, bit 31 of the source address determines whether the high half of the instruction pointer should increment or decrement.Indirect call predictorThe inputs of the BTB lookup for this mechanism seem to be:
  • The low 12 bits of the address of the source instruction (we are not sure whether it's the address of the first or the last byte) or a subset of them.
  • The branch history buffer state.

If the indirect call predictor can't resolve a branch, it is resolved by the generic predictor instead. Intel's optimization manual hints at this behavior: "Indirect Calls and Jumps. These may either be predicted as having a monotonic target or as having targets that vary in accordance with recent program behavior."
The branch history buffer (BHB) stores information about the last 29 taken branches - basically a fingerprint of recent control flow - and is used to allow better prediction of indirect calls that can have multiple targets.
The update function of the BHB works as follows (in pseudocode; src is the address of the last byte of the source instruction, dst is the destination address):
void bhb_update(uint58_t *bhb_state, unsigned long src, unsigned long dst) {  *bhb_state <<= 2;  *bhb_state ^= (dst & 0x3f);  *bhb_state ^= (src & 0xc0) >> 6;  *bhb_state ^= (src & 0xc00) >> (10 - 2);  *bhb_state ^= (src & 0xc000) >> (14 - 4);  *bhb_state ^= (src & 0x30) << (6 - 4);  *bhb_state ^= (src & 0x300) << (8 - 8);  *bhb_state ^= (src & 0x3000) >> (12 - 10);  *bhb_state ^= (src & 0x30000) >> (16 - 12);  *bhb_state ^= (src & 0xc0000) >> (18 - 14);}
Some of the bits of the BHB state seem to be folded together further using XOR when used for a BTB access, but the precise folding function hasn't been understood yet.
The BHB is interesting for two reasons. First, knowledge about its approximate behavior is required in order to be able to accurately cause collisions in the indirect call predictor. But it also permits dumping out the BHB state at any repeatable program state at which the attacker can execute code - for example, when attacking a hypervisor, directly after a hypercall. The dumped BHB state can then be used to fingerprint the hypervisor or, if the attacker has access to the hypervisor binary, to determine the low 20 bits of the hypervisor load address (in the case of KVM: the low 20 bits of the load address of kvm-intel.ko).Reverse-Engineering Branch Predictor InternalsThis subsection describes how we reverse-engineered the internals of the Haswell branch predictor. Some of this is written down from memory, since we didn't keep a detailed record of what we were doing.
We initially attempted to perform BTB injections into the kernel using the generic predictor, using the knowledge from prior research that the generic predictor only looks at the lower half of the source address and that only a partial target address is stored. This kind of worked - however, the injection success rate was very low, below 1%. (This is the method we used in our preliminary PoCs for method 2 against modified hypervisors running on Haswell.)
We decided to write a userspace test case to be able to more easily test branch predictor behavior in different situations.
Based on the assumption that branch predictor state is shared between hyperthreads [10], we wrote a program of which two instances are each pinned to one of the two logical processors running on a specific physical core, where one instance attempts to perform branch injections while the other measures how often branch injections are successful. Both instances were executed with ASLR disabled and had the same code at the same addresses. The injecting process performed indirect calls to a function that accesses a (per-process) test variable; the measuring process performed indirect calls to a function that tests, based on timing, whether the per-process test variable is cached, and then evicts it using CLFLUSH. Both indirect calls were performed through the same callsite. Before each indirect call, the function pointer stored in memory was flushed out to main memory using CLFLUSH to widen the speculation time window. Additionally, because of the reference to "recent program behavior" in Intel's optimization manual, a bunch of conditional branches that are always taken were inserted in front of the indirect call.
In this test, the injection success rate was above 99%, giving us a base setup for future experiments.

We then tried to figure out the details of the prediction scheme. We assumed that the prediction scheme uses a global branch history buffer of some kind.
To determine the duration for which branch information stays in the history buffer, a conditional branch that is only taken in one of the two program instances was inserted in front of the series of always-taken conditional jumps, then the number of always-taken conditional jumps (N) was varied. The result was that for N=25, the processor was able to distinguish the branches (misprediction rate under 1%), but for N=26, it failed to do so (misprediction rate over 99%).Therefore, the branch history buffer had to be able to store information about at least the last 26 branches.
The code in one of the two program instances was then moved around in memory. This revealed that only the lower 20 bits of the source and target addresses have an influence on the branch history buffer.
Testing with different types of branches in the two program instances revealed that static jumps, taken conditional jumps, calls and returns influence the branch history buffer the same way; non-taken conditional jumps don't influence it; the address of the last byte of the source instruction is the one that counts; IRETQ doesn't influence the history buffer state (which is useful for testing because it permits creating program flow that is invisible to the history buffer).
Moving the last conditional branch before the indirect call around in memory multiple times revealed that the branch history buffer contents can be used to distinguish many different locations of that last conditional branch instruction. This suggests that the history buffer doesn't store a list of small history values; instead, it seems to be a larger buffer in which history data is mixed together.
However, a history buffer needs to "forget" about past branches after a certain number of new branches have been taken in order to be useful for branch prediction. Therefore, when new data is mixed into the history buffer, this can not cause information in bits that are already present in the history buffer to propagate downwards - and given that, upwards combination of information probably wouldn't be very useful either. Given that branch prediction also must be very fast, we concluded that it is likely that the update function of the history buffer left-shifts the old history buffer, then XORs in the new state (see diagram).

If this assumption is correct, then the history buffer contains a lot of information about the most recent branches, but only contains as many bits of information as are shifted per history buffer update about the last branch about which it contains any data. Therefore, we tested whether flipping different bits in the source and target addresses of a jump followed by 32 always-taken jumps with static source and target allows the branch prediction to disambiguate an indirect call. [11]
With 32 static jumps in between, no bit flips seemed to have an influence, so we decreased the number of static jumps until a difference was observable. The result with 28 always-taken jumps in between was that bits 0x1 and 0x2 of the target and bits 0x40 and 0x80 of the source had such an influence; but flipping both 0x1 in the target and 0x40 in the source or 0x2 in the target and 0x80 in the source did not permit disambiguation. This shows that the per-insertion shift of the history buffer is 2 bits and shows which data is stored in the least significant bits of the history buffer. We then repeated this with decreased amounts of fixed jumps after the bit-flipped jump to determine which information is stored in the remaining bits.Reading host memory from a KVM guestLocating the host kernelOur PoC locates the host kernel in several steps. The information that is determined and necessary for the next steps of the attack consists of:
  • lower 20 bits of the address of kvm-intel.ko
  • full address of kvm.ko
  • full address of vmlinux

Looking back, this is unnecessarily complicated, but it nicely demonstrates the various techniques an attacker can use. A simpler way would be to first determine the address of vmlinux, then bisect the addresses of kvm.ko and kvm-intel.ko.
In the first step, the address of kvm-intel.ko is leaked. For this purpose, the branch history buffer state after guest entry is dumped out. Then, for every possible value of bits 12..19 of the load address of kvm-intel.ko, the expected lowest 16 bits of the history buffer are computed based on the load address guess and the known offsets of the last 8 branches before guest entry, and the results are compared against the lowest 16 bits of the leaked history buffer state.
The branch history buffer state is leaked in steps of 2 bits by measuring misprediction rates of an indirect call with two targets. One way the indirect call is reached is from a vmcall instruction followed by a series of N branches whose relevant source and target address bits are all zeroes. The second way the indirect call is reached is from a series of controlled branches in userspace that can be used to write arbitrary values into the branch history buffer.Misprediction rates are measured as in the section "Reverse-Engineering Branch Predictor Internals", using one call target that loads a cache line and another one that checks whether the same cache line has been loaded.

With N=29, mispredictions will occur at a high rate if the controlled branch history buffer value is zero because all history buffer state from the hypercall has been erased. With N=28, mispredictions will occur if the controlled branch history buffer value is one of 0<<(28*2), 1<<(28*2), 2<<(28*2), 3<<(28*2) - by testing all four possibilities, it can be detected which one is right. Then, for decreasing values of N, the four possibilities are {0|1|2|3}<<(28*2) | (history_buffer_for(N+1) >> 2). By repeating this for decreasing values for N, the branch history buffer value for N=0 can be determined.
At this point, the low 20 bits of kvm-intel.ko are known; the next step is to roughly locate kvm.ko.For this, the generic branch predictor is used, using data inserted into the BTB by an indirect call from kvm.ko to kvm-intel.ko that happens on every hypercall; this means that the source address of the indirect call has to be leaked out of the BTB.
kvm.ko will probably be located somewhere in the range from 0xffffffffc0000000 to 0xffffffffc4000000, with page alignment (0x1000). This means that the first four entries in the table in the section "Generic Predictor" apply; there will be 24-1=15 aliasing addresses for the correct one. But that is also an advantage: It cuts down the search space from 0x4000 to 0x4000/24=1024.
To find the right address for the source or one of its aliasing addresses, code that loads data through a specific register is placed at all possible call targets (the leaked low 20 bits of kvm-intel.ko plus the in-module offset of the call target plus a multiple of 220) and indirect calls are placed at all possible call sources. Then, alternatingly, hypercalls are performed and indirect calls are performed through the different possible non-aliasing call sources, with randomized history buffer state that prevents the specialized prediction from working. After this step, there are 216 remaining possibilities for the load address of kvm.ko.
Next, the load address of vmlinux can be determined in a similar way, using an indirect call from vmlinux to kvm.ko. Luckily, none of the bits which are randomized in the load address of vmlinux  are folded together, so unlike when locating kvm.ko, the result will directly be unique. vmlinux has an alignment of 2MiB and a randomization range of 1GiB, so there are still only 512 possible addresses.Because (as far as we know) a simple hypercall won't actually cause indirect calls from vmlinux to kvm.ko, we instead use port I/O from the status register of an emulated serial port, which is present in the default configuration of a virtual machine created with virt-manager.
The only remaining piece of information is which one of the 16 aliasing load addresses of kvm.ko is actually correct. Because the source address of an indirect call to kvm.ko is known, this can be solved using bisection: Place code at the various possible targets that, depending on which instance of the code is speculatively executed, loads one of two cache lines, and measure which one of the cache lines gets loaded.Identifying cache setsThe PoC assumes that the VM does not have access to hugepages.To discover eviction sets for all L3 cache sets with a specific alignment relative to a 4KiB page boundary, the PoC first allocates 25600 pages of memory. Then, in a loop, it selects random subsets of all remaining unsorted pages such that the expected number of sets for which an eviction set is contained in the subset is 1, reduces each subset down to an eviction set by repeatedly accessing its cache lines and testing whether the cache lines are always cached (in which case they're probably not part of an eviction set) and attempts to use the new eviction set to evict all remaining unsorted cache lines to determine whether they are in the same cache set [12].Locating the host-virtual address of a guest pageBecause this attack uses a FLUSH+RELOAD approach for leaking data, it needs to know the host-kernel-virtual address of one guest page. Alternative approaches such as PRIME+PROBE should work without that requirement.
The basic idea for this step of the attack is to use a branch target injection attack against the hypervisor to load an attacker-controlled address and test whether that caused the guest-owned page to be loaded. For this, a gadget that simply loads from the memory location specified by R8 can be used - R8-R11 still contain guest-controlled values when the first indirect call after a guest exit is reached on this kernel build.
We expected that an attacker would need to either know which eviction set has to be used at this point or brute-force it simultaneously; however, experimentally, using random eviction sets works, too. Our theory is that the observed behavior is actually the result of L1D and L2 evictions, which might be sufficient to permit a few instructions worth of speculative execution.
The host kernel maps (nearly?) all physical memory in the physmap area, including memory assigned to KVM guests. However, the location of the physmap is randomized (with a 1GiB alignment), in an area of size 128PiB. Therefore, directly bruteforcing the host-virtual address of a guest page would take a long time. It is not necessarily impossible; as a ballpark estimate, it should be possible within a day or so, maybe less, assuming 12000 successful injections per second and 30 guest pages that are tested in parallel; but not as impressive as doing it in a few minutes.
To optimize this, the problem can be split up: First, brute-force the physical address using a gadget that can load from physical addresses, then brute-force the base address of the physmap region. Because the physical address can usually be assumed to be far below 128PiB, it can be brute-forced more efficiently, and brute-forcing the base address of the physmap region afterwards is also easier because then address guesses with 1GiB alignment can be used.
To brute-force the physical address, the following gadget can be used:
ffffffff810a9def:       4c 89 c0                mov    rax,r8ffffffff810a9df2:       4d 63 f9                movsxd r15,r9dffffffff810a9df5:       4e 8b 04 fd c0 b3 a6    mov    r8,QWORD PTR [r15*8-0x7e594c40]ffffffff810a9dfc:       81 ffffffff810a9dfd:       4a 8d 3c 00             lea    rdi,[rax+r8*1]ffffffff810a9e01:       4d 8b a4 00 f8 00 00    mov    r12,QWORD PTR [r8+rax*1+0xf8]ffffffff810a9e08:       00
This gadget permits loading an 8-byte-aligned value from the area around the kernel text section by setting R9 appropriately, which in particular permits loading page_offset_base, the start address of the physmap. Then, the value that was originally in R8 - the physical address guess minus 0xf8 - is added to the result of the previous load, 0xfa is added to it, and the result is dereferenced.Cache set selectionTo select the correct L3 eviction set, the attack from the following section is essentially executed with different eviction sets until it works.Leaking dataAt this point, it would normally be necessary to locate gadgets in the host kernel code that can be used to actually leak data by reading from an attacker-controlled location, shifting and masking the result appropriately and then using the result of that as offset to an attacker-controlled address for a load. But piecing gadgets together and figuring out which ones work in a speculation context seems annoying. So instead, we decided to use the eBPF interpreter, which is built into the host kernel - while there is no legitimate way to invoke it from inside a VM, the presence of the code in the host kernel's text section is sufficient to make it usable for the attack, just like with ordinary ROP gadgets.
The eBPF interpreter entry point has the following function signature:
static unsigned int __bpf_prog_run(void *ctx, const struct bpf_insn *insn)
The second parameter is a pointer to an array of statically pre-verified eBPF instructions to be executed - which means that __bpf_prog_run() will not perform any type checks or bounds checks. The first parameter is simply stored as part of the initial emulated register state, so its value doesn't matter.
The eBPF interpreter provides, among other things:
  • multiple emulated 64-bit registers
  • 64-bit immediate writes to emulated registers
  • memory reads from addresses stored in emulated registers
  • bitwise operations (including bit shifts) and arithmetic operations

To call the interpreter entry point, a gadget that gives RSI and RIP control given R8-R11 control and controlled data at a known memory location is necessary. The following gadget provides this functionality:
ffffffff81514edd:       4c 89 ce                mov    rsi,r9
ffffffff81514ee0:       41 ff 90 b0 00 00 00    call   QWORD PTR [r8+0xb0]
Now, by pointing R8 and R9 at the mapping of a guest-owned page in the physmap, it is possible to speculatively execute arbitrary unvalidated eBPF bytecode in the host kernel. Then, relatively straightforward bytecode can be used to leak data into the cache.Variant 3: Rogue data cache loadBasically, read Anders Fogh's blogpost:
In summary, an attack using this variant of the issue attempts to read kernel memory from userspace without misdirecting the control flow of kernel code. This works by using the code pattern that was used for the previous variants, but in userspace. The underlying idea is that the permission check for accessing an address might not be on the critical path for reading data from memory to a register, where the permission check could have significant performance impact. Instead, the memory read could make the result of the read available to following instructions immediately and only perform the permission check asynchronously, setting a flag in the reorder buffer that causes an exception to be raised if the permission check fails.
We do have a few additions to make to Anders Fogh's blogpost:
"Imagine the following instruction executed in usermodemov rax,[somekernelmodeaddress]It will cause an interrupt when retired, [...]"
It is also possible to already execute that instruction behind a high-latency mispredicted branch to avoid taking a page fault. This might also widen the speculation window by increasing the delay between the read from a kernel address and delivery of the associated exception.
"First, I call a syscall that touches this memory. Second, I use the prefetcht0 instruction to improve my odds of having the address loaded in L1."
When we used prefetch instructions after doing a syscall, the attack stopped working for us, and we have no clue why. Perhaps the CPU somehow stores whether access was denied on the last access and prevents the attack from working if that is the case?
"Fortunately I did not get a slow read suggesting that Intel null’s the result when the access is not allowed."
That (read from kernel address returns all-zeroes) seems to happen for memory that is not sufficiently cached but for which pagetable entries are present, at least after repeated read attempts. For unmapped memory, the kernel address read does not return a result at all.Ideas for further researchWe believe that our research provides many remaining research topics that we have not yet investigated, and we encourage other public researchers to look into these.This section contains an even higher amount of speculation than the rest of this blogpost - it contains untested ideas that might well be useless.Leaking without data cache timingIt would be interesting to explore whether there are microarchitectural attacks other than measuring data cache timing that can be used for exfiltrating data out of speculative execution.Other microarchitecturesOur research was relatively Haswell-centric so far. It would be interesting to see details e.g. on how the branch prediction of other modern processors works and how well it can be attacked.Other JIT enginesWe developed a successful variant 1 attack against the JIT engine built into the Linux kernel. It would be interesting to see whether attacks against more advanced JIT engines with less control over the system are also practical - in particular, JavaScript engines.More efficient scanning for host-virtual addresses and cache setsIn variant 2, while scanning for the host-virtual address of a guest-owned page, it might make sense to attempt to determine its L3 cache set first. This could be done by performing L3 evictions using an eviction pattern through the physmap, then testing whether the eviction affected the guest-owned page.
The same might work for cache sets - use an L1D+L2 eviction set to evict the function pointer in the host kernel context, use a gadget in the kernel to evict an L3 set using physical addresses, then use that to identify which cache sets guest lines belong to until a guest-owned eviction set has been constructed.Dumping the complete BTB stateGiven that the generic BTB seems to only be able to distinguish 231-8 or fewer source addresses, it seems feasible to dump out the complete BTB state generated by e.g. a hypercall in a timeframe around the order of a few hours. (Scan for jump sources, then for every discovered jump source, bisect the jump target.) This could potentially be used to identify the locations of functions in the host kernel even if the host kernel is custom-built.
The source address aliasing would reduce the usefulness somewhat, but because target addresses don't suffer from that, it might be possible to correlate (source,target) pairs from machines with different KASLR offsets and reduce the number of candidate addresses based on KASLR being additive while aliasing is bitwise.
This could then potentially allow an attacker to make guesses about the host kernel version or the compiler used to build it based on jump offsets or distances between functions.Variant 2: Leaking with more efficient gadgetsIf sufficiently efficient gadgets are used for variant 2, it might not be necessary to evict host kernel function pointers from the L3 cache at all; it might be sufficient to only evict them from L1D and L2. Various speedupsIn particular the variant 2 PoC is still a bit slow. This is probably partly because:
  • It only leaks one bit at a time; leaking more bits at a time should be doable.
  • It heavily uses IRETQ for hiding control flow from the processor.

It would be interesting to see what data leak rate can be achieved using variant 2.Leaking or injection through the return predictorIf the return predictor also doesn't lose its state on a privilege level change, it might be useful for either locating the host kernel from inside a VM (in which case bisection could be used to very quickly discover the full address of the host kernel) or injecting return targets (in particular if the return address is stored in a cache line that can be flushed out by the attacker and isn't reloaded before the return instruction).
However, we have not performed any experiments with the return predictor that yielded conclusive results so far.Leaking data out of the indirect call predictorWe have attempted to leak target information out of the indirect call predictor, but haven't been able to make it work.Vendor statementsThe following statement were provided to us regarding this issue from the vendors to whom Project Zero disclosed this vulnerability:IntelNo current statement provided at this time.AMDNo current statement provided at this time.ARMArm recognises that the speculation functionality of many modern high-performance processors, despite working as intended, can be used in conjunction with the timing of cache operations to leak some information as described in this blog. Correspondingly, Arm has developed software mitigations that we recommend be deployed.
Specific details regarding the affected processors and mitigations can be found at this website:
Arm has included a detailed technical whitepaper as well as links to information from some of Arm’s architecture partners regarding their specific implementations and mitigations.LiteratureNote that some of these documents - in particular Intel's documentation - change over time, so quotes from and references to it may not reflect the latest version of Intel's documentation.
  • Intel's optimization manual has many interesting pieces of optimization advice that hint at relevant microarchitectural behavior; for example:
    • "Placing data immediately following an indirect branch can cause a performance problem. If the data consists of all zeros, it looks like a long stream of ADDs to memory destinations and this can cause resource conflicts and slow down branch recovery. Also, data immediately following indirect branches may appear as branches to the branch predication [sic] hardware, which can branch off to execute other data pages. This can lead to subsequent self-modifying code problems."
    • "Loads can:[...]Be carried out speculatively, before preceding branches are resolved."
    • "Software should avoid writing to a code page in the same 1-KByte subpage that is being executed or fetching code in the same 2-KByte subpage of that is being written. In addition, sharing a page containing directly or speculatively executed code with another processor as a data page can trigger an SMC condition that causes the entire pipeline of the machine and the trace cache to be cleared. This is due to the self-modifying code condition."
    • "if mapped as WB or WT, there is a potential for speculative processor reads to bring the data into the caches"
    • "Failure to map the region as WC may allow the line to be speculatively read into the processor caches (via the wrong path of a mispredicted branch)."
  • Intel's Software Developer Manuals
  • Agner Fog's documentation of reverse-engineered processor behavior and relevant theory was very helpful for this research.
  • and Prior research by Dmitry Evtyushkin, Dmitry Ponomarev and Nael Abu-Ghazaleh on abusing branch target buffer behavior to leak addresses that we used as a starting point for analyzing the branch prediction of Haswell processors. Felix Wilhelm's research based on this provided the basic idea behind variant 2.
  • The rowhammer.js research by Daniel Gruss, Clémentine Maurice and Stefan Mangard contains information about L3 cache eviction patterns that we reused in the KVM PoC to evict a function pointer.
  • Matt Godbolt blogged about reverse-engineering the structure of the branch predictor on Intel processors.
  • Sophia D'Antoine wrote a thesis that shows that opcode scheduling can theoretically be used to transmit data between hyperthreads.
  • Daniel Gruss, Moritz Lipp, Michael Schwarz, Richard Fellner, Clémentine Maurice, and Stefan Mangard wrote a paper on mitigating microarchitectural issues caused by pagetable sharing between userspace and the kernel.
  • This journal contains many articles on branch prediction.
  • This blogpost by Henry Wong investigates the L3 cache replacement policy used by Intel's Ivy Bridge architecture.
References[1] This initial report did not contain any information about variant 3. We had discussed whether direct reads from kernel memory could work, but thought that it was unlikely. We later tested and reported variant 3 prior to the publication of Anders Fogh's work at[2] The precise model names are listed in the section "Tested Processors". The code for reproducing this is in the writeup_files.tar archive in our bugtracker, in the folders userland_test_x86 and userland_test_aarch64.[3] The attacker-controlled offset used to perform an out-of-bounds access on an array by this PoC is a 32-bit value, limiting the accessible addresses to a 4GiB window in the kernel heap area.[4] This PoC won't work on CPUs with SMAP support; however, that is not a fundamental limitation.[5] linux-image-4.9.0-3-amd64 at version 4.9.30-2+deb9u2 (available at, sha256 5f950b26aa7746d75ecb8508cc7dab19b3381c9451ee044cd2edfd6f5efff1f8, signed via Release.gpg, Release, Packages.xz); that was the current distro kernel version when I set up the machine. It is very unlikely that the PoC works with other kernel versions without changes; it contains a number of hardcoded addresses/offsets.[6] The phone was running an Android build from May 2017.[7][8], section "background"[9] More than 215 mappings would be more efficient, but the kernel places a hard cap of 216 on the number of VMAs that a process can have.[10] Intel's optimization manual states that "In the first implementation of HT Technology, the physical execution resources are shared and the architecture state is duplicated for each logical processor", so it would be plausible for predictor state to be shared. While predictor state could be tagged by logical core, that would likely reduce performance for multithreaded processes, so it doesn't seem likely.[11] In case the history buffer was a bit bigger than we had measured, we added some margin - in particular because we had seen slightly different history buffer lengths in different experiments, and because 26 isn't a very round number.[12] The basic idea comes from, section IV, although the authors of that paper still used hugepages.
Categories: Security

aPAColypse now: Exploiting Windows 10 in a Local Network with WPAD/PAC and JScript

Google Project Zero - Mon, 12/18/2017 - 12:47
by Ivan Fratric, Thomas Dullien, James Forshaw and Steven VittitoeIntroMany widely-deployed technologies, viewed through 20/20 hindsight, seem like an odd or unnecessarily risky idea. Engineering decisions in IT are often made with imperfect information and under time pressure, and some oddities of the IT stack can best be explained with “it seemed like a good idea at the time”. In the personal view of some of the authors of this post, WPAD (“Web Proxy Auto Discovery Protocol” - and more specifically “Proxy Auto-Config”), is one of these oddities.
At some point in the very early days of the Internet - prior to 1996 - engineers at Netscape decided that JavaScript was a good language to write configuration files in. The result was PAC - a configuration file format that works as follows: The browser connects to a pre-configured server, downloads the PAC file, and executes a particular Javascript function to determine proper proxy configuration. Why not? It certainly is more expressive and less verbose than (let’s say) XML, and seems a reasonable way to provide configurations to many clients.
PAC itself was coupled with a protocol called WPAD - a protocol that makes it unnecessary for the browser to have a pre-configured server to connect to. Instead, WPAD allows the computer to query the local network to determine the server from which to load the PAC file.
Somehow this technology ended up being an IETF draft which expired in 1999, and now, in 2017, every Windows machine will ask the local network: “Hey, where can I find a Javascript file to execute?”. This can happen via a number of mechanisms: DNS, WINS, but - perhaps most interestingly - DHCP.
In recent years, browser exploits have mutated from being primarily DOM-oriented to targeting Javascript engines directly, so the mere mention that we can get Javascript execution over the network without the browser was motivating. An initial investigation revealed that the JS Engine responsible for executing these configuration files was jscript.dll - the legacy JS Engine that also powered IE7 and IE8 (and is still reachable in IE11 in IE7/8 compatibility mode if appropriate script attributes are used). This is both good and bad - on the one hand, it means that not every Chakra bug is automatically a local network remote attack, but on the other hand, it means that some pretty old code will be responsible for executing our Javascript.
Security researchers have previously warned about the dangers of WPAD. But, as far as we know, this is the first time that an attack against WPAD is demonstrated that results in the complete compromise of the WPAD user’s machine.
Windows is certainly not the only piece of software that implements WPAD. Other operating systems and applications do as well. For example Google Chrome also has a WPAD implementation, but in Chrome’s case, evaluating the JavaScript code from the PAC file happens inside a sandbox. And other operating systems that support WPAD don’t enable it by default. This is why Windows is currently the most interesting target for this sort of attack.Web Proxy Auto-DiscoveryAs mentioned above, WPAD will query DHCP and DNS (in that order) to obtain a URL to connect to - apparently LLMNR and Netbios can also be used if no response from DNS is available. Some peculiarities of WPAD-over-DNS enable surprising attack vectors.Attack scenario: Local network via DHCPIn the most common scenario, a machine will query the local DHCP server using option code 252. The DHCP server replies with a string - like “http://server.domain/proxyconfig.pac”, which specifies a URL from which the configuration file should be fetched. The client then proceeds to fetch this file, and execute the contents as Javascript.
In a local network, an attacker can simply impersonate the DHCP server - either by ARP games or by racing the legitimate DHCP. The attacker can then provide a URL where the malicious Javascript file is hosted.Attack scenario: Remote over the internet via privileged position and DNSAside from the local-network attack scenario, the fact that lookup for WPAD may also happen via DNS creates a secondary attack scenario. Many users configure their computers to perform DNS lookups against one of the public, globally visible DNS servers (such as,, and In such a scenario, a machine will send DNS queries (such as wpad.local) to the server which sits outside of the local network. An attacker in a privileged position on the network (e.g. a gateway, or any other upstream host) can monitor the DNS queries and spoof a reply, directing the client to download and execute a malicious Javascript file.
Setups like these seem to be common - according to this Wikipedia entry, a nontrivial proportion of the traffic that the DNS root servers see are .local requests.Attack scenario: Remote over the internet via malicious wpad.tldA particular oddity of WPAD is that it recursively walks the local machine name to find domains to query. If a machine is called “”, the following domains are supposedly queried in order:

This has (according to this Wikipedia entry) in the past led to people registering and redirecting traffic to an online auction site. Further quoting from that entry:
Through the WPAD file, the attacker can point users' browsers to their own proxies and intercept and modify all of WWW traffic. Although a simplistic fix for Windows WPAD handling was applied in 2005, it only fixed the problem for the .com domain. A presentation at Kiwicon showed that the rest of the world was still critically vulnerable to this security hole, with a sample domain registered in New Zealand for testing purposes receiving proxy requests from all over the country at the rate of several a second. Several of the wpad.tld domain names (including COM, NET, ORG, and US) now point to the client loopback address to help protect against this vulnerability, though some names are still registered (, an administrator should make sure that a user can trust all the DHCP servers in an organisation and that all possible wpad domains for the organisation are under control. Furthermore, if there's no wpad domain configured for an organisation, a user will go to whatever external location has the next wpad site in the domain hierarchy and use that for its configuration. This allows whoever registers the wpad subdomain in a particular country to perform a man-in-the-middle attack on large portions of that country's internet traffic by setting themselves as a proxy for all traffic or sites of interest.
The IETF draft, on the other hand, explicitly asks for clients to only allow “canonical” (e.g. non-top-level domains). We have not investigated to what extent clients implement this, or if second-level domains (such as are the culprit in the historical cases of traffic redirection.
Either way: Bugs in the Javascript engine under consideration can be exploited remotely via the internet if one manages to register wpad.$TLD for a given organization’s TLD, provided said TLD is not explicitly blacklisted by the client implementation. Given that the IETF draft from 1999 refers to a list of TLDs from 1994 (RFC1591), it is unlikely that clients have been updated to reflect the proliferation of new TLDs.
Our attempts to register$TLD for a variety of TLDs were not (yet) successful.BugsWe spent some time looking for bugs in jscript.dll and employed both manual analysis and fuzzing. JScript initially posed some challenge because a lot of “features” useful for triggering bugs in JavaScript engines can’t be used in JScript, simply due to it being too old to support them. For example:
  • There are no multiple arrays types (int array, float array etc.). Thus confusing one array type for another is not possible.
  • There are not as many optimizations (“fast paths”) as in the newer, faster JavaScript engines. These fast paths are often the source of bugs.
  • It is not possible to define a getter/setter on a generic JavaScript object. It is possible to call defineProperty but only on DOM objects which doesn’t work for us as there won’t be a DOM in the WPAD process. Even if there were, a lot of JScript functions will simply fail when called on a DOM object with a message “JScript object expected”.
  • It is impossible to change an object’s prototype once it is created (i.e. there is no “__proto__” property).

However, JScript does suffer from more “old-school” vulnerability classes such as use-after-free. JScript’s garbage collector is described in this old MSDN article. JScript uses a non-generational mark-and-sweep garbage collector. Essentially, whenever a garbage collection is triggered, it marks all the JScript objects. Then it scans them starting from a set of “root” objects (sometimes also referred to as “scavengers”) and clears the mark from all the objects it encounters. All the objects that are still marked get deleted. One recurring problem is that local variables on the stack aren’t added to the list of root objects by default, meaning that a programmer needs to remember to add them to the garbage collector’s root list, especially if those variables refer to objects that can be deleted during the function’s lifetime.
Other possible types of vulnerabilities include buffer overflows, uninitialized variables etc.
For fuzzing, we used the grammar-based Domato fuzzing engine and wrote a new grammar specifically for JScript. We identified interesting built-in properties and functions to add to the grammar by looking at EnsureBuiltin methods of various JScript objects. The JScript grammar has been added to the Domato repository here.
Between fuzzing and manual analysis we identified seven security vulnerabilities. They are summarized in the table below:
Vulnerability classVulnerabilities affecting IE8 modeVulnerabilities affecting IE7 modeUse-after-free1340, 1376, 13811376Heap overflow1369, 13831369, 1383Uninitialized variable13781378Out-of-bounds read13821382Total75
At the time of publishing this blog post, all the bugs have been fixed by Microsoft.
The table breaks down the vulnerabilities by class and compatibility mode required to trigger them. JScript in WPAD is equivalent to running a script in IE7 compatibility mode, which means that, although we found 7 vulnerabilities, “only” 5 of them can be triggered in WPAD. However, the other vulnerabilities can still be used against Internet Explorer (including IE11) when put into IE8 compatibility mode by a malicious webpage.Exploit
Understanding JScript VARs and Strings
Since in the remainder of this blogpost we’re going to talk about JScript VARs and Strings a lot, it is useful to describe these before going deeper into how the exploits work.
JScript VAR is a 24-byte (on 64-bit builds) structure that represents a JavaScript variable and is essentially the same as the VARIANT data structure described in this MSDN article. In most cases (sufficient to follow the exploit) its memory layout looks like this:
OffsetSizeDescription02Variable type, 3 for integer, 5 for double, 8 for string etc.88Depending on the type, either an immediate value or a pointer168Unused for most types
For example, we can represent a double precision number by a VAR that has 5 written in the first 2 bytes (indicating the double type), followed by an actual double value at offset 8. The last 8 bytes are going to be unused but they are going to be copied around if a value of another VAR is copied from this VAR.
A JScript string is a type of VAR that has the type 8 and a pointer at offset 8. The pointer points into a BSTR structure described here. On 64-bit builds BSTR layout looks like this:
OffsetSizeDescription04Unused44String length in bytes not counting the null character at the end8length+2String characters (16-bit) followed by a null character
A String VAR points directly to the character array, which means that, to obtain a String's length, the pointer needs to be decremented by 4 and the length read from there. Note that BSTRs are handled by OleAut32.dll and are allocated on a separate heap (i.e. a different heap than is being used for other JScript objects).
Freeing of BSTRs is also different than for most objects because, instead of directly freeing a BSTR, when SysFreeString is called, it first puts a string in a cache controlled by OleAut32.dll. This mechanism is described in detail in Heap Feng Shui in JavaScript.
Stage 1: Infoleak
The purpose of the infoleak will be to obtain the address of a string in memory whose content we fully control. We won’t be leaking any executable module addresses at this point, that will come later. Instead, the goal is to defeat high-entropy heap randomization and make the second stage of the exploit reliable without having to use heap spraying.
For the infoleak we’re going to use this bug in RegExp.lastParen. To understand the bug let’s first take a closer look at the memory layout of jscript!RegExpFncObj which corresponds to the JScript RegExp object. At offset 0xAC RegExpFncObj contains a buffer of 20 integers. Actually these are 10 pairs of integers: the first element of the pair is the start index into the input string and the second element is the end index. Whenever RegExp.test, RegExp.exec or with a RegExp parameter encounter a capturing group (parentheses in the RegExp syntax), the start and end index of the match are stored here. Obviously in the buffer there is space for only 10 matches, so only the first 10 matches are stored in this buffer.
However, if RegExp.lastParen is called and there were more than 10 capturing groups, RegExpFncObj::LastParen will happily use the number of capturing groups as an index into the buffer, leading to out-of-bounds read. Here is a PoC:
 var r= new RegExp(Array(100).join('()'));  ''.search(r);  alert(RegExp.lastParen);
The 2 indices (let’s call them start_index and end_index) are read outside the bounds of the buffer and can thus be made arbitrarily large. Assuming this first out-of-bounds access doesn’t cause a crash, if the values in those indices are larger than the length of the input string, then a second out-of-bounds access is going to occur which allows us to read a outside the bounds of the input string. The string content read out-of-bounds like this is going to be returned to the caller in a String variable where it can be examined.
This second out-of-bounds read is what we’re going to use, but first we need to figure out how to get controlled data into start_index and end_index. Fortunately, looking at the layout of RegExpFncObj, there is data we control after the end of the index buffer: RegExp.input value. By setting RegExp.input to an integer value and using a RegExp composed of 41 sets of empty parentheses, when  RegExp.lastParen gets called, start_index is going to be 0 and the end_index is going to be whatever value we wrote to RegExp.input.
If we make an input string adjacent to a freed string, then by reading after the bounds of input string, we can obtain the heap metadata such as the pointers to the other free heap segments (Left, Right and Parent node in the red-black tree of heap chunks, see Windows 10 Segment Heap Internals for more information). Image 1 shows the relevant objects at the moment of infoleak.
Image 1: Heap infoleak layout
We are using 20000 bytes-long strings as input in order for them not to be allocated on the Low Fragmentation Heap (LFH can only be used for allocations of 16K bytes and smaller) since the heap metadata for the LFH is different and does not include useful pointers in Windows 10 Segment Heap. Additionally, LFH introduces randomness that would affect our ability to place the input string next to a freed string.
By reading the heap metadata out of the returned string, we can obtain an address of a freed string. Then, if we allocate a string of the same size as the freed string, it might be placed at this address and we achieved our goal, that is we know the address of memory of a string whose content we control.
The whole infoleak process looks like this:
  1. Allocate 1000 10000-character strings (note: 10000 characters == 20000 bytes).
  2. Free every second one.
  3. Trigger the info leak bug. Use one of the remaining strings as an input strings and read 20080 bytes.
  4. Analyze the leaked string and obtain the pointer to one of the freed strings.
  5. Allocate 500 strings of the same length as the freed strings (10000 characters) with a specially crafted content.

The content of the specially crafted strings is not important at this stage, but will be important in the next one, so it will be described there. Also note that, by examining heap metadata, we can easily determine which heap implementation the process is using (Segment Heap vs NT heap).
Images 2 and 3 show heap visualization created using Heap History Viewer at the time around the infoleak. Green stripes represent allocated blocks (occupied by strings), grey stripes represent allocated blocks that are then freed by later allocated again (the stings we free and then reallocate after triggering the infoleak bug) and the white stripes represent data that is never allocated (guard pages). You can see how strings get allocated as the time passes, then half of them are freed (grey ones) and sometime later get allocated again (the stripes become green).
We can see that there are going to be guard pages after every 3 allocations of this size. Our exploit is never actually going to touch any of these guard pages (it reads too little data past the end of the string for that to occur) but in ⅓ of the cases there won’t be a free string after the input string for the infoleak so the expected heap metadata will be missing. We can, however, easily detect this case and either trigger the infoleak bug using another input string or silently abort the exploit (note: we didn’t trigger any memory corruption up to this point).
Image 2: Heap Diagram: Showing the evolution of the heap over time Image 3: Step-by-step illustration of leaking a pointer to a string.
Stage 2: Overflow
In stage 2 of the exploit we’re going to use this heap overflow bug in Array.sort. In case the number of elements in the input array to Array.sort is larger than Array.length / 2, JsArrayStringHeapSort (called by Array.sort if a comparison function isn’t specified) is going to allocate a temporary buffer of the same size as the number of elements currently in the array (note: can be smaller than array.lenght). It is then going to attempt to retrieve the corresponding elements for every array index from 0 to Array.length and, if that element exists, add it to the buffer and convert to string. If the array doesn’t change during the lifetime of JsArrayStringHeapSort, this will work fine. However, JsArrayStringHeapSort converts array elements into strings which can trigger toString() callbacks. If during one of those toString() callbacks elements are added to the array where they were previously undefined, an overflow is going to occur.
To understand the bug and its exploitability better let’s take a closer look at the structure of the buffer we’ll overflow out of. It is already mentioned that the array will have the same size as the number of elements currently in input array (to be exact, it is going to be number of elements + 1). Each element of the array is going to be 48 bytes in size (in a 64-bit build) with the following structure:
OffsetSizeDescripion08Pointer to a string VAR after the original VAR at offset 16 is converted to string84Index (int) of the current element1624VAR holding the original array element404int 0 or 1 depending on the type of VAR at offset 16
During JsArrayStringHeapSort, each element of the array with index < array.length is retrieved, and if the element is defined the following happens:
  1. The array element is read into VAR at offset 16
  2. The original VAR is converted into a string VAR. A pointer to the string VAR is written at offset 0.
  3. At offset 8, the index of the current element in array is written
  4. Depending on the original VAR type, 0 or 1 is written at offset 40

Looking at the structure of the temporary buffer, we don’t control a lot of it directly. If an array member is a string, then at offsets 0 and 24 we’re going to have a pointer that, when dereferenced, at offset 8 contains another pointer to the data we control. This is, however, one level of indirection larger than what would be useful to us in most situations.
However, if a member of array is a double precision number, then at offset 24 (corresponding to offset 8 into the original VAR) the value of that number is going to be written and it is directly under our control. If we create a number with the same double representation as the pointer obtained in Stage 1, then we can use our overflow to overwrite a pointer somewhere after the end of the buffer with a pointer to the memory we directly control.
Now the question becomes, what can we overwrite in this way to advance the exploit. One of the possible answers presents itself if we take a closer look at how Objects work in JScript.
Each Object (more specifically, a NameList JScript object) is going to have a pointer to a hashtable. This hashtable is just an array of pointers. When a member element of an Object is accessed, a hash of the name of the element is computed. Then, a pointer at the offset corresponding to the lowest bits of the hash is dereferenced. This pointer points to a linked list of object elements and this linked list is traversed until we reached an element with the same name as the requested element. This is shown in image 4.
Image 4: JScript Object element internals
Note that, when the name of the element is less than 4 bytes, it is stored in the same structure as the VAR (element value). Otherwise, there is going to be a pointer to the element name. Name lengths <=4 are sufficient for us so we don’t need to go into the details of this.
An Object hashtable is a good candidate to overwrite because:
  • We can control which elements of it are dereferenced by accessing the corresponding object members. Elements we overwrite with data we don’t control will simply never be accessed.
  • We have limited control over the hashtable size by controlling how many members the corresponding object has. For example a hashtable starts with 1024 bytes, but if we add more than 512 elements to the object, the hashtable will be reallocated to 8192 bytes.
  • By overwriting a hashtable pointer with a pointer to data we control, we can create fake JScript vars in the data we control and access them simply by accessing the corresponding object members.

To perform the overwrite reliably we do the following:
  1. Allocate and free a lot of memory blocks with size 8192. This will turn on the Low Fragmentation Heap for allocation of size 8192. This will ensure that the buffer we are overflowing out of, as well as hashtable we are overflowing into will be allocated on the LFH. This is important because it means there will be no other allocations of other sizes nearby to spoil the exploit attempt (since an LFH bucket can only contain allocations of a certain size). This in turn ensures that we will be overwriting exactly what we want with high reliability.
  2. Create 2000 objects, each containing 512 members. In this state, each object has a hashtable of 1024 bytes. However, adding just one more element to one of these objects will cause its hashtable to grow to 8192 bytes.
  3. Add the 513 element to the first 1000 objects, causing 1000 allocations of 8192-byte hashtables.
  4. Trigger Array.sort with an array with length=300 and 170 elements. This allocates a buffer of size (170+1)*48=8208 bytes. Due to LFH granularity this object will be allocated in the same LFH bucket as 8192-byte hashtables.
  5. Immediately (in the toString() method of the first array element) add 513th element to the second 1000 objects. This makes us pretty certain that by now the sort buffer is neighboring one of the hashtables. In the same toString() method also add more elements to the array which will cause it to grow out-of-bounds.

Image 5 shows heap visualization around the address of the sort buffer (red line). You can see the sort buffer is surrounded by allocations of similar size which all correspond to Object hashtables. You can also observe the LFH randomness in the sense that subsequent allocations are not necessarily on subsequent addresses, however this makes no difference for our exploit.
Image 5: Heap visualization around the overflow buffer
As mentioned previously, we crafted our overflow in such a way that some of the hashtable pointers of an unlucky JScript object will get overwritten with pointers into the data we control. Now finally what exactly we put into this data comes into play: we crafted it in such a way that it contains 5 (fake) JavaScript variables:
  • Variable 1 just contains number 1337.
  • Variable 2 is of special type 0x400C. This type basically tells JavaScript that the actual VAR is pointed to by pointer at offset 8, and this pointer should be dereferenced before reading or writing this variable. In our case, this pointer points 16 bytes before Variable 1. This basically means that the last 8-byte qword of Variable 2 and the first 8-byte qword of Variable 1 overlap.
  • Variable 3, Variable 4 and Variable 5 are simple integers. What is special about them is that they contain numbers 5, 8 and 0x400C in their last 8 bytes, respectively.

The state of the corrupted Object after the overflow is shown in image 6.

Image 6: State of objects after the overflow. Red areas indicate where the overflow occurred. Each box in the bottom row (except those marked as ‘...’) corresponds to 8 bytes. Data contained in ‘...’ boxes is omitted for clarity
We can access Variable 1 by simply accessing the corrupted object at the correct index (let’s call it index1) and similarly for Variables 2-5. In fact, we can detect which Object we corrupted by accessing index1 of all objects and seeing which now has the value 1337.
Overlapping Variable 1 and Variable 2 has the effect that we can change the type (first WORD) of Variable 1 into 5 (double), 8 (string) or 0x400C (pointer). We do this by reading Variable 2, 3 or 4 and then writing the read value into Variable 2. For example the statement
corruptedobject[index2] = corruptedobject[index4];
Has the effect that the type of Variable 1 will be changed into a String (8), while all other fields of Variable 1 will remain unchanged.
This layout gives us several very powerful exploitation primitives:
  • If we write some variable that contains a pointer into Variable 1, we can disclose the value of this pointer by changing the type of Variable 1 to double (5) and reading it out
  • We can disclose (read) memory at an arbitrary address by faking a String at that address. We can accomplish this by first writing a double value corresponding to the address we want to read into Variable 1 and then changing the type of Variable 1 toString (8).
  • We can write to an arbitrary address by first writing a numeric value corresponding to the address into Variable 1, then changing the type of Variable 1 to 0x400C (pointer) and finally writing some data to Variable 1.

With these exploit primitives, normally getting the code execution would be pretty simple, but since we’re exploiting Windows 10 we first need to bypass the Control Flow Guard (CFG).
Stage 3: CFG bypass
There are probably other known bypasses we could have used here, but it turns out that there are some very convenient bypasses (once attacker has a read/write primitive) specific to jscript.dll. We are going to exploit the facts that:
  • Return addresses are not protected by CFG
  • Some Jscript objects have pointers to the native stack

Specifically, each NameTbl object (in Jscript, all JavaScript objects inherit from NameTbl), at offset 24 holds a pointer to CSession object. CSession object, at offset 80 holds a pointer to near the top of the native stack.
Thus, with an arbitrary read, by following a chain of pointers from any JScript object, it is possible to retrieve a pointer to the native stack. Then, with an arbitrary write, it is possible to overwrite a return address, bypassing CFG.
Stage 4: Getting code execution as Local Service
With all the exploit elements in place, we can now proceed to getting the code execution. We are doing it in these steps:
  1. Read the address of jscript.dll from a vtable of any JScript object
  2. Read the address of kernel32.dll by reading the import table of jscript.dll
  3. Read the address of kernelbase.dll by reading the import table of kernel32.dll
  4. Scan kernel32.dll for rop gadgets we are going to need
  5. Get the address of WinExec from the export table of kernel32.dll
  6. Leak the stack address as explained in the previous section
  7. Prepare the ROP chain and write it to the stack, starting with a return address closest to our leaked stack address.

The ROP chain we are using looks like this:
[address of RET]  //needed to align the stack to 16 bytes[address of POP RCX; RET] //loads the first parameter into rcx[address of command to execute][address of POP RDX; RET] //loads the second parameter into rdx1[address of WinExec]
By executing this ROP chain we are calling WinExec with a command we specified. For example, if we run the command ‘cmd’ we are going to see a command prompt being spawned, running as Local Service (the same user WPAD service runs as).
Unfortunately, from a child process running as Local Service, we can’t talk to the network, but what we can do is drop our privilege escalation payload from memory to a disk location Local Service can write and execute it from there.
Stage 5: Privilege escalation
While the Local Service account is a service account, it doesn’t have administrative privileges. This means the exploit is quite limited in what it can access and modify on the system, especially to persist after exploitation or after the system has been rebooted. While there’s always likely to be an unfixed privilege escalation in Windows we don’t need to find a new vulnerability to escalate our privileges. Instead we can abuse a built-in feature to escalate from Local Service to the SYSTEM account. Let’s look at the privileges that the service account for WPAD has been granted:
Image 7: Service Access Token’s Privileges showing Impersonate Privilege
We’ve only got three privileges, but the highlighted privilege, SeImpersonatePrivilege is important. This privilege allows the service to impersonate other users on the local system. The reason the service has impersonate privilege is it accepts requests from all users on the local system and might need to perform actions on their behalf. However, as long as we can get an access token for the account we want to impersonate we can get the full access rights of the token’s user account, including SYSTEM which would give us administrator rights on the local system.
Abusing impersonation is a known issue with the Windows security model (you can find more details by searching for Token Kidnapping). Microsoft have tried to make it harder to get an access token for a privileged user but it’s virtually impossible to close all possible routes. For example, James discovered a vulnerability in Windows’ implementation of DCOM which allows any user to get access to a SYSTEM access token. While Microsoft fixed the direct privilege escalation vulnerability they didn’t, or perhaps couldn’t, fix the token kidnapping issue. We can abuse this feature to capture the SYSTEM token, impersonate the token, then completely compromise the system, such as installing a privileged service.
There’s an existing implementation of the token kidnapping via DCOM (RottenPotato) however the implementation was designed for use with the Metasploit framework’s getsystem command which we’re not using. Therefore, we implemented our own simpler version in C++ which directly spawns an arbitrary process with a SYSTEM token using the CreateProcessWithToken API. As a bonus we were able to compile it to an executable of 11KiB in size, much smaller than RottenPotato, which made it easier to drop to disk and run from the ROP payload.
Tying it all together
When the WPAD service queries for the PAC file, we serve the exploit file which exploits the WPAD service and runs WinExec to drop and execute the privilege escalation binary. This binary then executes a command (hardcoded ‘cmd’ in our case) as SYSTEM.
The exploit worked pretty reliably in our experiments, but it is interesting to note that a 100% reliable exploit isn’t required - if the exploit crashes the WPAD service, a new instance is going to get spawned when a client makes another request from WPAD service, so an attacker can just try again. There will be no indication in the UI that the WPAD service has crashed, although Window Error Reporting will likely pick up the crash and report it to Microsoft, provided that the user didn’t disable it.
In fact, our exploit doesn’t clean up gracefully and will crash the WPAD service once it runs its payload, so if we keep serving the exploit PAC file after the service has been exploited, it will just get exploited again. You can see the effect of that in Image 7, which was taken after leaving the exploit server running for some minutes and making a lot of HTTP requests in the victim machine.
Image 7: Did we leave the exploit running for too long?
We’ll publish the exploit source code in the issue tracker shortly.ConclusionExecuting untrusted JavaScript code is dangerous, and executing it in an unsandboxed process is even more so. This is true even if it’s done by a relatively compact JavaScript engine such as jscript.dll. We identified 7 security vulnerabilities in it and successfully demonstrated reliable code execution from local network (and beyond) against a fully patched (at the time of writing) Windows 10 64-bit with Fall Creators Update installed.
Since the bugs are now fixed, does this mean we are done and can go home? Unlikely. Although we spent a fair amount of time, effort and compute power on finding jscript.dll bugs, we make no claims that we found all of them. In fact, where there are 7 bugs, there is likely to be an 8th. So if something doesn’t change it is quite possible we’ll see a chain like this used in the wild someday (and that is, of course, optimistically assuming that attackers don’t have this capability already).
So, what can Microsoft do to make future attacks like this harder:
  • Disable WPAD by default. In fact, while the other operating systems support WPAD, Windows is the only one where it is enabled by default.
  • Sandbox the JScript interpreter inside the WPAD service. Since the interpreter needs to execute a JavaScript function with well defined inputs and return the output string, sandboxing it should be pretty straightforward. Given the simplicity of the input-output model, it would be great if Microsoft introduced a sandbox of comparable restrictiveness to seccomp-strict: Some processes really do not need more privileges than “receive a bit of data”, “perform a bit of computation”, “return a bit of data”.

In case you want to take action on your own, the only way to prevent this type of attack using new, currently unknown vulnerabilities, seems to be to completely disable the WinHttpAutoProxySvc service. Sometimes this can’t be done in the Services UI (“Startup type” control will be grayed out) due to other services depending on WPAD, but it can be done via the corresponding registry entry. Under “HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\WinHttpAutoProxySvc” change the value of “Start” from 3 (manual) to 4 (disabled).
These are some of the advices commonly found online when searching for “disabling WPAD” that did not work to prevent the attack in our experiments:
  • Turning off “Automatically detect settings” in Control Panel
  • Setting “WpadOverride” registry key
  • Putting “ wpad” in the hosts file (this is going to stop the DNS variant but likely not the DHCP variant)
Categories: Security
Subscribe to aggregator - Security