Posted by Natalie Silvanovich, Project Zero
Over the past five years, video conferencing support in websites and applications has exploded. Facebook, WhatsApp, FaceTime and Signal are just a few of the many ways that users can make audio and video calls across networks. While a lot of research has been done into the cryptographic and privacy properties of video conferencing, there is limited information available about the attack surface of these platforms and their susceptibility to vulnerabilities. We reviewed the three most widely-used video conferencing implementations. In this series of blog posts, we describe what we found.
This part will discuss our analysis of WebRTC. Part 2 will cover our analysis of FaceTime. Part 3 will discuss how we fuzzed WhatsApp. Part 4 will describe some attacks against WhatsApp that didn’t work out. And finally, Part 5 will discuss the future of video conferencing and steps that developers can take to improve the security of their implementation.
Typical Video Conferencing Architecture
All the video conferencing implementations we investigated allow at least two peers anywhere on the Internet to communicate through audiovisual streams. Implementing this capability so that it is reliable and has good audio and video quality presents several challenges. First, the peers need to be able to find each other and establish a connection regardless of NATs or other network infrastructure. Then they need to be able to communicate, even though they could be on different platforms, application versions or browsers. Finally, they need to maintain audio and video quality, even if the connection is low-bandwidth or noisy.
Almost all video conferencing solutions have converged on a single architecture. It assumes that two peers can communicate via a secure, integrity checked channel which may have low bandwidth or involve an intermediary server, and it allows them to create a faster, higher-bandwidth peer-to-peer channel.
The first stage in creating a connection is called signalling. It is the process through which the two peers exchange the information they will need to create a connection, including network addresses, supported codecs and cryptographic keys. Usually, the calling peer sends a call request including information about itself to the receiving peer, and then the receiving peer responds with similar information. SDP is a common protocol for exchanging this information, but it is not always used, and most implementations do not conform to the specification. It is common for mobile messaging apps to send this information in a specially formatted message, sent through the same channel text messages are sent. Websites that support video conferencing often use WebSockets to exchange information, or exchange it via HTTPS using the webserver as an intermediary.
Once signalling is complete, the peers find a way to route traffic to each other using the STUN, TURN and ICE protocols. Based on what these protocols determine, the peers can create UDP, UDP-over-STUN and occasionally TCP connections based of what is favorable for the network conditions.
Once the connection has been made, the peers communicate using Real-time Transport Protocol. Though this protocol is standardized, most implementations deviate somewhat from the standard. RTP can be encrypted using a protocol called Secure RTP (SRTP), and some implementations also encrypt streams using DTLS. Under the encryption envelope, RTP supports features that allow multiple streams and formats of data to be exchanged simultaneously. Then, based on how RTP classifies the data, it is passed on to other processing, such as video codecs. Stream Control Transmission Protocol (SCTP) is also sometimes used to exchange small amounts of data (for example a text message on top of a call) during video conferencing, but it is less commonly used than RTP.
Even when it is encrypted, RTP often doesn’t include integrity protection, and if it does, it usually doesn’t discard malformed packets. Instead, it attempts to recover them using strategies such as Forward Error Correction (FEC). Most video conferencing solutions also detect when a channel is noisy or low-bandwidth and attempt to handle the situation in a way that leads to the best audio and video quality, for example, sending fewer frames or changing codecs. Real Time Control Protocol (RTCP) is used to exchange statistics on network quality and coordinate adjusting properties of RTP streams to adapt to network conditions.
WebRTC
WebRTC is an open source project that enables video conferencing. It is by far the most commonly used implementation. Chrome, Safari, Firefox, Facebook Messenger, Signal and many other mobile applications use WebRTC. WebRTC seemed like a good starting point for looking at video conferencing as it is heavily used, open source and reasonably well-documented.
WebRTC Signalling
I started by looking at WebRTC signalling, because it is an attack surface that does not require any user interaction. Protocols like RTP usually start being processed after a user has picked up the video call, but signalling is performed before the user is notified of the call. WebRTC uses SDP for signalling.
I reviewed the WebRTC SDP parser code, but did not find any bugs. I also compiled it so it would accept an SDP file on the commandline and fuzzed it, but I did not find any bugs through fuzzing either. I later discovered that WebRTC signalling is not implemented consistently across browsers anyhow. Chrome uses the main WebRTC implementation, Safari has branched slightly and Firefox uses their own implementation. Most mobile applications that use WebRTC implement their own signalling in a protocol that is not SDP as well. So it is not likely that a bug in WebRTC signalling would affect a wide variety of targets.
RTP Fuzzing
I then decided to look at how RTP is processed in WebRTC. While RTP is not an interaction-less attack surface because the user usually has to answer the call before RTP traffic is processed, picking up a call is a reasonable action to expect a user to take. I started by looking at the WebRTC source, but it is very large and complex, so I decided fuzzing would be a better approach.
The WebRTC repository contains fuzzers written for OSS-Fuzz for every protocol and codec supported by WebRTC, but they do not simulate the interactions between the various parsers, and do not maintain state between test cases, so it seemed likely that end-to-end fuzzing would provide additional coverage.
Setting up end-to-end fuzzing was fairly time intensive, so to see if it was likely to find many bugs, I altered Chrome to send malformed RTP packets. I changed the srtp_protect function in libsrtp so that it ran the following fuzzer on every packet:
void fuzz(char* buf, int len){ int q = rand()%10; if (q == 7){ int ind = rand()%len; buf[ind] = rand(); } if(q == 5){ for(int i = 0; i < len; i++) buf[i] = rand(); } |
RTP fuzzer (fuzzer q)
When this version was used to make a WebRTC call to an unmodified instance of Chrome, it crashed roughly every 30 seconds.
Most of the crashes were due to divide-by-zero exceptions, which I submitted patches for, but there were three interesting crashes. I reproduced them by altering the WebRTC source in Chrome so that it would generate the packets that caused the same crashes, and then set up a standalone build of WebRTC to reproduce them, so that it was not necessary to rebuild Chrome to reproduce the issues.
The first issue, CVE-2023-6130 is an out-of-bounds memory issue related to the use of std::map find in processing VP9 (a video codec). In the following code, the value t10_pic_idx is pulled out of an RTP packet unverified (GOF stands for group of frames).
if (frame->frame_type() == kVideoFrameKey) {
...
GofInfo info = gof_info_.find(codec_header.tl0_pic_idx)->second;
FrameReceivedVp9(frame->id.picture_id, &info);
UnwrapPictureIds(frame);
return kHandOff;
}
If this value does not exist in the gof_info_ array, std::map::find returns the end value of the map, which points to one element past the allocated values for the map. Depending on memory layout, dereferencing this iterator will either crash or return the contents of unallocated memory.
The second issue, CVE-2023-6129 is a more typical out-of-bounds read issue, where the index of a field is read out of an RTP packet, and not verified before it is used to index a vector.
The third issue, CVE-2023-6157 is a type confusion issue that occurs when a packet that looks like a VP8 packet is sent to the H264 parser. The packet will eventually be treated like an H264 packet even though it hasn’t gone through the necessary checks for H264. The impact of this issue is also limited to reading out of bounds.
There are a lot of limitations to the approach of fuzzing in a browser. It is very slow, the issues are difficult to reproduce, and it is difficult to fuzz a variety of test cases, because each call needs to be started manually, and certain properties, such as the default codec, can’t change in the middle of the call. After I reported these issues, the WebRTC team suggested that I use the video_replay tool, which can be used to replay RTP streams recorded in a patched browser. The tool was not able to reproduce a lot of my issues because they used non-default WebRTC settings configured through signalling, so I added the ability to load a configuration file alongside the RTP dump to this tool. This made it possible to quickly reproduce vulnerabilities in WebRTC.
This tool also had the benefit of enabling much faster fuzzing, as it was possible to fuzz RTP by fuzzing the RTP dump file and loading it into video_replay. There were some false positives, as it was also possible that fuzzing caused bugs in parsing the RTP dump file format, but most of the bugs were actually in RTP processing.
Fuzzing with the video_replay tool with code coverage and ASAN enabled led to four more bugs. We ran the fuzzer on 50 cores for about two weeks to find these issues.
CVE-2023-6156 is probably the most exploitable bug uncovered. It is a large overflow in FEC. The buffer WebRTC uses to process FEC packets is 1500 bytes, but it does no size checking of these packets once they are extracted from RTP. Practically, they can be up to about 2000 bytes long.
CVE-2023-6155 is a use-after-free in a video codec called VP8. It is interesting because it affects the VP8 library, libvpx as opposed to code in WebRTC, so it has the potential to affect software that uses this library other than WebRTC. A generic fix for libvpx was released as a result of this bug.
CVE-2023-16071 is a use-after-free in VP9 processing that is somewhat similar to CVE-2023-6130. Once again, an untrusted index is pulled out of a packet, but this time it is used as the upper bounds of a vector erase operation, so it is possible to delete all the elements of the vector before it is used.
Overall, end-to-end fuzzing found a lot of bugs in WebRTC, and a few were fairly serious. They have all now been fixed. This shows that end-to-end fuzzing is an effective approach for finding vulnerabilities in this type of video conferencing solution. In Part 2, we will try a similar approach on FaceTime. Stay tuned!
Posting Komentar