Side-Channel Analysis of Keeloq

Over the past few days, I have spent some time reviewing side-channel analysis of the Keeloq cipher. This post documents the results of this review.

Background

This work is not “new” per se: various forms of side-channel attack against Keeloq have been carried out in the past. Some public resources around this are below:

In addition, the 25C3 presentation on breaking Keeloq is worth watching, as is (albeit slightly beyond the scope of this post) this presentation at RuhrSec 2017.

To begin with, let’s take an overview of the Keeloq algorithm:

Put simply, Keeloq “mixes in” a single bit of key material from each round – therefore, in theory, we should be able to use this as a distinguisher for a simple DPA attack, as set forth in marc-invalid’s work.

Initial Analysis

To start, I took an HCS301-based entry system from Aliexpress. Opening a remote, I was able to locate the HCS301 chip, which I could extract with a little bit of hot air. I placed a 10 Ohm shunt in the VCC line, wrote a simple AVR trigger, and reviewed my initial power trace:

On the trace, some “blocks” are clearly visible, including one in the middle which appears to be a cryptographic block. The RF synchronisation pulses can be seen on the right: unfortunately, this revealed nothing. A tip from the NewAE forums pointed me to using a larger shunt resistor, which together with using the LC filter from the ChipWhisperer’s CW308 board did the trick, and gave me clean traces:

This looks much more like the power trace in the 2008 paper, and from here, I can make a guess at the location of the cryptographic block. Note that since 2008, this chip completes the Keeloq encryption a bit quicker, and is finished in <20ms from the trigger point, suggesting some implementation change.

Onwards, to Key Recovery!

I initially attempted to go down the marc-invalid path of identifying the ciphertext output via a TVLA test, but this didn’t produce the expected correlation peaks (TVLA against any bit of ciphertext output):

No matter – we can identify that our power trace has the same general layout as the 2008 paper, albeit with some timing differences, so let’s assume that the encryption operation is in the same general place:

We begin our exploration by performing a CPA attack against the Hamming distance of the last intermediate value of Keeloq to the final value, in the final round of decryption with a guessed key but (LSB of the key guess below, turns out setting the maximum key guess value to 2 breaks reporting functionality elsewhere):

Unfortunately,the raw difference in correlation value isn’t that huge, and this introduces the additional complexity of having to backtrack when a bit is guessed incorrectly (in the 2008 paper, this is when all candidates for the next key guessing round show low correlation values). However, we can note that given the round-based nature of Keeloq, the Hamming distance between a “guessed” intermediate value and the actual intermediate value should broaden: that is, the error from a single incorrect guessed bit of Keeloq should cascade through subsequent rounds, resulting in a different intermediate values (and different Hamming distance in the state register) over the next few rounds.

In truth, the actual difference is tiny, but we should be able to correlate this with known ciphertexts and known intermediate values to make this difference more significant.

We can capitalise on this by grouping Keeloq rounds into blocks of 8 bits of each, and perform a correlation analysis, correctly picking the actual key against a reprogrammed test chip correctly, with a stronger degree of confidence:

This result is backed up by continued success on the second round:

We can further confirm this across a variety of independent traces, further rounds, against the VCC line and across a ground shunt, etc.

Interestingly, it is possible to get some TVLA result on intermediate values that lends credibility (imo) to the hypothesis that the round-to-round Hamming distance “cascades” with previous rounds.

Firstly, a TVLA with incorrect partitioning (partitioned by intermediate value of first key byte, all bits incorrect):

And now, with four correct bits:

And now with correct partitioning:

Note that this doesn’t actually identify the specific leakage point (or rather, indicates that there is a discernable, statistically significant difference of means across a large part of a trace), but can give some indicating by comparison whether a Keeloq bit is correct. Frustratingly, DPA (using the same partitioning code as above) doesn’t seem to work:

Further Work

The work is not yet complete, and there are still tooling improvements to make. This technique fails eventually, due to clock drift – the clock is inconsistent within a single trace, as well as between traces.

The marc-invalid fork of ChipWhisperer provides some clues to implementation, allowing the user to “slice” each trace and fit them onto a new trace, aligning each “slice” with the corresponding slices in other traces via their peak or minimum points – it’s my intention to implement something similar, and experiment with various peak extraction techniques.

There is also the matter of why DPA doesn’t work (and “fails hard”, not just ghost peaks). I wonder if it’s got to do with the fact that DPA is instantaneous (and the “raw” maximum difference of means is larger for an incorrect guess, than the correct guess).

All the code used above can be found here, and sample traces for Keeloq here (known key starts with 0x18). Note that you’ll need to edit support/attacks/Keeloq.py to enter fragments of the known key to replicate the above.

Posted in Bards, Computers, Jesting | Leave a comment

Observations on Trace De-noising

Over the last few days, I’ve been trying different approaches to improve my side channel analysis technique against smartcards. This post serves as some notes on the sources of noise impacting a power trace, and some observations of easily implemented filtering mechanisms.

Perhaps a trigger warning is suitable – this post contains a sizeable heaping of intuitive maths.

Noise Overview

To begin with, let’s review a horizontally aligned (AVR, GPIO trigger before AES128, think this was 125MSPS) signal:

Zooming in, we can see significant noise (mismatches) on the signal:

In my mind, there are a few sources of noise:

  • Noisy power supply. Though decoupling capacitors are in use, they are not perfect, so some level of noise appears (though minimal)
    • There is low-frequency noise and high frequency noise here. I’m not sure what’s the root cause of this (switching noise inside SMPS?)
  • Execution changes. Changed execution can disturb power traces, though it’s possible to “watch” for this in software (via discarding traces that go outside a threshold when aligning, preferrably with the help of a low pass filter to bring out features). I think it’s too troublesome to denoise this, and it’s more efficient to just throw away the traces if it’s beyond a simple “shift left 5000 samples” denoise.
  • Measurement noise. When you take an oscilloscope and touch a probe tip to it’s corresponding ground grabber, the result is not a straight horizontal line, indicating there’s some inherent noise in the measurement process itself (electromagnetic interference on the oscilloscope probe leads? noise from components in the scope?)

The impact of noise is that a trace, when impacted by noise, will work “against” you in statistical analysis.

  • In a correlation attack, the power consumption for a given area may not go in the same direction as the hypothesis, causing the trace to contribute inaccuracy.
  • In a DPA attack, noise can destroy the true power consumption value of a given operation, again causing the trace to contribute inaccuracy.

To give a feel for the impact of noise on a trace, let’s review a plaintext tlva graph of the above:

The distinguisher function is if the last bit of the first byte of plaintext is even: where there should be a clear leakage peak, the null hypothesis is not strongly rejected (relatively), the t-value is *just* above 4.5. Correlation peaks (for consistency: all attack examples are against AES first-round s-box output, starting from sample 3000, for a length of 10000 sample points) do show, as below:

The signal differencies which DPA relies on are almost all but destroyed:

Band Pass Filtering

Intuitively, a band pass filter (or either half of one) can help remove some of the noise. In the above trace, let’s review the signal in the frequency domain:

We can see the clock at 16Mhz, and harmonics at 32 and 64, and significant noise below 10Mhz. As an aside, this is far less clear when viewed through the lens of a spectrogram – though we can identify the cryptographic blocks via frequency component changes, it’s harder to see “where” the noise is.

Intuitively, the changes we want to measure shouldn’t be too far off from the clock frequency. Aided by the above, we can implement a band pass filter between 10Mhz and 25Mhz. The filtered trace looks like this:

Zooming in, we can see that a lot of the high-frequency noise has been removed, and the curve looks significantly “smoothed”:

Significantly stronger correlation peaks are now visible:

Stronger “point of difference” peaks are also now visible with a DPA attack:

I would have expected a significant number of traces to have their detailed information destroyed by the filtering process (particularly given that our sampling is not clock-synchronous, so will be off by a few ns here and there) – so to my limited understanding, this result is somewhat counter-intuitive.

Also, intuitively, it would be better to implement this in hardware, to preserve detail, rather than compound signal processing “errors” (treating signal as noise) with digitisation errors. Still, I’m not confident in my understanding of high-frequency circuit design, and I fear hand-made filters will induce such loss as to make traces useless.

Wavelet Decomposition

I’ll probably edit this with better wording and more accurate terminology as time goes by, and I understand this topic better.

I recently stumbled across this technique when reading some IACR papers (side note: https://www.iacr.org/tinfoil.html is going into this coming semester’s COMP6443/COMP6483 course). In a nutshell, wavelets are an alternative method (to Fourier transforms) of decomposing a signal into it’s component frequency components. Intuitively, this “feels” like a repeat band-pass filter (though without needing to manually work out frequencies – each iteration of the wavelet contains half the detail of the previous one).

This video (and the series) goes some way to providing the background of wavelets.

Interestingly, while it is possible to apply a wavelet decomposition function to extract the wavelet components of a signal, it is also possible to take a series of wavelets and reconstruct (an approximation of) the original signal from them.

We can use the PyWavelets library (as per this blog post) to apply a wavelet decomposition, filtering and recomposition function to a signal. Based on previous liternature (lost the link) I used the Daubechies wavelet family, though more experimentation is needed to see the practical impact of other wavelet parameters.

The result of a wavelet-denoised signal looks like this:

Intuitively, the effect of the wavelet filter is, in this case, is highly similar to our band pass filter above, confirmed now by visual inspection. Correlation and differential analysis of the resulting signals confirm the same:

And for DPA:

A quick t-test also illustrates the fact that we’ve brought out significantly more detail through only minimal filtering:

More analysis is certainly needed, especially with hardware filtering implementations to preserve detail.

All sample data can be downloaded here.

Posted in Bards, Computers, Jesting | Leave a comment

Tooling Improvements! Hooray!

Over the past month, while staying the fuck at home, I’ve been steadily chipping away at improving my tooling. These are below, for reference (and partly self-motivation).

Doxastica

First conceived years ago when I wanted to cheat in Underrail, this DLL injection tool injects a lua interpreter (accessible over IPC) and a stack of helper functions into a target process:

This toolset has been updated to work with a modern Win10 SDK, and has additional options for loading, such as create-suspended-and-overwrite-code and injection into .NET processes (you can’t really interact with the CLR, but you can execute as the target).

A number of quality-of-life fixes have also been implemented, including JIT function calls to native functions (stdcall 32-bit only, 64-bit WIP), QoL fixes to db/dw/dd(), and input via command files.

Interestingly, if you disassemble a .NET executable as a PE file, you’ll note a direct call to _CorExeMain (or _CorDllMain). Using the jmp $-2 trick and repatching the entrypoint doesn’t seem to work, you’ll need to insert a loader stub to manually call _CorExeMain (I’m not too sure why, please do comment if you know). Some interesting, related reading material is here.

Work is ongoing on the ability to “self-sustain” reflective DLL loading (preferrably without lugging around an image of doxastica into memory). Unfortunately, this doesn’t seem too easy with the MSVC Runtime’s initialization procedures – if you take a DLL from an “infected” process and try to copy it to another, you can manually sort out section addresses and relocs, but the MVSCRT strucutres are already done. It may be possible to copy an uninitialized copy of MSVCRT’s .data section over (but extremely brittle – but I think there are no “winner” approaches here). For what it’s worth, the code is available here.

I was hoping to take a group of people (some colleagues, some ex) through Pwn Adventure 3, but this has stalled – understandably, as the barrier of entry and time commitment requirement are both significantly higher than say, web security.

Sparkgap

Following the example of ChipWhisperer and it’s spawned community projects, I’ve combined my previous attempts at SCA and FI into a single project, built off the ChipWhisperer, ChipSHOUTER and a custom FPGA back-end (implemented on an Arty A7 35-T board) and some scrap protoboard glitch inserters.

The primary change is a change from paragraph-long command lines, to a console interface like this (with input-from-file support):

The new “capturebuddy” interface no longer requires significant rework every time I want to use a different oscilloscope or driver interface: I simply set up a “driver” (which controls target logic) and a “frontend” (representing capture hardware), and capturebuddy takes care of the rest.

A standalone “triggerbuddy” utility is also provided – this can be imported to control the FPGA trigger interface from a console, or imported as a library to automate the same. The trigger mechanism is extraordinarily simple, but imo adaptable to most protocols. You set four configuration parameters (io, clk, ns and pulse width), the FPGA waits for io_count rising edges on the io line, then clk_count edges on the clock line, then ns_edges *rising* edges on an internal 400Mhz clock, and generates a pulse of 0.5ns*pulsewidth out. No consideration is given to synchronous measurement (later project), but given the structure of the verilog, you can just extend the “master” ns_clock to implement this: CW1200 SAD-match functionality here we come!

I also did some further investigation into why my previous smartcard triggering introduced such extreme jitter – I’m confident enough in my Verilog now that I believe this is due to the 1.8V smartcard transmission not being detected by the 3.3v I/O ports on the FPGA board. It would detect in say, 99% of cases but fail occasionally (and this effect would horribly cascade over thousands of edges).

My initial attempt at resolution was to use two 2N3904’s to level shift up to 3.3v, but this creates “lag artifacts” during the level shift process and completely obliterates the clock line (input resistance slowing the changes?):

(Image from https://electronics.stackexchange.com/questions/333229/1-8v-uart-with-3-3v-uart, replace UART with IO and CLK).

Based on overall project status, I felt that I would be better off writing my own smartcard reader/writer:

The code for the improved sparkgap framework are here, and will include the smartcard reader/writer once it’s somewhat stable.

UNSW

I’m once again lecturing at UNSW, though this time remotely. I’ll be teaching the COMP6443/6483 Web Application Security and Extended Web Application Security courses. This is an excellent opportunity to revamp the entirety of the course, now with an expanded support team – to clean up old content, integrate new tooling and techniques and reinvigorate the course.

As I’m doing this, I can’t help but draw parallels to this video:

While there have been new innovations in web security (mostly dealing with scale, not the “innovations” of crawling the darkweb with blockchain regex machine learning cyber cloud[1]), we’re still teaching students about how to exploit SQL injection as a “core building block”.

It’s my hope that instead of continuing in this path, the 2020 revamped web security courses will focus more on underlying concepts of code injection (the mixing of code and data, instead of SQLi) and that students can more intuitively grasp SQLi – and moving forward, we can eventually drop traditional SQLi and XSS as topics (again, focussing on underlying concepts so students can intuitively derive these attacks, no matter the defenses in play). From the students perspective, they’re paying good money to learn, we should provide something of value, not hand them a stack of pre-made attacks they can stack together like lego.

My thanks to everyone who has contributed to this course so far, and to everyone who is working on the course this year.

[1]: This is an interesting talk around the “Security Products we Deserve”, and is well worth watching for some insight into this problem.

Posted in Bards, Computers, Jesting | Leave a comment

Writeup – Old Crypto Server (AeroCTF)

This weekend, I spent a little bit of time doing AeroCTF, but couldn’t maintain focus due to the sheer number of other active projects. I solved one challenge in the time allocated, a fairly straightforward AES padding game that you can download here.

Astoundingly, I have no writeups of any challenges like this, so here we go.

The vulnerability in the system is straightforward as can be: the server, using AES in ECB mode (i.e. an input block will always have the same output block), accepts some input and pads it into block-sized chunks. We’re able to manipulate this by controlling the plaintext to some degree (allowing us to insert an arbitrary block), which we compare to the last block:

From here, we can individually brute force the bytes of the key one byte at a time – we add one character to the prefix, causing one character (second last character of the flag) to be added to the final byte of the key. By brute forcing the additional prefix character, we can determine what the second last character of the flag is:

Assuming a few key functions are available, we can automate the attack. The solver script I used for this challenge is here.

Thanks to the AeroCTF team for putting these challenges together – while I didn’t manage to solve any more during the time allocated, the other challenge I began looking at (aerofloat) looked fairly well put together.

See you in UTCTF.

Posted in Bards, Computers, Jesting | Leave a comment

Fault Injection on Linux: Practical KERNELFAULT-Style Attacks

This post will detail my examination of the KERNELFAULT paper, which you can download here. In brief summary, the KERNELFAULT paper describes a methodology for injecting faults into a system running Linux, for the purpose of privilege escalation to root, as well as for controlling the program counter.

The code used in this attack can be downloaded here, within the “pi” folder.

I began my experiment by preparing a Raspberry Pi 4. I made a software reset puller (controlled via ChipWhisperer’s GPIO’s) and broke out the UART. I also removed the 1.2V decoupling capacitors on the other side of the PCB, opposite the Broadcom SoC. These are marked below, with purple and blue markings:

Conceptually, the path to root is clear:

  • We begin as a standard user
  • We set all our unused registers to 0
  • We execute a setresuid(0,0,0) system call
  • We induce a fault of some description within the setresuid call, causing the operating system to incorrectly assign us a privilege.
  • We check if we now have elevated privileges, and if so, return the user to a root shell.

Let us examine the setresuid call further, pasted below for reference incase it changes down the track:

/*
* This function implements a generic ability to update ruid, euid,
* and suid. This allows you to implement the 4.4 compatible seteuid().
*/
long __sys_setresuid(uid_t ruid, uid_t euid, uid_t suid)
{
struct user_namespace *ns = current_user_ns();
const struct cred *old;
struct cred *new;
int retval;
kuid_t kruid, keuid, ksuid;

kruid = make_kuid(ns, ruid);
keuid = make_kuid(ns, euid);
ksuid = make_kuid(ns, suid);

if ((ruid != (uid_t) -1) && !uid_valid(kruid))
return -EINVAL;

if ((euid != (uid_t) -1) && !uid_valid(keuid))
return -EINVAL;

if ((suid != (uid_t) -1) && !uid_valid(ksuid))
return -EINVAL;

new = prepare_creds();
if (!new)
return -ENOMEM;

old = current_cred();

retval = -EPERM;
if (!ns_capable_setid(old->user_ns, CAP_SETUID)) {
if (ruid != (uid_t) -1 && !uid_eq(kruid, old->uid) &&
!uid_eq(kruid, old->euid) && !uid_eq(kruid, old->suid))
goto error;
if (euid != (uid_t) -1 && !uid_eq(keuid, old->uid) &&
!uid_eq(keuid, old->euid) && !uid_eq(keuid, old->suid))
goto error;
if (suid != (uid_t) -1 && !uid_eq(ksuid, old->uid) &&
!uid_eq(ksuid, old->euid) && !uid_eq(ksuid, old->suid))
goto error;
}

if (ruid != (uid_t) -1) {
new->uid = kruid;
if (!uid_eq(kruid, old->uid)) {
retval = set_user(new);
if (retval < 0)
goto error;
}
}
if (euid != (uid_t) -1)
new->euid = keuid;
if (suid != (uid_t) -1)
new->suid = ksuid;
new->fsuid = new->euid;

retval = security_task_fix_setuid(new, old, LSM_SETID_RES);
if (retval < 0)
goto error;

return commit_creds(new);

error:
abort_creds(new);
return retval;
}

The fault must be injected precisely, such that the ns_capable_setid passes. We are unable to precisely control the fault to within an instruction’s execution time (we are – as far as I’m aware – unable to synchronise to the Broadcom SoC’s internal PLL), but fortunately, the instructions highlighted in blue are not required for our attack to succeed, some instruction corruption here is fine, as long as we do not end up with a kernel panic. And there are some wierd, wallpaper-worthy kernel panics:

To set up our test harness, we create a small usermode program, available in the git repo, which uses a memory mapped GPIO to trigger a ChipWhisperer’s delay, then a glitch is inserted in the middle of the sys_setresuid call. Several additional checks are made, to determine if we have “won”, as well as a sanity counter to determine when we our glitches are too far forward: regular corruption of the counting loop indicates that the ext_offset of our glitches is too large.

We must be careful in leveraging our success as well – “/bin/bash” and “/bin/sh” both drop privileges upon execution, so we must use “/bin/dash”, which does not – this cost me a number of successful glitches.

A little while of glitching later, and we have success:

Along the way, a large number of crashes within the kernel were generated. I built a simple Python script to generate some simple statistics. Of note:

  • At a high level, 4982 glitch attempts were made across a few hours in one afternoon (on and off). Of these, 1393 were “crashes” of some description, and 4 attempts resulted in a successful privilege escalation, giving this instance of the attack a success rate of just under 0.1%.
    • Note that the average time to a successful glitch is 15 to 30 minutes, once a narrow range of successful glitching has been established.
  • The top three locations in the kernel for crashes to occur during glitching (giving clues to the severity and type of corruption) are:
    • cap_capable+0x2C with 106 instances
    • commit_creds+0x90 with 94 instances
    • cap_capable+0x14 with 58 instances
  • A good portion of the cap_capable glitches involved pointer derefrencing to near-zero addresses. Given that we are likely seeding the zero pointer, perhaps it is possible to *vastly* increase the likelihood of success by pointing cap_capable to pre-poisoned structures, established in userland (and semi-bypassing KASLR from crash reporting?)
  • Twice, the system suffered “unsafe recovery”, where the system would appear to recover, but executables would fail to run, as if corruption in some fundamental system ABI would occur, unpredictably. For example:

  • A variant of this is that the executable would “mute”, and simply do nothing, needing to be rebuilt before working again.

Thankyou to the Riscure team for their original public discussion of this type of attack – given the potential impacts, I’m surprised this style of attack is not further discussed in the industry. This type of work is eminently enjoyable, and I look forward to investigating similar techniques further.

Posted in Bards, Computers, Jesting | 1 Comment