Observations on Trace De-noising

Over the last few days, I’ve been trying different approaches to improve my side channel analysis technique against smartcards. This post serves as some notes on the sources of noise impacting a power trace, and some observations of easily implemented filtering mechanisms.

Perhaps a trigger warning is suitable – this post contains a sizeable heaping of intuitive maths.

Noise Overview

To begin with, let’s review a horizontally aligned (AVR, GPIO trigger before AES128, think this was 125MSPS) signal:

Zooming in, we can see significant noise (mismatches) on the signal:

In my mind, there are a few sources of noise:

  • Noisy power supply. Though decoupling capacitors are in use, they are not perfect, so some level of noise appears (though minimal)
    • There is low-frequency noise and high frequency noise here. I’m not sure what’s the root cause of this (switching noise inside SMPS?)
  • Execution changes. Changed execution can disturb power traces, though it’s possible to “watch” for this in software (via discarding traces that go outside a threshold when aligning, preferrably with the help of a low pass filter to bring out features). I think it’s too troublesome to denoise this, and it’s more efficient to just throw away the traces if it’s beyond a simple “shift left 5000 samples” denoise.
  • Measurement noise. When you take an oscilloscope and touch a probe tip to it’s corresponding ground grabber, the result is not a straight horizontal line, indicating there’s some inherent noise in the measurement process itself (electromagnetic interference on the oscilloscope probe leads? noise from components in the scope?)

The impact of noise is that a trace, when impacted by noise, will work “against” you in statistical analysis.

  • In a correlation attack, the power consumption for a given area may not go in the same direction as the hypothesis, causing the trace to contribute inaccuracy.
  • In a DPA attack, noise can destroy the true power consumption value of a given operation, again causing the trace to contribute inaccuracy.

To give a feel for the impact of noise on a trace, let’s review a plaintext tlva graph of the above:

The distinguisher function is if the last bit of the first byte of plaintext is even: where there should be a clear leakage peak, the null hypothesis is not strongly rejected (relatively), the t-value is *just* above 4.5. Correlation peaks (for consistency: all attack examples are against AES first-round s-box output, starting from sample 3000, for a length of 10000 sample points) do show, as below:

The signal differencies which DPA relies on are almost all but destroyed:

Band Pass Filtering

Intuitively, a band pass filter (or either half of one) can help remove some of the noise. In the above trace, let’s review the signal in the frequency domain:

We can see the clock at 16Mhz, and harmonics at 32 and 64, and significant noise below 10Mhz. As an aside, this is far less clear when viewed through the lens of a spectrogram – though we can identify the cryptographic blocks via frequency component changes, it’s harder to see “where” the noise is.

Intuitively, the changes we want to measure shouldn’t be too far off from the clock frequency. Aided by the above, we can implement a band pass filter between 10Mhz and 25Mhz. The filtered trace looks like this:

Zooming in, we can see that a lot of the high-frequency noise has been removed, and the curve looks significantly “smoothed”:

Significantly stronger correlation peaks are now visible:

Stronger “point of difference” peaks are also now visible with a DPA attack:

I would have expected a significant number of traces to have their detailed information destroyed by the filtering process (particularly given that our sampling is not clock-synchronous, so will be off by a few ns here and there) – so to my limited understanding, this result is somewhat counter-intuitive.

Also, intuitively, it would be better to implement this in hardware, to preserve detail, rather than compound signal processing “errors” (treating signal as noise) with digitisation errors. Still, I’m not confident in my understanding of high-frequency circuit design, and I fear hand-made filters will induce such loss as to make traces useless.

Wavelet Decomposition

I’ll probably edit this with better wording and more accurate terminology as time goes by, and I understand this topic better.

I recently stumbled across this technique when reading some IACR papers (side note: https://www.iacr.org/tinfoil.html is going into this coming semester’s COMP6443/COMP6483 course). In a nutshell, wavelets are an alternative method (to Fourier transforms) of decomposing a signal into it’s component frequency components. Intuitively, this “feels” like a repeat band-pass filter (though without needing to manually work out frequencies – each iteration of the wavelet contains half the detail of the previous one).

This video (and the series) goes some way to providing the background of wavelets.

Interestingly, while it is possible to apply a wavelet decomposition function to extract the wavelet components of a signal, it is also possible to take a series of wavelets and reconstruct (an approximation of) the original signal from them.

We can use the PyWavelets library (as per this blog post) to apply a wavelet decomposition, filtering and recomposition function to a signal. Based on previous liternature (lost the link) I used the Daubechies wavelet family, though more experimentation is needed to see the practical impact of other wavelet parameters.

The result of a wavelet-denoised signal looks like this:

Intuitively, the effect of the wavelet filter is, in this case, is highly similar to our band pass filter above, confirmed now by visual inspection. Correlation and differential analysis of the resulting signals confirm the same:

And for DPA:

A quick t-test also illustrates the fact that we’ve brought out significantly more detail through only minimal filtering:

More analysis is certainly needed, especially with hardware filtering implementations to preserve detail.

All sample data can be downloaded here.

Posted in Bards, Computers, Jesting | Leave a comment

Tooling Improvements! Hooray!

Over the past month, while staying the fuck at home, I’ve been steadily chipping away at improving my tooling. These are below, for reference (and partly self-motivation).

Doxastica

First conceived years ago when I wanted to cheat in Underrail, this DLL injection tool injects a lua interpreter (accessible over IPC) and a stack of helper functions into a target process:

This toolset has been updated to work with a modern Win10 SDK, and has additional options for loading, such as create-suspended-and-overwrite-code and injection into .NET processes (you can’t really interact with the CLR, but you can execute as the target).

A number of quality-of-life fixes have also been implemented, including JIT function calls to native functions (stdcall 32-bit only, 64-bit WIP), QoL fixes to db/dw/dd(), and input via command files.

Interestingly, if you disassemble a .NET executable as a PE file, you’ll note a direct call to _CorExeMain (or _CorDllMain). Using the jmp $-2 trick and repatching the entrypoint doesn’t seem to work, you’ll need to insert a loader stub to manually call _CorExeMain (I’m not too sure why, please do comment if you know). Some interesting, related reading material is here.

Work is ongoing on the ability to “self-sustain” reflective DLL loading (preferrably without lugging around an image of doxastica into memory). Unfortunately, this doesn’t seem too easy with the MSVC Runtime’s initialization procedures – if you take a DLL from an “infected” process and try to copy it to another, you can manually sort out section addresses and relocs, but the MVSCRT strucutres are already done. It may be possible to copy an uninitialized copy of MSVCRT’s .data section over (but extremely brittle – but I think there are no “winner” approaches here). For what it’s worth, the code is available here.

I was hoping to take a group of people (some colleagues, some ex) through Pwn Adventure 3, but this has stalled – understandably, as the barrier of entry and time commitment requirement are both significantly higher than say, web security.

Sparkgap

Following the example of ChipWhisperer and it’s spawned community projects, I’ve combined my previous attempts at SCA and FI into a single project, built off the ChipWhisperer, ChipSHOUTER and a custom FPGA back-end (implemented on an Arty A7 35-T board) and some scrap protoboard glitch inserters.

The primary change is a change from paragraph-long command lines, to a console interface like this (with input-from-file support):

The new “capturebuddy” interface no longer requires significant rework every time I want to use a different oscilloscope or driver interface: I simply set up a “driver” (which controls target logic) and a “frontend” (representing capture hardware), and capturebuddy takes care of the rest.

A standalone “triggerbuddy” utility is also provided – this can be imported to control the FPGA trigger interface from a console, or imported as a library to automate the same. The trigger mechanism is extraordinarily simple, but imo adaptable to most protocols. You set four configuration parameters (io, clk, ns and pulse width), the FPGA waits for io_count rising edges on the io line, then clk_count edges on the clock line, then ns_edges *rising* edges on an internal 400Mhz clock, and generates a pulse of 0.5ns*pulsewidth out. No consideration is given to synchronous measurement (later project), but given the structure of the verilog, you can just extend the “master” ns_clock to implement this: CW1200 SAD-match functionality here we come!

I also did some further investigation into why my previous smartcard triggering introduced such extreme jitter – I’m confident enough in my Verilog now that I believe this is due to the 1.8V smartcard transmission not being detected by the 3.3v I/O ports on the FPGA board. It would detect in say, 99% of cases but fail occasionally (and this effect would horribly cascade over thousands of edges).

My initial attempt at resolution was to use two 2N3904’s to level shift up to 3.3v, but this creates “lag artifacts” during the level shift process and completely obliterates the clock line (input resistance slowing the changes?):

(Image from https://electronics.stackexchange.com/questions/333229/1-8v-uart-with-3-3v-uart, replace UART with IO and CLK).

Based on overall project status, I felt that I would be better off writing my own smartcard reader/writer:

The code for the improved sparkgap framework are here, and will include the smartcard reader/writer once it’s somewhat stable.

UNSW

I’m once again lecturing at UNSW, though this time remotely. I’ll be teaching the COMP6443/6483 Web Application Security and Extended Web Application Security courses. This is an excellent opportunity to revamp the entirety of the course, now with an expanded support team – to clean up old content, integrate new tooling and techniques and reinvigorate the course.

As I’m doing this, I can’t help but draw parallels to this video:

While there have been new innovations in web security (mostly dealing with scale, not the “innovations” of crawling the darkweb with blockchain regex machine learning cyber cloud[1]), we’re still teaching students about how to exploit SQL injection as a “core building block”.

It’s my hope that instead of continuing in this path, the 2020 revamped web security courses will focus more on underlying concepts of code injection (the mixing of code and data, instead of SQLi) and that students can more intuitively grasp SQLi – and moving forward, we can eventually drop traditional SQLi and XSS as topics (again, focussing on underlying concepts so students can intuitively derive these attacks, no matter the defenses in play). From the students perspective, they’re paying good money to learn, we should provide something of value, not hand them a stack of pre-made attacks they can stack together like lego.

My thanks to everyone who has contributed to this course so far, and to everyone who is working on the course this year.

[1]: This is an interesting talk around the “Security Products we Deserve”, and is well worth watching for some insight into this problem.

Posted in Bards, Computers, Jesting | Leave a comment

Writeup – Old Crypto Server (AeroCTF)

This weekend, I spent a little bit of time doing AeroCTF, but couldn’t maintain focus due to the sheer number of other active projects. I solved one challenge in the time allocated, a fairly straightforward AES padding game that you can download here.

Astoundingly, I have no writeups of any challenges like this, so here we go.

The vulnerability in the system is straightforward as can be: the server, using AES in ECB mode (i.e. an input block will always have the same output block), accepts some input and pads it into block-sized chunks. We’re able to manipulate this by controlling the plaintext to some degree (allowing us to insert an arbitrary block), which we compare to the last block:

From here, we can individually brute force the bytes of the key one byte at a time – we add one character to the prefix, causing one character (second last character of the flag) to be added to the final byte of the key. By brute forcing the additional prefix character, we can determine what the second last character of the flag is:

Assuming a few key functions are available, we can automate the attack. The solver script I used for this challenge is here.

Thanks to the AeroCTF team for putting these challenges together – while I didn’t manage to solve any more during the time allocated, the other challenge I began looking at (aerofloat) looked fairly well put together.

See you in UTCTF.

Posted in Bards, Computers, Jesting | Leave a comment

Fault Injection on Linux: Practical KERNELFAULT-Style Attacks

This post will detail my examination of the KERNELFAULT paper, which you can download here. In brief summary, the KERNELFAULT paper describes a methodology for injecting faults into a system running Linux, for the purpose of privilege escalation to root, as well as for controlling the program counter.

The code used in this attack can be downloaded here, within the “pi” folder.

I began my experiment by preparing a Raspberry Pi 4. I made a software reset puller (controlled via ChipWhisperer’s GPIO’s) and broke out the UART. I also removed the 1.2V decoupling capacitors on the other side of the PCB, opposite the Broadcom SoC. These are marked below, with purple and blue markings:

Conceptually, the path to root is clear:

  • We begin as a standard user
  • We set all our unused registers to 0
  • We execute a setresuid(0,0,0) system call
  • We induce a fault of some description within the setresuid call, causing the operating system to incorrectly assign us a privilege.
  • We check if we now have elevated privileges, and if so, return the user to a root shell.

Let us examine the setresuid call further, pasted below for reference incase it changes down the track:

/*
* This function implements a generic ability to update ruid, euid,
* and suid. This allows you to implement the 4.4 compatible seteuid().
*/
long __sys_setresuid(uid_t ruid, uid_t euid, uid_t suid)
{
struct user_namespace *ns = current_user_ns();
const struct cred *old;
struct cred *new;
int retval;
kuid_t kruid, keuid, ksuid;

kruid = make_kuid(ns, ruid);
keuid = make_kuid(ns, euid);
ksuid = make_kuid(ns, suid);

if ((ruid != (uid_t) -1) && !uid_valid(kruid))
return -EINVAL;

if ((euid != (uid_t) -1) && !uid_valid(keuid))
return -EINVAL;

if ((suid != (uid_t) -1) && !uid_valid(ksuid))
return -EINVAL;

new = prepare_creds();
if (!new)
return -ENOMEM;

old = current_cred();

retval = -EPERM;
if (!ns_capable_setid(old->user_ns, CAP_SETUID)) {
if (ruid != (uid_t) -1 && !uid_eq(kruid, old->uid) &&
!uid_eq(kruid, old->euid) && !uid_eq(kruid, old->suid))
goto error;
if (euid != (uid_t) -1 && !uid_eq(keuid, old->uid) &&
!uid_eq(keuid, old->euid) && !uid_eq(keuid, old->suid))
goto error;
if (suid != (uid_t) -1 && !uid_eq(ksuid, old->uid) &&
!uid_eq(ksuid, old->euid) && !uid_eq(ksuid, old->suid))
goto error;
}

if (ruid != (uid_t) -1) {
new->uid = kruid;
if (!uid_eq(kruid, old->uid)) {
retval = set_user(new);
if (retval < 0)
goto error;
}
}
if (euid != (uid_t) -1)
new->euid = keuid;
if (suid != (uid_t) -1)
new->suid = ksuid;
new->fsuid = new->euid;

retval = security_task_fix_setuid(new, old, LSM_SETID_RES);
if (retval < 0)
goto error;

return commit_creds(new);

error:
abort_creds(new);
return retval;
}

The fault must be injected precisely, such that the ns_capable_setid passes. We are unable to precisely control the fault to within an instruction’s execution time (we are – as far as I’m aware – unable to synchronise to the Broadcom SoC’s internal PLL), but fortunately, the instructions highlighted in blue are not required for our attack to succeed, some instruction corruption here is fine, as long as we do not end up with a kernel panic. And there are some wierd, wallpaper-worthy kernel panics:

To set up our test harness, we create a small usermode program, available in the git repo, which uses a memory mapped GPIO to trigger a ChipWhisperer’s delay, then a glitch is inserted in the middle of the sys_setresuid call. Several additional checks are made, to determine if we have “won”, as well as a sanity counter to determine when we our glitches are too far forward: regular corruption of the counting loop indicates that the ext_offset of our glitches is too large.

We must be careful in leveraging our success as well – “/bin/bash” and “/bin/sh” both drop privileges upon execution, so we must use “/bin/dash”, which does not – this cost me a number of successful glitches.

A little while of glitching later, and we have success:

Along the way, a large number of crashes within the kernel were generated. I built a simple Python script to generate some simple statistics. Of note:

  • At a high level, 4982 glitch attempts were made across a few hours in one afternoon (on and off). Of these, 1393 were “crashes” of some description, and 4 attempts resulted in a successful privilege escalation, giving this instance of the attack a success rate of just under 0.1%.
    • Note that the average time to a successful glitch is 15 to 30 minutes, once a narrow range of successful glitching has been established.
  • The top three locations in the kernel for crashes to occur during glitching (giving clues to the severity and type of corruption) are:
    • cap_capable+0x2C with 106 instances
    • commit_creds+0x90 with 94 instances
    • cap_capable+0x14 with 58 instances
  • A good portion of the cap_capable glitches involved pointer derefrencing to near-zero addresses. Given that we are likely seeding the zero pointer, perhaps it is possible to *vastly* increase the likelihood of success by pointing cap_capable to pre-poisoned structures, established in userland (and semi-bypassing KASLR from crash reporting?)
  • Twice, the system suffered “unsafe recovery”, where the system would appear to recover, but executables would fail to run, as if corruption in some fundamental system ABI would occur, unpredictably. For example:

  • A variant of this is that the executable would “mute”, and simply do nothing, needing to be rebuilt before working again.

Thankyou to the Riscure team for their original public discussion of this type of attack – given the potential impacts, I’m surprised this style of attack is not further discussed in the industry. This type of work is eminently enjoyable, and I look forward to investigating similar techniques further.

Posted in Bards, Computers, Jesting | Leave a comment

Late Writeup – SimpleMachine (Codegate 2020 Teaser)

This weekend, I participated in the Codegate 2020 Teaser CTF. I started late – in my defence, ctftime’s time reporting was inaccurate, and I was able to solve the SimpleMachine challenge, though far too late to get points. The writeup is presented below.

SimpleMachine

This challenge was presented as an executable and a “target” file, which you can download here. You can also download a copy of my working notes here.

The binary is a stripped Linux executable, but the challenge name indicates it’s some kind of virtual machine. Fortunately, the executable is not obfuscated, making reverse engineering relatively simple. Moving through the disassembly (i.e. following the call flow from main()), we can note the opcode processor at 0x17C0:

Looking at the function, we can make a few key observations:

  • The byte at $rdi + 0x30 holds the opcode.
  • The word at $rdi + 0x34 holds argument 1
  • The word at $rdi + 0x34 holds argument 2
  • The return value is held in $rdi + 0x3E. We don’t know where it goes for now.
  • 8 total operations exist – read, write, load, xor, multiply (without carry), add, compare, jump-if-zero (I intially mistakenly thought this was “exit”).

Next, we instrument the application with GDB to observe this in action. We can see it load the first opcode, a “read” opcode, corresponding to the first 8 bytes in the “target” file (the xxd options reverse the byte endianness, and groups bytes into pairs):

Matching this to the disassembly, this is the “read” opcode, reading ox24 bytes into address {base} + 0x4000. By following the read call itself, base is defined by the first qword pointer at $rdi[0] within the opcode processor function.

Further tracing of the application, correlating application behaviour with disassembly listing, sheds light on the instruction format (little endian):

AABB XXXX YYYY ZZZZ 
AA: Addressing mode (controls behaviour of XYZ)
BB: opcode
XXXX: where the result is stored.
YYYY: arg1
ZZZZ: arg2

Spending a bit more time debugging the executable, we can observe the following behavior:

  • Firstly, the program reads a flag (the 06 opcode)
  • The program compares the first part of the flag to CODEGATE2020 through static compares. If any bytes don’t match, the program is exited through the use of opcode 5, with an argument of 0x1a0 (which leads to opcode 8, exit).
  • The program then loads a number of constants into memory, for an unknown purpose. Let’s call this the “key material”.

At this point, we are at 0xf8 in the “target” file. The program continues:

  • We load the constant 0xdead into a virtual register
  • We load the constant 0x1 into a virtual register
  • We double 0xdead, and save the result in a register. This is at address 0x108 – this is important later.
  • We xor 0xdead with the doubled 0xdead, and save the result in a register. The result is 0x63f7.
  • I’m not sure what the next instruction is for. We multiply 2 by 0 and save the result?
  • We load 0xf974, a part of the key material loaded earlier, into 0x154. Note that 0x154 is part of the initial code, 0xFFFF in the “target” file – the program has now become self-modifying. Let’s press on.
  • We save the constant 0x400c, after “CODEGAT2020”, into 0x14c. This is also self-modifying. Note that 0x400c points to the next bytes of the flag after “CODEGATE2020”.

At this point, we can monitor the executable with gdb, using the “rwatch” command:

rwatch <address>
  • Three “nops” are executed, consisting of 0000 0000 0000 0000 instructions.
  • We xor the two bytes of the input flag with 0x63f7
  • We add the result to the first two bytes of the key material, 0xf974
  • We compare the above result to 0.

We can represent the above operation as thus:

input_flag ^ 0x63f7 + 0xf974 =0x10000
input_flag = (0x10000 - 0xf974) ^ 0x63f7

We can quickly check this in Python:

Great, this looks pretty sane. Let’s move forward:

  • If the last result was not zero (i.e. if the flag was incorrect), jump to 0x1a0
  • The next three operations appear to be a loop counter, checking if we’ve hit 0xc iterations, and if not, jumping to 0x108.

At this point, instead of doubling 0xdead, it doubled the 0x63f7. Going through the loop a few times, we note the following repeating cycle occuring:

From here, it is a simple matter to derive the key manually, and confirm it via the simple_machine executable:

My solving methodology for this challenge was particularly haphazard, with a half-hearted attempt at writing an emulator and many, many failures at tracing the program, mostly due to my own fuckups in patching virtual compare operations. A shortcut I learned was that gdb can skip X iterations of a breakpoint, with the following command:

ignore 1 15
(ignores breakpoint 1 the next 15 times)

I disagree with the assessment that it is “ezpz”, but in another light, it is a humbling reminder of how much more I have to learn, and what a challenge it is to keep my skills up to date, while also expanding into other areas.

Thankyou to the Codegate 2020 CTF organisers for putting together these challenges. See you in Aero CTF.

Posted in Bards, Computers, Jesting | Leave a comment