Overview

This was my first participation to Google CTF. I played as a member of Blue Water and we finished in 5th position. This was one of the seven challenges from the “misc” category.

Screenshot of the challenge information

The attachment is a Python script that runs on the remote server.

 .----------------------------------------------------------------.
| .--------------------------------------------------------------. |
| |  ____  ____   _____  _____    _______   _____   ____    ____ | |
| | |_   ||   _| |_   _||_   _|  /  ___  | |_   _| |_   \  /   _|| |
| |   | |__| |     | | /\ | |   |  (__ \_|   | |     |   \/   |  | |
| |   |  __  |     | |/  \| |    '.___`-.    | |     | |\  /| |  | |
| |  _| |  | |_    |   /\   |   |`\____) |  _| |_   _| |_\/_| |_ | |
| | |____||____|   |__/  \__|   |_______.' |_____| |_____||_____|| |
| |                                                              | |
| '--------------------------------------------------------------' |
 '----------------------------------------------------------------'


Welcome to G.U.G.L. hardware design and testing software.
We have nearly finished our newest CPU, but we are still
missing a critical component.

We need you to make us a full adder out of NAND gates:
Inputs: A, B, C
Outputs: S, Cout
1. Reset
2. Show circuit
3. Add new gate
4. Send to factory and test
5. Quit

The context of the challenge is quite original; we have to design a logic circuit, namely a full adder out of NAND gates (250 max). Our design is then checked against a few test cases and the component is used inside a CPU for its adding capabilities.

Finally, the CPU runs a few lines of assembly code provided within the attachment. The goal of this challenge is to include some kind of backdoor in our circuit in order to modify the CPU behaviour so that it leaks a flag from memory when it runs the assembly code.

Full adder and 8-bit adder

The first step is to design a working, non-backdoored, full adder.

A full adder is a simple combinational circuit that adds two input bits (A and B) as well as a carry-in bit (C or Cin) and outputs a sum bit (S) and a carry-out bit (Cout). This is the truth table of a full adder:

The full adder truth table

There are many existing designs, here is an example schematic of a full adder implemented using two XOR gates, two AND gates and one OR gate.

Recall that the only logic gate that we are allowed to use for this challenge is the NAND gate. Fortunately, the NAND gate is universal, meaning that any other logic gate (AND, XOR, NOR, etc.) can be replaced by a few NAND gates. Therefore, we will not bother trying to design a circuit made of only NAND gates and instead we will just automatically convert our circuit by replacing each gate with its NAND gates version. If we ever hit the 250 maximum gates limit, we can try to optimize further.

Now that the full adder is complete, we can “send it to the factory” to confirm that it passes the unit tests.

1. Send to factory and test
Circuit compiled correctly, sending to factory...
Circuit passed exhaustive testing, integrating with the CPU...

The “integrating with the CPU” step replicates our full adder circuit eight times and connects the eight adders together in a cascade (linking Cout to Cin of the next adder). This creates an 8-bit adder with 2x8=16 input bits and 8 (+1 carry) output bits. As the name indicates, the 8-bit adder is used to get the sum of two 8-bit inputs.

Here is an example of a 4-bit full adder built this way, used to calculate the sum 0b0100+0b0111=0b1011

A 4-bit full adder

CPU and assembly program

The simulated CPU in this challenge is very simple, it possesses 8 registers (r0 to r7), a program counter (pc) that points to the current instruction and 256 bytes of memory. The memory is initialized with 0’s and the flag is stored at mem[128:128+len(flag)]. Then the CPU runs a very simple program written in a pseudo-assembly language:

ldi r0, 0
ldi r1, 0
ldi r2, 60
ldi r3, 10
ldi r5, 1
read r4
store r1, r4
jeq r4, r3, 11
add r0, r5
add r1, r5
jl r0, r2, 5
ldi r1, 0
ldi r3, 3
load r4, r1
add r4, r3
ldi r2, 64
add r2, r1
store r2, r4
add r1, r5
jl r1, r0, 13
ldi r1, 0
ldi r2, 64
add r2, r1
load r2, r2
out r2
add r1, r5
jl r1, r0, 21

The instructions act on single bytes and behave as you would expect. The add instruction specifically uses the 8-bit adder designed in the previous section to calculate the result of 8-bit arithmetic addition.

Reversing the program is not too hard and we have the following pseudocode:

// Read user input to mem[0:]
char length = 0;
for (char i = 0; i < 60; i += 1) {
  mem[i] = input(); // one byte input
  if (mem[i] == '\n') {
    break;
  }
  length += 1;
}
// "Encrypt" each input byte (by adding +3) to mem[64:]
for (char i = 0; i < length; i += 1) {
  mem[64 + i] = mem[i] + 3;
}
// Print the encrypted result at mem[64:]
for (char i = 0; i < length; i += 1) {
  print(mem[64 + i]);
}

Notice that the program only accesses memory from address 0 to address 124. This means it never reads nor writes memory where the flag is stored (at address 128+).

The only control we have over the execution of the program is the input in the first loop (max 60 bytes) and the behaviour of the add instruction.

Backdooring the full adder

Now, we would like to modify the full adder design so that the program outputs the flag, or leak it through a side channel. We are faced with the following issues:

  • we don’t have complete control over the 8-bit adder design; we only design the adder that is duplicated to create the 8-bit adder, this makes it more difficult to implement a backdoor that does not break every other add instruction,
  • the full adder will not be accepted if it fails unit tests; therefore it must have a correct behaviour during those tests.

The first question is: what should the backdoor do? If we were free to trigger it at any time to yield any result for any add instruction, how would we make the program leak the flag from its memory?

The simplest answer I came up with is to make the 64+i addition in the printing loop yield the result of 128+i instead, while yielding a valid result for all other add instruction (such as the loop counter i+=1). Then the last loop will look like this and directly print the flag from memory:

// Print the flag at mem[128:]
for (char i = 0; i < length; i += 1) {
  print(mem[128 + i]);
}

Conveniently, this can be achieved by modifying the full adder corresponding to the addition of the 2nd Most Significant Bit so that it outputs S=0, Cout=1 when running the 64+i addition. The output of the 8-bit addition will have the MSB set to 1 and the 2nd MSB set to 0 (because i is always less than 64) and the 8-bit adder will yield the result of 128+i instead of 64+i.

Designing the backdoor trigger

The second question is: when should the full adder enable the backdoor? In other words, how does the full adder detects that it must output a S=0, Cout=1 result instead of the normal result? This is a bit tricky since every full adder will contain the backdoor, but we only want it to be triggered for the 2nd MSB. Plus, notice that the second loop in the program also computes 64+i and we definitely don’t want the backdoor to trigger there because it would write over the flag in memory.

Therefore, the full adder logic circuit must enable the backdoor only if it detects that the following two conditions are met:

  1. The current operation is 64+i (and not i+=1).
  2. The current operation is occuring in the third loop of the program and not before that.
  3. The current 1-bit adder is used to compute the 2nd MSB in the 8-bit adder.

Condition 1. is the easiest one to detect, assuming we only focus on the 2nd MSB full adder. Since 64=0b01000000 and i<64, and since the A input bit is the 2nd MSB of the first operand, A will equal 0 during the i+=1 operation and 1 during the 64+i operation. Thus, the condition “A is set to 1” is enough to distinguish between both operations.

Condition 2. is at another level though, because it implies that the full adder has some kind of information about the elapsed time or knows about previously executed instructions. As we don’t have access to any kind of clock, we can only rely on detecting previous instructions.

A way to retain information in a logic circuit is to use a latch, such as the SR NAND latch. As long as both inputs of the latch are 1, it will “remember” which of its two inputs was last set to 0.

Combining two latches together in a so-called “Master-Slave” configuration, we can create a Master-Slave SR Latch. This kind of circuit will effectively halve the number of “flips” it receives. In other words, the output of this circuit will flip once each time the input flips twice.

Finally, putting multiple Master-Slave SR latches in serie allows us to halve the flipping rate multiple times and effectively create a binary counter of how many times the input has flipped!

By connecting the input of this binary counter to the A input of the full adder, we are now able to detect how many times the A input bit has flipped. Recall that the A input bit of the 2nd MSB is 1 during the 64+i operation and 0 during the i+=1 operation, thus the A input bit flips twice during each turn of the 2nd and 3rd program for loops. The condition “A has flipped at least 120 times” (we can round to 128 for convenience) is roughly equivalent to condition 2.

At this point, condition 3. is pretty easy to detect assuming the previous two. By sending enough data during the first loop, the variable i will reach ~60. Then the 64+i operation in the second loop will set the input bit B of all full adders to 1 at least once, except for the 2 MSBs. Thus, the condition “input bit B has never been set” is only true for the 2 MSBs full adders. And since the 1st MSB is never used, it will never satisfy condition 2 anyway. The condition “input bit B has never been set” can easily be detected using a single SR NAND latch connected to B.

Wrapping up

By linking the 3 conditions sub-circuits together with AND gates, the full adder backdoor that yields S=0, Cout=1 will now only trigger during the print loop at the end of the program, and only for the 2nd MSB, effectively turning print(mem[64 + i]) to print(mem[128 + i]).

In the end, without any intentional optimization, the full adder circuit uses a total of 127 gates out of the 250 limit.

Running the solve script makes the CPU output the flag as expected: CTF{H4rdwar3_acc3ler4ted_backd00rs_are_7he_w0rst}.

[+] Opening connection to hwsim.2024.ctfcompetition.com on port 1337: Done

Circuit compiled correctly, sending to factory...
Running test 0-0...
[...]
Running test 1-127...
Circuit passed exhaustive testing, integrating with the CPU...
Deploying military-grade encryption program running on the synthesized CPU...
CPU is awaiting input character...
[...]
CPU outputs: d
CPU outputs: d
CPU outputs: d
CPU outputs: d
CPU outputs: H
CPU outputs: 4
CPU outputs: r
CPU outputs: d
CPU outputs: w
CPU outputs: a
CPU outputs: r
CPU outputs: 3
[...]
CPU outputs: w
CPU outputs: 0
CPU outputs: r
CPU outputs: s
CPU outputs: t
CPU outputs: }
CPU outputs: \x00

I knew learning Minecraft redstone 10 years ago would prove useful some day…