A8: Buffer Overflow

Instructions: Remember, all assignments in CS 3410 are individual. You must submit work that is 100% your own. Remember to ask for help from the CS 3410 staff in office hours or on Ed! If you discuss the assignment with anyone else, be careful not to share your actual work, and include an acknowledgment of the discussion in a collaboration.txt file along with your submission.

The assignment is due via Gradescope at 11:59pm on the due date indicated on the schedule.

Submission Requirements
Restrictions
Getting Started
Overview
Background: Buffer Overflow Vulnerabilities
Your Task
- 0wning browser
(Lab) Part 0: Overflow
Part 1: Overwrite
- xxd
- Pipelines and Redirections
Part 2: Execute the Payload
- asbin
Epilogue
FAQ
Submission
Rubric

Submission Requirements

You will submit your completed solution to this assignment on Gradescope. You must submit:

exploit, a raw binary file containing your specially-crafted input. We’ll try it out on our own copy of browser (in an environment that uses your NetID) to see if it successfully breaks it. A successful exploit causes browser to output the following exactly
```
Launching for user <NETID>
Where to connect?
LOL 0wn3d! <NETID> is on imgur.com!
```
where <NETID> is replaced with your NetID. browser must also exit gracefully (i.e., exits with a return code of 0).
README.md, a text document that explains your exploit file. It should contain the following:
- A text listing from xxd of the bytes in your exploit file, annotated with comments to explain what your exploit is doing (or trying to do).
- An explanation of how your exploit tries to subvert the program’s check that the input string matches the expected string, and why this works.
- An explanation of how your exploit is able to take control of the program and what steps the exploit takes to force the program to print out the desired string.
- If you wrote a script to make your exploit for you, you may include it in your README.md. However, by itself, a script does not constitute an explanation. We still expect a prose description of how your exploit works.

Restrictions

You cannot use system calls anywhere in your exploit. You may only use the standard library functions already linked to the browser executable.
Don’t modify the executable in any way. We will be evaluating your exploit using our own copy of browser.
Your exploit must work when piped in as input to the browser executable. Specifically, we will run the following command within the CS3410 container, launched with the docker command that is aliased by rv (as discussed previously in the course infrastructure setup and in Lab 4):
```
cat exploit | qemu browser
```

Getting Started

To get started, obtain the release code by cloning your assignment repository from GitHub:


$ git clone git@github.coecis.cornell.edu:cs3410-2025sp-student/<NETID>_bufferof.git

Replace <NETID> with your NetID. All the letters in your NetID should be in lowercase.

Overview

In this assignment you will get a chance to apply your knowledge of RISC-V assembly, calling conventions, and the layout of memory in order to exploit a buffer overflow vulnerability in a program we provide you. To accomplish this feat, you’ll analyze a pre-compiled binary using disassembly and debugging tools and and write an exploit which assumes control of the target program. We also hope to raise your awareness of real-world computer security issues.

There are conceptually three parts to this assignment:

Part 0 (Lab). Starting in the lab, you’ll begin by understanding how the program we provide you works with the aim of identifying where the buffer overflow vulnerability is and how you can exploit it. The goal of this part is to use the buffer overflow to cause the program to crash.
Part 1. The second step is to modify the return address stored on the stack in order to cause the program to jump execution to a location in memory of your choosing, thereby giving you (the attacker) control over the execution of the program.
Part 2. In the final part of this assignment, you’ll update your buffer overflow exploit to execute a set of RISC-V instructions of your choosing.

Use your skills for good, not evil.

These kinds of friendly hacking challenges have a long history, and hacking skills are priceless, as they reflect a deep understanding of the operation of a computer system. But you must be responsible and use your skills wisely. Taking over machines or hacking the Internet carries stiff penalties, is a sure-fire way to get expelled from Cornell, interferes with other people’s lives, and is a waste of your talent. It is also plain wrong.

Background: Buffer Overflow Vulnerabilities

Before getting your hands dirty, let’s start by understanding what a buffer overflow vulnerability is.

First, what do we mean by “vulnerability”? In the context of computer security, a vulnerability is a flaw in a computer, system, or program that compromises its security. Vulnerabilities can be caused by a design flaw or an implementation bug. Malicious attackers then can exploit these vulnerabilities to steal or damage the hardware, software, or data of a system, as well as disrupt any services the system provides. If you’re curious, most of the exploitable vulnerabilities that have been discovered are documented in the Common Vulnerabilities and Exposures (CVE) database.

A buffer overflow is perhaps the most well known form of a software vulnerability. Despite this, buffer overflow exploits are still quite common today. A buffer overflow is the result of a program trying to put more data into a buffer than can the buffer can hold. We’ve actually seen buffer overflows earlier this semester but with a different name: an out-of-bounds memory access.

For example, consider the following C program which contains a buffer overflow.


void foo(char* str, int n) {
  char buffer[8];
  for (int j = 0; j < n; j++) {
    buffer[j] = str[j];
  }
}

int main() {
  char long_str[128];
  for (int i = 0; i < 128; i++) {
    long_str[i] = 'A';
  }
  foo(long_str, 128);
  return 0;
}

This program initializes a string long_str consisting of 128 'A's. The foo() function then copies long_str into the local variable buffer. A buffer overflow occurs when the for loop in foo() starts to overwrite memory beyond the end of buffer (i.e., when j >= 8) as buffer is only 8 bytes large but long_str is 128 bytes large.

What will happen when we run this program? Because accessing out-of-bounds memory is a form of undefined behavior in C, we don’t know for sure! But let’s take a closer look at what happens when the program tries to write to &buffer[8] through &buffer[127].

Diagram of the call stack of `foo`

Depicted above is the layout of foo’s stack (call) frame. From top to bottom, the stack stores the return address (ra), the frame pointer (s0/fp), and then the two local variables buffer and j. Since buffer was only given 8 bytes on the stack, the 120 bytes after buffer would be overwritten with the ASCII character 'A' (0x41) once the for loop finishes. This includes the frame pointer and the return address! This means that after the for loop the return address is now 0x4141414141414141. When foo() returns, the next instruction is read from 0x4141414141414141, possibly causing a segmentation fault.

To summarize, a buffer overflow allowed us overwrite the return address of the call to foo(). This caused the program to jump to 0x4141414141414141 instead of back to main(), likely crashing the program. While this certainly is interesting, it may not yet be clear why buffer overflows are so dangerous.

Suppose that an attacker knew there was a buffer overflow in the program and also knew where the return address was stored on the stack. They could then set the return address to an address of their choosing causing whatever instructions located there to be blindly executed by the program! This effectively allows the attacker to execute arbitrary code without actually modifying the program itself — yikes!

Your Task

In this assignment, you will “0wn” a binary program called browser that we will provide to you. We provide one file, the compiled program. Although we do not provide any source code files, a version of the program’s source can be viewed (or extracted) from the compiled binary. All that you know about this program is what is documented here, and what you can figure out for yourself by running or examining the binary.

The browser program is a simplified “web browser”. When executed, it prompts you for a URL, and then prints a short message (the ~# shown here is the shell prompt within the CS3410 container):


$ rv
~# qemu browser
Launching for user hw342
Where to connect? www.cs.cornell.edu
hw342 is connected to www.cs.cornell.edu!

You can also send input to browser from another program using the Unix pipe operator (|):


~# echo "www.cs.cornell.edu" | qemu browser
Launching for user hw342
Where to connect?
hw342 is connected to www.cs.cornell.edu!

However, this browser only lets you connect to www.cs.cornell.edu. All other URLs will be rejected — try it and see!

As a proud hacker and social media enthusiast, you demand the right to connect to other, less productive sites, and the fact that browser was designed to only allow access to cs.cornell.edu is unacceptable. Rumor has it that browser suffers from a buffer overflow vulnerability. Since the program only takes one input, it’s not difficult to guess where the problem might lie.

Stay within the container!

The browser program has been compiled for RISC-V, which means that you need to run it using the QEMU emulator. Further, because it makes some use of interactivity and TTY escape sequences, there are uses of it—in particular, the exploits you will most likely write—that must be done at an existing shell prompt within the CS3410 container. While you can run it using the rv alias (i.e., rv qemu browser), attempts to redirect input using rv will most likely fail. There are ways to do it, but not elegant ones.

Instead, we recommend running all your commands using an interactive shell inside of a CS 3410 infrastructure container. The easiest way to open one is to use the rv-debug alias.

0wning `browser`

Your job is to craft some input to browser that will cause it to print out a different message, specifically: “LOL 0wn3d! is on imgur.com!” (substitute your own NetID). This input should be saved in a file named exploit. When it is fed as input to browser, the compromised behavior of the browser program should be the result:


~# cat exploit | qemu browser
Launching for user hw342
Where to connect?
LOL 0wn3d! hw342 is on imgur.com!

The fact that the normal “Only www.cs.cornell.edu is allowed” message is missing constitutes proof that you have completely subverted the browser, and have gotten it to do something that it could not do before. You are not allowed to modify or replace the browser program on disk. The only way you get to interact with browser is to feed it some carefully crafted input.

While not trivial to do, attacking browser is easier than you might think. The trick is to encode within your input a sequence of bytes that are structured in a way that injects new code into the browser program as it is running. To craft an input that will be read and subsequently give you control of the program, you need to find where the vulnerable input buffer is stored, how big it is, what the values are that lie near it in memory, and what precise instruction sequence is vulnerable to a buffer overflow attack.

Stack Randomization

In a feeble effort to thwart just such attacks, the QEMU VM, like many real machines, implements stack randomization, a limited kind of program layout randomization. When the simulator starts, it initializes the stack to a variable address, rather than the standard 0x7FFFFFFC. Likewise, Linux also implements ASLR. If you craft your exploit properly, neither of those things will matter.

The browser executable also performs a weak form of stack protection. The actual starting location of the stack is derived from the NETID environment variable, which should be equal to your Cornell NetID. We will test your submitted exploit in an environment where $NETID is your NetID. Hopefully, you already set that when you made your rv and rv-debug aliases/PowerShell functions! You can check this by running rv env and seeing whether NETID is indeed set to your Cornell NetID.

(Lab) Part 0: Overflow

View the lab slides here.

Before starting the lab, make sure you’ve got your copy of the browser binary by following the instructions in Getting Started.

The goal of this lab is to familiarize yourself with the browser binary, locate where the buffer overflow is, and finally craft an input which causes browser to crash with a segmentation fault. Once you’ve crashed browser, you can move on to Part 1.

As stated previously, you are only given the browser binary for this assignment. We encourage you to start exploring how browser works by testing it on example inputs. Recall that by default, browser will only allow connections to www.cs.cornell.edu and will refuse to connect to any other website.

It likely won’t be long until you’ve learned all that you can about browser simply by running it. You’ll need to use standard developer tools to help you learn more about how browser works. Luckily, browser has been built with source-level symbol and debugging information!

Getting browser’s Source Code

Whoever made browser appears to have embedded the source code within the executable! You can definitely use that to you advantage.

To extract the source code, you’ll need to run browser through GDB by following the instructions below. Once you’re inside a GDB prompt, run (gdb) printf "%s\n", src. This will open an application called a “pager” which allows you to page through the source code. You’re also free to copy the source code into a separate file (e.g., browser.c). If you do so, make sure that your text editor doesn’t automatically format the code! Otherwise, the line numbers that GDB reports won’t match up with line numbers in the source code that you extracted. VSCode in particular is notorious for this.

Once you’ve found the source of the buffer overflow, you need to use it to crash the program. One way to easily cause the program to crash is to overwrite the return address to a restricted area of memory. Note that you don’t need to know exactly where the return address is in order to get browser to crash. We recommend that you run browser through GDB so you can see where it crashes.

`objdump`

objdump (“object dump”) is a tool to display information about object files (i.e., machine code). You can use it to give you a listing of the assembly code for browser (among many other uses). For example, to see the assembly code of browser and all the libraries it uses, run:


$ rv objdump -xdl browser

To save the output of this command to a file, you can redirect the output using the > shell operator:


$ rv objdump -xdl browser > browser.s

This becomes very helpful as it includes the disassembly of the standard library, which has functions you’ll need to call later in Part 2.

Example: In the assembly of browser, you’ll find many labeled blocks which correspond to included standard library functions. For example, the following block ultimately calls the strlen() function from string.h:


00000000000109d0 <strlen@plt>:
   109d0:	00002e17          	auipc	t3,0x2
   109d4:	670e3e03          	ld	t3,1648(t3) # 13040 <strlen@GLIBC_2.27>
   109d8:	000e0367          	jalr	t1,t3
   109dc:	00000013          	nop

This block calls the strlen() function by first loading the address of the strlen() function into the t3 register. Elsewhere in the assembly you will be able to find multiple occurrences of the instruction jal 109d0 <strlen@plt>. This tells us that to call the strlen() function we need to first load our arguments into the appropriate registers according to our calling conventions and then we need to jump to 0x109d0.

All functions which aren’t prefixed by underscores follow calling conventions. If you want to see information about function calls to the standard library (e.g., printf(), exit()) that you see in the object dumps, refer to Section 3 of the Linux man pages.

GDB

Another powerful tool that you can (and should!) use in this assignment is GDB.

Recall from Lab 4 that you can start an interactive GDB session for the program execution by opening two terminal windows within the same CS3410 container image. In one of them, invoke our CS3410 “debugging” container


$ rv-debug

and in the other, open a shell in this same container:


$ docker exec -it `docker ps -f name=testing -q` bash

Next, start the browser executable for remote GDB in one of the terminal windows:


root@dd70ff2495b5:~# qemu -g 1234 browser

Finally, open gdb using the following commands in the other terminal window:


root@dd70ff2495b5:~# gdb  -ex 'target remote localhost:1234' -ex 'set sysroot /opt/riscv/sysroot'  browser

You can now debug the program remotely. From Lab 4, you are already familiar with common GDB commands for investigating details of source code-level symbols. In addition to those, you may find some of the following lower-level commands helpful:

To see the assembly of a single procedure in GDB, you can use disassemble _<procedure name>_
```
(gdb) disassemble main
Dump of assembler code for function main:
  <output omitted>
```
Just typing disassemble without specifying a procedure name will give you the assembly for the program’s entry point, _start.
To step to the next assembly instruction, you can use stepi (or its abbreviation, si):
```
(gdb) stepi
```
To set a breakpoint at a memory address addr, prefix the address with a *:
```
(gdb) break *addr
```
For example, if we wanted to set a breakpoint at the address 0x123456 we would use the following:
```
(gdb) break *0x123456
```
To examine the stack, we can use the sp register. This will give you the first 4 words of the memory, starting from sp.
```
(gdb) x/4x $sp
```
This will give you the first 10 words starting from sp - 20:
```
(gdb) x/10x $sp-20
```
To list all the registers, you can use:
```
(gdb) info r
```
To print the contents of a specific register, use the following (where <#> is the register number):
```
(gdb) print $<#>
```

You may find the GDB lab (Lab 4) useful as a refresher. For more information on the GDB x command, refer to: https://sourceware.org/gdb/onlinedocs/gdb/Memory.html.

Part 1: Overwrite

You should now have an input which exploits the buffer overflow in browser to cause it to crash by overwriting the return address with some garbage value. Next, you’ll refine this input by locating exactly where the return address is stored on the stack. Once you know where the return address is stored, you’ll be able to change it to whatever value you wish. We suggest that you try to change the return address to 0x0000000000000000.

Write a Script!

We strongly recommend writing a script that will build the exploit string for you. You’ll rapidly go through different versions of your exploit as you test it. Having a script that constructs your exploit will likely save you a lot of time, as well as help you document how your exploit works.

Be careful to ensure that any string you build consists of raw byte values in the places where they are needed, not just ASCII characters. For example, in Python, all ordinary string concatenation operations will produce ASCII characters, but you can use other means (e.g., the pack method in Python’s struct module) to convert non-byte values to bytes.

You may use any language you wish to write your script in, if you choose to write one at all. You do not need to submit it, although you can certainly reference it in your exploit writeup.

Below are some additional tools and tricks that can help you during this part of the attack.

`xxd`

xxd is a tool for converting back and forth between raw binary files and text (ASCII) representations of the binary data.

Using xxd in “plain mode”, you can convert ASCII text (interpreted as raw binary) into ASCII hexadecimal digits. For example,


$ echo "CS 3410" | xxd -p
435320333431300a

xxd can also go in reverse. For example, if you create a file exploit.txt (using a regular text editor) specifying twenty-eight consecutive “bytes” in hex:


68 77 33 34 32 20
00 00 00 00 00 00 00 00 00
00
01 02 03 04
aa bb cc dd
11 22 33 44

You can convert these bytes, which are currently written as hexadecimal characters in ASCII, into raw binary using xxd in “reverse plain” mode:


$ xxd -r -p exploit.txt > exploit
$ ls -l exploit*
-rw-r--r-- 1 hw342 hw342 28 2025-02-25 12:06 exploit
-rw-r--r-- 1 hw342 hw342 84 2025-02-25 12:06 exploit.txt

You can see that the text version is 84 bytes (includes spaces and 2 digits of text per “byte”), while the raw binary of the input file in “reverse” mode is exactly 28 (spaces at the ends of lines silently mess things up, for example). So you may want to convert the raw file back to text and compare to your desired bytes to make sure nothing went wrong:


$ xxd exploit
0000000: 6877 3334 3220 0000 0000 0000 0000 0000  hw342 ..........
0000010: 0102 0304 aabb ccdd 1122 3344            ......... "3D

You can learn more about xxd by reading its manpage.

xxd only converts between ASCII hexadecimal digits and binary data

When in “reverse plain” mode (xxd -r -p), xxd will only convert the ASCII hexadecimal digits in its input to raw binary. All other characters will be skipped and won’t appear in the output.

Pipelines and Redirections

The shell command line pipe operator (|) allows you to connect the output of one program (e.g., cat or xxd) to the input of another. For example, you can call


~# cat exploit | qemu browser

to pass the contents of the exploit file as the input to qemu browser.

You can also redirect the output of one program to a file, overwriting it if it previously existed, using the redirect operator (>). For example, the following command writes the string Hi! to a file hi.txt:


~# echo "Hi!" > hi.txt

The append operator (>>) does nearly the same thing, except that it doesn’t overwrite the file and instead appends its input to the end of the given file.

You can also chain multiple shell operators together to form a pipeline. For example, the following uses cat to feed the contents of exploit.txt as input to xxd -r -p, sends the resulting raw binary to the browser binary, and finishes by writing the output of browser to output.txt, all with one command:


~# cat exploit.txt | xxd -r -p | qemu browser > output.txt

Part 2: Execute the Payload

You’re nearly there! Now you’ll finish your exploit by injecting into the browser program some RISC-V assembly code to cause it to print “LOL 0wn3d! <netid> is on imgur.com!” and then exit gracefully with a return code of 0.

Early forms of buffer overflow attacks were made easier by the ability to modify an executable’s instructions directly. By default, Linux makes a program’s text and data sections read only. So, any attempt to modify the programs instructions while it is running will cause a segmentation fault. You’ll need to find another way.

`asbin`

The asbin script inside of the CS 3410 infrastructure container is a convenient way to assemble RISC-V assembly instructions. For example, we can use asbin to turn the RISC-V assembly instructions within payload.s into machine code:


$ rv asbin payload.s

This will create a file payload.bin containing the binary encoded instructions in your current working directory.

You can also use the equivalent shell incantation (assuming you’re already within the CS 3410 infrastructure container):


~# as payload.s -o tmp.o && objcopy tmp.o -O binary payload.bin && rm tmp.o

Epilogue

We’re here to help! Start early and take advantage of our office hours if you get stuck. Also, see the FAQ!

For an entertaining (and a somewhat dated) read on buffer overflow attacks, check out:

“Aleph One. Smashing the Stack for Fun and Profit”. Phrack Magazine, 7(49), November 1996. https://phrack.org/issues/49/14#article

And finally, to reiterate: a friendly hacking challenge can be fun, and hacking skills are invaluable for working with real systems. But you must be responsible for your own behavior. We are not giving you free reign to launch attacks on CMS, fellow students’ machines, or any anything else. Such behavior is unethical and most likely illegal as well.

FAQ

ECALLS and Other Instructions

You may see ECALL and other RISC-V instructions in the object-dump. ECALL is an assembly instruction used to make a system call to the OS. You can refer to the RISC-V manual for further explanation on instructions, but don’t worry too much about understanding every instruction. .

You need the newlines!

Yes, you need the newlines both before and after the “LOL 0wn3d!” message. Of course, getting the message in the first place is worth the most points, but the newlines will get you those final few points. So, an exploit that looks like this:


~# cat exploit | qemu browser
Where to connect?
LOL 0wn3d! hw342 is on imgur.com!

… is preferable to an exploit that looks like this:


~# cat exploit | qemu browser
Where to connect?  LOL 0wn3d! hw342 is on imgur.com!

As you may have discovered, you can’t simply embed a newline or carriage return in the message, because the browser stops reading when it encounters these characters. Something more clever is called for here.

Aha! I found this handy vertical tab (`0x0b`) character! I can just use that instead of a newline, right?

No, a vertical tab is not a newline. You must embed a newline into the message.

Why does calling printf in my exploit print garbage?

Because of the nature of the exploit, we may end up ruining the value of the stack pointer. We need to set our sp and fp to be valid stack values so that function calls still work nicely.

Why are some instructions only 16 bits wide?

Some instructions in the browser instruction set are compressed.This shouldn’t affect your solution.

Why does it fail to connect to my program when I run gdb and try to connect to localhost?

This might be because you did not open your second shell with the same Docker container image, which will happen if you use rv instead of rv-debug to start the container. Check the first two instructions in the Debugging section, above, and make sure you didn’t mis-type anything. You can verify that the shell prompts in the two windows are in the same container instance by looking at the full text of the prompt, “root@<container_id>:~#”. The <container_id> value for both prompts should match.

Docker gives an error when I try to launch the container with `rv-debug`, saying there is a conflict.

This could result from having a running rv-debug instance running in a different tab, or an older container instance may have failed to tear down, even though it is no longer accessible (among other ways, this can happen with some uses of remote GDB). Check to see if you have another instance of the container running, and if so, use docker stop to kill it.


$ rv-debug
docker: Error response from daemon: Conflict. The container name "/testing" is already in use by container "d56938529b09ec020c69431d49ecc08a0f3043df26df684e125e92eb4b3f78ab". You have to remove (or rename) that container to be able to reuse that name.
See 'docker run --help'.
$ docker ps -a
CONTAINER ID   IMAGE                          COMMAND       CREATED         STATUS         PORTS     NAMES
d56938529b09   ghcr.io/sampsyo/cs3410-infra   "/bin/bash"   3 minutes ago   Up 3 minutes             testing
$ docker stop testing
testing
$ rv-debug
root@45efb0b3855a:~#

I’m getting “broken pipe” and/or “the input device is not a TTY” errors when I try to pipe my exploit to `browser`’s input.

This will happen if you try to run browser with your exploit using rv or rv-debug in a single command, instead of first invoking a shell prompt in the container, because you’re connecting the stdout of one command with the stdin of browser, and there isn’t a way to invoke rv/rv-debug on both with the same container image. Possible forms of this error may look like one of the following:


$ rv cat exploit | qemu browser
bash: qemu: command not found
write /dev/stdout: broken pipe
$ cat exploit | rv qemu browser
the input device is not a TTY
$ rv cat exploit | rv qemu browser
the input device is not a TTY
write /dev/stdout: broken pipe

Submission

Submit all the files listed in Submission Requirements to Gradescope.

Rubric

exploit: 72 points
README.md: 28 points

Exploits that make system calls directly (i.e., using ecall) will receive no credit.