Project 4 > Buffer Overflow
CS 3410 Spring 2019
Due: 11:59PM, Tuesday, April 16, 2019
Late Policy: Up to 2 slip days can be used for this project. If you are out of slip days, submissions after the due date will incur a 25% deduction per day late.
Grace Period Policy: Do not rely on the grace period to submit on time. Everything should be uploaded BEFORE the due date above.
Reminder: You must work alone for this project.
Warning: Read the ENTIRE writeup before you begin. Regrades will not be honored for submissions that do not follow the writeup.
Setting up your Environment
For this project you should either be SSH'd into a UGCLINUX machine or
be using the course VM found on the course webpage.
Files: The files you will need for this assignment will be in
your personal github repository.
Before you can run the course simulator, you need to make sure that your
toolchain and environmental variables are set up correctly. We have provided
a handy setup script to handle this for you.
On the course server or VM, navigate to where your P4 release files are and run:
$ python setup.py
The script automatically detects whether you have the RISC-V toolchain installed
and makes an attempt at determing your netid from your system environment. If
everything goes well, you should see something like
RISC-V toolchain detected at /usr/local/riscv, skipping PATH modification
Setting up environment variables...
Please confirm your NETID (enter nothing if default is correct), autodetected as "<your netid>":
If the displayed netid is correct, simply hit enter and finish setup. If not, enter
the correct netid and return. In the case that there is not a local copy of the
RISC-V toolchain installed, setup.py
will automatically download and install it from
the course server. Follow the provided prompts and proceed.
To invoke the simulator, run:
$ python simulate.py browser
Overview
The goal of this project is to get intimately familiar with the layout and
use of call stacks, as well as RISCV machine language, assembly and
disassembly, debugging, and reverse engineering. As a side benefit, we hope
to raise your awareness of computer security issues. To this end, you will
write a buffer overrun exploit to break a program that we provide to
you.
WARNING: These kinds of friendly hacking challenges have a long
history, and hacking skills are priceless, as they reflect a deep
understanding of the operation of a computer system. But you must be
responsible and use your skills wisely. Taking over machines or hacking the
Internet carries stiff penalties, is a sure-fire way to get expelled from
Cornell, interferes with other people's lives, and is a waste of your
talent. It is also plain wrong.
What to Submit
Submit your raw binary exploit file containing the specially crafted
input. We will try it out on our own copy of browser
to see
if it successfully breaks it.
Also submit a text document/README that explains the exploit file. This should include
a text listing from xxd
of the bytes in your exploit file,
annotated with comments to explain what it is doing (or trying
to do). Your documentation should explain how your exploit tries to subvert the program's check
that the input string matches the expected string, and why this works. In addition, it should
explain how your exploit is able to take control of the program and what steps the exploit takes
to force the program to print out the desired string.
The Story
In this project, you will "0wn" a binary program called
browser
that we will provide to you. We will not be
providing the source code for this program. All that you know about this
program is what is documented here, and what you can figure out for
yourself by running or examining the binary. The browser
is a
simplified web browser. The normal operation of browser
is
very simple. When executed, it prompts you for a URL, and then prints a
simple message (the '$' shown here is the linux shell prompt):
$ python simulate.py browser
Where to connect? www.google.com
Connected to www.google.com!
I can also send input to browser
from another program using
the linux shell '|' operator, with the same results:
$ echo "www.google.com" | python simulate.py browser
Where to connect?
Connected to www.google.com!
However, this browser only lets you connect to www.google.com
.
All other URLs will be rejected — try it and see!
The rumor is that browser
suffers from a buffer overflow
vulnerability. Since the program only takes one input, it's not difficult
to guess where the problem might lie. Thus,
you would like to to get this browser to let you connect to facebook.ru, even
though browser
was originally designed to only allow access to google.com
0wning browser
: Your job is to craft some input to
browser
that will cause it to print out a different message,
specifically: "LOL 0wn3d! <netid> is on facebook.ru!" (substitute your
own NetID). The fact that the normal "Only www.google.com is allowed"
message is missing constitutes proof that you have completely subverted the
browser, and have gotten it to do something that it could not do before.
$ cat exploit | python simulate.py browser
Where to connect?
LOL 0wn3d! hw342 is on facebook.ru!
To do this, you will need to inject new code into the browser
program as it is running. You are not allowed to modify or replace the
browser
program on disk. The only way you get to interact with
browser
is to feed it some carefully crafted input.
The simulator: The browser
program is compiled to run
on a RISCV CPU. Since most of you don't have access to a real RISCV CPU
(neither do we), you will not be able to natively execute the program.
Instead, you can run a program which takes browser
and
simulates the execution of the code.
To figure out how to attack browser, you'll need to step through its code
as it is executing and reverse engineer the parts that matter, namely,
where (i.e., at which memory location) the input buffer is stored, what the
values are that lie near them in memory, and what precise instruction
sequence is vulnerable to a buffer overflow attack. Since you have the RISCV
binary, you can use various tools, to disassemble the browser
binary and learn about its layout and code.
*********You can also use the -d option to the simulator, which starts an
interactive debugger for the simulated program execution. This lets you
step through the execution one instruction at a time, examine memory and
the stack contents, and so on. See the README
file in your
repo for help using the simulator and it's built-in debugger.*******
Stack Randomization: Note that in a feeble effort to thwart just
such attacks, the simulator, like many real machines, implements stack
randomization, a limited kind of program layout randomization. When the
simulator starts, it initializes the stack to a variable address, rather
than the standard 0x7FFFFFFC
. The starting location of the
stack is derived from the $NETID
environment variable.
Executing the Attack: Once you have figured out the program and
stack layout, you need to come up with a carefully crafted input that will
take over browser
. This input will likely contain some binary
data (the attack payload) that corresponds to RISCV instructions you
want to have executed. There are several tools you might want to use to
create the payload and inject it into the running browser
: a
RISCV assembler, to convert from RISCV assembly into RISCV machine language;
xxd
for converting text files containing hex digits to (or
from) raw binary files; and cat
for sending raw binary input
to browser
.
Once your attack causes browser
to print the "LOL 0wn3d! <netid> is on facebook.ru!" message,
the browser program should exit gracefully (this means, exit with status 0).
It is trivial to make it loop forever. A clean exit only takes a few extra instructions
to invoke the normal exit()
routine.
Tools
Here are a few tools you might find useful for this homework.
xxd
xxd
is a tool for converting back and forth between
raw binary files and text representations of the binary data. For example,
if I create a file exploit.txt
(using a regular text editor)
specifying twenty-eight consecutive "bytes" in hex:
68 77 33 34 32 20
00 00 00 00 00 00 00 00 00
00
01 02 03 04
aa bb cc dd
11 22 33 44
then I can convert this into raw binary using xxd
in "reverse
plain" mode:
$ xxd -r -p exploit.txt > exploit
$ ls -l exploit*
-rw-r--r-- 1 hw342 hw342 28 2011-02-25 12:06 exploit
-rw-r--r-- 1 hw342 hw342 84 2011-02-25 12:06 exploit.txt
You can see that the text version is 84 bytes (includes spaces and 2
digits of text per "byte"), and the raw of the input file in "reverse" mode
(spaces at the ends of lines silently mess things up, for example). So you
may want to convert the raw file back to text and compare to your desired
bytes to make sure nothing went wrong:
$ xxd exploit
0000000: 6877 3334 3220 0000 0000 0000 0000 0000 hw342 ..........
0000010: 0102 0304 aabb ccdd 1122 3344 ........."3D
Using the standard library
riscv32-objdump
can give you a listing of the
assembly code for browser
:
$ riscv32-objdump -xdl browser
This becomes very helpful as it includes the disassembly of the standard library,
which has functions you need to call.
Note: all functions without underscores follow calling conventions. If you want to see
information about function calls (such as get()
, print()
etc) that you see in the object dumps, refer to the LINUX man pages.
Example: to use the stdlib function malloc
(which is not relevant to this project
and is only used here as an example), search in the assembly code of browser
output by
00010cb8 <malloc>:
malloc():
10cb8: 85aa mv a1,a0
10cba: 1cc1a503 lw a0,460(gp) # 1da1c <_impure_ptr>
10cbe: a029 j 10cc8 <_malloc_r>
This enables you to call the function malloc
by jumping to address 00010cb8
and using
standard calling conventions to invoke the call by saving arguments in appropriate registers.
Pipelines and Redirections
Pipes and redirection, you may recall, are shell command line
operators that let you connect the output of one program (say
cat
or xxd
) to the input of another program or to
a file. So you can, for example, concatenate two text files using
cat
, send the resulting text as input to xxd -r
-p
, send the resulting raw binary to the simulated
browser
, then send the resulting output to a file
output.txt
, all using a single command:
$ cat exploit_part1.txt exploit_part2.txt | xxd -r -p | python simulate.py browser > output.txt
Debugging
To start an interactive debugger for the simulated program execution, run
$ python simulate.py -d browser
For help, use
$ python simulate.py -h browser
When running this with the -d flag, you will be prompted that the program is listening on a given port number - you should use the port number in the next section.
Once you're in the debugger, open gdb using the following commands in another terminal:
$ riscv32-gdb browser
And in the GDB terminal, using the port number you saw earlier,
(gdb) target remote localhost:[enter port number here]
You can now debug the program remotely.
-
To debug with gdb and see the assembly of the program, you can use:
(gdb) layout asm
-
To step to the next assembly instruction, you can use:
(gdb) si
-
To examine the stack, we can use the sp register. This will give you the first 4 words of the memory, starting from sp.
(gdb) x/4x $sp
This will give you the first 10 words starting from sp - 20:
(gdb) x/10x $sp-20
-
To list all the registers, you can use:
(gdb) i r
-
To print the contents of a specific register, use the following where # is the register number.
(gdb) p $#
You may find the GDB lab (in your course repo) useful as a refresher. For more information on the x GDB command, refer to: https://sourceware.org/gdb/onlinedocs/gdb/Memory.html.
Epilogue
We're here to help. Take advantage of our office hours if you are stuck.
For an entertaining (and a somewhat dated) read on buffer overflow
attacks, check out:
Aleph One. Smashing the Stack for Fun and Profit. Phrack Magazine, 7(49), November 1996.
http://www.phrack.org/issues.html?issue=49&id=14
And finally, to reiterate: a friendly hacking challenge can be fun, and
hacking skills are invaluable for working with real systems. But you must
be responsible for your own behavior. We are not giving you free
reign to launch attacks on CMS, fellow students' machines, or any anything
else. Such behavior is unethical and most likely illegal as well.
FAQ
- ECALLS and Other Instructions
You may see ECALL and other RISC-V instructions in the object-dump. ECALL
is an assembly instruction used to make a system call to the OS. You can refer to the RISC-V manual
for further explanation on instructions, but don't worry too much about understanding every instruction. .
- You need the newlines!
Yes, you need the newlines both before and after the "LOL 0wn3d!"
message. Of course getting the message in the first place is worth the
most points, but the newlines will get you those final few points.
So, an exploit that looks like this:
$ python simulate.py browser < pht24-soln
Where to connect?
LOL 0wn3d! pht24 is on facebook.ru!
... is preferable to an exploit that looks like this:
$ python simulate.py browser < pht24-bad
Where to connect? LOL 0wn3d! pht24 is on facebook.ru!
As you may have discovered, you can't simply embed a newline or
carriage return in the message, because the browser stops reading when
it encounters these characters. Something more clever is called
for.
- Aha! I found this handy vertical tab (0x0b) character! I can just use
that instead of a newline, right?
No, a vertical tab is not a newline. You must embed a newline into the
message.
- Why does calling printf in my exploit print garbage?
-
Because of the nature of the exploit, we may end up ruining the value
of the stack pointer. We need to set our sp and fp to be valid stack
values so that function calls still work nicely.
- Why does it fail to connect to my program when I run gdb and try to connect to localhost?
-
This might be because you are not using tmux to open different
sessions/screens. Without tmux, if you ssh twice, you might get put
into a different machine, thus connecting using localhost won't work.
You could also get lucky and not be on a different machine, but that's
just luck.
- Why are some instructions only 16 bits wide?
Some instructions in the browser instruction set are compressed.This
shouldn't affect your solution.