Input/output in Java

This serves as a complement to the textbook’s coverage of file input and output (online supplement 2) and to the material introduced in lecture.

Streams and encodings

Computer programs can read input from a variety of sources, including characters typed on a keyboard, files loaded from a disk, and messages received over a network connection. Similarly, program output could be written to a variety of destinations, including being printed on a console, saved to a file on a disk, or sent over a network connection. Java abstracts over this variety by introducing a concept called a stream that represents the capabilities common to all of them. Streams provide methods for reading or writing the next byte or character, but unlike an array you usually can’t “rewind” or “fast forward” a stream (they don’t remember a history past key presses, and they can’t send “nothing” over the network). By using only the stream interface in your I/O code, programs can be agnostic to the precise sources and destinations of their inputs and outputs, decoupling their logic from those details.

Fundamentally, computer I/O takes the form of streams of bits, which are almost always grouped into bytes (8 bits) for general-purpose computing. Java provides two abstract classes (you can think of them like interfaces) for byte streams: InputStream and OutputStream. These are the interfaces to use when working with “binary” files and protocols and require you to handle issues like endianness and encoding yourself.

While binary data is the most general, much of computing involves “plain text” files and protocols; this includes Java source code files, HTML representing web pages, and HTTP network requests. Representing plain text as bytes requires a character encoding (or “charset”) to say which bytes correspond to which letters (e.g. the byte with value 65 might represent the letter ‘A’). Once upon a time, English-speaking programmers picked 95 common characters and assigned one byte to each one (with room left over for control codes and error checking); the result was ASCII encoding. Other societies with different alphabets or writing systems developed their own encodings, sometimes extending ASCII and other times being incompatible. The upshot was that international plain text files could not be interpreted without also knowing their encoding.

Eventually Unicode standardized how to map nearly all of the world’s characters to numbers. With over 149,186 characters and counting, there is no way to map each one to a byte (which only has 256 different values), so characters must be encoded as sequences of one or more bytes. The most common Unicode encoding format is UTF-8, which is ASCII-compatible (so a file containing only English letters would have the same binary representation as an ASCII-encoded version of that file). Files in this course will be encoded as UTF-8.

How does this relate to streams? When working with plain text files, Java provides the abstract classes Reader for text input and Writer for text output. The interfaces of these classes operate with one character at a time, rather than one byte at a time (as with InputStream and OutputStream). But since they ultimately need to read or write bytes from an underlying byte stream, they need to use an encoding. This can be specified when constructing a class (like FileWriter), or you can use the default. Unfortunately, Java’s default varies with operating system; on macOS and Linux it is usually UTF-8, but on Windows it may be something else (try printing java.nio.charset.Charset.defaultCharset() if you’re curious).

Decorators

As the Reader and Writer interfaces only work with individual characters (or batches thereof), they are low-level and inconvenient for many common tasks. For example, reading lines might look like this (note: StringBuilder is a class representing a mutable string):

    Reader in = new FileReader(path);

    // Read one line (don't do it this way!)
    StringBuilder sb = new StringBuilder();
    int c = in.read();
    while (c != -1 && c != '\n') {
        sb.append((char)c);
        c = in.read();
    }
    String line = sb.toString();

Not only is this ugly (and begging to be abstracted into a helper function), it would actually need to be even longer to accommodate differences in line endings between Windows and macOS/Linux (Windows uses "\r\n"), and it would run slowly, because reading only one character at a time from a stream is inefficient. We would like a stream class with higher-level functionality. Likewise, for output, we would like a class that could format and print strings (like System.out), which a plain Writer does not do. But in providing these features, we don’t want to lose access to the underlying stream, since other I/O functions might just be expecting a Reader or Writer.

The decorator pattern is a design strategy that helps us here. We want to “wrap” our stream in another object that provides additional functionality while also keeping the original stream interface in tact. These wrapper classes are therefore themselves subclasses of Reader or Writer. For example, a BufferedReader is a Reader with an additional method for reading whole lines of text (regardless of line ending); it also uses buffering to read multiple characters into an array as a batch before looking for newlines, improving performance compared to reading one char at a time. On the output side, PrintWriter is a Writer that also provides println(), print(), and printf() methods. System.out is actually a PrintStream, which extends OutputStream rather than Writer to allow non-character output (in this case, encoding is handled by the PrintStream itself).

Here’s how to read a line using BufferedReader:

    Reader in = new FileReader(path);
    BufferedReader br = new BufferedReader(in);

    // Read one line
    String line = br.readLine();

When working with decorators, it is usually best to use only the “most wrapped” object. Streams with buffers may not write all of their data at once, leading to jumbled up output if you also try to write to their underlying stream. In situations where you need to ensure that any data in output buffers immediately goes to its destination, you can call flush() on the decorated stream.

Closing resources

It is extremely important to close I/O streams when you are done using them. This ensures that output streams are flushed to their destination (see above) and that any system resources associated with these files are returned. For example, on some operating systems, you may not be allowed to delete a file if a program still has it open. In Java, a stream can be closed by calling its close() method; this should be done in reverse order of decoration. Once a stream (or any of its decorators) has been closed, the stream cannot be used again.

Java used to recommend closing streams in a finally block associated with the try block for opening and using the stream. However, programmers often forgot, and situations involving multiple streams were hard to handle robustly. Thankfully there’s now an easier way.

Java’s try-with-resources statement will automatically close any streams that were opened in the “argument” of the try. Here’s an example that reads a single line (as above) in context:

static String readFirstLineFromFile(String path) throws IOException {
    try (Reader in = new FileReader(path);
         BufferedReader br = new BufferedReader(in)) {
        return br.readLine();
    }
    // `br.close()` is automatically called here, followed by `in.close()`,
    // whether or not an exception was thrown.
}

Now regardless of whether the function returns or throws an exception, both the BufferedReader decorator and the FileReader source will have close() automatically called on them before execution proceeds. In this case, any exceptions are allowed to propagate, but you could add a catch clause to handle them if desired. A try-with-resources statement can be used with any object that implements the AutoCloseable interface, which applies to many resources with a close() method in addition to I/O streams.

Whenever creating or decorating a stream, always use a try-with-resources statement to ensure that it is closed (many students have experienced “missing” output because their output streams were not closed, which makes debugging very difficult).

What about Scanner?

Textbooks and AP courses typically use the Scanner class from java.util to handle input conveniently in Java. Scanner is useful when getting started, but it has a number of issues in more advanced applications:

For these reasons, you may see examples using BufferedReader and String.split() instead of Scanner in this course.

The limitation regarding empty tokens deserves an example. Consider the string ",a," with the delimiter ",":

String line = ",a,";

// Scanner will only see one token:
Scanner sc = new Scanner(line);
sc.useDelimiter(",");
sc.next();  // Evaluates to "a"
sc.hasNext();  // Evaluates to false

// split() can find three tokens:
String[] tokens = line.split(",", -1);
// The -1 above is necessary for trailing empty tokens (see JavaDoc)
tokens.length;  // Evaluates to 3
tokens[0];  // Evaluates to ""
tokens[1];  // Evaluates to "a"
tokens[2];  // Evaluates to ""

Finally, here is an example of reading (and printing) all lines in a stream using BufferedReader (returning null is frowned upon nowadays, but this class has been around since nearly the beginning):

String path = "hello.txt";
// Try-with-resources structure
try ( Reader in = new FileReader(path);
      BufferedReader br = new BufferedReader(in) ) {

    String line = br.readLine();
    while (line != null) {
        System.out.println(line);
        line = br.readLine();
    }

} catch (IOException e) {
    System.err.println("Error reading from file " + path);
}
// Will automatically call `br.close()`, `in.close()` at this point,
// even if an exception is thrown.

And here is an example of converting tokens to numbers without using Scanner (instead it uses the static “parse” functions in the wrapper classes for the primitive types, like Integer.parseInt()):

String phone = "607-555-5309";
String[] tokens = phone.split("-");
String areaCode = tokens[0];
String exchange = tokens[1];
String lineNum = tokens[2];

// With strings, `+` concatenates:
System.out.println("Concatenation: " + areaCode + exchange + lineNum);

// But with ints, `+` adds:
try {
    int sum = Integer.parseInt(areaCode) +
              Integer.parseInt(exchange) +
              Integer.parseInt(lineNum);

    System.out.println("Sum: " + sum);
} catch (NumberFormatException e) {
    // Instead of `hasNextInt()`, catch `NumberFormatException` to check if token is an int
    System.err.println("One of these tokens is not an int");
}