CMU 15-113: Input

One fairly important thing to know in C is how to get input from the user, whether from a human sitting at the keyboard or from a file on disk. The C standard library provides a lot of subtly different input functions; in this lecture, we'll cover most of them, and explain how to use them effectively.

`getc`: the root of all (good and) evil

C's input/output model is based on Unix streams, as explained briefly in the miscellaneous lecture. There are two fundamental kinds of stream — input and output — and each of them has one fundamental operation you can perform on it. You can read one byte from an input stream, and write one byte to an output stream.

C calls these two operations getc and putc. To read one character k from an input stream (a.k.a. FILE *) fp, you write k = getc(fp). To write that character to an output file fp2, you write putc(k, fp2). (putc returns a value, but you can generally ignore it. It's only relevant if the stream is broken for some reason — for example, another user deleting your file from the file system. Then putc will tell you about it.)

Instead of getc(stdin), you may write getchar(). Instead of putc(k, stdout), you may write putchar(k). These are just shortcuts provided by the C standard library; they probably compile to the same object code anyway.

Instead of getc(fp), you may write fgetc(fp). Instead of putc(fp), you may write fputc(fp). These pairs of functions do the exact same thing; the only difference is that getc and putc may be implemented in terms of macros, so, for example, getc(fp++) might do unexpected things, while fgetc(fp++) is guaranteed to do what you should expect from the operator precedence rules. In practice, there's usually no reason to prefer fgetc and fputc, but I mention them for completeness.

`gets`: the "all evil" I was telling you about

Consider the following function definition:

    char *get_a_line(char *s)  /* gets */
    {
        int i, k;
        for (i=0; (k=getchar()) != '\n'; ++i)
          s[i] = k;
        s[i] = '\0';
        return s;
    }

Looks simple, right? It just takes input from the user until the user enters a newline, and then returns that line of input as a null-terminated string. Easy! And that's the way input used to be done, back in the days when only one person ever used a computer, and you could trust him to enter the right data, because he was paying a couple hundred dollars an hour for the privilege.

These days, you can't make those assumptions, because computers are cheap and people are stupid. But there's still this function in the C library called gets, which is implemented basically the way you see above.^* But you shouldn't use it, because — as you should know by now — we will try to break your programs, and if you use gets, we will succeed, because there's no way you can stop us — we just figure out how many bytes your buffer has room for, and enter that many characters plus one.

So gets is evil and should never be used. Moving on.

`fgets`: one fix

The problem with gets is that there's no way to avoid a buffer overflow, because gets doesn't know how long your buffer is. The solution is to pass the length of your buffer — and the function that expects it is fgets. (Unlike fgetc, fgets actually does something different from the no-f version.^*) Here's another simplified implementation.

    char *get_safely(char *s, int len, FILE *fp)  /* fgets */
    {
        int i;
        for (i=0; i < len-1; ++i) {
            int k = getc(fp);
            s[i] = k;
            if (k == '\n') break;
        }
        s[i] = '\0';
        return s;
    }

Notice that this function, unlike its predecessor, actually stores the newline character in the buffer. This is so that you can tell whether it returned due to a newline, or just because it ran out of space. C's library functions are like that — they like to tell you as much information as they can, although they sometimes do it in cryptic or inconvenient ways. (For example, in order to find out whether fgets got a whole line or not, you have to scan to the end of the buffer: an O(n) operation!)

`fgetline_113`: our home-brewed solution

You know all about fgetline_113 by now. It is built on top of fgets, just the way fgets is built on top of getc. It removes the need to care about the length of a user-specified buffer, and replaces it with the need to care about freeing a buffer dynamically allocated by the input function itself.

Formatted input

Okay, let's talk about the interesting stuff: How do we parse complicated user input? For an example, let's take the result of the who command on Unix, which for me gives output like this:^*

ajo      pts/13       Apr 15 21:38 (voidptr.res.cmu.edu)
sckang   pts/14       Apr 15 21:50 (tarkin.ece.cmu.edu)
pgunn    pts/15       Apr 15 17:12 (dsl093-060-194.pit1.dsl.speakeasy.net)
justinli pts/5        Apr 15 16:08 (PHERAXUS.RES.cmu.edu)

A line here consists of a single word; some whitespace; the literal string "pts/"; a decimal integer; some whitespace; a date (essentially a word followed by an integer); a time (essentially two more integers separated by a colon); and then a word in parentheses. If we were going to print all this out — ignoring details like nice tabular alignment for the moment — we'd write something using the formatted output function printf, such as

    printf("%s  pts/%d   %s %d %d:%d (%s)\n",
        username, termline, month, day, hour, minute, hostname);

So, to read in that line, we can use the "opposite" of printf: the formatted input function, scanf. And, because printf and scanf were designed to work together this way, we can use the exact same format string!^* The only difference is that we have to take the addresses of all the variables, because we're asking scanf to modify them for us.

    char username[20], month[5], hostname[20];
    int termline, day, hour, minute;
    scanf("%s  pts/%d   %s %d %d:%d (%s)\n",
        username, &termline, month, &day, &hour, &minute, hostname);

Okay, all well and good, except — what happens if some user's hostname exceeds 19 characters? scanf("%s"), like gets, doesn't have any idea how big your buffer is, so it happily overflows the buffer and wreaks havoc with your program. So we can't use plain old scanf("%s").

Luckily, the designers of scanf planned for this. Let's tell scanf how big our buffers are:

    char username[20], month[5], hostname[20];
    int termline, day, hour, minute;
    scanf("%19s pts/%d %4s %d %d:%d (%19s)\n",
        username, &termline, month, &day, &hour, &minute, hostname);

Notice that we had to use 19 instead of 20 because scanf, unlike fgets, doesn't include the terminating null character in its calculations. Also, I've removed some of the redundant whitespace; scanf just skips over whitespace in the input anyway, unless it's specifically looking for something that might be a space or tab character.

And that's all there is to scanf, as long as the input is formatted properly. Try running the following program on the output of who; you can use a command line such as who | ./a.out to do that.

#include 

int main()
{
    char username[20], month[5], hostname[50];
    int termline, day, hour, minute;

    while (1) {
        int rc = scanf(" %19s pts/%d %4s %d %d:%d (%49[^) \n])",
            username, &termline, month, &day, &hour, &minute, hostname);
        if (rc != 7) break;
        printf("(%s) %d:%d %d %s pts/%d %s\n",
            hostname, hour, minute, day, month, termline, username);
    }
    return 0;
}

(Recall that scanf returns the number of format specifiers successfully converted; since we have seven percent signs, we expect a return value of 7. Notice also that I've quietly incorporated the bugfix from the last footnote.)

Error handling and recovery

Okay, so we can handle the valid input produced by who. Now try piping the output of who -q through our program. It fails silently, with no signal to the user that something isn't right. We could fix that by checking the return value of scanf and printing appropriate messages — something like

    switch (rc) {
        case 0: fatal("That line contained no data!");
        case 1: fatal("That line didn't contain a terminal number!");
        case 2: case 3: fatal("That line's date wasn't in the right place!");
        case 4: case 5: fatal("That line's time wasn't in the right place!");
        case 6: fatal("That line didn't contain a hostname!");
    }

But that "solution" would leave at least two problems: One, we can't handle the problem that scanf matches the hostname, but fails to match the final parenthesis; and two, we still have that arbitrary limit on the size of the username, month, and hostname strings. If the user gives us some bad data, we'll fail to parse it correctly. So let's scrap the "pure" scanf approach, and work on a new idea: parsing via the string functions.

Tokenizing strings in C

We already know how to get a single line from the input stream; use getline_113. So it might be a good idea to try getting each line of the input sequentially, and then breaking each line into pieces ourselves, rather than relying on scanf to do the breakdown for us. Consider the following function:

    char *get_username(const char *line)
    {
        char *username;
        int start, end;
        for (start = 0; line[start] == ' '; ++start)
          continue;
        for (end = start; line[end] != '\0' && line[end] != ' '; ++end)
          continue;
        username = malloc(end - start + 1);
        if (username == NULL) return NULL;
        memcpy(username, line+start, end-start);
        username[end-start] = '\0';
        return username;
    }

It takes a line of input — for example, a line gotten from getline_113 — and pulls out the first chunk of non-whitespace characters. That chunk goes into a newly allocated string that we end up returning to the caller. (We could just as easily copy the chunk into a buffer given to us by the caller, but then we'd have to deal with too-small buffers.)

Notice the use of the memcpy function. memcpy is just like strcpy, except that instead of copying until it finds a null byte, it copies only as many bytes as you tell it to. It doesn't null-terminate its target the way strcpy would, so we have to do that ourselves. Two equivalent ways of writing that code are

    int i;
    for (i=0; i < end-start; ++i)
      username[i] = line[start+i];
    username[end-start] = '\0';

and

    sprintf(username, "%.*s", end-start, line+start);

And that's the basic pattern for tokenizing strings in C. Just write a for loop, scanning over the characters in the string until you find what you're looking for.

The `<ctype.h>` macros

But there are some refinements we need to make. First, in the code above, we are allowing only spaces to separate the username from the next field on the line. What happens if the user enters a tab character instead? Well, the C library provides a few nice macros in the standard header <ctype.h> that will let us deal with this question. They are:

isspace: space, tab, and newline
isdigit: decimal digits 0 through 9
isalpha: letters A through Z and a through z
islower, isupper: a through z and A through Z, respectively
isalnum: equivalent to isdigit(x) || isalpha(x)
isxdigit: hexadecimal digits (0–9, A–F, a–f)

If we use isspace instead of a simple comparison to space, we can make the code more robust — that is, we can make it more likely to do the right thing. We could just add another explicit comparison to the tab character, line[i] != '\t', but if we use isspace, another C programmer reading our code will be able to see at a glance what we mean. So we're better off using the library macro.

One disadvantage of isspace and friends is that they technically require you to convert their argument to unsigned char before you can use them. The version of GCC on our Sun Solaris machines will give you a warning ("Warning: Subscript has type char") if you forget to do the conversion. The conversion is most easily performed with an explicit cast — isspace((unsigned char)x) — which is ugly and violates our rule that casts are generally evil. Unfortunately, if you want to be strictly correct, the conversion must be there somewhere. (In my own code, I typically do leave out this cast, but since we're telling you to avoid all warnings and errors, you should be aware of this issue.)

    int get_username(const char *line, char **username)
    {
        int start, end;
        for (start = 0; isspace((unsigned char)line[start]); ++start)
          continue;
        for (end = start; line[end] && !isspace((unsigned char)line[end]); ++end) {
            if (!isalnum((unsigned char)line[end]))
              return NO_VALID_USERNAME;
        }
        *username = malloc(end - start + 1);
        if (*username == NULL) return OUT_OF_MEMORY;
        sprintf(*username, "%.*s", end-start, line+start);
        return 0;
    }

The simple pitfalls

Consider the following program, Harry Bovik's "solution" to the input phase of Lab 4. How many bugs can you find in Harry's code? How many can you fix, without reading ahead?

    #include <stdio.h>

    void load_input(int *array)
    {
        int i;
        printf("Enter 16 integers, please.\n");
        while (scanf("%d", &array[i]) != 0)
          ++i;
    }

    int main()
    {
        int array[16];
        load_input(array);
        /* process(array); */
        return 0;
    }

You should have found at least two critical bugs. The first, and most trivial, bug is that Harry forgot to initialize i before using its value. Another bug, harder to correct, is that Harry is completely trusting the user to enter exactly 16 numeric values — no less, and certainly no more! If the user enters 17 values, the program will overflow array and invoke undefined behavior. A third, subtle, and kind of stupid bug is that Harry is only checking the return value of scanf against zero; if it returns EOF (for example, because the user hit Ctrl-D at the prompt), Harry's program will just sit there, silently looping, until the user gives up and kills it. That's not very user-friendly, is it?

You should now see that Harry's program contains three distinct kinds of bad things: trivial coding bugs, like forgetting to initialize i; security holes, like trusting the user to enter exactly 16 values; and user-unfriendliness, like silently looping if the user tries to quit with Ctrl-D. More generally, we could call these plain old stupidity, unwarranted friendliness to evil users, and unwarranted hostility to good users. You should check your own programs carefully for these three classes of bugs.

Let's start fixing Harry's program for him.

    #include <stdio.h>
    #include <stdlib.h>  /* for 'exit' */

    /* See misclecture for an explanation of this macro */
    #define NELEM(x) ((int)(sizeof (x) / sizeof *(x)))

    void load_input(int *array, int len)
    {
        int i;
        printf("Enter 16 integers, please.\n");
        for (i=0; i < len; ++i) {
            if (scanf("%d", &array[i]) < 1) {
                puts("You didn't enter enough integers!");
                exit(EXIT_FAILURE);
            }
        }
    }

    int main()
    {
        int array[16];
        load_input(array, NELEM(array));
        /* process(array); */
        return 0;
    }

Okay, that seems reasonable. But let's take it a step further. Suppose the user enters "1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 15.5 16". Our new program will silently accept the first 15 numbers, read "15.5" as "15", and leave the characters ".5 16" cluttering up the input buffer. That's less than perfect. So let's read a word at a time, and do the conversion to integers ourselves.

We have to watch out, here — we can't just use scanf("%s", buffer) to read a word, because that's a security hole. So let's exploit our knowledge that decimal integers with more than 20 digits are too big for our program, anyway, and just reject any word longer than 20 digits.

    #include <stdio.h>
    #include <stdlib.h>

    /* See misclecture for an explanation of this macro */
    #define NELEM(x) ((int)(sizeof (x) / sizeof *(x)))

    void load_input(int *array, int len)
    {
        int i;
        printf("Enter 16 integers, please.\n");
        for (i=0; i < len; ) {
            char buffer[21], *end;
            int k;
            if (scanf("%20s", buffer) < 1) {
                puts("You didn't enter enough integers!");
                if (feof(stdin)) exit(EXIT_FAILURE);
                printf("Please enter %d more integer%s.\n",
                    16-i, (16-i == 1)? "": "s");
               continue;
            }
            k = getchar();
            if (k != EOF && !isspace((unsigned char)k)) {
                printf("Your %d%s entry was much too long to be an integer!\n",
                    i+1, (i==1)? "st": (i==2)? "nd": (i==3)? "rd": "th");
                puts("I'm going to ignore it.");
                continue;
            }
            /* Put our test character back in the input buffer. */
            ungetc(k, stdin);

            /* Try to get a valid base-10 integer. */
            array[i] = strtoul(buffer, &end, 10);
            if (*end != '\0') {
                printf("Your %d%s entry didn't look like an integer!\n",
                    i+1, (i==1)? "st": (i==2)? "nd": (i==3)? "rd": "th");
                puts("I'm going to ignore it.");
                while ((k=getchar()) != EOF && !isspace((unsigned char)k))
                  /* eat the rest of the word */ ;
                continue;
            }

            /* Success! Go on to the next array element. */
            ++i;
        }
    }

    int main()
    {
        int array[16];
        load_input(array, NELEM(array));
        /* process(array); */
        return 0;
    }

That ungetc, by the way, is kind of the "opposite" of getc: it takes the character you give it and puts that byte back into the input buffer, to be retrieved by the next call to getc (or, in our case, scanf or getchar).^*

And voilà! We have a perfect input routine!^*

Footnotes

Okay, I left out about three lines of error handling. See the Wikipedia entry for the full implementation. It's eight lines, not counting the comments.
Compare this to the integer-input routine we develop at the end of this lecture, which is many, many lines long and still doesn't have enough error-checking! That's not a valid comparison, of course, but the gist of this footnote is that gets really is very short and simple — too simple! As Albert Einstein once said, "Make everything as simple as possible, but no simpler."
The standard I/O library uses the letter "f" to mean at least three different things. In fgetc, the "f" indicates that it's a function, not a macro. In fgets, the "f" stands for "file," because fgets expects a FILE * parameter. And in scanf and printf, the "f" stands for "formatted," as in "nicely formatted output."
If your version of who has a different format, a good exercise would be to rewrite our input function so that it works for your who. For a challenge, try writing an input function that works for both versions at the same time, for example by reading in a whole line at a time and assuming that fields shaped like "21:38" are probably times, and "ajo" is probably a username, and so on.
If you skip ahead to the complete C program in this section, you'll see that I've replaced the "%49s" specifier with "%49[^) \n]", which looks ugly, but really it's very similar. The "%[^abc]" specifier tells scanf to continue reading input characters until it finds one of the characters abc, and then stop. (Removing the caret yields "%[^abc]", which tells scanf to keep reading as long as the next character is one of abc.) Then the %s specifier is just a convenient shorthand notation for " %[^ \t\n]" — read any initial whitespace, and then keep reading until the next whitespace character.
We need to use the "%[]" specifier because we don't want to include the final closing parenthesis in our hostname string.

Standard C requires that ungetc be able to put back at least one character. But if you try pushing back a lot of characters, ungetc may simply fail. As an exercise, try the following program on several systems and see what you observe, using longer and longer strings for text:

    #include <stdio.h>
    #include <string.h>

    int main()
    {
        char buffer[200];
        char *text = "Hello world!\n";
        int i;
        for (i = strlen(text); i > 0; --i)
          ungetc(text[i-1], stdin);
        scanf("%199[^\n]", buffer);
        puts(buffer);
        return 0;
    }

Okay, okay. Clearly it's not perfect. Try entering the number "2147483649" (that's 2³¹+1) and see how this routine treats it, for example. We could simply change the hardcoded 20 to 4, and that would prevent this particular arithmetic overflow — but the user wouldn't be able to enter the valid integer "10042" anymore, so maybe that's unwarranted hostility to good users.
If you really want a challenge, try fixing the "perfect" routine by writing your own integer input routine using getc, isdigit, and the macros INT_MAX and INT_MIN from <limits.h>.

This page was last updated 24 March 2006
All original code, images and documentation on this page are in the public domain.

CMU 15-113: Input

getc: the root of all (good and) evil

gets: the "all evil" I was telling you about

fgets: one fix

fgetline_113: our home-brewed solution