Chapter 3. File I/O

This chapter discusses unbuffered I/O, which are not part of ISO C but are part of POSIX.1 and the Single UNIX Specification.

File Descriptors

open and openat Functions


#include <fcntl.h>

int open(const char *path, int oflag, ... /* mode_t mode */ );
int openat(int fd, const char *path, int oflag, ... /* mode_t mode */ );

/* Both return: file descriptor if OK, −1 on error */

oflag argument is formed by ORing one or more of the following constants from <fcntl.h> [p62]:




openat, for example, provides a way to avoid time-of-check-to-time-of-use (TOCTTOU) errors. A program is vulnerable if it makes two file-based function calls where the second call depends on the results of the first call (two calls are not atomic).

Filename and Pathname Truncation

Most modern file systems support a maximum of 255 characters for filenames.

creat Function


#include <fcntl.h>

int creat(const char *path, mode_t mode);

/* Returns: file descriptor opened for write-only if OK, −1 on error */

This function is equivalent to:

open(path, O_WRONLY | O_CREAT | O_TRUNC, mode);

With creat, the file is opened only for writing. To read and write a file, use [p66]:

open(path, O_RDWR | O_CREAT | O_TRUNC, mode);

close Function


#include <unistd.h>

int close(int fd);

/* Returns: 0 if OK, −1 on error */

When a process terminates, all of its open files are closed automatically by the kernel. Many programs take advantage of this fact and don’t explicitly close open files.

lseek Function

Every open file has a "current file offset", normally a non-negative integer that measures the number of bytes from the beginning of the file.


#include <unistd.h>

off_t lseek(int fd, off_t offset, int whence);

/* Returns: new file offset if OK, −1 on error */

The whence argument can be:

To determine the current offset, seek zero bytes from the current position:

off_t currpos;
currpos = lseek(fd, 0, SEEK_CUR);

This technique (above code) can also be used to determine if a file is capable of seeking. If the file descriptor refers to a pipe, FIFO, or socket, lseek sets errno to ESPIPE and returns −1.

read Function


#include <unistd.h>

ssize_t read(int fd, void *buf, size_t nbytes);

/* Returns: number of bytes read, 0 if end of file, −1 on error */

Several cases in which the number of bytes actually read is less than the amount requested:

write Function


#include <unistd.h>

ssize_t write(int fd, const void *buf, size_t nbytes);

/* Returns: number of bytes written if OK, −1 on error */

The return value is usually equal to the nbytes argument; otherwise, an error has occurred.

Common causes for a write error:

For a regular file, the write operation starts at the file’s current offset. If the O_APPEND option was specified when the file was opened, the file’s offset is set to the current end of file before each write operation. After a successful write, the file’s offset is incremented by the number of bytes actually written.

I/O Efficiency

#include "apue.h"

#define BUFFSIZE 4096

    int n;
    char buf[BUFFSIZE];

    while ((n = read(STDIN_FILENO, buf, BUFFSIZE)) > 0)
    if (write(STDOUT_FILENO, buf, n) != n)
        err_sys("write error");

    if (n < 0)
        err_sys("read error");


Caveats of the above program:

Timing results for reading with different buffer sizes (BUFFSIZE) on Linux:

Figure 3.6 Timing results for reading with different buffer sizes on Linux

The file was read using the program shown above, with standard output redirected to /dev/null. The file system used for this test was the Linux ext4 file system with 4,096-byte blocks (the st_blksize value is 4,096). This accounts for the minimum in the system time occurring at the few timing measurements starting around a BUFFSIZE of 4,096. Increasing the buffer size beyond this limit has little positive effect.

Most file systems support some kind of read-ahead to improve performance. When sequential reads are detected, the system tries to read in more data than an application requests, assuming that the application will read it shortly. The effect of read-ahead can be seen in Figure 3.6, where the elapsed time for buffer sizes as small as 32 bytes is as good as the elapsed time for larger buffer sizes. [p73]

File Sharing

The UNIX System supports the sharing of open files among different processes.

Figure 3.7 Kernel data structures for open files

The kernel uses three data structures to represent an open file, and the relationships among them determine the effect one process has on another with regard to file sharing.

Figure 3.7 shows a pictorial arrangement of these three tables for a single process that has two different files open: one file is open on standard input (file descriptor 0), and the other is open on standard output (file descriptor 1).

If two independent processes have the same file open, we could have the arrangement shown in Figure 3.8 (below).

Figure 3.8 Two independent processes with the same file open

Each process that opens the file gets its own file table entry, but only a single v-node table entry is required for a given file. One reason each process gets its own file table entry is so that each process has its own current offset for the file.

Specific operations

It is possible for more than one file descriptor entry to point to the same file table entry:

File descriptor flags vs. the file status flags

Atomic Operations

Older versions of the UNIX System didn’t support the O_APPEND option if a single process wants to append to the end of a file. The program would be:

if (lseek(fd, 0L, 2) < 0) /* position to EOF, 2 means SEEK_END */
    err_sys("lseek error");
if (write(fd, buf, 100) != 100) /* and write */
    err_sys("write error");

This works fine for a single process, but problems arise if multiple processes (or multiple instances of the same program) use this technique to append to the same file. The problem here is that our logical operation of "position to the end of file and write" requires two separate function calls. The solution is to have the positioning to the current end of file and the write be an atomic operation with regard to other processes. Any operation that requires more than one function call cannot be atomic, as there is always the possibility that the kernel might temporarily suspend the process between the two function calls. The UNIX System provides an atomic way to do this operation if we set the O_APPEND flag when a file is opened. This causes the kernel to position the file to its current end of file before each write. We no longer have to call lseek before each write.

pread and pwrite Functions

The Single UNIX Specification includes two functions that allow applications to seek and perform I/O atomically:


#include <unistd.h>

ssize_t pread(int fd, void *buf, size_t nbytes, off_t offset);
/* Returns: number of bytes read, 0 if end of file, −1 on error */

ssize_t pwrite(int fd, const void *buf, size_t nbytes, off_t offset);
/* Returns: number of bytes written if OK, −1 on error */

Creating a File

Atomic operation

When both of O_CREAT and O_EXCL options are specified, the open will fail if the file already exists. The check for the existence of the file and the creation of the file was performed as an atomic operation.

Non-atomic operation

If we didn’t have this atomic operation, we might try:

if ((fd = open(path, O_WRONLY)) < 0) {
    if (errno == ENOENT) {
        if ((fd = creat(path, mode)) < 0)
            err_sys("creat error");
    } else {
        err_sys("open error");

The problem occurs if the file is created by another process between the open and the creat. If the file is created by another process between these two function calls, and if that other process writes something to the file, that data is erased when this creat is executed. Combining the test for existence and the creation into a single atomic operation avoids this problem.

Atomic operation refers to an operation that might be composed of multiple steps. If the operation is performed atomically, either all the steps are performed (on success) or none are performed (on failure). It must not be possible for only a subset of the steps to be performed.

dup and dup2 Functions

An existing file descriptor is duplicated by either of the following functions:


#include <unistd.h>

int dup(int fd);
int dup2(int fd, int fd2);

/* Both return: new file descriptor if OK, −1 on error */

Kernel data structures after dup(1):

Figure 3.9 Kernel data structures after dup(1)

In the above figure, we assume the process executes:

newfd = dup(1);

Another way to duplicate a descriptor is with the fcntl function:


is equivalent to

fcntl(fd, F_DUPFD, 0);

Similarly, the call

dup2(fd, fd2);

is equivalent to

fcntl(fd, F_DUPFD, fd2);

In this last case (above), the dup2 is not exactly the same as a close followed by an fcntl:

  1. dup2 is an atomic operation, whereas the alternate form involves two function calls. It is possible in the latter case to have a signal catcher called between the close and the fcntl that could modify the file descriptors. The same problem could occur if a different thread changes the file descriptors.
  2. There are some errno differences between dup2 and fcntl.

sync, fsync, and fdatasync Functions

Traditional implementations of the UNIX System have a buffer cache or page cache in the kernel through which most disk I/O passes.

The kernel eventually writes all the delayed-write blocks to disk, normally when it needs to reuse the buffer for some other disk block. To ensure consistency of the file system on disk with the contents of the buffer cache, the sync, fsync, and fdatasync functions are provided.


#include <unistd.h>

int fsync(int fd);
int fdatasync(int fd);

/* Returns: 0 if OK, −1 on error */

void sync(void);

fcntl Function

The fcntl function can change the properties of a file that is already open.


#include <fcntl.h>

int fcntl(int fd, int cmd, ... /* int arg */ );

/* Returns: depends on cmd if OK (see following), −1 on error */

In this section, the third argument of fcntl is always an integer, corresponding to the comment in the function prototype just shown.

The fcntl function is used for five different purposes:

  1. Duplicate an existing descriptor (cmd = F_DUPFD or F_DUPFD_CLOEXEC)
  2. Get/set file descriptor flags (cmd = F_GETFD or F_SETFD)
  3. Get/set file status flags (cmd = F_GETFL or F_SETFL)
  4. Get/set asynchronous I/O ownership (cmd = F_GETOWN or F_SETOWN)
  5. Get/set record locks (cmd = F_GETLK, F_SETLK, or F_SETLKW)

The following text discusses both the file descriptor flags associated with each file descriptor in the process table entry and the file status flags associated with each file table entry.

The return value from fcntl depends on the command. All commands return −1 on an error or some other value if OK. The following four commands have special return values:

Getting file flags


#include "apue.h"
#include <fcntl.h>

main(int argc, char *argv[])
    int val;

    if (argc != 2)
        err_quit("usage: a.out <descriptor#>");
    if ((val = fcntl(atoi(argv[1]), F_GETFL, 0)) < 0)
        err_sys("fcntl error for fd %d", atoi(argv[1]));

    switch (val & O_ACCMODE) {
    case O_RDONLY:
        printf("read only");
    case O_WRONLY:
        printf("write only");
    case O_RDWR:
        printf("read write");
        err_dump("unknown access mode");

    if (val & O_APPEND)
        printf(", append");
    if (val & O_NONBLOCK)
        printf(", nonblocking");
    if (val & O_SYNC)
        printf(", synchronous writes");

#if !defined(_POSIX_C_SOURCE) && defined(O_FSYNC) && (O_FSYNC != O_SYNC)
    if (val & O_FSYNC)
        printf(", synchronous writes");



$ ./a.out 0 < /dev/tty
read only
$ ./a.out 1 >
$ cat
write only
$ ./a.out 2 2>>
write only, append
$ ./a.out 5 5<>
read write

Modifying file flags

To modify either the file descriptor flags or the file status flags, we must be careful to fetch the existing flag value, modify it as desired, and then set the new flag value. We can’t simply issue an F_SETFD or an F_SETFL command, as this could turn off flag bits that were previously set.


#include "apue.h"
#include <fcntl.h>

set_fl(int fd, int flags) /* flags are file status flags to turn on */
    int val;
    if ((val = fcntl(fd, F_GETFL, 0)) < 0)
        err_sys("fcntl F_GETFL error");
    val |= flags; /* turn on flags */
    if (fcntl(fd, F_SETFL, val) < 0)
        err_sys("fcntl F_SETFL error");

If we change the middle statement to

val &= ~flags; /* turn flags off */

we have a function named clr_fl, logically ANDs the one’s complement of flags with the current val.

Synchronous-write flag

If we add the line


to the beginning of the program shown in I/O Efficiency section, we’ll turn on the synchronous-write flag. This causes each write to wait for the data to be written to disk before returning. Normally in the UNIX System, a write only queues the data for writing; the actual disk write operation can take place sometime later. A database system is a likely candidate for using O_SYNC, so that it knows on return from a write that the data is actually on the disk, in case of an abnormal system failure.

Linux ext4 timing results using various synchronization mechanisms [p86]

Mac OS X HFS timing results using various synchronization mechanisms [p87]

The above program operates on a descriptor (standard output), never knowing the name of the file that was opened on that descriptor. We can’t set the O_SYNC flag when the file is opened, since the shell opened the file. With fcntl, we can modify the properties of a descriptor, knowing only the descriptor for the open file.

ioctl Function


#include <unistd.h> /* System V */
#include <sys/ioctl.h> /* BSD and Linux */

int ioctl(int fd, int request, ...);

/* Returns: −1 on error, something else if OK */

The ioctl function has always been the catchall for I/O operations. Anything that couldn’t be expressed using one of the other functions in this chapter usually ended up being specified with an ioctl. Terminal I/O was the biggest user of this function.

For the ISO C prototype, an ellipsis is used for the remaining arguments. Normally, however, there is only one more argument, and it’s usually a pointer to a variable or a structure.

Each device driver can define its own set of ioctl commands. The system, however, provides generic ioctl commands for different classes of devices.

We use the ioctl function in Section 18.12 to fetch and set the size of a terminal’s window, and in Section 19.7 when we access the advanced features of pseudo terminals.


Newer systems provide a directory named /dev/fd whose entries are files named 0, 1, 2, and so on. Opening the file /dev/fd/n is equivalent to duplicating descriptor n, assuming that descriptor n is open. /dev/fd is not part of POSIX.1.

The following are equivalent:

fd = open("/dev/fd/0", mode);
fd = dup(0);

Most systems ignore the specified mode, whereas others require that it be a subset of the mode used when the referenced file (standard input, in this case) was originally opened. The descriptors 0 and fd share the same file table entry.

The Linux implementation of /dev/fd is an exception. It maps file descriptors into symbolic links pointing to the underlying physical files. When you open /dev/fd/0, for example, you are really opening the file associated with your standard input. Thus the mode of the new file descriptor returned is unrelated to the mode of the /dev/fd file descriptor.

We can also call creat with a /dev/fd pathname argument as well as specify O_CREAT in a call to open. This allows a program that calls creat to still work if the pathname argument is /dev/fd/1, for example.

Some systems provide the pathnames /dev/stdin, /dev/stdout, and /dev/stderr. These pathnames are equivalent to /dev/fd/0, /dev/fd/1, and /dev/fd/2, respectively.

The main use of the /dev/fd files is from the shell. It allows programs that use pathname arguments to handle standard input and standard output in the same manner as other pathnames.

The following are equivalent:

filter file2 | cat file1 - file3 | lpr
filter file2 | cat file1 /dev/fd/0 file3 | lpr