The HyperNews Linux KHG Discussion Pages

Device Driver Basics

We will assume that you decide that you do not wish to write a user-space device, and would rather implement your device in the kernel. You will probably be writing writing two files, a .c file and a .h file, and possibly modifying other files as well, as will be described below. We will refer to your files as foo.c and foo.h, and your driver will be the foo driver.

Namespace

One of the first things you will need to do, before writing any code, is to name your device. This name should be a short (probably two or three character) string. For instance, the parallel device is the ``lp'' device, the floppies are the ``fd'' devices, and SCSI disks are the ``sd'' devices. As you write your driver, you will give your functions names prefixed with your chosen string to avoid any namespace confusion. We will call your prefix foo, and give your functions names like foo_read(), foo_write(), etc.

Allocating memory

Memory allocation in the kernel is a little different from memory allocation in normal user-level programs. Instead of having a malloc() capable of delivering almost unlimited amounts of memory, there is a kmalloc() function that is a bit different:

To free memory allocated with kmalloc(), use one of two functions: kfree() or kfree_s(). These differ from free() in a few ways as well:

See Supporting Functions for more information on kmalloc(), kfree(), and other useful functions.

Be gentle when you use kmalloc. Use only what you have to. Remember that kernel memory is unswappable, and thus allocating extra memory in the kernel is a far worse thing to do in the kernel than in a user-level program. Take only what you need, and free it when you are done, unless you are going to use it right away again.

Character vs. block devices

There are two main types of devices under all Unix systems, character and block devices. Character devices are those for which no buffering is performed, and block devices are those which are accessed through a cache. Block devices must be random access, but character devices are not required to be, though some are. Filesystems can only be mounted if they are on block devices.

Character devices are read from and written to with two function: foo_read() and foo_write(). The read() and write() calls do not return until the operation is complete. By contrast, block devices do not even implement the read() and write() functions, and instead have a function which has historically been called the ``strategy routine.'' Reads and writes are done through the buffer cache mechanism by the generic functions bread(), breada(), and bwrite(). These functions go through the buffer cache, and so may or may not actually call the strategy routine, depending on whether or not the block requested is in the buffer cache (for reads) or on whether or not the buffer cache is full (for writes). A request may be asyncronous: breada() can request the strategy routine to schedule reads that have not been asked for, and to do it asyncronously, in the background, in the hopes that they will be needed later.

The sources for character devices are kept in drivers/char/, and the sources for block devices are kept in drivers/block/. They have similar interfaces, and are very much alike, except for reading and writing. Because of the difference in reading and writing, initialization is different, as block devices have to register a strategy routine, which is registered in a different way than the foo_read() and foo_write() routines of a character device driver. Specifics are dealt with in Character Device Initialization and Block Device Initialization.

Interrupts vs. Polling

Hardware is slow. That is, in the time it takes to get information from your average device, the CPU could be off doing something far more useful than waiting for a busy but slow device. So to keep from having to busy-wait all the time, interrupts are provided which can interrupt whatever is happening so that the operating system can do some task and return to what it was doing without losing information. In an ideal world, all devices would probably work by using interrupts. However, on a PC or clone, there are only a few interrupts available for use by your peripherals, so some drivers have to poll the hardware: ask the hardware if it is ready to transfer data yet. This unfortunately wastes time, but it sometimes needs to be done.

Some hardware (like memory-mapped displays) is as fast as the rest of the machine, and does not generate output asyncronously, so an interrupt-driven driver would be rather silly, even if interrupts were provided.

In Linux, many of the drivers are interrupt-driven, but some are not, and at least one can be either, and can be switched back and forth at runtime. For instance, the lp device (the parallel port driver) normally polls the printer to see if the printer is ready to accept output, and if the printer stays in a not ready phase for too long, the driver will sleep for a while, and try again later. This improves system performance. However, if you have a parallel card that supplies an interrupt, the driver will utilize that, which will usually make performance even better.

There are some important programming differences between interrupt-driven drivers and polling drivers. To understand this difference, you have to understand a little bit of how system calls work under Unix. The kernel is not a separate task under Unix. Rather, it is as if each process has a copy of the kernel. When a process executes a system call, it does not transfer control to another process, but rather, the process changes execution modes, and is said to be ``in kernel mode.'' In this mode, it executes kernel code which is trusted to be safe.

In kernel mode, the process can still access the user-space memory that it was previously executing in, which is done through a set of macros: get_fs_*() and memcpy_fromfs() read user-space memory, and put_fs_*() and memcpy_tofs() write to user-space memory. Because the process is still running, but in a different mode, there is no question of where in memory to put the data, or where to get it from. However, when an interrupt occurs, any process might currently be running, so these macros cannot be used--if they are, they will either write over random memory space of the running process or cause the kernel to panic.

Instead, when scheduling the interrupt, a driver must also provide temporary space in which to put the information, and then sleep. When the interrupt-driven part of the driver has filled up that temporary space, it wakes up the process, which copies the information from that temporary space into the process' user space and returns. In a block device driver, this temporary space is automatically provided by the buffer cache mechanism, but in a character device driver, the driver is responsible for allocating it itself.

The sleep-wakeup mechanism

[Begin by giving a general description of how sleeping is used and what it does. This should mention things like all processes sleeping on an event are woken at once, and then they contend for the event again, etc...]

Perhaps the best way to try to understand the Linux sleep-wakeup mechanism is to read the source for the __sleep_on() function, used to implement both the sleep_on() and interruptible_sleep_on() calls.

static inline void __sleep_on(struct wait_queue **p, int state)
{
    unsigned long flags;
    struct wait_queue wait = { current, NULL };

    if (!p)
        return;
    if (current == task[0])
        panic("task[0] trying to sleep");
    current->state = state;
    add_wait_queue(p, &wait);
    save_flags(flags);
    sti();
    schedule();
    remove_wait_queue(p, &wait);
    restore_flags(flags);
}

A wait_queue is a circular list of pointers to task structures, defined in <linux/wait.h> to be

struct wait_queue {
    struct task_struct * task;
    struct wait_queue * next;
};
state is either TASK_INTERRUPTIBLE or TASK_UNINTERUPTIBLE, depending on whether or not the sleep should be interruptable by such things as system calls. In general, the sleep should be interruptible if the device is a slow one; one which can block indefinitely, including terminals and network devices or pseudodevices.

add_wait_queue() turns off interrupts, if they were enabled, and adds the new struct wait_queue declared at the beginning of the function to the list p. It then recovers the original interrupt state (enabled or disabled), and returns.

save_flags() is a macro which saves the process flags in its argument. This is done to preserve the previous state of the interrupt enable flag. This way, the restore_flags() later can restore the interrupt state, whether it was enabled or disabled. sti() then allows interrupts to occur, and schedule() finds a new process to run, and switches to it. Schedule will not choose this process to run again until the state is changed to TASK_RUNNING by wake_up() called on the same wait queue, p, or conceivably by something else.

The process then removes itself from the wait_queue, restores the orginal interrupt condition with restore_flags(), and returns.

Whenever contention for a resource might occur, there needs to be a pointer to a wait_queue associated with that resource. Then, whenever contention does occur, each process that finds itself locked out of access to the resource sleeps on that resource's wait_queue. When any process is finished using a resource for which there is a wait_queue, it should wake up and processes that might be sleeping on that wait_queue, probably by calling wake_up(), or possibly wake_up_interruptible().

If you don't understand why a process might want to sleep, or want more details on when and how to structure this sleeping, I urge you to buy one of the operating systems textbooks listed in the Annotated Bibliography and look up mutual exclusion and deadlock.

More advanced sleeping

If the sleep_on()/wake_up() mechanism in Linux does not satisfy your device driver needs, you can code your own versions of sleep_on() and wake_up() that fit your needs. For an example of this, look at the serial device driver (drivers/char/serial.c) in function block_til_ready(), where quite a bit has to be done between the add_wait_queue() and the schedule().

The VFS

The Virtual Filesystem Switch, or VFS, is the mechanism which allows Linux to mount many different filesystems at the same time. In the first versions of Linux, all filesystem access went straight into routines which understood the minix filesystem. To make it possible for other filesystems to be written, filesystem calls had to pass through a layer of indirection which would switch the call to the routine for the correct filesystem. This was done by some generic code which can handle generic cases and a structure of pointers to functions which handle specific cases. One structure is of interest to the device driver writer; the file_operations structure.

From /usr/include/linux/fs.h:

struct file_operations {
    int  (*lseek)   (struct inode *, struct file *, off_t, int);
    int  (*read)    (struct inode *, struct file *, char *, int);
    int  (*write)   (struct inode *, struct file *, char *, int);
    int  (*readdir) (struct inode *, struct file *, struct dirent *, int count);
    int  (*select)  (struct inode *, struct file *, int, select_table *);
    int  (*ioctl)   (struct inode *, struct file *, unsigned int, unsigned int);
    int  (*mmap)    (struct inode *, struct file *, unsigned long, size_t, int, unsigned long);
    int  (*open)    (struct inode *, struct file *);
    void (*release) (struct inode *, struct file *);
};
Essentially, this structure constitutes a parital list of the functions that you may have to write to create your driver.

This section details the actions and requirements of the functions in the file_operations structure. It documents all the arguments that these functions take. [It should also detail all the defaults, and cover more carefully the possible return values.]

The lseek() function

This function is called when the system call lseek() is called on the device special file representing your device. An understanding of what the system call lseek() does should be sufficient to explain this function, which moves to the desired offset. It takes these four arguments:

struct inode * inode
Pointer to the inode structure for this device.
struct file * file
Pointer to the file structure for this device.
off_t offset
Offset from origin to move to.
int origin
0 = take the offset from absolute offset 0 (the beginning).
1 = take the offset from the current position.
2 = take the offset from the end.
lseek() returns -errno on error, or the absolute position (>= 0) after the lseek.

If there is no lseek(), the kernel will take the default action, which is to modify the file->f_pos element. For an origin of 2, the default action is to return -EINVAL if file->f_inode is NULL, otherwise it sets file->f_pos to file->f_inode->i_size + offset. Because of this, if lseek() should return an error for your device, you must write an lseek() function which returns that error.

The read() and write() functions

The read and write functions read and write a character string to the device. If there is no read() or write() function in the file_operations structure registered with the kernel, and the device is a character device, read() or write() system calls, respectively, will return -EINVAL. If the device is a block device, these functions should not be implemented, as the VFS will route requests through the buffer cache, which will call your strategy routine. The read and write functions take these arguments:

struct inode * inode
This is a pointer to the inode of the device special file which was accessed. From this, you can do several things, based on the struct inode declaration about 100 lines into /usr/include/linux/fs.h. For instance, you can find the minor number of the file by this construction: unsigned int minor = MINOR(inode->i_rdev); The definition of the MINOR macro is in , as are many other useful definitions. Read fs.h and a few device drivers for more details, and see Supporting Functions for a short description. inode->i_mode can be used to find the mode of the file, and there are macros available for this, as well.
struct file * file
Pointer to file structure for this device.
char * buf
This is a buffer of characters to read or write. It is located in user-space memory, and therefore must be accessed using the get_fs*(), put_fs*(), and memcpy*fs() macros detailed in Supporting Functions. User-space memory is inaccessible during an interrupt, so if your driver is interrupt driven, you will have to copy the contents of your buffer into a queue.
int count
This is a count of characters in buf to be read or written. It is the size of buf, and is how you know that you have reached the end of buf, as buf is not guaranteed to be null-terminated.

The readdir() function

This function is another artifact of file_operations being used for implementing filesystems as well as device drivers. Do not implement it. The kernel will return -ENOTDIR if the system call readdir() is called on your device special file.

The select() function

The select() function is generally most useful with character devices. It is usually used to multiplex reads without polling--the application calls the select() system call, giving it a list of file descriptors to watch, and the kernel reports back to the program on which file descriptor has woken it up. It is also used as a timer. However, the select() function in your device driver is not directly called by the system call select(), and so the file_operations select() only needs to do a few things. Its arguments are:

struct inode * inode
Pointer to the inode structure for this device.
struct file * file
Pointer to the file structure for this device.
int sel_type
The select type to perform:
SEL_INread
SEL_OUTwrite
SEL_EXexception
select_table * wait
If wait is not NULL and there is no error condition caused by the select, select() should put the process to sleep, and arrange to be woken up when the device becomes ready, usually through an interrupt. If wait is NULL, then the driver should quickly see if the device is ready, and return even if it is not. The select_wait() function does this already.

If the calling program wants to wait until one of the devices upon which it is selecting becomes available for the operation it is interested in, the process will have to be put to sleep until one of those operations becomes available. This does not require use of a sleep_on*() function, however. Instead the select_wait() function is used. (See Supporting Functions for the definition of the select_wait() function). The sleep state that select_wait() will cause is the same as that of sleep_on_interruptible(), and, in fact, wake_up_interruptible() is used to wake up the process.

However, select_wait() will not make the process go to sleep right away. It returns directly, and the select() function you wrote should then return. The process isn't put to sleep until the system call sys_select(), which originall called your select() function, uses the information given to it by the select_wait() function to put the process to sleep. select_wait() adds the process to the wait queue, but do_select() (called from sys_select()) actually puts the process to sleep by changing the process state to TASK_INTERRUPTIBLE and calling schedule().

The first argument to select_wait() is the same wait_queue that should be used for a sleep_on(), and the second is the select_table that was passed to your select() function.

After having explained all this in excruciating detail, here are two rules to follow:

  1. Call select_wait() if the device is not ready, and return 0.
  2. Return 1 if the device is ready.

If you provide a select() function, do not provide timeouts by setting current->timeout, as the select() mechanism uses current->timeout, and the two methods cannot co-exist, as there is only one timeout for each process. Instead, consider using a timer to provide timeouts. See the description of the add_timer() function in Supporting Functions for details.

The ioctl() function

The ioctl() function processes ioctl calls. The structure of your ioctl() function will be: first error checking, then one giant (possibly nested) switch statement to handle all possible ioctls. The ioctl number is passed as cmd, and the argument to the ioctl is passed as arg. It is good to have an understanding of how ioctls ought to work before making them up. If you are not sure about your ioctls, do not feel ashamed to ask someone knowledgeable about it, for a few reasons: you may not even need an ioctl for your purpose, and if you do need an ioctl, there may be a better way to do it than what you have thought of. Since ioctls are the least regular part of the device interface, it takes perhaps the most work to get this part right. Take the time and energy you need to get it right.

The first thing you need to do is look in Documentation/ioctl-number.txt, read it, and pick an unused number. Then go from there.

struct inode * inode
Pointer to the inode structure for this device.
struct file * file
Pointer to the file structure for this device.
unsigned int cmd
This is the ioctl command. It is generally used as the switch variable for a case statement.
unsigned int arg
This is the argument to the command. This is user defined. Since this is the same size as a (void *), this can be used as a pointer to user space, accessed through the fs register as usual.
Returns:
-errno on error
Every other return is user-defined.
If the ioctl() slot in the file_operations structure is not filled in, the VFS will return -EINVAL. However, in all cases, if cmd is one of FIOCLEX, FIONCLEX, FIONBIO, or FIOASYNC, default processing will be done:
FIOCLEX (0x5451)
Sets the close-on-exec bit.
FIONCLEX (0x5450)
Clears the close-on-exec bit.
FIONBIO (0x5421)
If arg is non-zero, set O_NONBLOCK, otherwise clear O_NONBLOCK.
FIOASYNC (0x5452)
If arg is non-zero, set O_SYNC, otherwise clear O_SYNC. O_SYNC is not yet implemented, but it is documented here and parsed in the kernel for completeness.
Note that you have to avoid these four numbers when creating your own ioctls, since if they conflict, the VFS ioctl code will interpret them as being one of these four, and act appropriately, causing a very hard-to-track-down bug.

The mmap() function

struct inode * inode
Pointer to inode structure for device.
struct file * file
Pointer to file structure for device.
unsigned long addr
Beginning of address in main memory to mmap() into.
size_t len
Length of memory to mmap().
int prot
One of:
PROT_READregion can be read.
PROT_WRITEregion can be written.
PROT_EXECregion can be executed.
PROT_NONEregion cannot be accessed.
unsigned long off
Offset in the file to mmap() from. This address in the file will be mapped to address addr.

The open() and release() functions

struct inode * inode
Pointer to inode structure for device.
struct file * file
Pointer to file structure for device.

open() is called when a device special files is opened. It is the policy mechanism responsible for ensuring consistency. If only one process is allowed to open the device at once, open() should lock the device, using whatever locking mechanism is appropriate, usually setting a bit in some state variable to mark it as busy. If a process already is using the device (if the busy bit is already set) then open() should return -EBUSY. If more than one process may open the device, this function is responsible to set up any necessary queues that would not be set up in write(). If no such device exists, open() should return -ENODEV to indicate this. Return 0 on success.

release() is called only when the process closes its last open file descriptor on the files. [I am not sure this is true; it might be called on every close.] If devices have been marked as busy, release() should unset the busy bits if appropriate. If you need to clean up kmalloc()'ed queues or reset devices to preserve their sanity, this is the place to do it. If no release() function is defined, none is called.

The init() function

This function is not actually included in the file_operations structure, but you are required to implement it, because it is this function that registers the file_operations structure with the VFS in the first place--without this function, the VFS could not route any requests to the driver. This function is called when the kernel first boots and is configuring itself. The init function then detects all devices. You will have to call your init() function from the correct place: for a character device, this is chr_dev_init() in drivers/char/mem.c.

While the init() function runs, it registers your driver by calling the proper registration function. For character devices, this is register_chrdev(). (See Supporting Functions for more information on the registration functions.) register_chrdev() takes three arguments: the major device number (an int), the ``name'' of the device (a string), and the address of the device_fops file_operations structure.

When this is done, and a character or block special file is accessed, the VFS filesystem switch automagically routes the call, whatever it is, to the proper function, if a function exists. If the function does not exist, the VFS routines take some default action.

The init() function usually displays some information about the driver, and usually reports all hardware found. All reporting is done via the printk() function.

Copyright (C) 1992, 1993, 1994, 1996 Michael K. Johnson, johnsonm@redhat.com.


Messages

1. Question: using XX_select() for device without interrupts by Elwood Downey
2. Feedback: found reason for select() problem
3. Question: Why do VFS functions get both structs inode and file? by Reinhold J. Gerharz