Rather than update this pitiful document, I encourage you to purchase the definitive work on the subject: Linux Device Drivers by Alessandro Rubini, published by O'Reilly, ISBN 1-56592-292-1. It's a great introduction to the subject, and is not expensive.
If there is something that you need to do that isn't covered in this introductory tutorial, and which has been overlooked in the KHG, the next option is to look through other device drivers to see how they handle the problem. Chances are good that you are not the first person to encounter that particular problem. It is also likely that if you put some time into looking around and can't figure out what to do, you can find help on the linux-kernel mailing list or on the comp.os.linux.development.system Usenet group.
struct file_operations {
int (*lseek) (struct inode *, struct file *, off_t, int);
int (*read) (struct inode *, struct file *, char *, int);
int (*write) (struct inode *, struct file *, char *, int);
int (*readdir) (struct inode *, struct file *, struct dirent *, int);
int (*select) (struct inode *, struct file *, int, select_table *);
int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
int (*mmap) (struct inode *, struct file *, struct vm_area_struct *);
int (*open) (struct inode *, struct file *);
void (*release) (struct inode *, struct file *);
int (*fsync) (struct inode *, struct file *);
int (*fasync) (struct inode *, struct file *, int);
int (*check_media_change) (dev_t dev);
int (*revalidate) (dev_t dev);
};Some of the names of these function pointers should look suspiciously
like system calls with which you are familiar. lseek(),
read(), write(), readdir(), select(),
ioctl(), mmap(), open(), and fsync() all are
called directly or indirectly by the system calls of the same name.
release() is called on close() and when a file is closed by a
process exiting, or calling exec() when close-on-exec is set on the
file. check_media_change() and revalidate() are not really
file operations, they are device operations, as can be seen by their arguments.
fasync() is a bit unusual; it is called when fcntl(fd, F_SETFL,
FASYNC) (or ~FASYNC) is called; devices implementing this need to
be aware when this change is made. The functions are provided with sensible
defaults; most of the time, more than half of the functions are set to
NULL because the VFS does the right thing without having to call the
driver. You will see this in the skeleton driver presented.
I will start by presenting a very simple character device driver which implements a simple form of /dev/zero. Note that it does not deal with the memory-management uses of /dev/zero with which some of you may be familiar; this is intended to be a simple example that sends me on as few tangents as possible. All it does is allow writing of any values and reading all zero values. The code to do reading and writing takes 13 lines total; the rest of this file is a skeleton that anyone writing any device driver will find useful.
/* Compile with "gcc -O -DMODULE -D__KERNEL__ -c zero.c" */
#include <linux/config.h>
#ifdef MODULE #include <linux/module.h> #include <linux/version.h> #else #define MOD_INC_USE_COUNT #define MOD_DEC_USE_COUNT #endif
#include <linux/types.h> #include <linux/fs.h> #include <linux/mm.h> /* for verify_area */ #include <linux/errno.h> /* for -EBUSY */ #include <asm/segment.h> /* for put_user_byte */
static int zero_major;
static int read_zero(struct inode * node, struct file * file, char * buf, int count) {
int left;
if (verify_area(VERIFY_WRITE, buf, count) == -EFAULT) return -EFAULT;
for (left = count; left > 0; left--) {
put_user_byte(0, buf);
buf++;
}
return count;
}Before actually writing the zeros into the buffer provided, we verify that the entire buffer is legal to write in, using a verify_area() call. This prevents us from generating kernel-space faults from the reading process if the reading process passes in a pointer to non-existent memory or a count that makes part of the buffer lie in non-existent memory.
static int write_zero(struct inode * inode, struct file * file, char * buf, int count) {
return count;
}static int lseek_zero(struct inode * inode, struct file * file, off_t offset,
int orig) {
return file->f_pos=0;
}static int open_zero(struct inode *inode, struct file * file) {
MOD_INC_USE_COUNT;
return 0;
}
static void release_zero(struct inode *inode, struct file * file) {
MOD_DEC_USE_COUNT;
}static struct file_operations zero_fops = {
lseek_zero,
read_zero,
write_zero,
NULL, /* no special zero_readdir */
NULL, /* no special zero_select */
NULL, /* no special zero_ioctl */
NULL, /* no special zero_mmap */
open_zero,
release_zero,
NULL, /* no special fsync */
NULL, /* no special fasync */
NULL, /* no special check_media_change */
NULL, /* no special revalidate */
};#ifndef MODULE
long zero_init(long mem_start, long mem_end) {
if (zero_major = register_chrdev(0, "zero", &zero_fops))
printk("unable to get major for zero device\n");
return mem_start;
}N.B.: Allocating memory this way is deprecated, since it makes writing a loadable device driver harder. This is vestigial functionality from when the Linux kernel malloc() was not able to allocate more than 4096 bytes at once.
#else
int init_module(void)
{
if ((zero_major = register_chrdev(0, "zero", &zero_fops)) == -EBUSY) {
printk("unable to get major for zero device\n");
return -EIO;
}
return 0;
}Note that the register_chrdev() function is called with the first argument 0. The first argument can either be a requested major number, in which case the function returns failure (-EBUSY) if that major number is already allocated, or it can be 0, in which case the first available major number greater than 64 (see MAX_BLKDEV and MAX_CHRDEV in <linux/major.h>) is allocated and returned, or if all possible slots are taken up, -EBUSY is returned.
void cleanup_module(void)
{
unregister_chrdev(zero_major, "zero");
}
#endifIf a program requests data from a file, the particular filesystem which holds that data determines what block the data is on and requests that block from the buffer cache. That block might be cached, in which case the request is satisfied by the buffer cache, or it might not, in which case the buffer cache creates a request for the device driver to fetch the block from disk and store the data in a buffer.
Of course, while finding that data block, the filesystem might have to read other blocks containing directory entries and inodes. When it needs to read those blocks, it requests them from the buffer cache in the same way it requests any other data block. The buffer cache does not need to know what the blocks are used for.
This indirect approach speeds up disk access considerably, and ends up simplifying some things, as crazy as that sounds. For one thing, the block device driver has very little interaction with user programs; calls to ioctl() are likely to be the most common direct interaction. This means that block device drivers don't have to be as suspicious of their input as character device drivers. By the time a request for a data block has made it to the strategy routine, it has been pretty well checked to make sure it is valid. There is no question of writing to user-space memory and there is no chance that a buggy user-level program passed in bad arguments that need to be checked.
This is fortunate, because other facets of block device drivers are more complicated. There is more infrastructure to be set up than for a simple character device driver. Especially with interrupt-driven block devices, there are opportunities for race conditions that need to be watched out for.
Here is a simple example of what a non-interrupt-driven request function would look like.
static void do_foo_request(void) {
repeat:
INIT_REQUEST;
/* check to make sure that the request is for a valid physical device */
if (!valid_foo_device(CURRENT->dev)) {
end_request(0);
goto repeat;
}
if (CURRENT->cmd == WRITE) {
if (foo_write(CURRENT->sector, CURRENT->buffer, CURRENT->nr_sectors << 9)) {
/* successful write */
end_request(1);
goto repeat;
} else {
end_request(0);
goto repeat;
}
if (CURRENT->cmd == READ) {
if (foo_read(CURRENT->sector, CURRENT->buffer, CURRENT->nr_sectors << 9)) {
/* successful read */
end_request(1);
goto repeat;
} else {
end_request(0);
goto repeat;
}
}
}
If this looks needlessly complex to you, realize that non-interrupt-driven device drivers do not take full advantage of the infrastructure. Interrupt-driven drivers, by contrast, only start things going, and then return without calling end_request() at all; the interrupt handler (or handlers) and timeout functions (if any) do that when a request has been satisfied or there has been an error. Here is an example of a vaguely-defined interrupt-driven device driver.
static int foo_busy; /* foo_init or init_module sets this to zero */
static void do_foo_request(void) {
if (foo_busy)
/* another request is being processed;
this one will automatically follow */
return;
foo_busy = 1;
foo_initialize_io();
}
static void foo_initialize_io(void) {
if (CURRENT->cmd == READ) {
SET_INTR(foo_read_intr);
} else {
SET_INTR(foo_write_intr);
}
/* send hardware command to start io
based on request; just a request to
read if read and preparing data for
entire write; write takes more code */
}
static void foo_read_intr(void) {
int error=0;
CLEAR_INTR;
/* read data from device and put in
CURRENT->buffer; set error=1 if error
This is actually most of the function... */
/* successful if no error */
end_request(error?0:1);
if (!CURRENT)
/* allow new requests to be processed */
foo_busy = 0;
/* INIT_REQUEST will return if no requests */
INIT_REQUEST;
/* Now prepare to do IO on next request */
foo_initialize_io();
}
static void foo_write_intr(void) {
int error=0;
CLEAR_INTR;
/* data has been written. error=1 if error */
/* successful if no error */
end_request(error?0:1);
if (!CURRENT)
/* allow new requests to be processed */
foo_busy = 0;
/* INIT_REQUEST will return if no requests */
INIT_REQUEST;
/* Now prepare to do IO on next request */
foo_initialize_io();
}
To access I/O ports, 12 inline functions are available by including <asm/io.h>. Six functions read data from ports, and each takes one argument: the name of the port. The other six write data to ports, and they take two arguments, the first being the value to write, and the second being the port to write it to. Each function has a size designation: b stands for byte, w for word (16 bits), and l for long (32 bits). Half of the functions are "pausing" functions that pause briefly when writing; a lot of hardware is a little slow on the uptake when it is being read or written, and is unable to keep up with the CPU. These functions have _p appended to their names. Here is the list: inb(), inb_p(), outb(), outb_p(), inw(), inw_p(), outw(), outw_p(), inl(), inl_p(), outl(), outl_p(), and inb().
free_irq() frees up an IRQ. It takes one argument; the interrupt number to free.
cli(), which stands for CLear Interrupt enable, disables interrupts temporarily. sti(), SeT Interrupt enable, re-enables them. These are used to prevent race conditions where an interrupt-driven function and a system call (or function called from a system call) access the same data structures.
cli(); diable_dma(channel); /* Turn it off */ clear_dma_ff(channel); /* Clear pointer flip/flop */ /* Set DMA mode. Some of these are defined in * dma.h. Others (such as auto-initialize mode) * aren't there but you can either (a) find them * in other drivers (the znet Ethernet card driver * has a few) or (b) figure out the hex value to * plug into the 8237's registers. Get the specs * on the 8237 DMA controller chip if you don't * have them already. */ set_dma_mode(channel, DMA_MODE_READ); /* Set transfer address and page bits for your channel */ set_dma_addr(channel, buffer); /* Set tranfer size */ set_dma_count(channel, count); enable_dma(channel); sti();
You will still have to make the device do the DMA, as well. Other functions are available for managing DMA depending on what you need to do; all of these functions except for disable_dma(), enable_dma(), request_dma(), and free_dma() should be called with interrupts disabled.
Make sure that you read all the comments in dma.h, as they will help you avoid many possible mistakes in programming DMA. It is probably also worth reading the source code for other drivers that use DMA. Also, read the actual dma_*() function source code which is in <asm/dma.h> and compare it to the documentation for the device for which you are writing a driver to make sure that you understand what you are doing; DMA is probably the easiest hardware programming interface to use incorrectly.
Before accessing memory, use verify_area() to avoid kernel-space segmentation faults in case of error. verify_area() takes three arguments: the first is the type (VERIFY_WRITE or VERIFY_READ), the second is the address at which to start validating, and the third is the number of bytes to validate.
Do be careful to free everything when you are done using it, because kernel memory is non-swapable which makes memory leaks more serious than in user-space programs. Also, be careful not to free memory before you are finished using it, because freeing memory and the continuing to use it will usually cause a kernel fault--and that's if you are lucky. If you are unlucky, it will silently corrupt memory.
If a device driver needs to sleep on an event, it can call one of several functions that are available for doing so, which work for most instances. However, some drivers need to sleep on multiple events, or do something else to avoid race conditions. In Linux, a task in kernel mode can set its state hint to a sleeping mode and keep executing for a while before calling the scheduler, which schedules another task to run. This is extremely flexible, and is partially covered in the KHG. Several devices use this to good effect, including simple devices like the lp parallel port driver and complex ones like the serial driver.
There are two functions for simple sleeping on an event: sleep_on() and interruptible_sleep_on(). There are two corresponding functions for waking up all processes sleeping on an event: wake_up() and wake_up_interruptible().
current->state = TASK_INTERRUPTIBLE; current->timeout = jiffies + jiffies_to_wait; schedule();
Timers that act like hardware interrupts are also available. Include <linux/timer.h> and allocate a struct timer_list. First pass a pointer to your structure to init_timer(), then fill in the expires, data, and function members, then call add_timer() with a pointer to your structure as the argument. expires gives the number of jiffies after which to time out, data gives the argument to pass to the timer handler, and function is a pointer to the timer handler function. When the function is called, it will not be executed in the context of a running process, so it will not be able to access any user-space data. Just like with a hardware interrupt handler, only kernel-space data structures will be available.
It is possible to request multiple timers at once by making a list of these timer structures; read <linux/timer.h> for details. Most of the time, this will not be necessary.
extern void console_print(const char *);
It is also possible to use gdb to read /proc/kcore to do inspection-only debugging of the kernel. This currently does not work with loadable modules, but a kernel patch is available to allow inspection of loaded modules as well.