{Work-in-Progress, Date Started: 30 Aug, 2005, Applies to kernel v2.6.11}

1 Introduction

A filesystem manages storage of data persistently on a storage medium. All Operating systems have a native filesystem (ext3 for Linux), but also, the operating systems should support other filesystems. This may include support for ISO9660 filesystem for CD-Roms, NT filesystem to mount Windows partitions or support for remote access file systems like NFS/CIFS.

To support multitude of such filesystems the operating system provides an abstraction called VFS or the Virtual Filesystem. Yes, that is what the Linux VFS is, an abstraction, and not a filesystem in itself. The VFS layer is a very good example of the use of Object Oriented principles in the Linux kernel. Though an object oriented language like C++ is not used to implement this, the concepts are taken from the Object Oriented Programming(OOP) paradigm.

As is obvious, the “inheritance” feature of OOP is used in the Linux VFS. The Linux VFS acts as the base class for all the filesystem supported by Linux. It defines a set of objects and methods to operate on these objects. Filesystems (like ext3, reiserfs) inherit this base class, and extend its objects to suite their own need. Also, these filesystem refine the methods that operate on these objects as per their requirement. Thus, a generalisation-specialisation relationship is maintained between the Linux VFS and Linux filesystems. With such an architecture, the application that uses system calls like (open/read/write/close), can access the filesystem, without being concerned about which particular filesystem the application is talking to. The application is simply using APIs provided by the Linux VFS (base class).

Providing a generic abstraction to such a vast set of filesystems is a daunting task. Added to it the burden of providing all this functionality with maximum possible performance makes it more complex. The Linux VFS layer has been stretched in different directions to accomodate a varied set of requirements. As a result, the Linux VFS layer is a large and complex entity.

We’ll take a look at it one-piece-at-a-time. Although the most in-depth knowledge can be had only by experience and spending hours tracing through different calls, we’ll try to cover most of the major components and their interactions with each other.

2 VFS Objects

We mentioned above that the Linux VFS defines a set of objects and corresponding methods to operate on these objects. Lets look at what these objects are, and why they exist.

2.1 The Superblock

(Defined : include/linux/fs.h : 758)

A superblock, as the name suggests, is the “super” block which defines how to make sense of all other blocks in the filesystem. This block maintains all the details about the filesystem, like the block size, the total number of blocks available on the filesystem, the number of free blocks, the number of inodes (we’ll come to them soon) on the filesystem, and importantly, the root inode of this filesystem. The superblock is an on-disk data structure.

The superblock that we’ll discuss here, is the in-core (in memory) version of this super block. The Linux VFS provides us the object “struct super_block” and its associated methods, “struct super_operations” for the same.

For now, we’ll talk about a few members in this “struct super_block” object. The others we’ll discuss later as we go through the rest of the VFS objects.

The definition of the super_block object, will show you the default members that VFS defines for us. Yes, filesystems may extend this object further to include more information. The member, “void *s_fs_info” assists filesystems to extend this object as per the filesystem. Filesystem may define their own structures to associate more information with the super_block and hook these structures off this pointer. For example, the ext3 filesystem extends this object with the structure, “struct ext3_sb_info”.

E.g.

        struct ext3_sb_info *sbi;        sbi = kmalloc(sizeof(*sbi), GFP_KERNEL);

if (!sbi)

return -ENOMEM;

sb->s_fs_info = sbi;

The member “struct super_operations *s_op” defines the methods for the super_block. This is a structure of function pointers that are used by the VFS for different operations. A filesystem may initialise this structure with its own methods. We’ll discuss these in the later sections.

The VFS layer maintains a global list of all the superblocks in the system. The variable “super_blocks” is the head of this list of superblocks. The member “struct list_head s_list” of a super_block, hooks a superblock into this global list of superblocks. The spinlock sb_lock protects this list of super blocks.

2.1.1 XXX What Remains

Talk about how a superblock is usually created on a mount.

Talk about global superblock linked list.

Talk about *_operations. (super_operations).

Talk about different lists maintained by the super_block.

Then, talk about some important operations that are performed on the super_block.

2.2 The Inode

(Defined : include/linux/fs.h : 430)

An inode, or an “index node” is a unique index to an object in the filesystem. The object could be a regular file, a directory, a symbolic link, a device or a socket. An inode stores all the required information to retreive all the metadata (attributes) and data of one single file. The inode is an on-disk data structure.

As with the super_block, the inode that we’ll discuss here, is the in-core version of the inode. The Linux VFS provides us the object “struct inode” and its associated methods “struct inode_operations”.

An inode is a frequently used entity, and unlike the superblock, which is one perfilesystem, there could be hundreds and thousands of inodes in a given filesystem. Appropriate management of the inodes is critical to achieve optimum performance levels. To this effect, the Linux VFS layer maintains an inode cache that holds all the inodes that are currently being used. The inode cache also maintains a list of unused inodes that had been recently used, with the expectation that the inode may be accessed soon. The recently used, but now unused, inodes may be purged from the cache, if VM pressure rises.

Lets look at some of the members of the inode structure.

unsigned long           i_ino; 

Every inode in the filesystem has a unique inode number associated with this. The i_ino field maintains this number.

atomic_t                i_count;

An inode may have multiple users accessing it. These users are tracked by means of reference counts with the member, “atomic_t i_count”.

umode_t                 i_mode;

defines the permission modes (rwx-ugo) for this inode. i_mode also indicates the type (regular/directory) of the file as well as attributes like setuid and setgid.

unsigned int            i_nlink;

the number of hardlinks to this inode

uid_t                   i_uid;

uid of the owner of this inode.

gid_t                   i_gid;

gid of the owner of this inode.

loff_t                  i_size;

size of the file

struct timespec         i_atime;

access time

struct timespec         i_mtime;

modification time

struct timespec         i_ctime;

change time

The member “struct inode_operations *i_op” defines the methods for the inode. A filesystem may define different inode operations based on the type of the inode (regular/directory/link/device). We’ll discuss these in the later sections.

As with other objects, the inode object can also be extended by filesystems. The VFS provides a mechanism similar to that of the super_block object to extend this object. This is by providing a generic (void *) pointer, “void *generic_ip”. Filesystems may define their own structures to associate more information with an inode and hook them off this pointer.

But for efficent use of the slab cache, the VFS layer recommends filesystems to embed the “struct inode” in their own inode specific data structure.

E.g.

        struct my_inode {               /* my fs specific inode info */

struct inode vfs_inode;

}

For example, the ext3 filesystem extends the inode object with the structure, “struct ext3_inode_info”. Notice the “vfs_inode” member toward the bottom of this structure.

An inode belongs to a number of lists. Lets look at these :

A super_block for a given filesystem stores a list of all the inodes for that particular filesystem. The head of this list is the member “struct list_head s_inodes” in the super_block structure. The member “struct list_head i_sb_list” in the inode structure hooks onto this linked list.

Note that, the “struct list_head” data type may act as

  1. List Head : a HEAD of an linked list in some cases (as in the s_inodes member in super_block)
  2. List Member : stash, to hook into some list (as with the i_sb_list member in the inode)

All the inodes are also stored in a hash table for faster lookup purposes. The member “struct hlist_head i_hash” hooks into the lists in this hash table. A system-wide hash of inodes is tracked by the variable “struct hlist_head *inode_hashtable”. Any inode is hashed into this hash table based on a key derived from the address of the superblock and the inode number (yes, because an inode number UNIQUEly indentifies an object within a given filesystem, i.e. the super_block).

The member “struct list_head i_list” allows the inode to be hooked into three different lists. This is the inode cache implementation that we talked about earlier.

  • inode_in_use : valid inode, i_count > 0, i_nlink > 0

All the inodes that are currently in use are maintained in this list.

  • inode_unused : valid inode, i_count = 0

The inodes who do not have any active users (i_count = 0) are stored in this list. The inodes are not freed yet, with the expectation that they may be accessed soon. If VM pressure increases inodes are removed from this list and freed.

  • sb->s_dirty : same as inode_in_use, but also dirty

All inodes that have pending writes against them are stored in this list. The member “struct list_head s_dirty” in the super_block is the head of this linked list. Every super_block has its own s_dirty list, as against the inode_in_use and inode_unused lists above which are global. This helps to sync all pending writes for a given filesystem in an efficient manner.

2.2.1 XXX What remains

sb->s_io –> intermediate list, while I/O is being performed on the inode.

inode_operations

2.3 The dentry

(Defined : include/linux/dcache.h : 83)

A dentry, or a “directory entry”, is what its name suggests. You may have noticed in the Inodes section, that there was no mention of the name of the file that an inode refers to. That is not because we forgot to mention it, but because there is separate structure, called the “dentry”, to keep track of that.

A dentry is responsible for storing the names of filesystem objects. Well, that means there ought to be the following members in the dentry structure,

  1. something to store the name of the filesystem object
  2. and a pointer to the inode for this filesystem object

The dentries also keep track of the heirarchical namespace for a filesystem. That means there ought to be the following members as well,

  1. something to point to the parent directory for this filesystem object
  2. and if the dentry is for a directory, something to point to the children of this directory

Like inodes, a dentry is also a frequently used entity, and there could be hundreds and thousands of dentries in a given filesystem. To ensure optimum performance levels, the Linux VFS layer maintains a dentry cache that holds all the dentries that are currently being used. The dentry cache also maintains a list of unused dentries that had been recently used, with the expectation that the dentry may be accessed soon. The unused dentries cache is purged if VM pressure rises.

Lets talk about some of the members of this relatively small structure.

atomic_t d_count;

maintains a track of active users for this dentry

struct inode *d_inode;

points to the inode

struct dentry *d_parent;

points to the parent of this dentry

struct qstr d_name;

the name of this filesystem object

struct list_head d_subdirs;

the list of all the children of this dentry (if this is a directory). This is the head of the linked list.

struct dentry_operations *d_op;

the methods that operate on this dentry

struct super_block *d_sb;

the superblock to which this dentry belongs

void *d_fsdata;

generic pointer to extend this object.

There are quiet a few other “list_head” members that need some explanation.

d_child:
 

This acts as the list member of a linked list headed by the parent dentry. The head of this linked list is the member “struct list_head d_subdirs” of the parent dentry.

d_hash:
 

All the dentries are stored in a hash table for faster lookup purposes. The member “struct hlist_head d_hash” hooks into the lists in this hash table. A system-wide hash of dentries is tracked by the variable “struct hlist_head *dentry_hashtable”. Any dentry is hashed into this hash table based on a key derived from the address of the parent dentry and a hash of the name of this dentry.

d_lru:
 

Unused dentries are not immediately freed, but are stored in list, accessed by the variable “dentry_unused”, with the expectation that they may be accessed soon. The “struct list_head d_lru” member in the dentry hooks into this list of unused dentries. The “dentry_unused” cache is freed when VM pressure rises.

Note that, the “struct list_head” data type may act as

  1. List Head : a HEAD of an linked list in some cases (as in the s_inodes member in super_block)
  2. List Member : stash, to hook into some list (as with the i_sb_list member in the inode)
XXXXX
DCACHE_REFERENCED: 

This flag is set, the first time a dentry is put into the “dentry_unused” list. A purge of the cache, does not free the dentries with this flag set, simply resets the flag, so that the next pruning clears them out.

2.3.1 So why a separate structure, you say?

The Unix filesystem namespace heirarchy is not a tree but a directed acyclic graph (DAG). Thus, a file may have one or more than one names associated with it. The same file may be seen at multiple places in a namespace heirarchy. Thus, these two components of the name and the namespace were segregated from the inode object and became a separate structure called the dentry. Thus, multiple dentries may point to a single inode. An inode maintains a list of dentries that point to it. The member “struct list_head i_dentry;” in the inode structure, is the head of this list. Dentries hook onto this list by means of the member “struct list_head d_alias” in the dentry structure.

2.3.2 XXX What remains

Negative Dentry

A +ve dentry holds reference on the corresponding inode

2.4 The File

(Defined : include/linux/fs.h : 580)

The file structure is used to keep track of open files in a system. The file-handle (that we get in response to the open(2) system call, internally points to this structure. All the attributes related to the opening of the file, and other state information (like offset) during the use of the file descriptor is maintained in this file structure.

3 File I/O Scenarios and code flow

4 Locking

5 Acknowledgements

6 References