1 Journaling

Journaling is a database terminology. One of the characteristics of a database is the atomic nature of transactions. This implies that a transaction either goes through completely, or does not go through at all. The effects of the transaction are either completely seen or are not seen at all. Harddisks provide atomicity at the sector level. Which means that a write started on a sector on the harddisk will either completely goes through or it does not. But, when a transaction spans mutliple sectors on the harddisk, there is a need for some higher level mechanism which sees to it that modifications to the entire set of sectors occurs atomically. Say, if a transaction has to modify 3 sectors, and the machine crashes after modifying only 2 sectors, then this will lead to inconsistency in the database, since the 3rd sector will contain stale data.

Databases use a technique called journaling to maintain the atomicity of operations. This technique consists of writing all the modified sectors to a separate portion of the storage called the journal, instead of overwriting the actual locations on the hard disk. The actual locations on the harddisk are later over-written with the contents of this journal _only_ after making sure that all the sectors associated with the transaction have been flushed to the journal. In order to identify if all the sectors belonging to a transaction are present on the journal, a commit record is flushed to the journal at the end of the transaction. Let us see what happens when the machine crashes in the following scenarios with the above example :

  • after the commit record is flushed to the journal.

In this case, when the machine comes back up again, it checks the journal. It finds a transaction with the commit record at the end of it. The commit record indicates that this is a completed transaction and could be written to the actual location. All the sectors belonging to this transaction are written at their actual locations on the disk, overwriting the previous contents (replaying the journal).

  • after only two sectors are flushed to the storage.

In this case, when the machine comes back up again and checks the journal it finds a transaction with no commit record at the end of the transaction. This indicates that it may not be a completed transaction and hence NO modifications are done to the disk.

  • after the 3 sectors are flushed to the storage but the commit record is not yet flushed to the storage.

Even in this case, because of the absence of the commit record no modifications are done to the disk.

Going a step further, it may be noted that in the first case there could be another crash during the replaying of the journal information. But no harm is caused since the next time around, the same information will be present in the journal (as the contents of the journal are not yet modified), thus creating a situation similar to the first case.

Thus, with the help of journaling it can be assured that transactions are committed atomically to the disk.

In the case of filesystems, all the metadata updates should be journaled in order for the metadata information to be consistent. In the absence of journaling a crash on the filesystem may cause inconsistency in the filesystem metadata stored on the disk. Special programs that check for consistency within the filesystem need to check and fix any inconsistencies caused because of the crash. This check is a time consuming process and the time taken is directly proportional to the size of the disk.

Journaling filesystems reduce boot time by replacing the usually time consuming filesystem checks with the fast and efficient journal replays.

It should be noted that journaling guarantees the consistency of metadata. It does not make any guarantees about the consistency of data associated with a file.

2 ext3 and journaling

The ext3 filesystem is a journaling filesystem. It uses the journaling facilities provided by the jbd module in the linux kernel for journaling purposes.

Basically, the ext3 code base is exactly similar to the ext2 code base but with the additional functionality of journaling. It has no changes, other than the journaling support, from the ext2 filesystem. This will be clearer in the code snippets discussed in this section.

While working on the design for the ext3 filesystem the designers designed it in two components. They designed a journaling layer which does only journaling. It is independent of the ext3 filesystem on top of it. The objective being that this layer could be potentially serve other modules which require a journaling support as well. Hence, the jbd gives an api which could be used by ext3 and other modules to incorporate journaling.

2.1 The handle

The ext3 needs some way of informing the journaling layer, which set of updates form a single atomic update1. To address this issue the journaling layer provides the concept of handle. When ext3 wants to perform an atomic update, it informs the journaling layer the number of block updates that constitute this single atomic update2. This is accomplished by the journal_start() function call. It returns a handle to ext3. The handle is a opaque structure for the ext3 layer. This handle is to be used by the ext3 layer, while communicating with the journaling layer, to identify the update under progress. Once the update is complete the handle can be destroyed by the journal_stop() function call.

2.2 create system call

For the purpose of understanding how ext3 uses journaling , let us first trace through the code for creating a filesystem object (file/directory) in the ext3 file system. A comparison of the ext3 object creation code, with the corresponding ext2 code will help in understanding the precise changes made for journaling. Further for simplicity purposes let us focus on a file creation as against the creation of a directory.

Here is the code for the create system call in ext2 and ext3.

ext2 :

static int ext2_create (struct inode * dir, struct dentry * dentry, int mode){

struct inode * inode = ext2_new_inode (dir, mode); int err = PTR_ERR(inode);

if (!IS_ERR(inode)) {

inode->i_op = &ext2_file_inode_operations;

inode->i_fop = &ext2_file_operations;

inode->i_mapping->a_ops = &ext2_aops;

mark_inode_dirty(inode);

err = ext2_add_nondir(dentry, inode);

}

return err; }

ext3 :

static int ext3_create (struct inode * dir, struct dentry * dentry, int mode){

handle_t *handle; struct inode * inode; int err; handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS + 3);

if (IS_ERR(handle))

return PTR_ERR(handle);

if (IS_SYNC(dir))

handle->h_sync = 1;

inode = ext3_new_inode (handle, dir, mode);

err = PTR_ERR(inode);

if (!IS_ERR(inode)) {

inode->i_op = &ext3_file_inode_operations;

inode->i_fop = &ext3_file_operations;

inode->i_mapping->a_ops = &ext3_aops;

err = ext3_add_nondir(handle, dentry, inode);

}

ext3_journal_stop(handle, dir);

return err;

}

As can be seen, the task of create system call could be logically divided into the sub-tasks :

  • Allocate and initialise new inode
  • Add an entry for this inode to the directory

Also it can be seen from the source code for ext2 and ext3, the calls to the ext3 functions are made with an extra parameter called the handle. This is the same handle that is returned by the journal_start() function call as discussed earlier.

2.2.1 Allocate and initialise new inode

ext2 : 

struct inode * ext2_new_inode(const struct inode * dir, int mode)

This function accepts a directory inode (in which the file is to be created) and the mode in which it has to be created and returns a newly created and initialised in-memory inode structure. The inode is marked dirty in the function itself.

sb = dir->i_sb; 

inode = new_inode(sb); 

This code allocates an in-core uninitialised inode belonging to the directory’s superblock. The new_inode function returns an inode with the minimum number of fields required.

group=find_group_other(sb, dir->u.ext2_i.i_block_group);

ext2 filesystem divides the available space into groups for better management.

This function finds the correct group to which this inode should belong. Also, ext2 filesystem keeps track of the number of free inodes in a given group. This is information is updated for the group which is found. The function also arranges for the syncing of the corresponding buffers by marking the buffers dirty.

bh=load_inode_bitmap (sb, group);if (IS_ERR(bh))

goto fail2;

i = ext2_find_first_zero_bit ((unsigned long *) bh->b_data, EXT2_INODES_PER_GROUP(sb));

if (i >= EXT2_INODES_PER_GROUP(sb))

goto bad_count;

ext2_set_bit(i, bh->b_data);

mark_buffer_dirty(bh);

This code loads the buffer head which contains the inode bitmap for the filesystem. The bitmap stores the allocated/unallocated state of inodes in the filesystem. The first zero bit (unallocated inode) is found. This bit is set to mark the inode as allocated and the buffer is marked dirty, so that the updated state could be flushed on the disk at some point of time later.

es->s_free_inodes_count=cpu_to_le32(le32_to_cpu(es->s_free_inodes_count)-1); 

mark_buffer_dirty(sb->u.ext2_sb.s_sbh); 

The number of free inode in the superblock is decremented and the super block buffer is marked as dirty.

mark_inode_dirty(inode); 

All the fields in the inode are initialised and then the inode is marked dirty.

ext3:

Now let us take a look at the corresponding ext3 code :

struct inode * ext3_new_inode (handle_t *handle, const struct inode * dir, int mode) 

As can be seen the function definition accepts the same arguments, directory inode and the mode in which the file has to be created, but along with that, it also accepts another parameter, which is the handle. This function as well returns a newly created and initialised in-memory inode structure. The inode is marked dirty in the function itself.

sb = dir->i_sb; 

inode = new_inode(sb); 

This code is exactly similar to the ext2 code. Now we have an inode structure with minimum initialised fields.

The next change that we see is that ext3 cannot simply use the find_group_other() function as is being used in the ext2 function. Why ? This function involves updation of metadata information, and since all the metadata should first go to the journal, special care needs to be taken.

ext3 finds the correct group to which the handle belongs and also keeps a pointer to the buffer head for the group’s metadata.

bitmap_nr = load_inode_bitmap (sb, i);

if(bitmap_nr < 0) 

goto fail;

bh = sb->u.ext3_sb.s_inode_bitmap[bitmap_nr];

if ((j = ext3_find_first_zero_bit ((unsigned long *) bh->b_data,EXT3_INODES_PER_GROUP(sb))) < EXT3_INODES_PER_GROUP(sb)) { 

This code is pretty much similar to what ext2 was doing. The inode bitmap is loaded and the first zero bit is taken into the variable j.

err = ext3_journal_get_write_access(handle, bh);if (err)

goto fail;

if (ext3_set_bit (j, bh->b_data)) {

ext3_error (sb, “ext3_new_inode”, “bit already set for inode %d”, j);

goto repeat;

}

BUFFER_TRACE(bh, “call ext3_journal_dirty_metadata”);

err = ext3_journal_dirty_metadata(handle, bh);

if (err)

goto fail;

The code presented here is definitely different than the code that was there in the ext2_new_inode. The difference is that, the ext3_set_bit() function is enclosed within calls to ext3_journal_get_write_access() and ext3_journal_dirty_metadata() function calls. Both these functions calls the journal_ counterparts to complete the function call.

journal_get_write_access() is an indication to the journaling layer that this buffer is supposed to be written to the journal soon. Whereas, the journal_dirty_metadata() call informs the journaling layer that the changes to this buffer have been made and the journaling layer can now journal the buffer under consideration as a part of the current atomic update.

As you can see the modifications to the buffer are done after the journal_get_write_access() is done and before the journal_dirty_metadata() call.

Most of the differences between ext2 and ext3 stem from the same concept. The idea is to find the buffer head for metadata which has to be modified, and to carry out the modifications enclosed within the ext3_journal_get_write_access() and ext3_journal_dirty_metadata() function calls.

Continuing with our call trace,

err = ext3_journal_get_write_access(handle, bh2); if (err)    goto fail;

gdp->bg_free_inodes_count = cpu_to_le16(le16_to_cpu(gdp->bg_free_inodes_count) – 1);

if (S_ISDIR(mode))

gdp->bg_used_dirs_count = cpu_to_le16(le16_to_cpu(gdp->bg_used_dirs_count) + 1);

BUFFER_TRACE(bh2, “call ext3_journal_dirty_metadata”);

err = ext3_journal_dirty_metadata(handle, bh2);

if (err)

goto fail;

BUFFER_TRACE(sb->u.ext3_sb.s_sbh, “get_write_access”);

err = ext3_journal_get_write_access(handle, sb->u.ext3_sb.s_sbh);

if (err)

goto fail;

es->s_free_inodes_count = cpu_to_le32(le32_to_cpu(es->s_free_inodes_count) – 1);

BUFFER_TRACE(sb->u.ext3_sb.s_sbh, “call ext3_journal_dirty_metadata”);

err = ext3_journal_dirty_metadata(handle, sb->u.ext3_sb.s_sbh);

This section of the code updates the free inodes count in the superblock as well as the group descriptor. As mentioned earlier, in case of ext2, the find_group_other() call returns a group with the free_inode_count already decremented. In case of ext3, we have to do it explicitly. Again this is done by enclosing the writes to the metadata withing the journaling system calls.

err = ext3_mark_inode_dirty(handle, inode); 

Eventually, the inode is marked dirty as is the case with the ext2 code. Again the difference here is that an extra handle parameter is passed to this function. The ext3_mark_inode_dirty() function will eventually find a buffer head for the on-disk inode, and copy the inode information to the buffer containing the on-disk (raw) inode image. Again, this would be done by enclosing the changes within the journal system calls.

2.2.2 Add an entry for this inode to the directory

ext2 :

To add the given inode to the directory, ext2_create calls ext2_add_nondir(), which eventually calls the ext2_add_link() function.

The ext2_add_link() is quite a simple function. It loops through all the directory entries present in the directory. It cycles through all the _pages_ of the directory, to find an appropriate directory entry. It modifies the directory entry such that the filename and inode information are added to the directory entry. Once this is done the page is scheduled to be written to the disk. Also, the directory inode is marked dirty, since the timestamps of the directory inode change.

ext3:

To add the given inode to the directory, ext3_create calls ext3_add_nondir(), which eventually calls ext3_add_entry() function.

The ext3_add_entry() function cycles loops through all the directory entries in the directory. As you would have guessed the ext3 code cycles through the _buffers_ of the directory, to find an appropriate directory entry. It modifies the directory entry such that the filename and inode information are added to the directory entry. This change in the contents of the buffer data are done within the journaling system calls as well.

3 Mounting ext3

Now that we know how information is written to the journal, let us see how it is actually used by ext3 when a filesystem is mounted. This is one of the important part since this is where recovery of a transaction which has been written to the log completely but not to the actual location occurs. In order to avoid getting defocused, we will only go through the code which is really relevant with the journaling.

Let us see ext3’s read super function to understand this :

if (ext3_load_journal(sb, es))    goto failed_mount2;

The function starts of by initialising most of the variables in the super_block structure. Once this is done, a call to the ext3_load_journal is made. The ext3_load_journal function performs a set of tasks, the tasks along with the jbd functions they use are listed below :

  • journal_init_dev/journal_init_inode : Initialise this journal and return a journal structure.
  • journal_update_format : updates the journal superblock to the latest format.
  • journal_wipe : wipes the journal safely. This is done by ext3 only if if the filesystem is being mounted as read only.
  • journal_load : loads the journal into the memory. Performs journal recovery, if needed.