Kedar Sovani (ksovani@kernelcorp.com)
Journaling is a database terminology. One of the characteristics of a database is the atomic nature of transactions. This implies that a transaction either goes through completely, or does not go through at all. The effects of the transaction are either completely seen or are not seen at all. Harddisks provide atomicity at the sector level. Which means that a write started on a sector on the harddisk will either completely goes through or it does not. But, when a transaction spans mutliple sectors on the harddisk, there is a need for some higher level mechanism which sees to it that modifications to the entire set of sectors occurs atomically. Say, if a transaction has to modify 3 sectors, and the machine crashes after modifying only 2 sectors, then this will lead to inconsistency in the database, since the 3rd sector will contain stale data.
Databases use a technique called journaling to maintain the atomicity of operations. This technique consists of writing all the modified sectors to a separate portion of the storage called the journal, instead of overwriting the actual locations on the hard disk. The actual locations on the harddisk are later over-written with the contents of this journal _only_ after making sure that all the sectors associated with the transaction have been flushed to the journal. In order to identify if all the sectors belonging to a transaction are present on the journal, a commit record is flushed to the journal at the end of the transaction. Let us see what happens when the machine crashes in the following scenarios with the above example :
Going a step further, it may be noted that in the first case there could be another crash during the replaying of the journal information. But no harm is caused since the next time around, the same information will be present in the journal (as the contents of the journal are not yet modified), thus creating a situation similar to the first case.
Thus, with the help of journaling it can be assured that transactions are committed atomically to the disk.
In the case of filesystems, all the metadata updates should be journaled in order for the metadata information to be consistent. In the absence of journaling a crash on the filesystem may cause inconsistency in the filesystem metadata stored on the disk. Special programs that check for consistency within the filesystem need to check and fix any inconsistencies caused because of the crash. This check is a time consuming process and the time taken is directly proportional to the size of the disk.
Journaling filesystems reduce boot time by replacing the usually time consuming filesystem checks with the fast and efficient journal replays.
It should be noted that journaling guarantees the consistency of metadata. It does not make any guarantees about the consistency of data associated with a file.
The ext3 filesystem is a journaling filesystem. It uses the journaling facilities provided by the jbd module in the linux kernel for journaling purposes.
Basically, the ext3 code base is exactly similar to the ext2 code base but with the additional functionality of journaling. It has no changes, other than the journaling support, from the ext2 filesystem. This will be clearer in the code snippets discussed in this section.
While working on the design for the ext3 filesystem the designers designed it in two components. They designed a journaling layer which does only journaling. It is independent of the ext3 filesystem on top of it. The objective being that this layer could be potentially serve other modules which require a journaling support as well. Hence, the jbd gives an api which could be used by ext3 and other modules to incorporate journaling.
The ext3 needs some way of informing the journaling layer, which set of updates form a single atomic update1. To address this issue the journaling layer provides the concept of handle. When ext3 wants to perform an atomic update, it informs the journaling layer the number of block updates that constitute this single atomic update2. This is accomplished by the journal_start() function call. It returns a handle to ext3. The handle is a opaque structure for the ext3 layer. This handle is to be used by the ext3 layer, while communicating with the journaling layer, to identify the update under progress. Once the update is complete the handle can be destroyed by the journal_stop() function call.
For the purpose of understanding how ext3 uses journaling , let us first trace through the code for creating a filesystem object (file/directory) in the ext3 file system. A comparison of the ext3 object creation code, with the corresponding ext2 code will help in understanding the precise changes made for journaling. Further for simplicity purposes let us focus on a file creation as against the creation of a directory.
Here is the code for the create system call in ext2 and ext3.
{
struct inode * inode = ext2_new_inode (dir, mode); int err = PTR_ERR(inode);
if (!IS_ERR(inode)) {
inode->i_op = &ext2_file_inode_operations;
inode->i_fop = &ext2_file_operations;
inode->i_mapping->a_ops = &ext2_aops;
mark_inode_dirty(inode);
err = ext2_add_nondir(dentry, inode);
}
return err; }
{
handle_t *handle; struct inode * inode; int err; handle = ext3_journal_start(dir, EXT3_DATA_TRANS_BLOCKS + 3);
if (IS_ERR(handle))
return PTR_ERR(handle);
if (IS_SYNC(dir))
handle->h_sync = 1;
inode = ext3_new_inode (handle, dir, mode);
err = PTR_ERR(inode);
if (!IS_ERR(inode)) {
inode->i_op = &ext3_file_inode_operations;
inode->i_fop = &ext3_file_operations;
inode->i_mapping->a_ops = &ext3_aops;
err = ext3_add_nondir(handle, dentry, inode);
}
ext3_journal_stop(handle, dir);
return err;
}
struct inode * ext2_new_inode(const struct inode * dir, int mode)
This function accepts a directory inode (in which the file is to be
created) and the mode in which it has to be created and returns a
newly created and initialised in-memory inode structure. The inode
is marked dirty in the function itself.
sb = dir->i_sb;
inode = new_inode(sb);
This code allocates an in-core uninitialised inode belonging to the
directory's superblock. The new_inode function returns an inode with
the minimum number of fields required.
group=find_group_other(sb, dir->u.ext2_i.i_block_group);
ext2 filesystem divides the available space into groups for better management.
This function finds the correct group to which this inode should belong.
Also, ext2 filesystem keeps track of the number of free inodes in
a given group. This is information is updated for the group which
is found. The function also arranges for the syncing of the corresponding
buffers by marking the buffers dirty.
if (IS_ERR(bh))
goto fail2;
i = ext2_find_first_zero_bit ((unsigned long *) bh->b_data, EXT2_INODES_PER_GROUP(sb));
if (i >= EXT2_INODES_PER_GROUP(sb))
goto bad_count;
ext2_set_bit(i, bh->b_data);
mark_buffer_dirty(bh);
es->s_free_inodes_count=cpu_to_le32(le32_to_cpu(es->s_free_inodes_count)-1);
mark_buffer_dirty(sb->u.ext2_sb.s_sbh);
The number of free inode in the superblock is decremented and the
super block buffer is marked as dirty.
mark_inode_dirty(inode);
All the fields in the inode are initialised and then the inode is
marked dirty.
Now let us take a look at the corresponding ext3 code :
struct inode * ext3_new_inode (handle_t *handle, const struct inode * dir, int mode)
As can be seen the function definition accepts the same arguments,
directory inode and the mode in which the file has to be created,
but along with that, it also accepts another parameter, which is the
handle. This function as well returns a newly created and initialised
in-memory inode structure. The inode is marked dirty in the function
itself.
sb = dir->i_sb;
inode = new_inode(sb);
This code is exactly similar to the ext2 code. Now we have an inode
structure with minimum initialised fields.
The next change that we see is that ext3 cannot simply use the find_group_other() function as is being used in the ext2 function. Why ? This function involves updation of metadata information, and since all the metadata should first go to the journal, special care needs to be taken.
ext3 finds the correct group to which the handle belongs and also
keeps a pointer to the buffer head for the group's metadata.
bitmap_nr = load_inode_bitmap (sb, i);
if(bitmap_nr < 0)
goto fail;
bh = sb->u.ext3_sb.s_inode_bitmap[bitmap_nr];
if ((j = ext3_find_first_zero_bit ((unsigned long *) bh->b_data,EXT3_INODES_PER_GROUP(sb))) < EXT3_INODES_PER_GROUP(sb)) {
This code is pretty much similar to what ext2 was doing. The inode
bitmap is loaded and the first zero bit is taken into the variable
j.
if (err)
goto fail;
if (ext3_set_bit (j, bh->b_data)) {
ext3_error (sb, "ext3_new_inode", "bit already set for inode %d", j);
goto repeat;
}
BUFFER_TRACE(bh, "call ext3_journal_dirty_metadata");
err = ext3_journal_dirty_metadata(handle, bh);
if (err)
goto fail;
journal_get_write_access() is an indication to the journaling layer that this buffer is supposed to be written to the journal soon. Whereas, the journal_dirty_metadata() call informs the journaling layer that the changes to this buffer have been made and the journaling layer can now journal the buffer under consideration as a part of the current atomic update.
As you can see the modifications to the buffer are done after the journal_get_write_access() is done and before the journal_dirty_metadata() call.
Most of the differences between ext2 and ext3 stem from the same concept.
The idea is to find the buffer head for metadata which has to be modified,
and to carry out the modifications enclosed within the ext3_journal_get_write_access()
and ext3_journal_dirty_metadata() function calls.
Continuing with our call trace,
goto fail;
gdp->bg_free_inodes_count = cpu_to_le16(le16_to_cpu(gdp->bg_free_inodes_count) - 1);
if (S_ISDIR(mode))
gdp->bg_used_dirs_count = cpu_to_le16(le16_to_cpu(gdp->bg_used_dirs_count) + 1);
BUFFER_TRACE(bh2, "call ext3_journal_dirty_metadata");
err = ext3_journal_dirty_metadata(handle, bh2);
if (err)
goto fail;
BUFFER_TRACE(sb->u.ext3_sb.s_sbh, "get_write_access");
err = ext3_journal_get_write_access(handle, sb->u.ext3_sb.s_sbh);
if (err)
goto fail;
es->s_free_inodes_count = cpu_to_le32(le32_to_cpu(es->s_free_inodes_count) - 1);
BUFFER_TRACE(sb->u.ext3_sb.s_sbh, "call ext3_journal_dirty_metadata");
err = ext3_journal_dirty_metadata(handle, sb->u.ext3_sb.s_sbh);
err = ext3_mark_inode_dirty(handle, inode);
Eventually, the inode is marked dirty as is the case with the ext2 code. Again the difference here is that an extra handle parameter is passed to this function. The ext3_mark_inode_dirty() function will eventually find a buffer head for the on-disk inode, and copy the inode information to the buffer containing the on-disk (raw) inode image. Again, this would be done by enclosing the changes within the journal system calls.
To add the given inode to the directory, ext2_create calls ext2_add_nondir(), which eventually calls the ext2_add_link() function.
The ext2_add_link() is quite a simple function. It loops
through all the directory entries present in the directory. It cycles
through all the _pages_ of the directory, to find an appropriate
directory entry. It modifies the directory entry such that the filename
and inode information are added to the directory entry. Once this
is done the page is scheduled to be written to the disk. Also, the
directory inode is marked dirty, since the timestamps of the directory
inode change.
To add the given inode to the directory, ext3_create calls ext3_add_nondir(), which eventually calls ext3_add_entry() function.
The ext3_add_entry() function cycles loops through all the directory
entries in the directory. As you would have guessed the ext3 code
cycles through the _buffers_ of the directory, to find an appropriate
directory entry. It modifies the directory entry such that the filename
and inode information are added to the directory entry. This change
in the contents of the buffer data are done within the journaling
system calls as well.
Now that we know how information is written to the journal, let us see how it is actually used by ext3 when a filesystem is mounted. This is one of the important part since this is where recovery of a transaction which has been written to the log completely but not to the actual location occurs. In order to avoid getting defocused, we will only go through the code which is really relevant with the journaling.
Let us see ext3's read super function to understand this :
goto failed_mount2;
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.70)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -no_subdir -split 0 -show_section_numbers /tmp/lyx_tmpdir14844heazJR/lyx_tmpbuf0/journaling_and_the_ext3_fs.tex
The translation was initiated by Kedar Sovani on 2007-01-22