The Linux Journaling Block Device
Date: Jan 20, 2006
Atomicity is a property of an operation either to succeed or fail completely. Disks assure atomicity at the sector level. This means that a write to a sector either goes through completely or not at all. But when an operation spans over multiple sectors of the disk, a higher-level mechanism is needed. This mechanism should ensure that modifications to the entire set of sectors are handled atomically. Failure to do so leads to inconsistencies.
Let’s look at how these inconsistencies could be introduced to a filesystem. Say we have an application that creates a file. The filesystem internally has to decrease the number of free inodes by one, intialize the inode on the disk and add an entry to the parent directory for the newly created file. But what happens if the machine crashes after only the first operation is executed? In this circumstance, an inconsistency has been introduced in the filesystem. The number of free inodes has decreased, but no initialisation of the inode has been performed on the disk.
The only way to detect these inconsistencies is by scanning the entire filesystem. This task is called fsck, filesystem consistency check. In large installations, the consistency check requires a significant amount of time (many hours) to check and fix inconsistencies. As you might have guessed, such downtime is not desirable. A better approach to solve this problem is to avoid introducing inconsistencies in the first place, and this could be accomplished by providing atomicity to operations. Journaling is such a way to provide atomicity to operations.
Simply stated, using journaling is like using a scratch pad. You perform operations on the scratch pad, and once you are satisfied that the operations are correct, you reflect them in a fairer copy.
In the case of filesystems, all the metadata and data are stored on the block device for the filesystem. Journaling filesystems use a journal or the log area as the scratch pad. A journal may be a part of the same block device or it may be a separate device in itself. A journaling filesystem first records all the operations it has performed in the journal. Once the set of operations that is part of one single atomic operation has completed and been recorded in the journal, only then is it writtent to the actual block device. Henceforth, the term disk is used to indicate the actual block device, whereas the term journal is used for the log area.
Journal Recovery Scenarios
The example operation from above requires that three blocks be modified—the inode count block, the block containing the on-disk inode and the block holding the directory where the entry is to be added. All of these blocks first are written to the journal. After that, a special block, called the commit record, is written to the journal. The commit record is used to indicate that all the blocks belonging to a single atomic operation are written to the journal.
Given journaling behavior, then, here is how a journaling filesystem reacts in the following three basic scenarios:
- The machine crashes after only the first block is flushed to the journal. In this case, when the machine comes back up again and checks the journal, it finds an operation with no commit record at the end. This indicates that it may not be a completed operation. Hence, no modifications are done to the disk, preserving the consistency.
- The machine crashes after the commit record is flushed to the journal. In this case, when the machine comes back up again and checks the journal, it finds an operation with the commit record at the end. The commit record indicates that this is a completed operation and could be written to the disk. All the blocks belonging to this operation are written at their actual locations on the disk, replaying the journal.
- The machine crashes after all the three blocks are flushed to the journal but the commit record is not yet flushed to the journal. Even in this case, because of the absence of the commit record, no modifications are done to the disk. The scenario thus is reduced to the scenario described in the first case.
Likewise, any other crash scenario could be reduced to any of the scenarios listed above.
Thus, journaling guarantees consistency for the filesystem. The time required for looking up the journal and replaying the journal is minimal as compared to that taken by the filesystem consistency check.
Journaling Block Device
The Linux Journaling Block Device (JBD) provides this scratch pad for providing atomicity in operations. Thus, a filesystem controlling a block device can make use of JBD on the same or on another block device in order to maintain consistency. The JBD is a modular implementation that exposes a set of APIs for the use of such applications. The following sections describe the concepts and implementation of the Linux JBD as is present in the Linux 2.6 kernel.
Before we move on to the implementation details of the JBD, an understanding of some of the objects that JBD uses is required. A journal is a log that internally manages updates for a single block device. As mentioned above, the updates first are stored in the journal and then are reflected to their real locations on the disk. The area belonging to the journal is managed like a circular-linked list. That is, the journal reuses its area when the journal is full.
A handle represents a single atomic update. The entire set of changes/writes that should be performed atomically are carried out with reference to a single handle.
It may not be an efficient approach to flush each atomic update (handle) to the journal, however. To achieve better performance, the JBD bunches a set of handles together into a transaction and flushes this transaction to the journal. The JBD ensures that the transaction is atomic in nature. Hence, the handles, which are the subcomponents of the transaction, also are guaranteed to be atomic.
The most important property of a transaction is its state. When a transaction is being committed, it follows the lifecycle of states listed below.
- Running: the transaction currently is live and can accept new handles. In a system only one transaction can be in the running state.
- Locked: the transaction does not accept any new handles but existing handles are not complete. Once all the existing handles are completed, the transaction goes to the next state.
- Flush: all the handles in a transaction are complete. The transaction is writing itself to the journal.
- Commit: the entire transaction log has been written to the journal. The transaction is writing a commit block indicating that the transaction log in the journal is complete.
- Finished: the transaction is written completely to the journal. It has to remain there until the blocks are updated to the actual locations on the disk.
Transaction Committing and CheckPointing
A running transaction is written to the journal area after a certain period. Thus, a transaction can be either in-memory (running) or on-disk. Flushing a transaction to the journal and marking that particular transaction as finished is a process called transaction commit.
The journal has a limited area under its control, and it needs to reuse this area. As for committed transactions, those having all their blocks written to the disk, they no longer need to be kept in the journal. Checkpointing, then, is the process of flushing the finished transactions to the disk and reclaiming the corresponding space in the journal. It is discussed in more detail later in this article.
The JBD layer performs journaling of the metadata, during which the data simply is written to the disk without being journaled. But this does not stop applications from journaling the data, as it could be presented to the JBD as metadata itself.
A Kjournald thread is associated with every journaled device. The Kjournald thread ensures that the running transaction is committed after a specific interval. The transaction commit code is divided into eight different phases, described below. Figure 1 shows a logical layout of a journal.
Phase 0: moves the transaction from running state (T_RUNNING) to locked state (T_LOCKED), meaning the transaction no longer can issue new handles. The transaction waits until all the existing handles have completed. A transaction always has a set of buffers reserved for when the transaction is initiated. Some of these buffers may be unused and are unfiled in this phase. The transaction now is ready to be committed with no outstanding handles.
Phase 1: the transaction enters into the flush state (T_FLUSH). The transaction is marked as a currently committing transaction for the journal. This phase also marks that no running transaction exists for the journal; therefore, new requests for handles initiate a new transaction.
Phase 2: the actual buffers of the transaction are flushed to the disk. Data buffers go first. There are no complications here, as data buffers are not saved in the log area. Instead, they are flushed directly to their actual positions on the disk. This phase ends when the I/O completion notifications for all such buffers are received.
Phase 3: all the data buffers are written to a disk but their metadata still is in the volatile memory. Metadata flushing is not as straightforward as data buffer flushing, because metadata needs to be written to the log area and the actual positions on the disk need to be remembered. This phase starts with flushing these metadata buffers, for which a journal descriptor block is acquired. The journal descriptor block stores the mapping of each metadata buffer in the journal to its actual location on the disk in the form of tags. After this, metadata buffers are flushed to the journal. Once the journal descriptor is full of tags or all metadata buffers are flushed to the journal, the journal descriptor also is flushed to the journal. Now we have all the metadata buffers in the journal, and their actual positions on the disk are remembered. This data, being persistent, can be used for recovery if failure occurs.
Phase 4 and Phase 5: both phase 4 and phase 5 wait on I/O completion notifications of metadata buffers and journal descriptor blocks, respectively. The buffers are unfiled from in-memory lists once I/O completion is received.
Phase 6: all the data and metadata is on safe storage, data at its actual locations and metadata in the journal. Now transactions need to be marked as committed so that it can be known that all the updates are safe in the journal. For this reason, a journal descriptor block again is allocated. A tag is written stating that the transaction has committed successfully, and the block is synchronously written to its position in the journal. After this, the transaction is moved to the committed state, T_COMMIT.
Phase 7: occurs when a number of transactions are present in the journal, without yet being flushed to the disk. Some of the metadata buffers in this transaction already may be a part of some previous transaction. These need not be kept in the older transactions as we have their latest copy in the current committed transaction. Such buffers are removed from older transactions.
Phase 8: the transaction is marked as being in the finished state, T_FINISHED. The journal structure is updated to reflect this particular transaction as the latest committed transaction. It also is added to the list of transactions to be checkpointed.
Checkpointing is initiated when the journal is being flushed to the disk—think of unmount— or when a new handle is started. A new handle can fall short of guaranteed number of buffers, so it may be necessary to carry out a checkpointing process in order to free some space in the journal.
The checkpointing process flushes the metadata buffers of a transaction not yet written to its actual location on the disk. The transaction then is removed from the journal. The journal can have multiple checkpointing transactions, and each checkpointing transaction can have multiple buffers. The process considers each committing transaction, and for each transaction, it finds the metadata buffers that need to be flushed to the disk. All these buffers are flushed in one batch. Once all the transactions are checkpointed, their log is removed from the journal.
When the system comes up after a crash and it can see that the log entries are not null, it indicates that the last unmount was not successful or never occurred. At this point, you need to attempt a recovery. Figure 2 depicts a sample physical layout of journal. The recovery takes place in three phases.
- PASS_SCAN: the end of the log is found.
- PASS_REVOKE: a list of revoked blocks is prepared from the log.
- PASS_REPLAY: unrevoked blocks are rewritten (replayed) in order to guarantee the consistency of the disk.
For recovery, the available information is provided in terms of the journal. But the exact state of the journal is unknown, as we do not know the point at which the system crashed. Hence, the last transaction could be in the checkpointing or committing state. A running transaction cannot be found, as it was only in the memory.
For committing transactions, we have to forget the updates made, as all of the updates may not be in place. So in the PASS_SCAN phase, the last log entry in the log is found. From here, the recovery process knows which transactions need to be replayed.
Every transaction can have a set of revoked blocks. This is important to know in order to prevent older journal records from being replayed on top of newer data using the same block. In PASS_REVOKE, a hash table of all these revoked blocks is prepared. This table is used every time we need to find out whether a particular block should get written to a disk through a replay.
In the last phase, all the blocks that need to be replayed are considered. Each block is tested for its presence in the revoked blocks’ hash table. If the block is not in there, it is safe to write the block to its actual location on the disk. If the block is there, only the newest version of the block is written to the disk. Notice that we have not changed anything in the on-disk journal. Hence, even if system crashes again while the recovery is in progress, no harm is done. The same journal is present for the recovery next time, and no non-idempotent operation is performed during the process of recovery.
Copyright (c) 2004-2006 Kedar Sovani and Amey Inamdar