In the official NTFS implementation, all metadata changes to a file system are logged to ensure the consistent recovery of critical file system structures after a system crash. This is called write-ahead logging.
The logged metadata is stored in a file called “$LogFile”, which is found in a root directory of an NTFS file system.
Currently, there is no much documentation for this file available. Most sources are either too high-level (describing the logging and recovery processes in general) or just contain the layout of key structures without further description.
The process of metadata logging is based on two components: the log file service (LFS) and the NTFS client of the LFS (both are implemented as a part of the NTFS driver).
The LFS provides an interface for its clients to store a buffer in a circular (“infinite”) area of a log file and to read such buffers from that log file. In particular, the following simplified types of actions are supported:
- store a buffer (client data) as a log record, return its log sequence number (LSN);
- store a buffer (client data) as a restart area, return its LSN;
- if a log file is full, raise an exception for a client;
- mark previously stored data as unused;
- given an LSN, locate a stored buffer (client data) and return it;
- given an LSN, find a next LSN for the same client and return it (forward search);
- given an LSN, find a previous LSN for the same client and return it (backward search).
As you can see, the LFS is the data management layer for the NTFS logging component, the LFS doesn’t do the actual logging of metadata operations. Each buffer received from a client is opaque to the LFS (the LFS is only aware of a type of this buffer: whether it’s a log record or a client restart area).
The actual logging (and recovery) is implemented as a part of the NTFS client of the LFS. Each buffer sent from this component to the LFS contains something related to a transaction. Here, a transaction is a set of metadata changes necessary to complete a specific high-level operation.
For example, the following metadata changes are combined as a transaction when a file is renamed:
- delete an index entry (with an old file name) for a target file from a file name index within a parent directory;
- delete the $FILE_NAME attribute (with an old file name) from a target file record;
- create the $FILE_NAME attribute (with a new file name) in a target file record;
- add an index entry (with a new file name) for a target file in a file name index within a parent directory.
If all of these changes were applied to a volume successfully, then the transaction is marked as forgotten.
But before we get to the format of metadata changes used by the NTFS client, we need to dissect on-disk structures of the LFS.
First of all, since each client buffer stored in a log file is identified by an LSN, it’s important to understand how these LSNs are generated by the LFS.
Each LSN is a 64-bit number containing the following components: a sequence number and an offset. An offset is stored in the lower part of an LSN, its value is a number of 8-byte increments from the beginning of a log file. This offset points to an LFS structure containing a client buffer and related metadata, this structure is called an LFS record. A sequence number is stored in the higher part, it’s a value from a counter which is incremented when a log file is wrapped (when a new structure is written to the beginning of the circular area, not to the end of this area).
The number of bits reserved for the sequence number part of an LSN is variable, it depends on the size of a log file (and it’s recorded in it).
For example, if 44 bits are reserved for the sequence number part and the LSN is 2124332, then the sequence number is 2 and the offset is 27180 8-byte increments (217440 bytes).
The LSNs have an important property: they are always increasing. An LSN for a new entry is always greater than an LSN for an older entry (technically, these numbers can overflow, but they won’t, because it’s practically impossible to reach the 64-bit limit).
An LFS record is a structure containing a header and client data. The following data is stored in the LFS record header: an LSN for this record, a previous LSN for the same client, an LSN for the undo operation for the same client, a client ID, a transaction ID, a record type (a log record or a client restart area), length of client data, various flags. Many values mentioned before are specified by the client.
LFS records are written to LFS record pages. Each LFS record page is 4096 bytes in size (it’s equal to the page size), it contains a header (the first four bytes are “RCRD”) and one or more LFS records. Since client data can be large, two or more adjacent LFS record pages may be required to store one LFS record (thus, an LFS record can be larger than an LFS record page; only the first segment has the LFS record header).
Each LFS record page is protected by an update sequence array, which is used to detect failed (torn) writes. Here is a description of the protection process (source):
The update sequence array consists of an array of n USHORT values, where n is the size of the structure being protected divided by the sequence number stride. The first word contains the update sequence number, which is a cyclical counter of the number of times the containing structure has been written to disk. Next are the n saved USHORT values that were overwritten by the update sequence number the last time the containing structure was written to disk.
Each time the protected structure is about to be written to disk, the last word in each sequence number stride is saved to its respective position in the sequence number array, then it is overwritten with the next update sequence number. After the write, or whenever the structure is read, the saved word from the sequence number array is restored to its actual position in the structure. Before restoring the saved words on reads, all the sequence numbers at the end of each stride are compared with the actual sequence number at the start of the array. If any of these comparisons are not equal, then a failed multisector transfer has been detected.
(It should be noted that the stride is 512 bytes, even if an underlying drive has a larger sector size. Also, the size of an update sequence array isn’t n, but n+1.)
Here is the layout of a typical LFS record page:
Here is the layout of two LFS record pages containing a large LFS record:
Finally, the circular (“infinite”) area of a log file consists of many LFS record pages. As described before, LFS records written to a log file can wrap, so a large LFS record starting in the last LFS record page also hits the first LFS record page of the circular area.
When writing a new LFS record into a current LFS record page, existing LFS records in this page can be lost because of a torn write or a system crash. Thus, data that was successfully stored before can be lost because of a new write.
In order to protect against such scenarios, a special area exists in a log file. It’s located before the circular area.
In the version 1.1 of the LFS, a special area consists of two pages, which are used to store two copies of a current LFS record page. Before putting a new LFS record into a current LFS record page, this page is stored in the special area (the first copy). After putting a new LFS record into a current LFS record page, the modified page is also written to the special area (the second copy, the first copy isn’t overwritten by the second one).
If a torn write or a system crash occurs when writing the second copy, the first copy (without a new LFS record) will be available for the recovery. If everything is okay and the LFS needs the special area for a new update, then the second copy is written to the circular area of a log file (and the special area becomes available for a new update).
These two copies of a current LFS record page are called tail copies (because they always represent the latest LFS record page to be written to the circular area). The latest tail copy isn’t moved to the circular area immediately. So, in order to get a full set of LFS record pages during the recovery, the LFS should apply the latest tail page (or the valid one, if another tail page is invalid) to the circular area.
In the version 2.0 of the LFS, a special area consists of 32 pages. When the LFS needs to put a new LFS record into a current LFS record page or when the LFS prepares a new LFS record page with a single LFS record, the updated page (containing a new LFS record) or the new page is simply written to the special area (to an unused page).
If a torn write or a system crash occurs when writing that page, an older version of the same page from the special area is used. Occasionally, LFS record pages with latest data are moved to the circular area (and corresponding pages in the special area are marked as unused).
I don’t know how LFS record pages in this special area are called. I call them fast pages.
The new version of the LFS requires less writes by reducing the number of page transfers to the circular area. It should be noted that the version of a log file is downgraded to 1.1 during the clean shutdown by default (so an NTFS file system can be mounted using a previous version of Windows).
Also, Microsoft is going to release the version 3.0 of the LFS. This version will be used on DAX volumes. When a log file is mapped in the DAX mode, integrity of its pages is going to be protected using the CRC32 checksum (and there would be no update sequence arrays, because they won’t work well with byte-addressable memory). This will make things faster (no paging writes).
Finally, a log file begins with two restart pages, each one is 4096 bytes in size (again, it’s the page size; the first four bytes for each page are “RSTR”). These pages are also protected with update sequence arrays.
A restart page contains the LFS version number, a page size, and a restart area (not to be confused with a client restart area).
A restart area is a structure containing the latest LSN used (at the time when this structure was written), the number of clients of the LFS, the list of clients of the LFS, the number of bits used for the sequence number part of every LSN, as well as some data for sanity checks and forward compatibility (an offset of the first LFS record within an LFS record page, which is also an offset of the continuation of client data from a previous LFS record page, and a size of an LFS record header; both offsets allow unsupported fields to be ignored in LFS record pages and in LFS records).
A list of clients is composed of client records. A client record contains the oldest LSN required by this client, the LSN of the latest client restart area, the name of this client (as well as other information about this client). Currently, the only client is called “NTFS”.
Two restart pages provide reliability against a possible failure (a torn write or a system crash). These pages aren’t necessary synchronized.
Here is the generic layout of a log file:
When the LFS is asked to provide initial data for its client, it will read and return the latest client restart area according to an LSN recorded in the appropriate client record. (Later, during the logging operation, the LFS won’t touch the oldest LFS record required by each client.)
A client receives its latest restart area, interprets it (remember that the LFS is unaware of the client data format), and decides what actions (if any) must be taken. If a log record is needed, then a client asks the LFS to provide this record (as a buffer) by its LSN.
The NTFS client tells the LFS to write a client restart area at the end of the checkpoint operation. During a checkpoint, the NTFS client writes a set of log records containing data about current transactions followed by a restart area, which points to every piece of that data (using LSNs). During the recovery, the NTFS client uses this data to decide which transactions are committed and which aren’t: committed transaction must be performed again using their redo data (there is a chance that this data didn’t hit the volume), while uncommitted transaction must be rolled back using their undo data.
And now we can take a look at the format of client data!
There are three versions of the NTFS client data format: 0.0, 1.0, and 2.0.
The last one seems to be under development, because it’s not enabled yet. This new version removes redundant open attribute table dumps and attribute names dumps, which were previously made during a checkpoint (the same data can be reconstructed from log records, so there is no reason to waste the space and link these dumps to a client restart area).
Currently, only the first two versions are used: 0.0 and 1.0. There are no significant differences between them. The most notable difference, although not a really significant one, is the format of open attribute entries.
A client restart area contains major and minor version numbers of the NTFS client data format used, an LSN to be used as a starting point for the analysis pass (when the NTFS driver builds a table of transactions and a table of dirty data ranges). Also, a client restart area contains LSNs for a transaction table dumped to a log file from memory (this table can be absent as well), an open attribute table dumped to a log file from memory, a list of attribute names dumped to a log file from memory, and a dirty page table dumped to a log file from memory (which is used to track dirty data ranges).
An open attribute table and a list of attribute names reference a nonresident attribute opened for a log operation. An entry from an open attribute table contains an $MFT reference number for a file record which nonresident attribute has been opened and a type code of this attribute (e.g, $DATA). An entry from a list of attribute names contains a Unicode name of a nonresident attribute opened along with an index of a corresponding entry in the open attribute table.
And a log record written during an operation on a nonresident attribute contains an index of a target attribute in the open attribute table. Based on this information (an $MFT file reference, an attribute code, and an attribute name), it’s possible to locate a target attribute. Also, a log record contains an offset within a target attribute at which new data is going to be written.
It should be noted that no table referenced by a client restart area is in the up-to-date state. New items from log records after the client restart area should be accounted in these tables.
A log record is an actual descriptor of a logged operation. A log record contains a redo type and data (can be empty), an undo type and data (can be empty too), a number of a target $MFT file record segment (for operations on resident attributes and on $MFT data in general), an index of a target attribute within the open attribute table (for operations on nonresident attributes), and several fields used to calculate an offset within a target.
Redo data is written when a transaction is committed, undo data is written when a transaction is rolled back (to bring things back to their previous state). There are some exceptions, however: when a nonresident attribute is opened, its open attribute record is stored as redo data and its Unicode name is stored as undo data.
Here is a full list of log operation types (as of Windows 10, build 18323):
Noop CompensationLogRecord InitializeFileRecordSegment DeallocateFileRecordSegment WriteEndOfFileRecordSegment CreateAttribute DeleteAttribute UpdateResidentValue UpdateNonresidentValue UpdateMappingPairs DeleteDirtyClusters SetNewAttributeSizes AddIndexEntryRoot DeleteIndexEntryRoot AddIndexEntryAllocation DeleteIndexEntryAllocation WriteEndOfIndexBuffer SetIndexEntryVcnRoot SetIndexEntryVcnAllocation UpdateFileNameRoot UpdateFileNameAllocation SetBitsInNonresidentBitMap ClearBitsInNonresidentBitMap HotFix EndTopLevelAction PrepareTransaction CommitTransaction ForgetTransaction OpenNonresidentAttribute OpenAttributeTableDump AttributeNamesDump DirtyPageTableDump TransactionTableDump UpdateRecordDataRoot UpdateRecordDataAllocation UpdateRelativeDataIndex UpdateRelativeDataAllocation ZeroEndOfFileRecord
Here is a decoded transaction used to rename a file (from “aaa.txt” to “bbb.txt”).
It should be noted that updates to some attributes can be recorded partially. For example, an update to the $STANDARD_INFORMATION attribute can record data starting from the M timestamp (and the C timestamp, which is stored before the M timestamp, will be absent in the redo/undo data).
The only thing left is the meaning of every log operation. Not today!
How long does it take for old data to become overwritten with new data?
In one of my tests with Windows 10, it took 16 minutes. In another test with Windows 10, it took 5 hours and 20 minutes. In both tests, mouse movements were the only user activity.