Core on-disk format#
Overview#
The EROFS core on-disk format is designed to be as simple as possible, since one of the basic use cases of EROFS is as a drop-in replacement for tar or cpio:
Here are some design principles:
Data (except for inline data) is always block-based, but metadata may not;
There are no centralized inode and directory tables because they are not quite friendly for image incremental updates, metadata flexibility, and extensibility. It’s up to users whether inodes or directories are arranged one by one or not.
I/O amplification out of extra metadata access should be as small as possible.
There are only three ondisk components to form a full filesystem tree:
erofs_super_block
, erofs_inode_{compact,extended}
, and erofs_dirent
.
If extended attribute
support also needs to be considered, the additional components will still be
limited.
Note that only erofs_super_block
needs to be kept at a fixed offset, as
mentioned below.
Superblock#
EROFS superblock is currently 128 bytes in size, which records various information about the enclosing filesystem. The superblock will start at an absolute offset 1024 bytes, and the first 1024 bytes are unused to allow for other advanced formats easily based on EROFS filesystem, as well as the installation of x86 boot sectors and other oddities.
The EROFS superblock is laid out as follows in struct erofs_super_block
:
Offset |
Size |
Name |
Description |
---|---|---|---|
0x0 |
__le32 |
magic |
Magic signature, 0xE0F5E1E2 |
0x4 |
__le32 |
checksum |
Superblock checksum |
0x8 |
__le32 |
feature_compat |
Compatible feature flags. The kernel can still read this fs even if it doesn’t understand a flag |
0xC |
__le32 |
blkszbits |
Block size is 2blkszbits. It should be no less than 9 (512-byte block size) |
0xD |
__le32 |
sb_extslots |
The total superblock size is 128 + sb_extslots * 16. It should be 0 for future expansion |
0xE |
__le16 |
root_nid |
NID (node number) of the root directory |
0x10 |
__le64 |
inos |
Total valid inode count |
0x18 |
__le64 |
build_time |
When the filesystem was created, in seconds since the epoch |
0x20 |
__le32 |
build_time_ns |
Nanoseconds component of the above timestamp |
0x24 |
__le32 |
blocks |
Total block count |
0x28 |
__le32 |
meta_blkaddr |
Start block address of metadata area |
0x2C |
__le32 |
xattr_blkaddr |
Start block address of shared xattr area |
0x30 |
__u8 |
uuid[16] |
128-bit UUID for volume |
0x40 |
__u8 |
volume_name[16] |
Filesystem label |
0x50 |
__le32 |
feature_incompat |
Incompatible feature flags. The kernel will refuse to mount if it doesn’t understand a flag |
0x54 |
__le16 |
available_compr_algs |
Bitmap for compression algorithms used in this image (FEATURE_INCOMPAT_COMPR_CFGS is set) |
0x54 |
__le16 |
lz4_max_distance |
Customized LZ4 window size. 0 means the default value (FEATURE_INCOMPAT_COMPR_CFGS isn’t set) |
0x56 |
__le16 |
extra_devices |
Number of external devices. 0 means no extra device |
0x58 |
__le16 |
devt_slotoff |
(Indicate the start address of the external device table) |
0x5A |
__u8 |
dirblkbits |
Directory block size is 2blkszbits + dirblkbits. Always 0 for now |
0x5B |
__u8 |
xattr_prefix_count |
Total number of long xattr name prefixes |
0x5C |
__le32 |
xattr_prefix_start |
(Indicate the start address of long xattr prefixes) |
0x60 |
__le64 |
packed_nid |
NID of the special packed inode, which is mainly used to keep fragments for now |
0x68 |
__u8 |
xattr_filter_reserved |
Always 0 for reserved use |
0x69 |
__u8 |
reserved[23] |
Reserved |
Inodes#
Each valid ondisk inode should be aligned to a fixed inode slot (32-byte) boundary, which is set to be kept in line with the compact inode size.
Each inode can be directly located using the following formula:
inode absolute offset = meta_blkaddr * block_size + 32 * NID
Valid inode sizes are either 32 or 64 bytes, which can be distinguished from
a common field that all inode versions have – i_format
:
32-byte compact inodes are defined as
struct erofs_inode_compact
as below:
Offset |
Size |
Name |
Description |
---|---|---|---|
0x0 |
__le16 |
i_format |
Inode format hints (e.g. on-disk inode version, datalayout, etc.) |
0x2 |
__le16 |
i_xattr_icount |
(Indicate the extended attribute metadata size of this inode) |
0x4 |
__le16 |
i_mode |
File mode |
0x6 |
__le16 |
i_nlink |
Hard link count |
0x8 |
__le32 |
i_size |
Inode size in bytes |
0xC |
__u8 |
i_reserved[4] |
Reserved |
0x10 |
__u8 |
i_u[4] |
(Up to the specific inode datalayout) |
0x14 |
__le32 |
i_ino |
Inode incremental number, mainly used for 32-bit stat(2) compatibility |
0x18 |
__le16 |
i_uid |
Owner UID |
0x1A |
__le16 |
i_gid |
Owner GID |
0x1C |
__u8 |
i_reserved2[4] |
Reserved |
64-byte extended inodes are defined as
struct erofs_inode_extended
as below:
Offset |
Size |
Name |
Description |
---|---|---|---|
0x0 |
__le16 |
i_format |
Inode format hints (e.g. on-disk inode version, datalayout, etc.) |
0x2 |
__le16 |
i_xattr_icount |
(Indicate the extended attribute metadata size of this inode) |
0x4 |
__le16 |
i_mode |
File mode |
0x6 |
__u8 |
i_reserved[4] |
Reserved |
0x8 |
__le64 |
i_size |
Inode size in bytes |
0x10 |
__u8 |
i_u[4] |
(Up to the specific inode datalayout) |
0x14 |
__le32 |
i_ino |
Inode incremental number, mainly used for 32-bit stat(2) compatibility |
0x18 |
__le32 |
i_uid |
Owner UID |
0x1C |
__le32 |
i_gid |
Owner GID |
0x20 |
__le64 |
i_mtime |
Inode timestamps derived from the original |
0x28 |
__le32 |
i_mtime_nsec |
This provides nanosecond precision |
0x2C |
__le32 |
i_nlink |
Hard link count |
0x30 |
__u8 |
i_reserved2[16] |
Reserved |
inode.i_format
contains format hints for each inode as below:
Bits |
Description |
|
---|---|---|
0 |
1 |
Inode version (0 - compact; 1 - extended) |
1 |
3 |
Inode data layout (0-4 are valid; 5-7 are reserved for now) |
Inode data layouts#
There are five valid data layouts in total for each inode to indicate how inode data is recorded on disk. Only three values are taken into account in the EROFS core on-disk format:
EROFS_INODE_FLAT_PLAIN (0)
:The consecutive physical blocks contain the entirety of the inode’s content with the starting block address stored in
inode.i_u
.EROFS_INODE_FLAT_INLINE (2)
:Except for the tail data block, all consecutive physical blocks hold the entire content of the inode with the starting block address stored in
inode.i_u
. The tail block is kept within the block immediately following the on-disk inode metadata.EROFS_INODE_CHUNK_BASED (4)
:The entire inode is split into several fixed-size chunks. Each chunk has consecutive physical blocks.
Directories#
All ondisk directories are now organized in the form of directory blocks
.
Each directory block is split into two variable-size parts (directory entries
and filenames
) in order to make random lookups work. All directory entries
(including .
and ..
) are strictly recorded in alphabetical order to
enable the improved prefix binary search algorithm.
Each directory entry is defined as 12-byte
struct erofs_dirent
:
Offset |
Size |
Name |
Description |
---|---|---|---|
0x0 |
__le64 |
nid |
Node number of the inode that this directory entry points to |
0x8 |
__le16 |
nameoff |
Start offset of the file name in this directory block |
0xA |
__u8 |
file_type |
|
0xB |
__u8 |
reserved |
Reserved |
Note that nameoff0 (nameoff
of the 1st directory entry) also
indicates the total number of directory entries in this directory block.
Note
Other alternative forms (e.g., Eytzinger order
) were also considered (that is
why we once had .*_classic
naming). Here are some reasons that those forms weren’t supported:
Filenames are variable-sized strings, which makes
Eytzinger order
harder to be utilized unlessnamehash
is also introduced, but it also complicates the overall implementation and expands the directory sizes;Also, it makes it harder to keep filenames and directory entries in the same directory block (especially large directories) to minimize I/O amplification;
readdir(3) will be impacted too if we’d like to keep alphabetical order strictly.
If there are some better ideas to resolve these, the ondisk definition could be updated in the future.