Core On-disk Format#
Overview#
The EROFS core on-disk format is designed to be as simple as possible, since one of the basic use cases of EROFS is as a drop-in replacement for tar or cpio:
The format design principles are as follows:
Data (except for inline data) is always block-based; metadata is not strictly block-based.
There are no centralized inode or directory tables. These are not suitable for image incremental updates, metadata flexibility, and extensibility. It is up to users to determine whether inodes or directories are arranged one by one or not.
I/O amplification from extra metadata access should be as small as possible.
There are only three on-disk components to form a full filesystem tree: superblock, inodes, and directory entries.
Note that only the superblock needs to be kept at a fixed offset, as mentioned below.
Conformance to Core Format#
An EROFS image conforms to the core on-disk format if and only if all of the following conditions are met:
The
is_compressedfield (offset 0x54, 2 bytes) in the superblock is 0.All bits in
feature_compatandfeature_incompat, except those listed in the Feature Flags section below, are 0.
An image that does not meet these conditions uses one or more optional features described in separate feature-specific documents. Note that the core on-disk format has always been supported since Linux 5.4; thus, the 48-bit layout is not part of the core on-disk format (for example), and not all users need 48-bit block addressing.
Superblock#
The EROFS superblock is located at a fixed absolute offset of 1024 bytes.
Its base size is 128 bytes. When sb_extslots is non-zero, the total superblock
size is 128 + sb_extslots * 16 bytes. The first 1024 bytes are unused, which
allows for support of other advanced formats based on EROFS, as well as the
installation of x86 boot sectors and other oddities.
Field Definitions#
Offset |
Size |
Type |
Name |
Description |
|---|---|---|---|---|
0x00 |
4 |
|
|
Magic signature: |
0x04 |
4 |
|
|
CRC32-C checksum of the superblock block; see Superblock Checksum |
0x08 |
4 |
|
|
Compatible feature flags; see Feature Flags |
0x0C |
1 |
|
|
Block size = |
0x0D |
1 |
|
|
Number of 16-byte superblock extension slots |
0x0E |
2 |
|
|
Root directory NID |
0x10 |
8 |
|
|
Total inode count; see blocks and inos Fields |
0x18 |
8 |
|
|
Filesystem creation time, seconds since UNIX epoch |
0x20 |
4 |
|
|
Nanoseconds component shared by all compact inodes; see Modification Time in Compact Inodes |
0x24 |
4 |
|
|
Total block count; see blocks and inos Fields |
0x28 |
4 |
|
|
Start block address to specify the inode-metadata zone |
0x2C |
4 |
|
|
Start block address to specify the extended attribute zone |
0x30 |
16 |
|
|
128-bit UUID for the volume |
0x40 |
16 |
|
|
Filesystem label (not null-terminated if 16 bytes) |
0x50 |
4 |
|
|
Incompatible feature flags; see Feature Flags |
0x54 |
2 |
|
|
0 for non-compressed images, any non-zero value for compressed images |
0x56 |
4 |
|
dontcare |
External device support specific; ignored in core format |
0x5A |
1 |
|
|
Directory block size = |
0x5B |
5 |
|
dontcare |
Xattr specific; ignored in core format |
0x60 |
8 |
|
dontcare |
Compression specific; ignored in core format |
0x68 |
1 |
|
reserved |
Reserved; must be 0 |
0x69 |
1 |
|
dontcare |
Xattr specific; ignored in core format |
0x6A |
2 |
|
reserved |
Reserved; must be 0 |
0x6C |
12 |
|
dontcare |
48-bit layout specific; ignored in core format |
0x78 |
8 |
|
reserved |
Reserved; must be 0 |
Note the difference between reserved and dontcare fields:
reserved: Users must not use these fields, and they must be filled with
0to comply with the supported features or reserve for future use.dontcare: Users can safely use these for other purposes as long as the corresponding incompatible feature flag is not set.
Magic Number#
The magic number at offset 0x00 must be 0xE0F5E1E2 (little-endian). A reader must
reject any image whose first four bytes at offset 1024 do not match this value.
Superblock Checksum#
When EROFS_FEATURE_COMPAT_SB_CHKSUM is set, the checksum field contains a
CRC32-C digest. The digest is computed over the byte range [1024, 1024 + block_size),
with the four bytes of the checksum field itself treated as zero during computation.
For example, when
blkszbitsis 12 (block size is 4 KiB):
Offset
Size
Description
Checksum covered
0
1024
Padding
No
1024
4
Magic number
Yes
1028
4
Checksum field in superblock, filled with zero
Yes
1032
3064
Remaining bytes in the filesystem block
Yes
Tip: Some implementations (e.g.,
java.util.zip.CRC32C) apply a final bit-wise inversion. If the superblock checksum does not match, try inverting it.
Feature Flags#
feature_compat β Compatible Feature Flags#
A mount implementation that does not recognise a bit in feature_compat may still
mount the filesystem without loss of correctness.
Bit mask |
Name |
Description |
|---|---|---|
|
|
Superblock CRC32-C checksum is present; see Superblock Checksum |
|
|
Per-inode mtime is stored in extended inodes |
Note
For new filesystem builders, it is recommended to always set
EROFS_FEATURE_COMPAT_MTIME, since it indicates that all inode timestamps
record modification time (mtime) rather than change time (ctime).
feature_incompat β Incompatible Feature Flags#
A runtime implementation that doesnβt implement any feature implied by a bit in
feature_incompat must refuse to mount the entire filesystem.
The core on-disk format defines no incompatible feature flags. A non-zero
feature_incompat value indicates one or more non-core feature extensions.
blocks and inos Fields#
The blocks and inos fields are primarily intended for statvfs(3)
reporting. For dynamically generated EROFS filesystems, these fields can be set
to 0.
Implementations should not use the blocks field to validate whether a
block address or NID is valid. Such checks are unnecessary; malicious block
addresses or NIDs will simply result in -EIO or reading corrupted (meta)data
without causing any real harmful behaviors.
Furthermore, a maliciously crafted image can easily bypass bounds checking by
modifying the blocks field accordingly, making such validation meaningless.
Inodes#
Each on-disk inode must be aligned to a 32-byte inode slot boundary, which is
set to be kept in line with the compact inode size. Given a NID nid, its inode can
be located in O(1) time by computing the absolute byte offset as follows:
inode_offset = meta_blkaddr * block_size + 32 * nid
The NIDs for the root directory and special-purpose inodes are stored in the
superblock. Valid inode sizes are either 32 bytes (compact) or 64 bytes
(extended), distinguished by bit 0 of the i_format field.
Compact Inode (32 bytes)#
Defined as struct erofs_inode_compact:
Offset |
Size |
Type |
Name |
Description |
|---|---|---|---|---|
0x00 |
2 |
|
|
Inode format hints; see i_format Field |
0x02 |
2 |
|
reserved |
Xattr specific; must be 0 if no xattrs |
0x04 |
2 |
|
|
File type and permission bits |
0x06 |
2 |
|
|
Hard link count |
0x08 |
4 |
|
|
File size in bytes (32-bit) |
0x0C |
4 |
|
reserved |
48-bit layout specific; ignored in core format |
0x10 |
4 |
|
|
Union; see i_u Union |
0x14 |
4 |
|
|
Inode serial number for 32-bit |
0x18 |
2 |
|
|
Owner UID (16-bit) |
0x1A |
2 |
|
|
Owner GID (16-bit) |
0x1C |
4 |
|
reserved |
Reserved; must be 0 |
Modification Time in Compact Inodes#
Due to space constraints, compact inodes cannot store a full 64-bit per-inode
timestamp, let alone an additional nanosecond field. Consequently, when the
48-bit layout extension is unused, the effective timestamp for all compact
inodes is (epoch, fixed_nsec), which has been the case since Linux 5.4.
Extended Inode (64 bytes)#
Defined as struct erofs_inode_extended:
Offset |
Size |
Type |
Name |
Description |
|---|---|---|---|---|
0x00 |
2 |
|
|
Inode format hints; see i_format Field |
0x02 |
2 |
|
reserved |
Xattr specific; must be 0 if no xattrs |
0x04 |
2 |
|
|
File type and permission bits |
0x06 |
2 |
|
reserved |
Reserved; must be 0 |
0x08 |
8 |
|
|
File size in bytes (64-bit) |
0x10 |
4 |
|
|
Union; see i_u Union |
0x14 |
4 |
|
|
Inode serial number for 32-bit |
0x18 |
4 |
|
|
Owner UID (32-bit) |
0x1C |
4 |
|
|
Owner GID (32-bit) |
0x20 |
8 |
|
|
Modification time, seconds since UNIX epoch |
0x28 |
4 |
|
|
Nanoseconds component of |
0x2C |
4 |
|
|
Hard link count (32-bit) |
0x30 |
16 |
|
reserved |
Reserved; must be 0 |
i_format Field#
The i_format field is present at offset 0x00 in both inode variants and encodes
layout metadata:
Bits |
Width |
Description |
|---|---|---|
0 |
1 |
Inode version: 0 = compact (32-byte), 1 = extended (64-byte) |
1β3 |
3 |
Data layout: values 0β4 are defined; 5β7 are reserved. See Inode Data Layouts |
4 |
1 |
48-bit layout specific; ignored in core format |
5β15 |
11 |
Reserved; must be 0 |
Note
When bits 1β3 contain reserved values (5β7), the inode uses an unsupported data layout. Implementations must reject such inodes and return an appropriate error (e.g., βnot supportedβ). This typically indicates a maliciously crafted or corrupted image.
i_u Union#
The i_u field (4 bytes at offset 0x10) is interpreted based on the data layout:
Name |
Applicable when |
Description |
|---|---|---|
|
Flat inodes |
Starting block number |
|
Character/block device inodes |
Device ID |
Inode Data Layouts#
The data layout of an inode is encoded in bits 1β3 of i_format. The core format
defines two flat layouts.
EROFS_INODE_FLAT_PLAIN (0)#
i_u is interpreted as startblk (the 32-bit starting block address).
The inodeβs data lies in consecutive blocks starting from that address,
occupying ceil(i_size / block_size) consecutive blocks.
EROFS_INODE_FLAT_INLINE (2)#
i_u is interpreted as startblk (the 32-bit starting block address).
The inodeβs data lies in consecutive blocks starting from that address, except
for the tail part (i_size % block_size) that is inlined in the block
immediately following the inode metadata. If i_size is small enough that the
entire content fits in the inline tail, there are no preceding blocks and i_u
is a donβt-care field.
Note
This layout is not allowed if the tail inode data block cannot be inlined (e.g., if inlining the tail data would cause the inode to cross a physical block boundary).
Directories#
All on-disk directories are organized in the form of directory blocks of size
2^(blkszbits + dirblkbits). dirblkbits is strictly 0 for now.
Directory Block Structure#
Each directory block is divided into two contiguous regions:
An array of fixed-size directory entry records starting from the beginning of the block.
Variable-length filename strings following the directory entry array.
The nameoff field of the first entry in a block indicates the total number of
directory entries in that block:
entry_count = nameoff[0] / sizeof(erofs_dirent)
All entries within a directory block, including . and .., are stored in strict
lexicographic (byte-value ascending) order to enable an improved prefix binary
search algorithm.
Directory Entry Record#
Defined as struct erofs_dirent:
Offset |
Size |
Type |
Name |
Description |
|---|---|---|---|---|
0x00 |
8 |
|
|
Node number of the target inode |
0x08 |
2 |
|
|
Byte offset of the filename within this directory block |
0x0A |
1 |
|
|
File type code (see below) |
0x0B |
1 |
|
reserved |
Reserved; must be 0 |
file_type Values#
Value |
Constant |
POSIX type |
|---|---|---|
0 |
|
Unknown |
1 |
|
Regular file |
2 |
|
Directory |
3 |
|
Character device |
4 |
|
Block device |
5 |
|
FIFO |
6 |
|
Socket |
7 |
|
Symbolic link |
Filename Encoding#
Filenames are not null-terminated (\0) except for the last one in each
directory block. For each directory block, if the last filename doesnβt reach
up to the end of the block, the remaining bytes must start with 0x00.
So the length of entry i is derived as:
For all entries except the last:
nameoff[i+1] β nameoff[i].For the last entry in the block:
strnlen(filename, block_end β nameoff[last]).
No character encoding is mandated; UTF-8 is recommended.
Note
Other alternative forms (e.g., Eytzinger order) were also considered (that is
why there was once .*_classic
naming). Here are some reasons those forms were not supported:
Filenames are variable-sized strings, which makes
Eytzinger orderharder to utilize unlessnamehashis also introduced, but that complicates the overall implementation and expands directory sizes.It is harder to keep filenames and directory entries in the same directory block (especially large directories) to minimize I/O amplification.
readdir(3)would be impacted too if strict alphabetical order were required.
If there are better ideas to resolve these, the on-disk definition could be updated in the future.