Saturday, October 25, 2014

The Virtual Filesystem

Virtual file system (VFS) or Virtual filesystem switch is an abstraction layer on top of a more concrete file system. The purpose of a VFS is to allow for client applications to access different types of concrete file systems in a uniform way.

VFS is a kernel software layer that handles all system calls related to file systems. Its main strength is providing a common interface to several kinds of file systems. It also helps different types of filesystems to interoperate.

Architectural View of VFS

The central idea for VFS operation-

VFS substitutes the generic system call like read and write with the native function for that particular filesystem, eg. NTFS. Each specific filesystem implementation must translate its physical organization into VFS’s common file model.

What are the file systems supported by linux VFS?

–Disk-based filesystems

• Ext2, ext3, ReiserFS

• Sysv, UFS, MINIX, VxFS




–Network filesystems


–Special filesystems

•E.g. /proc

What are the system calls supported by VFS?
• Filesystem

–Mount(), umount(), umount2()


–Statfs(), fstatfs(), statfs64(), fstatfs64(), ustat()


–Chroot(), pivot_root()

–Chdir(), fchdir(), getcwd()

–Mkdir(), rmdir()

–Getdents(), getdents64(), readdir(), link(), unlink(), rename(), lookup_dcookie()


–Readlink(), symlink()


–Chown(), fchown(), lchown(), chown16(), fchown16(), lchown16()

–Chmod(), fchmod(), utime()

–Stat(), fstat(), lstat(), acess(), oldstat(), oldfstat(), oldlstat(), stat64(), lstat64(), fstat64()

–Open(), close(), creat(), umask()

–Dup(), dup2(), fcntl(), fcntl64()

–Select(), poll()

–Truncate(), ftruncate(), truncate64(), ftruncate64()

–Lseek(). _llseek()

–Read(), write(), readv(), writev(), sendfile(), sendfile64(), readahead()

UNIX Filesystems
What is a filesystem?

  • A filesystem means a hierarchical storage of data following a particular structure. 
  • Filesystem contains files, directories and associated control information. 
  • Basic filesystem related operations are 1. creation 2. deletion 3. mounting. 
  • Filesystems are mounted at a specific mount point in a global hierarchy known as namespace. 
  • Mounting at a global namespace allows file systems to appear as entries in a single tree. 

UNIX has provided 4 file system related abstractions

1. Mount points

2. Directory entries

3. Files

4. Inodes

A file is an ordered string of bytes. 
File operations are read, write, create, delete. 

Files are organized in directories.A directory is analogous to a folder and usually contains related files. Directories can also contain other directories, called subdirectories 
A file's metadata is stored in a separate data structure known as inode. 
inode+FS' control information = super block/FS metadata .
VFS is object oriented.
VFS introduces a common file model to represent all supported filesystems.
The objects here refer to structures—not explicit class types, such as those in C++ or Java. 

The four main object types of the VFS are-

  • File object, it is an open file associated with a process.It is really just a block of logically related arbitrary data. 
  • Inode object, it represents metadata about a file.An inode contains essentially information about ownership (user, group), access mode (read, write, execute permissions) and file type. Surprisingly the inodes don't contain the file names as we might think it to contain. 
  • Dentry object ,it is the glue that holds inodes and files together by relating inode numbers to file names. Dentries also play a role in directory caching which, ideally, keeps the most frequently used files on-hand for faster access. File system traversal is another aspect of the dentry as it maintains a relationship between directories and their files. 
  • Superblock object,it is basically the file system metadata and defines the file system type, size, status, and information about other metadata structures (metadata of metadata). The superblock is very critical to the file system and therefore is stored in multiple redundant copies for each file system. The superblock is a very "high level" metadata structure for the file system. For example, if the superblock of a partition, /var, becomes corrupt then the file system in question (/var) cannot be mounted by the operating system. Commonly in this event fsck is run and will automatically select an alternate, backup copy of the superblock and attempt to recover the file system. The backup copies themselves are stored in block groups spread through the file system with the first stored at a 1 block offset from the start of the partition. This is important in the event that a manual recovery is necessary. 

The common file model is specifically geared toward Unix filesystems, all other filesystems must map their own concepts into the common file model

– For example, FAT filesystems do not have inodes

NB:Since VFS treats directory as a normal file that's why there is no separate directory object.

An operations object is contained within each of these primary objects.These objects describe the methods that the kernel invokes against the primary objects:

1.The super_operations object, which contains the methods that the kernel can invoke on a specific filesystem, such as write_inode() and sync_fs()

2. The inode_operations object, which contains the methods that the kernel can invoke on a specific file, such as create() and link()

3. The dentry_operations object, which contains the methods that the kernel can invoke on a specific directory entry, such as d_compare() and d_delete()

4 The file_operations object, which contains the methods that a process can invoke on an open file, such as read() and write()

Each registered filesystem is represented by a file_system_type structure.This object describes the filesystem and its capabilities.
Furthermore, each mount point is represented by the vfsmount structure.This structure contains information about the mount point, such as its location and mount flags.
There are two per-process structures also that describe the filesystem and files associated with a process.They are, respectively, the fs_struct structure and the file structure.

The Superblock Object

  • The superblock object is implemented by each filesystem . 
  • It is used to store information describing the specific filesystem. 
  • The filesystem object corresponding to the superblock is filesystem superblock or filesystem control block. 
  • This object is stored in a special sector of the disk. 
  • For filesystems like sysfs which do not reside on the disk, the superblock is generated on the fly and is stored in memory. 
superblock object is represented by struct super_block and defined in <linux/fs.h>.

The code for creating, managing, and destroying superblock objects lives in fs/super.c.
A superblock object is created and initialized via the alloc_super() function.

When mounted, a filesystem invokes this function, reads its superblock off of the disk, and fills in its superblock object.

The most important item in the superblock object is s_op, which is a pointer to the superblock operations table.The superblock operations table is represented by struct super_operations and is defined in <linux/fs.h>. It looks like this:

struct super_operations {

struct inode *(*alloc_inode)(struct super_block *sb);
void (*destroy_inode)(struct inode *);
void (*dirty_inode) (struct inode *);
int (*write_inode) (struct inode *, int);
void (*drop_inode) (struct inode *);
void (*delete_inode) (struct inode *);
void (*put_super) (struct super_block *);
void (*write_super) (struct super_block *);
int (*sync_fs)(struct super_block *sb, int wait);
int (*freeze_fs) (struct super_block *);
int (*unfreeze_fs) (struct super_block *);
int (*statfs) (struct dentry *, struct kstatfs *);
int (*remount_fs) (struct super_block *, int *, char *);
void (*clear_inode) (struct inode *);
void (*umount_begin) (struct super_block *);
int (*show_options)(struct seq_file *, struct vfsmount *);
int (*show_stats)(struct seq_file *, struct vfsmount *);
ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);

Each item in this structure is a pointer to a function that operates on a superblock object.The superblock operations perform low-level operations on the filesystem and its inodes.
When a filesystem needs to perform an operation on its superblock, it follows the pointers from its superblock object to the desired method. For example, if a filesystem wanted to write to its superblock, it would invoke sb->s_op->write_super(sb);
In this call, sb is a pointer to the filesystem’s superblock. Following that pointer into s_op yields the superblock operations table and ultimately the desired write_super() function, which is then invoked. Note how the write_super() call must be passed a superblock, despite the method being associated with one.This is because of the lack of object-oriented support in C. In C++, a call such as the following would suffice: sb.write_super();

Let’s take a look at some of the superblock operations that are specified by

1. struct inode * alloc_inode(struct super_block *sb)
Creates and initializes a new inode object under the given superblock.
2. void destroy_inode(struct inode *inode)
Deallocates the given inode.
3. void dirty_inode(struct inode *inode)
Invoked by the VFS when an inode is dirtied (modified). Journaling filesystems such as ext3 and ext4 use this function to perform journal updates.
4. void write_inode(struct inode *inode, int wait)
Writes the given inode to disk.The wait parameter specifies whether the operation should be synchronous.
5. void drop_inode(struct inode *inode)
Called by the VFS when the last reference to an inode is dropped. Normal Unix filesystems do not define this function, in which case the VFS simply deletes the inode.
6.void delete_inode(struct inode *inode)
Deletes the given inode from the disk.
7. void put_super(struct super_block *sb)
Called by the VFS on unmount to release the given superblock object.The caller must hold the s_lock lock.
8. void write_super(struct super_block *sb)
Updates the on-disk superblock with the specified superblock.The VFS uses this function to synchronize a modified in-memory superblock with the disk.The caller must hold the s_lock lock.
9. int sync_fs(struct super_block *sb, int wait)
Synchronizes filesystem metadata with the on-disk filesystem.The wait parameter specifies whether the operation is synchronous.
10. void write_super_lockfs(struct super_block *sb)
Prevents changes to the filesystem, and then updates the on-disk superblock with the specified superblock. It is currently used by LVM (the LogicalVolume Manager).
11. void unlockfs(struct super_block *sb)
Unlocks the filesystem against changes as done by write_super_lockfs().
12. int statfs(struct super_block *sb, struct statfs *statfs)
Called by the VFS to obtain filesystem statistics.The statistics related to the given filesystem are placed in statfs.
13. int remount_fs(struct super_block *sb, int *flags, char *data)
Called by the VFS when the filesystem is remounted with new mount options.The caller must hold the s_lock lock.
14. void clear_inode(struct inode *inode)
Called by the VFS to release the inode and clear any pages containing related data.
15. void umount_begin(struct super_block *sb)
Called by the VFS to interrupt a mount operation. It is used by network filesystems, such as NFS.

All these functions are invoked by the VFS, in process context.All except
dirty_inode() may all block if needed.

The Inode Object

The inode object represents all the information needed by the kernel to manipulate a file
Filesystems without inodes generally store file-specific information as part of the file; unlike Unix-style filesystems, they do not separate file data from its control information. Some modern filesystems do neither and store file metadata as part of an on-disk database.Whatever the case, the inode object is constructed in memory in whatever manner is applicable to the filesystem.
The inode object is represented by struct inode and is defined in <linux/fs.h>.

An inode represents each file on a filesystem, but the inode object is constructed in memory only as files are accessed.This includes special files, such as device files or pipes.
Consequently, some of the entries in struct inode are related to these special files. For example, the i_pipe entry points to a named pipe data structure, i_bdev points to a block device structure, and i_cdev points to a character device structure.These three pointers are stored in a union because a given inode can represent only one of these (or none of them) at a time.

The Dentry Object

  • VFS employs the concept of a directory entry (dentry).A dentry is a specific component in a path. /, bin, and vi are all dentry objects for the file /bin/vi.The first two are directories and the last is a regular file.
  • Dentry objects are all components in a path, including files. Resolving a path and walking its components is a nontrivial exercise, time-consuming and heavy on string operations, which are expensive to execute and cumbersome to code.The dentry object makes the whole process easier.
  • Dentries might also include mount points. In the path /mnt/cdrom/foo, the components /, mnt, cdrom, and foo are all dentry objects.
  • The VFS constructs dentry objects on the- fly, as needed, when performing directory operations.
  • Dentry objects are represented by struct dentry and defined in <linux/dcache.h>.

Dentry State

  • A valid dentry object can be in one of three states: used, unused, or negative.
  • A used dentry corresponds to a valid inode (d_inode points to an associated inode) and indicates that there are one or more users of the object (d_count is positive).A used dentry is in use by the VFS and points to valid data and, thus, cannot be discarded.
  • An unused dentry corresponds to a valid inode (d_inode points to an inode), but the VFS is not currently using the dentry object (d_count is zero). Because the dentry object still points to a valid object, the dentry is kept around—cached—in case it is needed again.
  • A negative dentry is not associated with a valid inode (d_inode is NULL) because either the inode was deleted or the path name was never correct to begin with.The dentry is kept around, however, so that future lookups are resolved quickly.

The Dentry Cache

The kernel caches dentry objects in the dentry cache or, simply, the dcache.
The dentry cache consists of three parts:

1. Lists of “used” dentries linked off their associated inode via the i_dentry field of
the inode object. Because a given inode can have multiple links, there might be
multiple dentry objects; consequently, a list is used.

2. A doubly linked “least recently used” list of unused and negative dentry objects.The
list is inserted at the head, such that entries toward the head of the list are newer
than entries toward the tail.When the kernel must remove entries to reclaim memory,
the entries are removed from the tail; those are the oldest and presumably have
the least chance of being used in the near future.

3. A hash table and hashing function used to quickly resolve a given path into the
associated dentry object.

The File Object

  • The final primary VFS object that we shall look at is the file object.The file object is used to represent a file opened by a process.
  • When we think of the VFS from the perspective of user-space, the file object is what readily comes to mind. Processes deal directly with files, not superblocks, inodes, or dentries. It is not surprising that the information in the file object is the most familiar (data such as access mode and current offset) or that the file operations are familiar system calls such as read() and write().
  • The file object is the in-memory representation of an open file.
  • The object (but not the physical file) is created in response to the open() system call and destroyed in response to the close() system call. All these file-related calls are actually methods defined in the file operations table.
  • Because multiple processes can open and manipulate a file at the same time, there can be multiple file objects in existence for the same file.
  • The file object merely represents a process’s view of an open file.The object points back to the dentry (which in turn points back to the inode) that actually represents the open file.
  • The inode and dentry objects, of course, are unique.
  • The file object is represented by struct file and is defined in <linux/fs.h>.

High View Diagram of the layers in VFS


  1. The complex part is ignored...

    How the control navigates from one filesystem to the mounted filesystem