Operating Systems
Just another WordPress.com weblog

Nov
24

Sometimes tasks under Linux are blocked forever (essentially hung). Recent Linux kernels have an infrastructure to detect hung tasks. When this infrastructure is active it will periodically get activated to find out hung tasks and present a stack dump of those hung tasks (and maybe locks held). Additionally we can choose to panic the system when we detect atleast one hung task in the system.
I will try to explain how khungtaskd works.

The infrastructure is based on a single kernel thread named as “khungtaskd”. So if you do a ps in your system and see that there is entry like [khungtaskd] you know it is there. I have one in my system:
136 root SW [khungtaskd]

The loop of the khungtaskd daemon is a call to the scheduler for waking it up after ever 120 seconds (default value). The core algorithm is like this:

1. Iterate over all the tasks in the system which are marked as TASK_UNINTERRUPTIBLE (additionally it does not consider UNINTERRUPTIBLE frozen tasks & UNINTERRUPTIBLE tasks that are newly created and never been scheduled out).

2. If a task has not been switched out by the scheduler atleast once in the last 120 seconds it is considered as a hung task and its stack dump is displayed. If CONFIG_LOCKDEP is defined then it will also show all the locks the hung task is holding.

One can change the sampling interval of khungtaskd through the sysctl interface /proc/sys/kernel/hung_task_timeout_secs.

Nov
17

I was looking into the implementation of the mount system call and found that it needs the type of the file-system (like vfat, ext3, ext2 etc.) to proceed else it will return error eventually. Hey, I do not always provide this information to the kernel when I say

mount block_device mount_point

I just provide the block device and the mount point. I do not provide the file-system type but it still works!

So, where does the kernel find out the file-system information from? Most of the time humans tend to complicate matters and I am no exception. My first thought was the kernel must be reading the superblock from the block device and matching against file-system signatures. I know you must be yelling at me for superblock layout is unique to a file-system type and there may be no guarantee on where in the superblock to look for the signature.

So I tried to do some kernel debugging and find out what is passed to the sys_mount system call. Here is an output when I issue
# mount /dev/mmcblk0p1 /mnt/mmc/
DEBUG: type:ext3, dir_name:/mnt/mmc/, dev_name:/dev/mmcblk0p1
DEBUG: type:ext2, dir_name:/mnt/mmc/, dev_name:/dev/mmcblk0p1
DEBUG: type:vfat, dir_name:/mnt/mmc/, dev_name:/dev/mmcblk0p1

Looking at the output it looks like userspace mount code is iterating over the available device backed filesystem and trying to pass those filesystem types one by one to the kernel and stopping when it is successful or has exhausted the list! I know for sure that /proc/filesystems has that information. On my machine it is:

nodev sysfs
nodev rootfs
nodev bdev
nodev proc
nodev tmpfs
nodev binfmt_misc
nodev debugfs
nodev sockfs
nodev pipefs
nodev anon_inodefs
nodev rpc_pipefs
nodev inotifyfs
nodev devpts
ext3
ext2
nodev ramfs
vfat
msdos
nodev nfs
nodev nfs4

see the order ext3, ext2, vfat, msdos. The partition I was trying to mount had vfat hence the userspace started trying from ext3 and so on. When vfat was passed it was successful and we had our file-system mounted. So much for mounting a filesystem.

Nov
16

Last week I updated on the simple problem with removable block devices under Linux. The root of the problem is that frozen kernel threads are either terminated or woken up (from the driver suspend path) after being frozen. This can render the system in an apparent system freeze.
Here is another manifestation of the problem:
1. Mount a filesystem over mmc
2. Suspend the kernel
System is apparently frozen.

Here is a modified stack dump on the hung task.

“echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
sh D c027b048 0 387 1 0x00000000
schedule from bdi_sched_wait
bdi_sched_wait from __wait_on_bit
__wait_on_bit from out_of_line_wait_on_bit
out_of_line_wait_on_bit from sync_inodes_sb
sync_inodes_sb from __sync_filesystem
__sync_filesystem from fsync_bdev
fsync_bdev from invalidate_partition
invalidate_partition from del_gendisk
del_gendisk from mmc_blk_remove
mmc_blk_remove from mmc_bus_remove
mmc_bus_remove from __device_release_driver
__device_release_driver from device_release_driver
device_release_driver from bus_remove_device
bus_remove_device from device_del
device_del from mmc_remove_card
mmc_remove_card from mmc_sd_remove
mmc_suspend_host from omap_hsmmc_suspend
omap_hsmmc_suspend from platform_pm_suspend
platform_pm_suspend from pm_op
pm_op from dpm_suspend_start
dpm_suspend_start from suspend_devices_and_enter
suspend_devices_and_enter from enter_state
enter_state from state_store
state_store from kobj_attr_store
kobj_attr_store from sysfs_write_file
sysfs_write_file from vfs_write
vfs_write from sys_write
sys_write from ret_fast_syscall

The stack dump clearly shows that we are blocked on a bdi or the ‘backing device’. But it is already in the refrigerator!

Nov
12

Imagine you have a removable block device like MMC/SD card.

1. You have a filesystem mounted on it.

2. You now suspend the system.

3. The filesystem gets synced, the block device is invalidated (if you are of a sane mind you will not choose CONFIG_MMC_UNSAFE_RESUME=y).

4. The filesystem is still mounted but the backup device flusher task pointer that is a part of the block device info is not invalidated.

Even though the device is gone! It creates problems! Patch to follow soon!!!

Nov
12

Linux patch for bdi flusher tasks

Suspend resume in Linux exposed yet another problem. I should say two problems. One very deep the other one not quite.

First the easy problem:

If a task is already in the refrigerator and we want to stop the task, it simply won’t. This is because of the design of the refrigerator.

Here is the code at the heart of the refrigerator!
for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (!frozen(current))
break;
schedule();
}

what it means is if the task is frozen and it is woken up, it will be scheduled out until we have PF_FROZEN flag cleared out of the task flags in which case frozen(current) will be FALSE.

The MMC suspend routine assumes that the block device is removable and on suspend tries to delete the block device. Eventually it will attempt to kill the bdi thread backing the block device. But by now (when the suspend routine is called) the bdi thread is in the refrigerator. This results in the suspend thread being blocked for ever. Here is the patch that solves the problem:

Kicks out frozen bdi flusher task out of the refrigerator when the flusher task
needs to exit.
Signed-off-by: Romit Dasgupta <romit@ti.com>
---
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 5a37e20..c757b05 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -606,8 +606,11 @@ static void bdi_wb_shutdown(struct backing_dev_info *bdi)
 	 * Finally, kill the kernel threads. We don't need to be RCU
 	 * safe anymore, since the bdi is gone from visibility.
 	 */
-	list_for_each_entry(wb, &bdi->wb_list, list)
+	list_for_each_entry(wb, &bdi->wb_list, list) {
+		if (unlikely(frozen(wb->task)))
+			wb->task->flags &= ~PF_FROZEN;
 		kthread_stop(wb->task);
+	}
 }