Corrupted microSD cards!

Background

Our Talking Book Devices use 512MB microSD cards to store audio files that are recorded and played back on the devices. The cards store other data used to operate and configure the device. They are formated in FAT16, although our system would accept FAT32. The memory cards are not accessible to the user without disassembling the unit. They are mounted directly on the printed citcuit board, which is enclosed by the rugged plastic housing. A silicone band wraps around the seal of the plasic housing to minimize entry of dust and water.

The microSD cards currently being used in our devices are generic, purchased from a company that claims to have not had any complaints about the generic cards. However, no one questions that brand name options have more rigorous testing and quality standards.

Our C-based software uses a binary library provided by the chipset manufacturer for typical file system commands, such as open(), read(), write(), and close().

Power to the device (usually from batteries) can be immediately removed by the user with an external switch. This ensures there is no costly battery drain, but it also means there is no opportunity to close down any operation. A user could cut power while recording audio directly to the memory card. However, we have not found this to cause a problem.

When the system starts up, it mounts the memory card’s file system. If it has any trouble accessing its startup files or if there is any unhandled exception during operation, it activates USB device mode. While in USB device mode, if the file system is still intact, it can be inspected from a laptop with a USB cable.

Problem

Of our 68 devices actively being used in our norther Ghana pilot test, ten of them have failed to operate after a week or two of use. When the devices are inspected, one or more directories are often corrupted: sometimes existing filenames include non-alphanumeric characters, sometimes new files appear with the same non-standard characters. After being reformatted and reloaded with content, the devices work fine.

At this point, we don’t have enough data to know if these same devices consistently fail again after some amount of time. We also cannot deterministically create the failure.

Resources

One of our field volunteers has captured images of five of the corrupted memory cards. Three of them appear to have their file system data corrupted such that they cannot be mounted. One of them shows only two of the original six directories and only two of dozens of files. Another seems to have all or most of the files and directories intact, but the /log directory now has lots of additional zero-size files with non-standard characters in the filename.

Here are the images of the corrupted microSD cards (compressed from the 472MB-capacity card):

  • sample1.img.zip (46MB): mountable, but only two directories of the original six show up, and only two of dozens of files
  • sample2.img.zip (28 MB): unmountable
  • sample3.img.zip (32 MB): mountable; most files/directories in place, but check out the /log directory, which should have only one file in it, "log.txt".
  • sample4.img.zip (67 MB): unmountable
  • sample5.img.zip (36 MB): unmountable

What We Need

We are not worried about recovering the data on these devices. We just need to find the cause of the problem so that we can fix it.

Since we have not yet been able to consistently reproduce the problem, we do not yet have any data about whether the problem is correlated with particular memory cards or particular devices. We have also not been able to run tests comparing the failure rate of devices with brand name memory cards vs. generic. Our chipset manufactuer is not aware of any bugs in their libraries, although we may be using their recording feature more than any of their other customers.

Things We Need to Try

Guesses of the Cause of the Problem

  • Thinking out loud:
    • If the flash media is physically secure, we can rule out physical ejection, etc. It may still be the case that we have an intermittent connector to the microSD that responds to physical stress, shock, etc. to the device.
    • Power-down during writes, etc.: This may be happening but may be hard to repro in the lab. Firstly, the flash media is going to have pretty fast write time, so catching it during a write is probably pretty hard. Secondly, there are some writes to the file system that may be more catastrophic than others. Updating the FAT table and directory table, etc., if interrupted, may cause the disk to be unreadable, whereas interrupting the write of an audio file may not have as big an impact. Wonder when these FS write events happen w.r.t. to the act of recording? When you start recording? When you stop? Scattered throughout? These info may help us to reproduce the problem.
    • I am not ready to rule out software/firmware issues. Just because the read/write library is presumably well tested, we could still be seeing things like memory corruption from other parts of the binary corrupting the stack and other modifiable internal data structures.