Rule #0 of Firmware Development
by dweller - 2024-12-25
I was reading some datasheets and documentation for a SoC I was tinkering with, and was reminded about a very important practice in firmware development. In order to appreciate it, first I want to ask you a few hypotheticals.
Have you ever had your x86_64, EFI32 laptop be bricked by an OS because it decided to, without asking the user, to “update” the firmware on your machine. And it just so happened that it didn’t check the EFI platform and just assumed if the CPU is 64bit so is the EFI? No? Well Ubuntu did that to my old Macbook. (I do not endorse using Apple, it’s my dad’s old laptop that held sentimental value.)
Have you ever was in the middle of updating your phone’s ROM just to brick it because either you made a mistake in the convoluted steps or because the ROM is bad? This didn’t happened to me (yet), but did to my friends lots of times.
Or maybe you have IoT devices (my condolences) and one of them (or more) just died on an unattended update you wasn’t even aware of? (Push updates are the worst thing ever BTW, separate rant.)
All of these have one thing in common. They violate the 0th rule of firmware development. So what’s the rule? It’s simple:
Always keep two copies of your firmware on the device.
In case of an update you download to one slot, and mark it “not-tested” or something. Set the boot to that slot, start a watchdog timer (IDK any modern SoC that doesn’t have watchdogs, if yours don’t add an I²C one or something connected to a NMI) and reboot. If the watchdog timer runs out before successful boot, or the boot failed in some other, detectable way more than N times, you mark that slot as “bad” and boot back from previous, working slot. In case of success, you mark the slot as “working” and mark the previous slot as “old”. So next update will pick that slot.
This technique also helps in case your ROM develops a bad sector (a CRC check during boot should mark the slot as “bad”,) or a very rare and unfortunate bug you didn’t catch in dev. cleared a wrong Flash page. In any case, generally speaking, having redundant copy of firmware lets your device be way more resilient to fatal conditions. It also makes possible for technicians to issue a firmware downgrade in a very easy, less error prone and fast manner.
This requires you to sacrifice storage/code space, as well as, necessitates a bootloader of sorts. But it is way easier to thoroughly test the bootloader vs your whole firmware. I also understand that many older embedded devices simply didn’t have the room to store even one firmware the developers wanted, let alone two. But there is no excuse in the modern world for embedded devices to not have enough Flash storage. Nor is there any excuse in past eternity for consumer oriented PCs and laptops to not have that space.
I honestly have no idea why so many devices to this day are shipped in a state where one wrong move bricks them. Well, maybe one idea - planned obsolescence. Or just sheer incompetence. Whatever it is, I didn’t came up with this rule alone, I am sure any reasonable person who worked in embedded space recognized at some point how valuable this is. (Like my old coworkers.)
I am giving this #0 because I am fed up of devices getting bricked all the time.
And with that, I leave you be for now, back to celebrations and nice food! Hope you have nice holidays and have fun!