2333 CT Leiden
Semiotic Labs is a fast growing scale-up based in Leiden, the Netherlands. We serve a broad range of industries with SAM4, a predictive maintenance system that leverages machine learning algorithms and IoT sensors to detect upcoming failures in critical assets up to months in advance. Clients such as Vopak, Schiphol, Nouryon and ArcelorMittal use SAM4 to prevent unplanned downtime, prioritize maintenance tasks and reduce energy waste.
Perfect fit not there? Send us a message at firstname.lastname@example.org and let us know how you’d like to contribute!
2333 CT Leiden
Computing on the edge
How I hit the ground running when I joined a growing scale-up in the middle of a gateway issue
By Onno Steenbergen
Software developer at Semiotic Labs
After several years managing computer networks and routers at a major Dutch ISP, I joined Semiotic Labs’ three-person development team in February 2019. My main focus would be to improve the stability of the system devices that we install at our customers.
SAM4 system hardware as it installs inside the motor control cabinet at the customer location, from left to right: power supply, gateway, switch, data acquisition device.
ISPs call these devices customer premise equipment; you could also call them IoT devices, since Semiotic Labs’ SAM4 system includes sensors. You could even get fancy and say IIoT, because SAM4 is deployed at industrial sites such as wind farms, steel plants and airports.
Whatever the name, the primary issue with remote hardware is always the same, from a networking and software perspective: how to keep the gateway up and running.
From corporate structure to GTD agile
After I joined, it was immediately clear that my new employer was leaner than a lot of companies out there. The Semiotic dev team runs a getting-things-done type of agile: each developer maintains a list of tickets and plans work on them as needed. Meetings are kept to a minimum.
There’s very little hierarchy at Semiotic Labs. You’re responsible for figuring out what needs to be done and then doing it, without an order coming down from above. It’s a strategy that works in part because the company is relatively small; getting answers and aligning work is as easy as walking two desks over to chat with someone in operations or data science—or even the CTO. That level of independence (and responsibility!) isn’t for everyone, but it was part of what drew me to the company.
There was another major change from my previous job: the management tooling I was used to didn’t exist yet. The basics were there, but to improve device availability we needed additional metrics: how much bandwidth was being used, how stable the internet connection was, the state of the modem, and several more. Part of my new job was to make that happen.
But first, we had an enigmatic bug to squash.
Dropped measurements and high CPU usage on part of our fleet
Just before Christmas 2018 (about six weeks prior to my arrival), a puzzling issue had reared its head. A small number of customer gateways started inexplicably dropping sensor measurements, accompanied by high CPU usage. The operations team, which owns the devices, would restart the gateway and the issue would resolve, but a couple of hours later it was back.
The dev team started investigating on the software side. On 2 January 2019 they pinpointed the cause: a runaway modem management process named mbim-proxy. It was fully utilizing one of the two cores available in the gateway.
The next step was to find out how many gateways were suffering from the issue. The answer wasn’t obvious; most of the installed gateways are responsible for just a handful of sensors, which a single core can handle, so the loss of the second core to mbim-proxy didn’t show up as dropped measurements. Using dropped sensor measurements as the flag would only reveal the small portion of gateways that needed the second core. To understand the true impact, the dev team needed insight into the gateways themselves.
Once a minute, the gateways log performance metrics to an AWS DynamoDB table. The team wrote a script to query the table and find out how many gateways were affected. It turned out that roughly 30 percent of the entire population was suffering from the runaway process.
A screenshot from Jira issue DV-83, “Investigate Gateway CPU load variations and causes,” opened on 19 December 2018. On 3 January 2019 it spawned the issue that would be my main focus for 2019: DV-92, “mbim-proxy process on Gateways use 100% CPU.”
The next question was, why? Why these and not the other 70 percent?
Meanwhile, the dev team updated the modem software, with no result. They contacted Dell, the gateway manufacturer, who said it had never seen this issue before. And they kept digging through the gateway logs to find patterns that would answer “why.”
It’s all in the connection
The gateway supports Ethernet, Wi-Fi and 4G mobile connections; which one gets used depends on what’s most appropriate for the customer and the location. Roughly 30 percent of customer gateways connect over mobile 4G—the same number that were suffering runaway mbim-proxy processes. Further inspection confirmed that the two groups were the same. Only gateways connecting over 4G had the problem.
There were three possible culprits: the network, the hardware, or the software.
Mobile connections have the advantage that they’re easy to add without interfering with existing networking infrastructure. The downside is that the network is nowhere near as robust as a cabled one. Throughput and latency can vary based on the weather or the number of cell phones nearby. Plants are often located outside normal coverage areas and are usually full of signal-disrupting metal. With a single sensor generating at least 5GB of data each month, and the gateway able to support up to ten of them, SAM4 needs more bandwidth than the average IoT system. An occasional hiccup was to be expected. But consistently and constantly on every single gateway that was using 4G, in disparate locations? That couldn’t be the network.
It also wasn’t the hardware: we had two modem types and multiple firmware versions in the field, and the issue cut clean across all of them.
That meant it was the software.
This is when I joined the company. My mission was clear: find out what’s going wrong, and fix it.
Finding an elusive bug
Determining the cause of mobile connection issues is easy if you have access to a well-equipped testing lab staffed by experts on radio communications. As a scale-up, we lacked the expertise and the equipment, so we started by gathering information and running a dozen gateways over 4G in one corner of our office.
Our makeshift testing lab, running a dozen gateways over 4G to try to reproduce the issue.
After two weeks, we hadn’t spotted a single runaway process.
Meanwhile, further digging indicated the fix might lie with the gateway’s OS, Ubuntu Core, which is a minimal, security-focused version of Ubuntu Linux. A problem very close to what we were seeing had been fixed in the Linux kernel, but Ubuntu Core 16 seemed to be missing that fix. We contacted Canonical, the company behind Ubuntu Core, and explained the situation.
Part of our email to Canonical.
But we couldn’t know if this fix would solve our issue unless we could reliably test it. We kept running different kinds of tests in our makeshift lab, and kept failing to reproduce the problem.
We sifted through the data we’d gathered from the devices in the field, and resifted, and sifted some more, searching for clues. Finally, in early August 2019, we found the key: the issue always (and only) happened when the modem switched between HSDPA and UMTS.
(Which was a problem of its own—the ISP we use has a 99 percent 4G coverage rate in the Netherlands, so our nearby customer gateways shouldn’t have been running on 3G in the first place. Another ticket for the Jira backlog.)
But it explained why we weren’t able to reproduce the problem even once: our office is right beside a 4G antenna. We dug through the technical sheets and found a low-level method to force the modem to use 3G. At last, eight months after the problem first reared its head, we were able to reproduce it in our “lab”—which meant we could start testing fixes.
The issue tracker doesn’t capture the joy and relief of this day, when we finally reproduced the bug in our lab.
We logged a ticket with Canonical to coordinate the process to fix the Ubuntu Core 16 kernel.
Adding company structure, GTD agile style
Meanwhile, the company was growing fast. There had been occasional ad hoc priority meetings before I joined, but as teammates and customers grew, so did the need for structure. So in January 2019 the company had added in a formal cross-team priority meeting every two weeks where the dev team (responsible for the software side of our devices) met with the operations team (responsible for the physical side). With only biweekly changes coming out of the priority meeting, a weekly dev team meeting was still enough to keep everyone updated on progress.
(A year later and at twice the company size, it’s a system that still works. We’re still committed to keeping meetings to a minimum.)
Two weeks after we filed our ticket with Canonical, they sent us a test kernel with the fix included. The import of rolling it out was so major that the priority meeting set up its own Jira board, and on 26 August 2019 issue PRIOR-1 was born. Its assignee: me.
The priority meeting’s first-ever Jira issue.
Rolling out the fix
I spent the next two months exhaustively testing the kernel on 8 different gateways to confirm that it had fixed the problem—without introducing new ones. In early November 2019 I gave Canonical the thumbs-up, and they merged the modified kernel with the mainline kernel for Ubuntu Core.
We release a new version of our software almost daily. Inspired by mobile phone apps and Docker virtualization, everything is packaged in an independent “snap” that can be downloaded from the Ubuntu Core snap store. Even the OS itself is released as a package. So fixing the problem was as easy as uploading a new snap: the gateways would automatically download it and install.
The operations team drew up a rollout plan and we started pushing the fix to our production fleet in mid-December 2019, almost a year to the day from when the issue first surfaced.
An issue is worth a thousand improvements
Solving our mbim-proxy problem drove improvements throughout our software and cloud infrastructure. Operations now has a dashboard with a concise gateway overview, showing which sensors are connected and what the system’s status is. Using what we’ve learned about mobile networking, we added a live graph of 4G signal strength to the installation tool that our field engineers use, which helps them position the antenna better and improve our first-time-right deployments. We’ve added failover support, so the mobile connection will take over if Ethernet fails, and made other networking-related improvements across the entire codebase.
Post-mbim-proxy, most gateway issues have been the result of external factors: network problems or unplanned customer maintenance. But that doesn’t mean our work is done. Fleet management is still an industry-wide concern, and we’re looking forward to implementing AWS IoT to relieve some of the problems we’re having. We’re currently working to introduce more powerful hardware that operations will install in the coming months. With that, we might even be able to run AI on the edge, bringing new challenges to our quest to keep our gateways at optimal performance.
I was hired to get our fleet stable, and to reduce the burden on operations to keep it that way. So far it’s been a great ride, and I’m looking forward to where it takes me in 2020.