Skill, inspiration and teamwork: deep-space problem solving
"Hang on, lads; I've got a great idea..."
Last month, we spoke with several of the Mars Express team here at ESOC about their almost completed activities to restore, reconfigure and return Mars Express to service.
An interview with Mars Express Spacecraft Operations Engineer Daniel Lakey
Spacecraft Operations Engineer Daniel Lakey sitting beside the SSMM
A black box, edge-length 30 cm, is at the centre of the recent trouble with Mars Express.
Daniel Lakey, an engineer working on the mission at ESOC in Darmstadt, looks down at the engineering model of said black box sitting on his desk and recalls the seemingly endless night shifts he has had to pull because of its twin mounted on Mars Express, orbiting the Red Planet many millions of kilometres away.
In mid-August 2011, Mars Express unexpectedly placed itself into safe mode – think blue screen of death and reboot on a PC – because something went wrong either with the Solid-State Mass Memory (SSMM) housed inside this black box or with the on-board channels it uses to pass data to the spacecraft’s data management system (DMS) computer.
To extend the PC analogy, imagine that the memory chips in your computer, the RAM chips, or the memory controllers that tell them what to do, suffered a fault. The memory might continue functioning, apparently normal, but whenever an electronic signal tried to access the faulty unit, the operation would fail and the system would crash. That's what happened with Mars Express.
Holiday phone call at 3 AM
"I was on holidays in England when I got the call at three o'clock in the morning. Since I'm assigned as the mission's software coordinator, the problem fell in my area of responsibility," says Lakey.
Switching into safe mode means that the spacecraft automatically turns its solar panels to the Sun for maximum energy and its antenna to Earth for good communication – ostensibly very helpful in any untoward situation – but this process uses a significant amount of vital fuel. Every unnecessary safe mode reduces the life of this hugely valuable mission, and in safe mode, normal gathering of scientific data stops.
After an initial investigation, it was found that the safe modes were being triggered by the DMS computer whenever a batch of commands transferred from the SSMM was interrupted.
The problem: Big command batches were being interrupted, triggering a fuel-gobbling safe mode
The SSMM is a large-capacity device, and it stores large numbers of commands sent by mission controllers and the instrument scientists, as well as raw data gathered by the instruments (prior to their being radioed back to Earth).
The SSMM then delivers a constant ‘stream’ of commands to the DMS computer one at a time; when the stream was interrupted – either due to a fault in the SSMM or due to some unknown problem with the on-board communication channels – the DMS detects the problem and auto-commands the spacecraft to switch to safe mode.
Taking action - but problems persist
At first, the flight control team executed the standard recovery procedures and restarted observations, hoping that Mars Express would function normally again.
But, frustratingly, safe modes happened two more times in the next few weeks, even though the engineers had tried switching on-board systems to use back-up communication channels (there is only one SSMM), among many other normal fixes. Nothing in the routine procedures, it seemed, could prevent the frustrating safe modes from occurring.
"We had to find a solution," says Lakey, "otherwise the mission would have soon been over."
By late August, the team had already gone through many night shifts trying to coax the recalcitrant spacecraft into some sort of stable configuration, with little luck.
"But then one day, an idea came to mind – while I was standing under the shower," says Lakey, with a laugh. "It occurred to me that, since something was happening to interrupt the flow of commands, triggering the safe mode, the solution might lie in by-passing the checks between the SSSM and the DMS computer, and finding a safe way to ignore problems with the link between the two."
With a little checking, Lakey was able to determine that the problem was, in fact, an issue of 'transient communication problems' between the SSMM and the computer. "When the main computer sees this interruption, it interprets it as a serious problem and stops executing its 'To-Do' list of commands – because it doesn't know whether the list is complete," says Lakey.
Fortunately, there's another, back-up, memory inside the DMS computer that could store the command stack, but it's much, much smaller than the SSMM, holding only 117 commands vs. over 3000.
So the engineers set about reconfiguring the spacecraft's systems to transfer commands from the SSMM to the onboard computer's memory in a different way. Rather than a constant stream of commands, one at a time, the commands would be transferred as a discrete block of commands relating to one complete spacecraft activity, just before that activity started.
"I thought we could use a trick, by packing the commands into smaller stacks and telling the on-board software to act only when it received a complete package. This 'all-or-nothing' scheme means we're no longer affected by the SSMM problems, but now we have more limits on what we can schedule in one go – but that's been proven to be acceptable."
But would they buy it?
As soon as he could, Lakey presented the idea to his colleagues.
"Perhaps predictably, they reacted with an operations engineer's traditional caution and scepticism. The first answers were, 'No, no, that won't work, No way...' But, after a lot of discussion, they slowly came around to 'Oh wait... maybe we should look at this... it could work'," Lakey tells.
The solution: make command batches small
With a clear consensus and the approval of Mars Express Spacecraft Operations Manager Michel Denis, the team set to work designing operations procedures that could be implemented using reduced command stacks, working first on just a certain set of basic on-board activities. This was a huge challenge.
As designed, Mars Express normally makes use of thousands of commands; for example, it takes up to 50 separate commands to simply take a single photo of Mars using the HRSC camera. Using the new, reduced command stacks would prove worthless if engineers couldn't actually do anything with the reduced command stacks.
Thus, making the solution work entails a massive amount of reprogramming to drastically reduce the number of commands needed to do anything on board. This work is what has kept the mission operations team on extended hours since November 2011.
Smiles all around
But, to everyone's delight, the solution is working and the team is substantially finished the work of converting thousands of commands to on-board procedures to be used much more efficiently than its designers had ever envisaged.
"We are confident that all the Mars Express instruments and systems can be commanded using the reduced command stacks," says Lakey.
"Now, we only need a few commands to capture an image and we can switch on and operate all the instruments at one time," he explains.
"We can proudly say that Mars Express is working properly again – and, with luck – the fuel left could last for another ten years."












27-02-2012 • 22:43:04
I was slightly confused until Daniel produced this one sentence: "The hardest part has been taking the simple concept and applying it to a complex deep space mission, with tight constraints and complex interactions." I can only imagine the possible anomalies and resultant workarounds this must have required.
My hat is off to all you guys, and I just shake my head at the genius (IMO) of the group. Man, you folks looked at every angle and contingency and came up with a winner. You saved this wonderful mission from an early ending, and that is simply magnificent work.
09-02-2012 • 13:05:03
Kudos to the ops engineers, and the FCRs. However, person who wrote the article...COME ON.
"Transient communication error" - and discretising packets of information is not a novel idea. HTTP, E-mail, the entire internet! They're all based on this. Run the ipconfig utility and ping any website. You can, yourself, set packet size for your computer to make your internet connection more robust, and 'faster'. It's patronising to think that this was the crux of the issue. As rightly mentioned by Daniel Larkey, the more 'fun' bits would have been the translation of this concept to a complex situation like Mars Express.
The article should've probably dwelt on how the engineers ingeniously reduced the number of steps required to do something from 50 to 2.
While I understand that this is to reach out to a non-technical audience, and engage in STEM outreach-
a) It's too long to reach out to people who are not already interested in this
b) It disappoints knowledgeable fans by only mentioning broad topics.
c) (the worst) It may turn people off - by misrepresenting the distribution of efforts for various tasks; making the easy thing seem hard, and the hard things easy!
Better sci/tech journalism next time round please. Or Hire me :D
Nonetheless, whoever did this might have had a million other things on their plate, so - in spite of my reservations with it, thank you for the article!
09-02-2012 • 15:44:33
You are overlooking several important factors, not least of which that the Internet is not a critical real-time system and thus has ample time for re-tries and data redundancy. Mars Express does use packetised data for internal and external communications, however the delivery of these is time critical, and resource constraints (bandwidth, memory, alternative routes) are extremely tight, which in turn limits the amount of internet-style packet switching and resending to work around a transient issue.
Command packets received too late, or out of sequence, are dangerous, which is why the default behaviour is to go to safe mode in such a circumstance.
09-02-2012 • 09:35:28
To put things in perspective, Mars Express has survived the loss of 30% power since launch, the partial (and unstable) deployment of a 20m long antenna for one month, 2 weeks on a 'crash course' for Mars (before the ejection of Beagle-2 lander) and more than 20 Safe Modes prior to this problem.
Fortunately the on-board control procedures for several instruments had already been written over a year before; those for other instruments could be written in days rather than months with the new MOIS system. The return to near full-mission within a few months was made possible with the very flexible planning system (and ingenious mission planners), the dedication of the entire flight control team, and infinite support & patience of the Science Ground Segment (in ESAC, Spain), the Mission Control System team maintenance team, PI instrument science teams, S/C Ops Manager and the Mission Manager.
This particular success was down to the teamwork, coordination and cooperation shown by the over 100 engineers and scientists involved in Mars Express.
09-02-2012 • 09:11:24
The hardest part has been taking the simple concept and applying it to a complex deep space mission, with tight constraints and complex interactions. This challenge has occupied every member of the flight control team, working together to ensure that all cases are covered and that the new concept is safe, reliable and practical to fly.