The day the SORCE solar satellite was almost lost

This is the little known story of what recently happened to the SORCE spacecraft and how it was nursed back to a mostly operational status over a period of weeks, after nearly dying in the cold of space due to what appears to be a software glitch. First, some background.

The Solar Radiation and Climate Experiment (SORCE) is a NASA-sponsored satellite mission that is providing state-of-the-art measurements of incoming x-ray, ultraviolet, visible, near-infrared, and total solar radiation. The measurements provided by SORCE specifically address long-term climate change, natural variability and enhanced climate prediction, and atmospheric ozone and UV-B radiation. These measurements are critical to studies of the Sun; its effect on our Earth system; and its influence on humankind.

The SORCE spacecraft was launched on January 25, 2003 on a Pegasus XL launch vehicle to provide NASA’s Earth Science Enterprise (ESE) with precise measurements of solar radiation. It launched into a 645 km, 40 degree orbit and is operated by the Laboratory for Atmospheric and Space Physics (LASP) at the University of Colorado (CU) in Boulder, Colorado, USA. It will continue the precise measurements of total solar irradiance (TSI) that began with the ERB instrument in 1979 and has continued to the present with the ACRIM series of measurements. SORCE also provides the measurements of the solar spectral irradiance from 1nm to 2000nm, accounting for 95% of the spectral contribution to TSI.

SORCE carries four instruments including the Spectral Irradiance Monitor (SIM), Solar Stellar Irradiance Comparison Experiment (SOLSTICE), Total Irradiance Monitor (TIM), and the XUV Photometer System (XPS).

What happened: The spacecraft went into Safe Hold on Sunday, Sept. 26th. The failure appears to be due to a zero length data packet that scrambled the software control due to it being unable to handle the error condition. 2 of 3 reaction wheels (like gyros) failed, sun aligned attitude for solar cell charging was lost, batteries discharged, and the temperature of the spacecraft internal electronics plunged to as low as -30°C. Recovery of the spacecraft took three weeks of work and the support of 82 ground tracking stations along with data relays via NASA’s TDRS network.

From the SORCE weekly status reports from 9/23 – 10/13:

SORCE experienced an OBC reset at 2010/269-17:57:40 due to the MU sending a CCSDS packet with a length of zero. An OBC reset results in the satellite regressing to Safehold, and performing basic power and attitude maintenance on the APE processor.

The following activities were performed to recover the observatory.

  • Spacecraft time was jammed and data dumped from after the anomaly
  • The OBC was reset to resync with the 1553 bus
  • OBC patches 6.8, 6.9 and 7.0 were loaded
  • All the spacecraft tables and RTS’s were reloaded
  • SORCE was then commanded out of safehold and back to OBC control on DOY 269.

At the next contact after exiting safehold, the spacecraft was found to be operating on one reaction wheel with lower than expected power margins. The flight operations team manually shed RWA 3, both star trackers and turned off the transmitter and commanded safehold.

Poor pointing performance lead to low battery charge state, and APE power charging tables changed to charge at a higher value on the APE. To recover from this configuration and a hybrid flight software configuration, the following actions were performed:

  • RWA FSW process was reinitialized to clear the faults.
  • The OBC AC control tables were poked for 3 wheel control.
  • The OBC was reset to clear any lingering issues which might have existed
  • OBC patches 6.8 and 6.9 were loaded. OBC 7.0 for two wheel control was not loaded at this time due to questions about it’s performance.
  • All the spacecraft tables and RTS’s were reloaded
  • RWA over speed was disabled in table 108. It was determined that RWA 4 had an over speed fault which lead to the transition to one wheel control in contingency mode.
  • SORCE was then commanded out of safehold and back to OBC control on DOY 272.

A phased approach was taken to re-warm the satellite back to operational temperature. The degraded battery necessitated this approach. To duty cycle the battery heater around eclipse as is done in normal operations, a special ATS was loaded that included commands to power off the heater in eclipse. Over the next several orbits after exiting safehold the Star Tracker heaters and Instrument Bench heaters were enabled.

  • Star Tracker 1 and Star Tracker 2 were powered on DOY 273.
  • SORCE was also commanded to normal pointing mode on DOY 273.

The operational temperature of the battery with duty cycling and the instruments off was cooler than desired. The redundant battery heater set point was raised to cycle between 1 and 2 deg. C. that improved battery performance.

http://lasp.colorado.edu/sorce/images/instruments/sorce_im_callouts.gif

The instrument suite was recovered as follows:

  • MU turned on and FSW patched and configured for normal operations on DOY 274
  • TIM turn on and science operations began on DOY 275
  • SOLSTICE B turn on and science operations began on DOY 275
  • SIM B turn on and science operations began on DOY 276
  • On DOY 277 SOLSTICE A was turned on. Due to a known “feature” where sending an instrument turn off command, followed by another instrument turn off command will power off the instrument after a turn on command, SOLSTICE A turn on was not successfully completed. Successful turn on and return to science was completed on DOY 278.
  • XPS was turned on and began taking science on DOY 279.

Read the entire summary at the SORCE weekly status reports from 9/23 – 10/13

The most recent SORCE weekly status report on 11/18/2010 is a bit more encouraging, as they finally got the SIM B instrument back online and the temperature/heater/available battery power situation seems to be managed now.

Here’s TSI data from SORCE regularly plotted by Dr. Leif Svalgaard along with other solar data, click to enlarge.

http://www.leif.org/research/TSI-SORCE-2008-now.png

h/t’s to Leif Svalgaard and Harold Ambler

Advertisements

33 thoughts on “The day the SORCE solar satellite was almost lost

  1. We need all the reliable instrumentation we can get in order to shed light on these difficult questions of climate. Especially satellite instrumentation. No one is going to be casually sticking those instruments next to a black asphalt parking pad, as so often happens to the terrestrial variant. Kudos to the ground team for having saved a potentially unrecoverable situation, and averted the waste of an expensive and scarce platform.
    That said, we are told that “The failure appears to be due to a zero length data packet that scrambled the software control due to it being unable to handle the error condition.”
    Was a zero length packet in fact mistakenly uplinked?
    And why was input boundary condition testing not routinely performed in order to identify such control risks, well in advance of putting inaccessible hardware on orbit?
    Engineering 101, regardless of field, should be teaching us to try to anticipate all reasonably foreseeable circumstances of failure.

  2. [SNIP – okay we get it, stop bombing this tip over several threads, one in Tips and Notes is all that is needed ~mod]

  3. To boot …. short for boot strap (a term for the hardware involved) … short for “to pull up by the boot straps” … the ability to lift one’s self up by your own bootlaces.
    A kind of joke that it isn’t possible to have self booting software!

  4. Sounds like “that was close” Sadly it happens a lot. It’s always the errors that cause the failures.
    And why wasn’t this tested before launch and debugged? They should have had Al Gore writing the test procedures. But the real reason was the test engineers probably never thought it could happen — oops, isn’t that the purpose of testing, to find the things that can never happen, and fix them before they do?

  5. I knew they had problems with this equipment. I thought, like Icarus, it had flown too close to the sun.

  6. Usually, problems like this occur in the field when testing/fuzzing was cut short due to budget shortfalls (either time or money budget). You force people to rush something out the door, you get problems like this. Serious fuzzing just takes time.
    For the nonprogrammers: Fuzzing means to bombard a system with partially randomized data. You uncover a lot of flaws you wouldn’t have thought of before; but as it’s a probabilistic process it takes time, especially when the target system is slow.

  7. I had contacted Leif to ask whether he agreed that there appeared to be the start of downward trend on some SORCE data in the last couple of months. I was particularly interested in the appearance of the F10.7 and TSI plots. Leif pointed out that there had been a problem with the spacecraft and that there might be calibration issues to be waded through.
    Time will tell. Many thanks to Anthony for posting on this.

  8. [SNIP – okay we get it, stop bombing this tip over several threads, one in Tips and Notes is all that is needed ~mod]
    Most surprising and disappointing to see this going on here of all places. Maybe we should create a new epithet for the heretics who dare challenge Galileo’s four hundred year old theory about the composition of the Sun?

  9. Not very robust programming to blow up on an empty data packet.
    Who wrote the code – one of Phil Jones’ motley crew?

  10. Re: tarpon says: November 23, 2010 at 1:53 am

    And why wasn’t this tested before launch and debugged?

    Probably as DirkH said, cost and time. Feynman did a nice piece about this with his idea of ‘tiger teams’ incentivised to find and correct problems. Works well, but increases costs. I’ve also done some testing work and found that sometimes the more people ‘know’ the system, the less likely they are to do the unexpected and find more obscure bugs. Even simple things. One time I closed a window using the ‘x’ button. No way to recreate the window. Developer said I wasn’t supposed to do that. I could, I did, it broke.
    Similar situation may have occured here, even though it was post-Ping of Death and many malformed packet attacks used on the ‘net. It was an unexpected condition untested, and assuming it was a zero length payload rather than packet, or the packet length count in the packet was zero in error. Still a nice example of engineering vs adversity. and hopefully cost cutting won’t prevent similar workarounds being usable on future satellites.

  11. Thank you for this report. It takes me back to the “bad old days” when this kind of satellite recovery was just a “normal” part of my job. I miss it – but I don’t miss the sustained adrenaline and blood pressure highs or the 40 hour shifts that went with it. Or do I?
    The only technical comments I’ll make are that 1/ their initial recovery attempt was faulty and 2/ CCSDS on the 1553 bus generates more problems than you’d believe. On one of the spacecraft I worked with, there was a command sequence that would turn off the command transmitter – and receiver. I wonder if anyone understands what that means for spacecraft operational longevity? :-((

  12. Don’t be nasty or too harsh. Write a few thousand lines of code and see how easy it is not. When smoking was PC a person would pull out a Zippo lighter and “guarantee” it to work the first time. Then someone would ask if the owner was willing to bet his index finger against the possible failure to light. I never saw one not light, but I never saw anyone take the bet either. When your code is finished will you guarantee it to work?
    Keep in mind too – this is rocket science!

  13. @ tarpon and Atomic Hairdryer –
    Spacecraft (and ground system) contingency planning requires months of effort – and each spacecraft is unique, so each requires it’s own unique contingency plans. BTDT – more than once.
    The kicker is that in 42 years, I never at any time found a (any) spacecraft that was kind enough to fail in a mode that could be handled by the the pre-determined contingency plans. Every failure was unique and required unique recovery solutions.
    The ostensible purpose of contingency plans is to provide a blueprint for recovery that can be blindly followed by a subnormal chimpanzee. This is the attitude exhibited by too many “managers” in the business. And it labels them as “amateurs”.
    The reality is that the only purpose in contingency planning and training is to teach the ops people the many possible operational variations for the spacecraft and the ground system. As Eisenhower put it during WWII – Planning is essential, plans are useless.

  14. Re Jim Owen

    Planning is essential, plans are useless.

    Or ‘no plan survives first contact with the users’. Agree on managers vs engineers though and had a brief encounter with satellite comms when the interplanetary internet was proposed. Short lived when I asked why use TCP/IP when it struggles with high latency and UDP is unreliable by default, neither optimal for long distance, reliable communications. I wasn’t invited back to subsequent meetings for being too sceptical. Not a fan of kludging inappropriate technology into doing jobs it was never designed for, despite working on the ‘net for far too long.

  15. Given the magnitude and uncertainties surrounding climate issues, given that only good data collected over time can resolve the issues, and given the instability of these platforms, it is amazing that the Post Normal Science crowd isn’t militating for having at least 3 SORCE-type satellites in orbit at all times. Then, aberrations of any one SORCE could be checked against the data from the other two (and downtime for any one SORCE would not leave science without data for that period).
    The cost would be miniscule compared with the costs these people would like to impose on our civilization as a whole based on our current weak climate data as interpreted through the lens of their “precautionary principle”.

  16. I am sorry for bring so blunt, but unless you are a software developer, you don’t know enough to have a valid opinion. What ever you think you know is probably wrong, your experience and intuition does not apply. BTW, for 10 years I wrote High Reliability software for a living and before that, I spent a decade doing qualification and testing of Level-S flight hardware for NASA.
    Some of you guys are whining about testing. Software is not like anything else, though software managers try to make it that way. One can build something in software that could not exist in the real world. The complexity level of many software systems is order of magnitude greater then any other type of system. Testing of software is sometimes more complicated then the software being tested. Testing also suffers granularity problems, in that a methodology that is valid at one level, is invalid at another.
    To put things in perspective, I recently wrote a small set of classes. It took me about one half hour to design, write, and run the thing. Even though it was only about 100 lines, it was still fairly complex with multiple asynchronous threads, and behavior changing delegates. It took me a *day* to write the unit tests, and another 3 days to do the testing and to validate that testing. (can’t just test, need to make sure the tests actually test). 4 days of testing for one half hour of coding!
    Right now, I am working for a company that you have all heard of, but who’s management would not know a good software design if it bit them on the butt. No one is here right now due to too much snow on the ground, and there are no pressing project activities, so I am spending the day testing software, doing mean things like yanking cables and pulling power cords. Normally, there is not enough time to do this sort of thing. Just imagine trying to simulate the type of things that might happen in space, like for instance, a stray high energy particle meandering through a memory location, innocently changing a one into a zero. This is a very rare occurrence, but it does happen.
    On the other hand, we do indeed have a serious problem. It has more to do with software management and software project design then anything else. The problem has two main root causes. I will focus on one of them, as the other would take way to long. Software design methodologies are flawed. Managers, who do not understand software have too much impact on the design. I call it “Design by Bullet Point”. Instead of creating a solution that maps onto the problem space, developers are being forced through budget and time constraints into creating solutions that answer the bullet points, and also do absolutely nothing else ( a solution that solves something more completely then the bullet points is considered a bad solution!) When this design mentality hits NASA we will be in big trouble.

  17. @ Atomic Hairdryer –
    lol!!! We may have met. At one time, just before I retired, I was peripherally involved with the “Internet in Space” thing. some of it looked promising, but needed some work. At the time, the conclusion was that it was useful in cislunar space, but the planetary apps (specifically Mars at that time) were sometime in the future. I was at the last conference that GSFC sponsored – before those conferences were cut due to budget considerations.

  18. @ DesertYote –

    When this design mentality hits NASA we will be in big trouble.

    Condolences are in order. I once worked in that kind of environment, doing IV&V on the Hubble Space Telescope C3 software systems. About 6 million lines of code. Interesting, frustrating, time consuming and ultimately impossible to eliminate ALL the bugs. Took 3 years and 8 or 9 test cycles to convince the management that “their” take on the robustness of the multiple computer interfaces was totally wrong.
    Your statement above is true – except for the timing. It happened sometime around 1990.

  19. BTW, the assumption that the zero length data packet had anything to do with external communication is silly. Subsystems on the bird, communicate with each other via protocols that make use of, you guessed it, “data packets”. The fact that there was an unhanded zero length error indicates that it originated from a subsystem that was beveled to be incapable of sending a zero length data packet, as normally these would have at least on byte by design.

  20. In the harsh environment of Space there can also be transient hardware errors, e.g. random changes to memory caused by energetic particles or spacecraft charging.
    REPLY: Yes, I remember when Intel 2102 RAM chips first came out, it was discovered they were sensitive to cosmic rays.
    http://lambda-diode.com/opinion/ecc-memory
    Space is hostile, design considerations many, and there usually are no second chances. It’s not like you can call Triple-A and ask for orbit-side service. – Anthony

  21. #
    #
    Jim Owen says:
    November 23, 2010 at 9:43 am
    @ DesertYote –
    When this design mentality hits NASA we will be in big trouble.
    Condolences are in order. I once worked in that kind of environment, doing IV&V on the Hubble Space Telescope C3 software systems. About 6 million lines of code. Interesting, frustrating, time consuming and ultimately impossible to eliminate ALL the bugs. Took 3 years and 8 or 9 test cycles to convince the management that “their” take on the robustness of the multiple computer interfaces was totally wrong.
    Your statement above is true – except for the timing. It happened sometime around 1990.
    ###
    I was afraid that might have been the case, but I was hoping! I bailed about that time, right after the GAO witch hunts. BTW, most of my work involved C3 also, TDRSS, GRO/COBE, GOES9/10, Venus Mapper, Mars Explorer.

  22. @ ian middleton –

    Of course these problems would not occur if they used DOS 3.1 . That was bulletproof.


    lol!! What they use on the Shuttle system is “almost” that advanced. And one of the 2 HST computers was the same model – leftover 60’s era MMS computers. That’s why the Shuttles have to reload the computers in-orbit – so they can get back “home”.

  23. @ DesertYote –
    Mine was Nimbus (1-4), Landsat (all of them), UARS, HST, TDRSS (Space Network) and a couple black programs that I still can’t talk about.
    And Leif is right – it’s NOT a friendly environment up there. We flew the first GPS unit on Landsat 4. It wasn’t built with hardened components so turning it on was always a crapshoot re: how long it would operate before crashing. IIRC, the longest it stayed up was about 15 minutes (although I could be off by a few minutes there). I know for sure that every time we turned it on, it was down for at least a week while we reworked the ground data base and the “airborne” software. GPS has come a long way since those days. So have the science instruments – but only because of the knowledge base that we built back in the “bad old days”. One should never talk about past failures without realizing that those failures were the “learning curve” for today’s successes.

  24. I remember using SRAMs with packaging that emitted alpha particles – but we didn’t find out until we started looking into unexplained crashes. “But not to worry,” the vendor said. “The failure rate is something times ten to the minus something!” So, we multiplied the number of chips in our system by the failure rate, and sure enough it was supposed to fail every half hour.
    I have great respect for the folks who send their software and hardware into space – it’s not an easy proposition.

  25. Every sw/hw testing group should have at least one clueless ijit on staff, to do whatever comes naturally. After indulging in howls of outrage after each crash, they’d actually get important improvements in reliability.
    🙂

  26. DesertYoghurt said: “The fact that there was an unhanded zero length error indicates that it originated from a subsystem that was beveled to be incapable of sending a zero length data packet, as normally these would have at least on byte by design.”
    Hey bud, it looks like your packets are getting mangled a bit. You should know better than most never to bevel the on byte of your packets once you’re in orbit!

Comments are closed.