Redundancy and failure analysis should extend beyond end items and the subsystem under design and consider potential failure scenarios for all relevant interfacing subsystems.
The Launch Release Subsystem (LRS) is a critical electrical subsystem located on the mobile launcher, the ground platform structure that will launch NASA’s Space Launch System (SLS) rocket and Orion spacecraft on Artemis missions to the Moon and Mars. LRS generates the signals used to release and retract the umbilicals from the SLS vehicle at T-0/launch. It is also used to command and monitor the Launch Abort System and Booster Ignition Safe and Arm devices as well as Interim Cryogenic Propulsion Stage functions.
Due to the critical nature of most LRS functions, redundancy was heavily considered during the design and development phase. However, an issue with the redundancy architecture was identified after the subsystem completed certification. The root cause of the problem was that the redundancy and failure analysis for LRS was incomplete. The analysis reviewed how LRS interfaced with the end items such as umbilical release mechanisms and safe and arm devices but did not consider all potential failure scenarios of other subsystems that interface with LRS.
During the design and development phase, a complete end-to-end evaluation and analysis of redundancy and potential failure scenarios should be performed. The analysis should not be limited to only the subsystem under review but should also account for all interfacing subsystems.
NASA Kennedy Space Center Launch Release Subsystem Lead Engineer Damion Lucas on the importance of this lesson learned:
As designers, we always want to make sure that we have redundancy, not only to meet our requirements, but to ensure crew safety and mission success. It’s essentially a given when you’re designing systems, especially critical ones like the Launch Release System, that you look at redundancy end-to-end. The LRS team has always been very particular about that, to the point that it seems like we’re almost over-redundant in some cases. We have two or three layers of redundancy instead of just a single layer for many of our circuits.
While the LRS design team successfully completed Verification and Validation testing and Design Certification Review with no issues, a fault in the overall integrated system was later identified when another system was performing dry run testing. The system had an unrelated hardware failure and the team saw some data that didn’t seem right. They made the LRS team aware, and we went back and looked closely at how all the subsystems interface with LRS and identified the cause. It turned out that we didn’t pay attention to the details of what the Kennedy Ground Control Subsystem (KGCS) did with those redundant paths inside their system. So basically, we had redundancy in our subsystem, and then when integrated with KGCS, it swapped things around so that the redundancy was essentially undone.
We tested redundancy at the subsystem level, but we didn’t test out to the interfacing subsystems. We tested for fault tolerance (redundancy) within LRS by shutting redundant functions off within our own subsystem and verifying system performance. We didn’t go back into the other subsystem during integrated testing to shut off redundant functions to see how the overall end-to-end system responded. While we performed some integrated tests, redundancy testing at this level was not one of them.
As stated earlier, the LRS team was extremely diligent in meeting all its design requirements and especially redundancies and found it difficult to believe that this was missed. It’s been ingrained in our heads to verify redundancy throughout the various design milestones. We want to get this information out to others to help them avoid a similar situation.
I think the major takeaway here is to be aware of this during design and development and to perform as much testing as you can and try to account for all those scenarios during testing — not only within your subsystem, but also including the overall integrated system as well. The cost associated with this level of testing will more than pay for itself when compared to redesign costs and testing when identified later during operations or, more concerning, during the launch countdown processing or in flight. When you’re looking at your design, it’s important to not limit your scope by focusing where other systems interface with your design but to go and look at how those other systems complement your design because they’re pieces of the puzzle.