By Brent Robertson and Jerry Klein
Wouldn’t it be nice to have a project management crystal ball that revealed all problems before they occurred? Then they could be anticipated, mitigated, and dealt with before they affected technical performance, schedule, or cost. As the Observatory Manager for a large space science mission at NASA’s Goddard Space Flight Center, I didn’t have a crystal ball, so I wondered how good our team was at preventing problems before they occur.
The Solar Dynamics Observatory (SDO) mission will help us better understand the dynamic structure of the sun and what drives solar processes and space weather. Goddard is building the spacecraft in house, managing and integrating the instruments, developing the ground system and mission operations, and will perform observatory environmental testing. We have a compelling mission with well-defined requirements, adequate funding, a seasoned project management team, a resources staff capable of miracles, experienced instrument teams, strong systems and quality assurance engineering, top-notch engineers, and Center management eager to help with problems. it’s what I consider a dream team for anticipating and correcting problems. But would this expertise really matter when we were using a risk management process none of us had used before?
SDO is one of the first Goddard in-house flight projects to use the formal continuous risk management process now required by NASA Headquarters. Our risk management plan required approval from the Goddard Office of System Safety and Mission Assurance to ensure that all the elements in the NASA standards and guidelines were adequately addressed. Since we know that communication is critical to managing risk successfully, we included a risk management coordinator in our plan to solicit potential risks from project personnel and help disseminate risk data throughout the project.
Unlike risk management tools used only by the project manager, our risk management system is integrated into the SDO project culture. Everyone is responsible for identifying and mitigating risks. Each month, we solicit new risks through an interview process and discover others in meetings, vendor status reports, hallway discussions, voicemails, and e-mails. We collaboratively develop mitigation plans for each risk, then discuss the risk and our plan to alleviate it at a monthly risk meeting with project senior staff. We are uncovering more potential risks because we have a group instead of one person looking for them. Because everyone has their special areas of expertise, they are better able to point out issues within their subsystems than an outsider would.
Many NASA accident investigations point to poor communication as an important factor: someone in the project’s rank and file sees a problem but does not successfully report it to the top. Our risk management system, which makes risk everyone’s business, improves communication, giving people permission — in fact requiring them — to report perceived problems up the chain. For example, special meetings with our subsystem teams have not turned up any additional risks because their concerns have already been successfully communicated. Working as a team to manage risk also helps create a common vision across the project, giving people a better idea of the shared goal and helping them see beyond their own part of the project at the subsystem or component level.
So we believed our new risk management system worked well, but I wanted to measure its effectiveness, if possible. I did a quick search that turned up a lot of material about risk management systems but little information on their metrics. I realized we could measure effectiveness by how many active, retired, or accepted risks we had and how long it took us to mitigate them and at what cost. Active risks are reviewed regularly at the monthly risk management meetings. Retired risks are those whose likelihood has been reduced to zero. Accepted risks are those we accept with process controls to mitigate single-point failures, where a single component failure could end the mission (for instance, premature deployment of solar arrays, structural failure, propellant leaks) or those we think are beyond our project control (for example, new launch vehicle certification or contractor/vendor internal infrastructure issues).
A review of SDO risks from the project formulation to the implementation phase revealed that by the time of the Critical Design Review roughly 50 percent of all risks had been retired. The number of active risks decreased slightly over the same period of time, and the number of accepted risks remained a relatively small percentage (about 15 percent). A closer look at our retired risks showed that approximately 80 percent of retired risks had been retired within a year of being entered into the system. If a project is able to retire risks faster than new risks are generated, it allows the team to concentrate on a manageable number of active risks. Retiring risks quickly assures our team members that management is serious about mitigating their risks, which improves the working environment and encourages everyone to bring problems to light. And as our active risks continue to decrease with time, we are decreasing our risk exposure.
Project contingency spending is another measurable indicator of risk management system performance. Forty-five percent of our contingency fund expenditures were used to deal with risks our system had identified. If we had found that funds were being spent on issues our system missed, we would know it needed improvement. When we uncovered manufacturing issues with the Ka-Band Transmitter, a new technology our in-house engineers were developing to meet SDO’s high data volume requirements, the project brought on an experienced vendor in a parallel effort to build the Ka-Band Transmitter engineering test unit and flight unit and mitigate the potential delay we would face if we fixed the problems by ourselves. Because we prioritized risks effectively and identified them early, we spent most of our contingency money on what could have become more serious problems later. We were not blindsided by as many big, expensive problems as we would have been without the system.
Until an effective crystal ball comes along, some unforeseen problems will always occur, even with the best risk management process. We discovered an error in a flight dynamics model months after the Critical Design Review. Our engineers had assumed that the term “solar north celestial pole” in commercial off-the-shelf software was solar north when, in fact, it was Earth north. When we corrected the model, the spacecraft blocked the high gain antenna field of view during certain times of the year. The good news was we caught the error prior to launch, but design changes and operational workarounds were required to fix the problem. Our team put out the word to verify all other engineering models to ensure this didn’t happen again (lesson learned).
A look back over the past three years revealed that we have reported a total of fifteen issues to our Goddard Program Management Council. Of these, about 70 percent were identified and tracked as risks before they became issues. Our team was finding nearly three-quarters of all cost, schedule, or technical risks before they could affect the project with major delays or mechanical failure — an impressive result. That kind of statistic and outstanding teamwork and communication suggest that our risk management system is a success.
About the Author
Jerry Klein is the Risk Management Coordinator for the Solar Dynamics Observatory Project at Goddard Space Flight Center. |