Early in my career, I was working for an organization that was doing an initial rollout of Datadog. There, we used `rsync` as what I would refer to as a "poor man's" configuration management system by syncing directories to all our Linux servers based on a variety of variables. Many of those servers were those that hosted our SAP infrastructure, which was a major component of our business operations. What I thought would be a basic configuration rollout of our base `datadog.yaml` file ended up changing the permissions of the `/etc` directories on every one of our production SAP systems, bringing all of our business operations to a screeching halt once the weekend change window had ended.
While the root cause was identified as a simple permissions change, production operations were still stopped for around 2 hours while we identified and remediated the issue. While I prepared for the worst when it came to the repercussions or backlash during the P1 review meeting that came the following week, to my surprise, that wasn't the reaction I got from any of the technology leaders or even my fellow engineers.
A saying that floated around my team was that you weren't a "real engineer" until you had caused a production outage in your career. While I always thought this was a joke, the surprisingly blameless and almost pleasant reaction I got from other members of the engineering organization actually reinforced the idea. This was my first exposure to a culture of blameless engineering and "shit happens." It was a moment that I have often thought about as I've continued my career, and has shaped the way I try to operate when it comes to building and leading engineering teams.
The only repercussion I faced was to be the proud new owner of "The Brick." To take the saying and the joke we made of being a "real engineer" one step further, we had a trophy that rotated to whoever caused the most recent production outage. The trophy was a bad call brick often seen at sporting events with a small bottle of Excedrin taped to the side of it.
The bad call brick in this case was a representation of someone making a bad call that caused a production outage. It was also a play on the common IT phrase of bricking a system, and the aspirin was for the headache caused by the outage. While I don’t think the "real engineer" saying should be taken literally, I now recognize and appreciate the subliminal meaning around it after being in the industry for around a decade.
Before that incident, I assumed being a good engineer meant always getting it right. Zero bugs. Clean code. Perfect deployments. But real-world engineering does not work like that. If you are doing meaningful work, shipping changes, refactoring systems, and improving automation, you will eventually break something.
The key difference is how you handle failure.
Owning the mistake, sharing what went wrong, and helping to put safeguards in place afterward matter more than never failing at all. That outage taught me more in a single evening than months of smooth deployments had ever taught me.
That incident reinforced several important principles:
Most importantly, the experience deepened my respect for engineering cultures that focus on growth and improvement rather than blame.
What stood out most was how the team handled the situation. No one asked who was at fault. The postmortem focused on what happened, why it happened, and how to prevent it from happening again.
Despite working on a traditional infrastructure team at a century-old manufacturing company, this was my first exposure to a true DevOps culture of practicing blameless engineering. When engineers feel safe admitting mistakes without fear of retribution, the entire organization benefits. Teams are more transparent, more collaborative, and more resilient. In contrast, blame-heavy environments can drive problems underground, delaying fixes and increasing the long-term cost and complexity of addressing these issues.
The goal should be continuous improvement. Complex systems often fail in complex ways, and even in simple ways, like in my case. Rarely is a single person solely responsible for an outcome. In my case, the system lacked adequate validation checks that would have caught the change before it hit production. Fixing that gap made the entire environment more stable going forward.
Years later, I still think back to that outage. Not because of the failure, but because of how it shaped my thinking. The experience taught me that perfection is not the goal in engineering. Progress is.
Mistakes are inevitable. What matters is how you respond, what you learn, and whether you help make the system and the team stronger because of it.
If your organization encourages experimentation, avoids blame, and sees failure as part of the learning cycle, you are on the right track. And if you ever find a red foam brick on your desk, consider it a rite of passage. It means you are doing work that matters. If this resonates with you we encourage you to explore our open positions!
We go further and faster when we collaborate. Geek out with our team of engineers on our learnings, insights, and best practices to unlock maximum value and begin your business transformation today.
Ranked #2 for fostering a culture of innovation, flexibility, and employee empowerment.
Named one of Inc.’s Best Workplaces for putting people, purpose, and continuous growth at the core
RapDev, the leading Datadog implementation partner, named the 2025 Datadog Partner of the Year – North America, a record fourth consecutive win