Episode 1

Knowing the Unknown, With Andrey Zhuk

When you have billions of data dependencies, and one goes amiss, how do you figure out where the issue is? Listen as Mark and Carolyn are joined by Andrey Zhuk of CTG Federal to discover how artificial intelligence is opening new doors for data security and recovery.

Episode Table of Contents

[00:46] The Road to AIOps
[09:07] Overarching Umbrella
[17:25] Knowing the Unknown From Poorly Written Codes
[27:55] Knowing the Unknown in the World of Tech

Episode Links and Resources

Andre’s e-Book Software Intelligence for the Federal Government: The Road to AIOps
Connect With Andrey Zhuk on LinkedIn

The Road to AIOps

Carolyn: I'm excited to introduce today's guest, , principal solutions architect at CTG and author of several eBooks. Today, we're going to talk about one of his latest eBooks, Software Intelligence for the Federal Government: The Road to AIOps. It focuses on cloud development migration in the federal government.

Carolyn: Let's start with the easy stuff. Tell us your story. What do you do? Where are you talking to us from? How did you get to where you are now?

Andrey: Sure. My background is actually electrical engineering. I used to design satellite systems and the networks that go along with them for the Department of Defense. From that, I went to the side of sales.

I was actually selling a lot of Palo Alto products and some optimization solutions. From there, I transitioned to the world of cloud. That's kind of where I got into the whole application performance management space. I was at a startup called Skyhigh Networks. They were one of the early cloud X security brokers. We were dealing with cloud apps and security cloud apps for government customers.

That's where I had the experience of dealing with the federal government workers, trying to modernize their applications. Then Skyhigh Networks got bought by McAfee. I was a solutions architect for cloud technologies with McAfee for a year or so. Then I moved on to CTG Federal to take on a principal architect position to help build their cybersecurity business with a little bit of the APM sprinkled in. We had a Dynatrace partner.

Knowing the Unknown About Satellite Stuff

Carolyn: Yes, we wish we could say that right upfront that we are partners. But before we get into it, I got to go back to the satellite stuff. How does that compare to what you're doing now, how long did you do that?

Andrey: Wow, probably six years, but ultimately everybody needs Facebook and satellite platforms get outdated like once every 10 years. So it's all about software.

Carolyn: Oh, so this is a lot faster for you then like, quicker pace.

Andrey: Yes.

Mark: The world that you're playing in now, Andrey, is it more high-level conceptual as opposed to the engineering work that you might have done, working on satellites?

Andrey: So it's interesting. The world is more about software now than it ever was before. Just to give you an example, YouTube platform. They're the one that when it got shut down by the Soviets, that is still in operation. So that airborne frames still exist, but the internals get modernized. The internals get modernized with new circuits, new equipment, but the software on those circuits gets changed quite frequently.

Mark: We truly are living in a software world, are we?

Andrey: Yes. Which makes application performance management and software intelligence that much more important.

Carolyn: It gets us right to one of the first things we want to talk to you about is the title of this eBook that you wrote, Software Intelligence for Federal Government: The Road to AIOps. I'd like you to unpack it a little bit for me. So first, what does it mean? What do you mean by software intelligence for the federal government? And then what do you mean by the road to AIOps?

A World Powered by Applications

Andrey: So we live in a world powered by applications. The applications are everywhere. They control how we work and collaborate, how we book travel, how we get our entertainment, how we get medical care, and taxes. But the thing is these applications have grown in scale. It used to be not so long ago, even as far back as 10 years ago, we would have just a couple of servers with some backend storage. Interact somewhere, maybe at clinics or something like that, our rec space, but now that's changing.

All of a sudden the applications are running on top of complex computing infrastructures that are dynamic, hybrid, and multi-cloud. And now these environments contain hundreds, if not thousands of technologies. Million lines of codes, and literally billions of dependencies. So traditionally we manage all these applications with a set of disparate point tools.

This is what we call, tooling. Each one required human involvement. This is also what we used to call application performance management, different tools reporting back to us. But as applications grew, the traditional data centers into cloud and subsequently public and hybrid clouds. So the old ways of doing things could no longer scale. The volume of data aggregated by all these monitoring and observability point tools quickly became so immense that no human can make sense of it.

There's also an issue of the unknown unknown, so you can't stage for something that you have no clue can even happen. So we needed a new approach. This is where we need to make all these sensors to be intelligent and make sense of all this data. And so this is where the software intelligence comes into play.

Knowing the Unknown With Intelligence

Andrey: With software intelligence, we need to imbue these monitoring solutions with intelligence to make sense of all this stuff.

Carolyn: So have you seen that shift? You started the story where it was a lot more simple, was that the case when you were working on the satellites?

Andrey: Yes.

Carolyn: You've seen this shift through your career.

Andrey: Yes. So I'll give you an example. The old school application performance management circa 2010, it was primarily network-based. For example, I worked with a company called Riverbed. Back then, Riverbed was one of the hottest companies in Silicon Valley. They bought the startup from Cambridge, Massachusetts, which was called Cascade. All it did is it took packets from the network and made predictions about application performance based on flows and packet capture. That's it.

There are also a lot of competing solutions from the likes of NetScout, JDSU, and a couple of others. So it all just flows in the network. But now as things are changing going to the cloud, those solutions are no longer relevant. So you need something new.

Carolyn: So this isn't just a fundamental shift, I feel like this is a leap forward. I wouldn't say that it's equivalent to the invention of fire, but maybe.

Andrey: Maybe.

Carolyn: It's really huge to me.

Andrey: It's interesting. This is where the world of AIOps came to be. So artificial intelligence operations, I think it's a gardener term. But ultimately it's referred to a suite of products or software platforms that bring together, act as a force multiplier for correlating data across application performance management tools, IT infrastructure monitoring. This would be like your SolarWinds type stuff and network monitoring and the other diagnostics.

Overarching Umbrella

At this point, AIOps is moving forward to now being this overarching umbrella which now encompasses application performance, infrastructure monitoring, AIOps. We used to think of it as opening tickets, for example, and maybe automating service, ticketing and response. But now we also include digital business analytics, which is especially important to the likes of say, Uber and the digital experience.

Mark: Yes, and user experience.

Andrey: Right. Especially if we think about the likes of Uber, this is where these kinds of platforms really come into play, just behemoths. How do you make sense of it all?

Carolyn: You mentioned something about our lives are run by,I mean, you said the word millions and billions, even at one point.

Andrey: Billions of codes and dependencies. Yes

Carolyn: Yes, and dependencies on an application. So give me some examples.

Mark: Some use cases.

Carolyn: Tell me a story about AIOps and I guess, how we're using it now, compared to what we were doing before?

Andrey: Sure.

Carolyn: You kind of already talked about that.

Andrey: I feel like we were talking more about the public sector or federal space. One of the best use cases that's easy to latch onto, even before undergoing through this transformation technology and re-architecting all your software. It's probably just a meantime to recover. So meantime to recover your MTTR, that refers to ability to recover a system back to an operational state. When you have billions of dependencies, something goes amiss, how do you figure out what the issue is?

Where AIOps Come Into Play

Andrey: So this is where an AIOps can come into play and be able to help you figure out, find that needle in the haystack. You go from weeks of teams trying to figure out what the issue is. Isolating components and testing to literally hours, because you have all this data and it's not just a sea of information. It's very precise nuggets of information telling you when something is amiss. So that's kind of the easiest to digest use case. The other one, which was very much a federal, I feel like federal specific is mission continuity between contracts.

So imagine you have a big contractor. Like Nordstrom come and run the show at one of the satellite facilities I worked with for the Department of Defense. Then that contract expires and maybe a Raytheon comes in to run the show. In an ideal world, that'd be documentation, kind of to bridge from one contractor to the next. But in the real world, documentation is usually lacking and key personnel have moved on. So how do you go about figuring out all the dependencies and process to process relationships in a satellite imaging system, for example, that one of the software factories is developing.

Actually, we talked about this pre-show. DoD is probably the biggest proponent of DevOps. They're on the forefront of re-architecting a lot of these legacy applications and making new ones, cloud-native. The big software factories, the ones like Platform One, Castle Run, Thunder Campbell Sky, Sonic, there are a bunch of them. So these guys support these numerous applications. How do you make sense of these applications when the contracts change?

Knowing the Unknown About the AIOps Tool

Andrey: This is where an AIOps tool is of great help. I can keep going. Is there any question?

Carolyn: How does the AIOps tool help that shift from contractor A to contractor B?

Andrey: The tool would help with mapping out dependencies, both horizontally between components at the same time and vertically.

Carolyn: Okay. It's keeping track of everything.

Andrey: Yes, between components of different types. And then you will get a real-time map of the entire application stack end-to-end from your customer's web browser. All the way to the application, down to the underlying containers and infrastructure, cloud resources, and so on. All of a sudden, that stack of hay is no longer a stack of hay, but a logical interconnection of resources for consumers.

Carolyn: Yes. See, I guess this is going to reveal how much of a developer I am not. I thought all that stuff had to be linked. Anyway, but you're telling me it's not, you need some kind of third entity to map it.

Andrey: In an ideal world, you're right. But I’ve been in situations where there was a hasty hire and the developer did not provide any inline documentation of any kind. No comments within the code. So making sense of that is the most impossible without a third-party tool mapping all those dependencies.

Mark: How do you think agencies are managing through these use cases like, we're talking about this digital transformation today?

Carolyn: Yes. What if they don't have that tool in place?

Andrey: Right now, it's very much just human elbow grease.

Carolyn: No. That's not even possible.

Science Fiction Movies

Andrey: I can't reveal a lot of things that I experienced firsthand, but you'd be surprised. So it's what you see in science fiction movies. These security operation centers like Starship Enterprise, those do exist, but they are far and few in between. The reality of things, especially with the smaller agencies. It's like three or four guys in a cube farm, looking at alerts and dashboards every couple of times a week. So without a tool like this, it's very difficult.

Actually, this brings up the next use case. Well, maybe the next two use cases. So one is helping augment the IT staff shortages. For example, I get hit up on LinkedIn several times a day. The federal government can compete with the commercial vendors like Dynatraces of the world. They have to do more with less. For example, we have a civilian agency that was piloting a Dynatrace tool for intelligent observability. They were using Dynatrace with Ansible to proactively detect and remediate memory leaks in a large enterprise application.

So Dynatrace would receive telemetry from the running processes. Then we'll use AI to determine if the telemetry received is indicative of a memory leak. So, this is not something that human can do. But if you're an AI operations platform, you have a broad statistical data set to make a decision upon. And so you say, "Okay, this is indicative of a memory leak at a very early stage. Then the tool integrates with Ansible to restart the call process without any human intervention required."

Carolyn: What do you mean restart the call process? Like, fix it?

Knowing the Unknown From Poorly Written Codes

Andrey: Yes. For example, a container that has some poorly written code on it is over utilizing the memory of the hypervisor it's running on. Sometimes the easiest way to fix these problems is just killing the process and bringing it back up, simple as that. But if you're a human and you have hundreds of containers running a copy of this microservice to support many threads, you don't know which one is problematic.

So you have an AIOps to tell you exactly, "Hey, this container has a memory leak. Let's kill it and bring it back up." And so you're also minimizing human error. That's another huge thing. This actually ties back to the next use case of how best automation works. So all these agencies usually rely on IT service management platforms like, ServiceNow, or they used to be BMC Remedy. It's now rebranded as BMC Helix, which you can run on prem.

There's also SolarWinds, which has a couple tools like that. So an AIOps tool would be able to detect and create and it will parse its content and apply AI to take an appropriate action. And maybe send an email for Joe, the engineer, to go do action X. So you may have an application that worked prior to a new code release.

Then you have a code upgrade and users are experiencing issues. Well, you can configure the solution. The AIOps tool or the software intelligence platform to automatically roll back code to the last working version. The software intelligence platform will take care of all the dependencies and do it correctly every time.

The Room for Error

Imagine doing that just by us humans manually, the room for error is immense. Even if you've rolled back the code to the previous state, you probably forget one in 200 dependencies and you're in a world of hurt. But with a software intelligence platform, that can be all automated. In the federal government, that's kind of a key takeaway when you talk to federal customers.

I was in cyber and you were trying to sell them all these point products. They're like, "Look, I know it's great, but I don't have time for this. I have three guys doing all these things. We need something that provides automation." So automation, like you asked about before we started chatting, and what is the big takeaway from the federal government? I think automation is that journey to automate as much as possible.

Carolyn: Well, based on what you just said, this is not a founded fear. But there is a fear when you talk about AI and when you talk about automation. That means loss of jobs, because the robots are going to take over.

Andrey: Oh, yes.

Carolyn: But what I heard you say is, and I've heard this from other people too, like Willie Hicks, our federal CTO.

Andrey: It empowers workers and enables them to perform at their best. Yes. I mean, the stats are out there. There's simply not enough people to do this. Data keeps growing at exponential rates. Don't quote me on this, but in the last year, we've generated more data than in all of humanity, since World War II or something like that.

Knowing the Unknown Real Issue

Mark: I wonder if you're seeing that fear of losing jobs is really not the issue. Because within a cloud-first mandated world, particularly in the federal space, it really allows organizations to take their smart people and re-allocate them. Have them do things that they really intended to do in the first place, as opposed to triage all the time.

Andrey: So you have the issue, with all these disciplines growing, where there's the broad knowledge and the deep knowledge. Unfortunately, a lot of the smart people are now spread thin having to be experts in multiple areas. But there's only so much CPU. Even a smart person can allocate all this. So I feel like a software intelligence platform tool can help go deep and take care of all those nuances. Even if you look at the field of where networking used to be, I'd say, with networks we would configure everything box by box.

Even to this day, most of the legacy networks and data centers in the federal government and even in commercial space are configured box by box. This is why we have CCAs making so much money back in the day. The CISCO certified is now the experts for those not familiar. But now there's a move to...