DevOps Deep Dive - The Stranger Things / Des O'Connor One Artwork

The IJYI Way

The IJYI Way Podcast is about the inner workings of a software delivery startup, covering everything from business, finance, software development, agile delivery and DevOps.

All Episodes

The IJYI Way

DevOps Deep Dive - The Stranger Things / Des O'Connor One

January 30, 2020 • IJYI

In our second DevOps Deep Dive podcast we bring the team back to discuss all things DevOps, tooling, automation and the future of DevOps....with a little Stranger Things and Des O'Connor thrown in!

Joining us on the podcast is:
Chris Pont, CEO
John Nicholson, CTO
Alan Jackson, COO and Voice of Reason
Tim Naish, Consultant.

Andrew:   0:10
Welcome back to the IJYI Way podcast is the 2nd one of the new year. If you missed the last one, we're going to say Happy New Year again. Happy New Year. Happy New Year and welcome to January we hope you're having a dry January if you're not we hope you having a good January and you know we're doing our January challenge. That's 2020 minutes of exercise will be done by the whole of the IJYI team. So far, it's only Chris. Do find him on Strava and see if you can race him on his route around Ipswich. Contact Us on Twitter for more on that, that's @IJYILltd. So joining us this week, we have again. We have Tim Nash, who is DevOps consultant and Co founder and CEO. Chris Pont. Also co founder and CEO John Nicholson. And last, but definitely not least Alan Jackson, the chief operating officer, soon to be global head of operations.

Alan:   1:08
I think Global Operations Director sounds better because its back to God! *Laughing*

Andrew:   1:14
I'm Andrew and I'm a freelance writer. I am an old friend of IJYI's, and I'm passionate about all things digital. I also worked out from the last series of stranger things how to get into buildings through the air, conditioning vents and that's why I'm still here. Hey, fellas, but they're they're not really laughing because someone got in through the vents and stuff has been going missing.

Andrew:   1:47
We're back this week talking more about DevOps, and if you haven't heard the last podcast, then go and download it now and listen to it. We talked a little about what DevOps is and the way. It's changing. The digital operations which underpin pretty much every business. I want to start this question with Tim because you're a consultant, which means I should consult you because I've got this very painful elbow and I was, oh no the RSI jokes. If only if only it wasn't real. Okay, So, um, the question I'm gonna ask him is Process Automation plays a big part in DevOps, but a lot of people get worried when they think about automation because they think it's gonna be replacing humans with robots, presumably replacing humans with robots isn't a part of DevOps.

Tim:   2:40
It kind of is, in a way, with DevOps Automation is a repeating theme throughout several different parts of it. So in order to deliver things as quickly as possible, you need to be able to test things to check that you haven't broken everything to check that things are working as they should. And the most efficient way of doing that is via automation, then moving further down the pipeline when you come to deploy changes into into production, the most efficient way of doing that is via automation. And then way have companies like Netflix that have introduced this concept of introducing organised chaos to the production platforms in order to test whether how it response to failure because failure is, let's face it inevitable. They are automating failure of their systems to check that they are resilient and still carry on, providing service that they're designed to do

Andrew:   3:45
haven't heard that before - automating failure.

Chris:   3:48
Yes, chaos, monkey

Andrew:   3:50
Chaos. Really? Come on, you gotta tell us more about that. It's amazing

Tim:   3:54
They have a whole army of chaos. Chaos monkey is one of the service is that goes, goes through their environments and randomly, I think it kills instances. Or I think they are AWS based so it kills instances that are running and and generally breaks things to check the service still carries on.

Andrew:   4:17
Wow. I mean, I mean, it does still carry on because as you know I've seen the new season of stranger things at last, and I have learned a great deal about burglary from it. So John a question I want to ask you because there is an element then that comes from this. Presumably we're talking about replacing dull, repetitive tasks with automation. We're not talking about computers that, you know, replace devs.

John:   4:42
Yeah, absolutely. I mean, the the key point is, there are numerous tasks in traditional IT that are tedious, repetitive and prone to failure because of how tedious they are, people lose concentration. One example previous project that we did automate. We had 110 page release notes that had to be run through manually by someone to deploy this application. That deployment would take probably two days at the first pass to get there and then, realistically, about a week to understand what steps they'd missed in that 110 pages at some point, because it was very hard to understand what had gone wrong at different points, automating that and that involved automating the environment, the systems it was running on as well as automating the application took a deployment down to 15 minutes. When we're talking about these fast deployment cycles that the automation enables it on, it doesn't just have to be replacing test though can replace test, certainly on a regression basis. So change checking that nothing's changed. It's really valuable.

Andrew:   5:54
And by regression, we mean looking at the last version of the code and make sure the next version is compatible.

Tim:   5:59
Absolutely. Making sure the functionality that we don't expect to have changed hasn't.

Chris:   6:05
I mean, looking at that regression impact we work in an agile fashion here. So we run sprints generally two weeks prints, so we're developing functionality over a two week period and we need to test it and make sure it works. Obviously, the second sprint. We're testing all of the new functionality in Sprint two along with regression testing all of the functionality we built in Sprint one, and that's fine. Sprint three we're testing sprint three and two and one. You imagine you get to sprint number 29. You've got 28 sprints worth of regression to test during that period. You get yet through that much functionality, you're not gonna have the manpower to fully regression test everything. So if you can automate as much possible you're then allowing your testers to actually go and test the stuff that they need to that needs a bit more manual input.

Tim:   6:55
Yeah, And whilst they say it is replacing work that would be traditionally done by by humans, it's it's replacing that work and allowing them, freeing them up to do things that humans are very good at. Like what happens if I do this? This this this which which is is more difficult automate and taking away that that tedious work that takes a long time and is error prone, that computers are very good at.

Chris:   7:25
Just coming back to John's point on automated deployment, I think one of the key takeaways there as well is that if you could automate two weeks worth of deployment activities into 15 minutes, it means you don't have emergency releases anymore because those releases go through the same amount of rigor that any other release would go through. It's just clicking a button and allowing everything to take place with those activities that roll the software out whatever environment you're targeting. So it really does put a lot more safety and rigor around any of your releases.

Andrew:   8:01
Presumably, if you're combining that, then with also this Canary model where you release to a small number of users first to check that. actually, it's working. And for some reason I'm suddenly thinking, Didn't that presumably TSB who suffered a major outage last year? They didn't apply the whole canary automation process? Can we talk about that?

Alan:   8:24
I wouldn't dare comment on someone else's project!

John:   8:28
I know enough of the details. Unfortunately, being researching major software failures over the last year, um, essentially the software itself was tested in an ongoing fashion. The deployments to a level were automated and happening in an incremental manner. Unfortunately, there's a part of automation that is very, very difficult to deal with, and that's actually around data. Moving data between existing systems is a complicated task to get right, particularly if you have to synchronize from one system to another. Data migrations is one of Alan's great fears. Um, when it's from a cold system and they want to do it in a big bang, it's very difficult to get right. It's it's, ah, lot of risk involved. What TSB tried to do is do that over one weekend with. I believe it was 250,000 customers, at least and they all got locked out of their bank accounts. They all got locked out of card transactions. All facilities were lost to them for some of those customers it went on for a period of weeks, not just that weekend. As much as that's not spoken about as much, it was a very significant failure

Andrew:   9:42
That's a data migration issue. Presumably then, if you've got your DevOps team balanced and right and you've got the right automation in place, you would have a different approach to data migration already.

Alan:   9:58
It's another ops term. We didn't mention at the beginning the last one, I don't think which is DataOps which one of our consultants Brendon is very fond of and recently spoke at Microsoft Conference about, um so, yeah, it takes the same approach, you know, to pushing data. But data ops works more on your day to day data requirements. You know, sort of incrementing existing data systems to have more reliable data in them for MI/ BI. Things like that. I think the data migration....Any time you decide to do anything in a one off big bang, you collect all that risk up, you bundle it up, and then you put it all on one moment when you you need push a button or maybe five buttons and it's I just don't I don't conceive of any other instance in your life where you would do that as an approach for anything risky. I have to say, though RBS and Natwest did a merger late noughties early two thousands, where they absolutely nailed it, doing it this kind of way. There was literally a button flick and they did make it work so it can work. It's not to say that it can't and there'll be lots of people saying "that's rubbish it does work" but it's just harder than it needs to be. You know, it's just not the safest way to do things.

John:   11:12
It's down to that risk, absolutely down to the risk. You can't you can make it work and it will work. But there's a high chance in comparison to another way that it may not

Chris:   11:23
I mean looking at the high risk examples, one of the ones that always comes up is Knights Capital, who manually deployed software to 8 of their trading servers. One of those servers was going very wrong and after about 45 minutes of stumbling around, scratching their heads and trying to work out what had gone wrong they eventually pulled the plug on their entire platform, by which point they were about $1.3 billion short, about $1.3 billion long across various different stocks. And that resulted in a net loss of about half a $1,000,000,000 in 45 minutes. Because of release processes going wrong.

Tim:   12:09
One thing that's implied by doing releases a lot more often is, of course. In a traditional setting you would do your releases out of hours, maybe the weekends in middle of the night, whereas in the DevOps process, because you're releasing often and releasing small changes, you would, quite generally, do this during the day, which has a number of advantages and reduce the risk because the people who know about what's going on are all there in the office already, rather than having to say "Okay, you're gonna have to work overtime. You have to work weekends." People look tired. It's three o'clock in the morning and then things aren't going that well. Whereas if you do it while people are there in the in the day, then you've got your experts there. People are fresh, ready to go. You've got everything you need so that the risk is reduced.

Alan:   12:59
And, of course, if you're a global business, it's not three o'clock everywhere. Somewhere someone's awake, you know, working with a company based out of New York. There was always the Australians were the ones that always suffered because it was always done on New York time, and then the and the guys in Australia had to sort of suck it up a little bit, But you know you by doing by doing the canary deployments, by releasing to small groups of users one at a time by doing little and offen at various times, you sort of just reduce all of that.

Tim:   13:27
And, of course, with your global company, there is no downtime, because somewhere in the world someone is always trying to use your system. So being able to deploy at any point without any noticeable interruption to your users is a real advantage,

Andrew:   13:42
Let's say, I'm a typical SME with a fairly typical sort of reliance on digital tech and fairly average data needs, whatever that might be. Let's just say there's a generic company here and we've heard about DevOps. What's the next step? How do we start changing the company process?

Alan:   14:05
Well come speak to us for a start! There is no silver bullet, really. It's lots of small changes across the organisation. We've taken the approach. We'll audit a business. We will look at how the various different work units interact with each other and do an assessment work out where those small, incremental changes can be made where those benefits could be realised and every organisations different. So there's always there's always constraints, be they political be they budgetary or else. So you know, it's about considering that, considering what benefits could be made and then addressing the areas where the most benefit can be made.

Andrew:   14:51
Tim, you're a consultant. So where do you find most, I suppose, resistance from stakeholders or all the most difficult ways to try and explain the process of stakeholders who who aren't familiar with DevOps.

Tim:   15:06
Yeah, I think that touches on a really important point, you have to have buy-in from your stakeholders at the top that this is this is the way to go. Often the changes that you're looking to make, the payback on them can seem like it's not immediate and you're asking for a lot of investment. For example, the automation we talked about it does take time to automate stuff. But then you reap the benefits again and again and again and again. So it's I think it's it's important to get that at stake holder buy-in, but also, once you start to pull your team together, they need to understand why. Why the change because let's be honest people people don't like change. They like that particular way, working that comfortable with, sort of trying to force that on a team of people can meet with resistance, so they need to understand what you're trying to do as an organization.

Andrew:   16:03
I heard a great phrase about that that's summed that up, which was. People love the idea of change, the bit they don't like is doing things differently. Yes, So there is a There's a cultural, very human element here in getting this working in your business. It's not like the old, I suppose people's fear is. You call in the management consultants and they come in, they sack half the workforce and they put up more cubicles. They do something that you know damages that working environment and really what you need to do here is try and improve the employees experience because ultimately that is going to transform the business performance.

Tim:   16:39
Often you're trying to effectively take an operations team and a development team who may have been at war effectively for quite some time, with lots of kind of finger pointing and and when things go wrong. Whose fault was it? And you're trying to combine those two groups of people who I think as we've touched on before one of them is trying to stop change, and the other one is trying to create change, so merging those teams together can  challenging.

Alan:   17:08
That's that's where the business buying comes into it. Because if the business motivates them both groups of people in the same way - increased sales or increase throughput on the platform or something like this gives both sets people of the same target, the same business driver, then that should become easier because they're they're management, then has to break those walls down and, you know, it's quite from a IT's point of view, I think most Devs get it now, you know, we're at a point where it's actually not hard to convince the technical folks to come on board because the new ways of working are actually really interesting. You know, the tooling's great . it's pretty mature now.....

Andrew:   17:46
Sorry to interrupt but by tooling just for those people at home suddenly imagining that we've got a room full of machines....I suppose we do. Computers.

Alan:   17:57
Yeah but the sort of the software products and the tools that the developers need to skill themselves up in now are pretty mature. They're there, But they're really good there. The learning curves not too steep in most cases. And, you know, and they've all got cool names . So, you know, I think that Devs are are curious problem solvers and so are people from ops people in IT are curious problem solvers by their nature. So you give them a new problem to solve. Can you deliver? Can you do this? Can you do that? Here's some great tooling,  here's some interesting stuff is an opportunity to learn new skills. You're not gonna have trouble motivating your IT team to do that. And if you do, quite frankly, you've got the wrong IT Team you know, because they're not curious. They're not problems solvers in which case they're not very much good. In any case, any instance.

John:   18:47
One of the other ways and key motivators for often for the ops team specifically is actually you reduce the heroics required to keep systems online. So if you look at traditional ops department. Often you have key people in that department who can't go on holiday without taking their phone with them. They're working all hours every time there's a new release. Night shifts are a point of regularity rather than the exception. As you adopt a DevOps process, those things should go away. They're meant to be measured, looked at and examined so they can go away because they're a form of waste. What I mean by that is anything that impedes value being delivered. If your key resource is exhausted, you're not going to get the best out.

Andrew:   19:35
This is a fascinating topic, and the clock is running down. So I have one last question for everyone, which is we heard briefly that DataOps is a thing and I've heard a lot recently about DesOps, who I assumed was a sort of lounge singer from the 1970's but no, I'm showing my age again it's Des O'Connor. It was a Des O'Connor Joke! Anyone remember him? Ask your parents about Des....Oh I can't believe this is happening to me again. So what is next? It's 2020. Okay, we're gonna be doing a show in January 2021. I hope so. If you haven't changed the locks by then, or the air con I'll still be getting in. So in the show in 2021 are we still gonna be talking about DesOps or is there a new movement on the horizon?

John:   20:27
DesOps is an interesting one. It's the design agencies and design department's trying to elevate their position as a stakeholder within the process, it's essentially still part of that DevOps process in the end, if you actually look at all the articles talking about it, they show two halfs of the same thing. Um, the big conversation point you will see a lot about is anything that integrates AI  into this DevOps flow. At the moment, it's falling under the banner of DataOps because it requires large sets of data to be of value in to work. But the management of models and trained models specifically for machine learning and how those get deployed into environments is becoming a bigger topic around this. There's a lot of people doing good work. There's a lot of interesting things going on. This is very much a area and field that's in constant flux. It's constantly being contributed to from the open source world, Um, but it certainly hasn't settled down into a mature position at this point,

Alan:   21:38
So build on your stranger things references you keep making. The other creepy thing that's starting to emerge is the use of AI in the pipeline to automatically fix software problems. So this is where people are trying to write routines and create models that can fix bugs, which is which is really kind of creepy, because that means your software is going to start to evolve. And then I get all sorts of Terminator Sky Net type flashbacks. Maybe we cross over there generationally!

John:   22:05
Just to interject. It's not trying to. They've succeeded. So if we look at what Facebook have going right now, they, based on their automated test suits, which I hasten to add our human generated just to be clear. They identify errors. Look back at the back catalogue of fixes that have been produced and, suggest fixes to the developer who introduced the error and send them the changed code so they can examine it, review and then commit it.

Alan:   22:33
It's amazing is really incredible stuff. You know, it's obviously not mainstream for your SME example. It's a long way from those guys, but the way tooling evolves and I say, like the tools are maturing the tools are getting better. It's just a matter of time before you could buy that for a 15 buck license. ? It's just, it's insane!

Andrew:   22:50
I went to a lecture a while ago about neural networks and how they work.nteresting about how they work. Does anybody know how they work? Because my understanding

Alan:   23:02
That's an interesting one, about how they work. Does anybody know how they work?

Andrew:   23:02
Well, they all work the same way, which is interesting. It's to do with probability and nodes. And, um, I mean, I may have snoozed in parts, but I think the important what was really interesting about is the fact that you can use this to start identifying what should happen as opposed to what has happened. And presumably that's the kind of technology we're going to see in these, AI sort of auto bug fixes.

Chris:   23:25
Yeah, I think I think some of the cloud platforms are already doing a bit of that where they're identifying increases in error rates increases in page hits. Suggesting fixes, trying to spot issues with how the platform works in production. So you know that's really helpful and it's not fixing the problems itself is just suggesting where changes might need to be made, where efficiencies could be made.

Tim:   23:50
And that's really where AI  is useful. And so the extension of the automation. Because when you're deploying every kind of 12 seconds, there's a lot of data flowing through. I would be impossible for humans to detect those kind of patterns. And AI and machine learning can really add value there.

Andrew:   24:08
Okay, good. Well, that's it. I'm afraid for this episode. Part 2 of DevOps for the IJYI Way Podcast. I'd like to thank you for joining us today the team.......They are humans some of them are augmented with cybernetic parts. I think it's mostly John. It's fair to say, And on that front. If you want to find out this if this is real visit @IJYIltd. We are actually not AI avatars we're actually real. people!!