How to Prime Your Data Lake

A security data lake, a data repository of everything you need to analyze and get analyzed sounds wonderful. But priming that lake, and stocking it with the data you want to get the insights you need is a more difficult task than it seems.

Check out this post for the discussion that is the basis of our conversation on this week’s episode co-hosted by me, David Spark (@dspark), the producer of CISO Series, and Geoff Belknap (@geoffbelknap), CISO, LinkedIn. Joining us is our sponsored guest, Matt Tharp, Head of Field Engineering, Comcast DataBee.

Got feedback? Join the conversation on LinkedIn.

Huge thanks to our sponsor, Comcast Technology Solutions

DataBee™, from Comcast Technology Solutions, is a cloud-native security, risk and compliance data fabric platform that transforms your security data chaos into connected outcomes. 

Built by security professionals for security professionals, DataBee enables users to examine the past, react to the present, and protect the future of the business.

Full Transcript

[David Spark] A security data lake, a data repository of everything you need to analyze and get analyzed sounds wonderful. But priming that lake and stocking it with the data you want to get the insights you need is a more difficult task than it seems.

[Voiceover]  You’re listening to Defense in Depth.

[David Spark] Welcome to Defense in Depth. My name is David Spark. I am the producer of the CISO Series. And joining me for this very episode, you have… He’s warmed your hearts for years here on the CISO Series and specifically Defense in Depth. It’s Geoff Belknap, the CISO of LinkedIn. Geoff, welcome.

[Geoff Belknap] Thanks for having me. I just might remind you that that’s probably heartburn, not heartwarming.

[David Spark] Oh, really? It’s heartburn.

[Geoff Belknap] You should get that checked out.

[David Spark] All right, so you can complain to Geoff for your heartburn. Tell the doctor that.

[Geoff Belknap] Yeah, check on that.

[David Spark] “I’ve got a bad case of Geoff Belknap.”


[Geoff Belknap] It’s not a healthy situation.

[David Spark] No. Our sponsor for today’s episode, very thrilled to have them on board… We’ve been in long conversations, and finally they’re on board. It’s Comcast DataBee – transform your data chaos into connected outcomes with a security, risk, and compliance data fabric platform. More about exactly that coming up later in the show.

But first, let me talk about today’s subject, which is not just data lakes but I guess ingesting data lakes. And the analogy that I… I put a post up online, Geoff, about when I went to CES many, many years ago, and I saw the very first generations of handheld video players. And now the first iPod had come out, and that was transformative because there were .mp3 players before that, but iTunes was this companion to the iPod that really made the seamless experience of getting the music into the device super easy and usable as well.

Well, I saw the same thing happening with the video players as I saw the early generation of audio players is that they had these amazing audio players that had beautiful screens, that had nice size hard drives, but the process of getting a movie onto it, not so easy. And this I’m hearing is often a complaint with data lakes.

Yet, there are ways to get data into it. But into this usable, comfortable format that like the iTunes experience is not so easy anymore. And I should mention that we now have this video like with Netflix and now just called Max, HBO, where you can download videos and very much consume them. So, there is a precedence of making things more usable with data.

Why do we struggle with data lakes?

[Geoff Belknap] I think data lakes like a lot of things, and especially like video platforms, there’s so many powerful things you can do with them. But it’s difficult. If a genie jumps in front of you and says, “I’ll give you three wishes,” your immediate thought is, “I have no idea what to wish for.” Data lakes, similar problem.

Very, very powerful. But in that power comes great complexity. I think we have the perfect guest with us today to sort through that.

[David Spark] I like that genie analogy of like you’re given all these wishes, and I’m like, “Oh, geez. Where do I start?” You wouldn’t even know because it does have this amazing power, and it’s one of those through use you actually discover the power. And to use it, you need to I guess prime the pump.

All right, so I’m very thrilled to bring our guest on today. It’s actually with our sponsor, Comcast DataBee. He is the head of field engineering, Matt Tharp. Matt, thank you so much for joining us.

[Matt Tharp] Thanks so much for having me. Good to be here.

What problem is this solving?


[David Spark] So, Yaron Levi, CISO over at Dolby, said, “Many of these solutions focus on getting and storing the data but provide very little in terms of what does the data really mean. Most organizations don’t have the ability to make good use of the data. Think about what problem are you trying to solve.

Start with the end in mind and work backwards from there.” Rodrigo Carvalho of taú Unibanco said, “We will continue to struggle with data lake the same way we do with unarchitected and unmanaged fileservers. You just changed the name of the repository.” So, I know there’s a lot of excitement about it.

The other cohost, Mike Johnson, has spoken very highly of it. This is a struggle that a lot have, don’t they, Geoff?

[Geoff Belknap] They sure do. I think everybody comes to a data lake with this idea of, “Aw, this is going to be a great way to either save money because I’m not going to put data directly into my primary detection engineering or processing system.” Or, “I just know that this log data might be valuable, and I’m just going to hang onto it.” And congratulations, you’ve not become a hoarder.

And now the secret is it’s really easy, we’ve all done it. You can shove all the papers off your desk and into a drawer, close that drawer and forget about it. You’re like, “Oh, good job, me. I’ve really taken care of that problem.” But the challenge is getting value out of that. So, getting value out of that means you have to be a little bit thoughtful about how you get all this data into your data lake so that you can do some really great stuff with it.

[David Spark] So, Matt, the idea of just putting data somewhere, pretty much anyone can do that. But what is that little hook that we’re missing between putting data somewhere and making it usable? What is that little magic sauce we’re missing?

[Matt Tharp] Yeah, so I think there’s a couple things. One, it’s to the point you got to keep the end in mind. What are you going to do with that data? And then making sure that the data when it’s coming into the system already conforms to what it is you want to do with it.  Things like normalization, things like flattening so that you can make it accessible and structuring it.

Putting it into a schema. Those are the types of things that make the difference.

[David Spark] Let me pause you right there for a second because I want to reference the genie wish thing. I think a lot of people don’t have that vision of what is the format I need to do with this.

[Matt Tharp] Yeah. So, one, I think the big thing is you’re putting it into a Sequel data lake. So, when you put it into a data lake like that, what you want to do is make sure that it’s going to be accessible again, that you can query it, that you can use it easily. Big things then are making sure you flatten out whatever structure you put in there.

Often times the general strategy to a data lake is sort of like you were saying, Geoff, is you just dump the papers in. Well, but if you don’t sort the papers before you dump them into your data lake, you end up with that data swamp instead of a data lake.

[David Spark] Can you give us a quick example of flattening? Because I remember Mike Johnson mentioned something to the fact that time code information from Europe comes in differently than from the US. And then that…once you normalize that or I would guess flatten it… Is that kind of the same thing?

Normalize and flatten? Then that makes it more usable. Go ahead.

[Matt Tharp] Yeah, it certainly does make it more usable. I would separate those two things. That process is normalization. We are taking it, and we are standardizing it on something. That way it’s usable. The flattening I think is taking a look at the data and saying, “Okay, but I’m going to use these fields from this deeply nested JSON a lot.

So, let’s put them up where I can get them for an analyst or for an auditor so that they can use them instead of chaining all their searches together.”

What’s the optimal approach?


[David Spark] Shawn Bowen of World Kinect Corporation said, “We need to look at making OCSF,” which if Open Cyber Security Schema Framework, “a more pertinent standard and use throughout. Similar to .mp3s being a highly portable open standard format.” And Chance Daniels said, “Right now the solution is piecemeal.

First organization to solve this elegantly is going to 10x their market cap.” Now, I’m going to throw this to you. I think this is exactly what Comcast DataBee is trying to do. You want to be the one that solves this and 10x your market cap, right?

[Matt Tharp] Yeah, that’d be something, wouldn’t it?

[David Spark] Yes. All right. So, this standardizing format… Because the whole concept of the data lake is, “Give it to me in anything, and I’ll deal with it,” kind of a thing. But it doesn’t always work that way, does it?

[Matt Tharp] It makes it very difficult to use it on the output side. Like yeah, you got it in. Sort of to your point, Geoff, is now I’ve become a hoarder. I’ve got the data, but how do I ever do anything with it. That’s the part. OCSF I think does help a lot there. It’s an opportunity to say, “Well, you know what?

All these different pieces when they come in, they should land in this place regardless of the tool, regardless of whatever.” And one of the other benefits of OCSF is the ability to say, “I don’t care whether what you’re giving me is device inventory info. I don’t care if what you’re giving me is compliance data.

And I don’t care if what you’re giving me is a firewall log. I can put all of those into a schema. I’ve got a place for all of it.” While still being extensible enough to handle the nuances of how do I make it most optimal for my analysts and my auditors at my company.

[David Spark] Geoff, I don’t even know. What is your experience with data lakes up until this point?

[Geoff Belknap] Well, I’ve definitely taken everything off my desk and shoved it in a drawer, and I have definitely done the cardinal sin of a big data lake environment. I think the reality is the best thing about a data lake, especially for security, is it’s a relatively low cost place that you can put all this data that you don’t know quite what you want to do with it, but you know it’s valuable, in one place and make it valuable.

But the worst thing about it is you’ve got all your data in this one place, and you’re not quite sure what to do about it. So, going from there and really deriving that value means you either have to have a schema, or you have to have an idea in mind of what kind of detection you want to run, what kind of audits you want to run, what kind of questions do you want to ask this mysterious oracle that has all knowledge.

And I think as long as you come at it from a perspective of, “What kinds of answers do I want to get? What kind of questions do I want to answer,” you really can accelerate your progress.

[David Spark] I’m going to let you close this segment. I’m going to ask you this question, Matt. And that is for those people who have never used a data lake before, have never ingested…give me an example of an eye opening moment that they see, “Oh, this is the direction we need to go in because now we can do this.” Give me one of the examples of that.

[Matt Tharp] Yeah, so one of the things that we’ve managed to do internal to Comcast is turn it into a compliance continuous monitoring framework. It’s the piece that says of all of the assets in my environment, which ones have the required security controls. At the same time, I can also say, “And right next to that, who has done their phishing training?

Oh, and threat hunters, here’s your EDR data for you to go look for anomalous power shell commands across all power shell instantiations anywhere in my environment.”

Sponsor – Comcast DataBee


[David Spark] Before I go on any further, we’ve been talking a little bit about Comcast DataBee, and I want to give you a little lowdown here. So, listen to this. Data is the currency of the 21st century. I mean that’s the reason you’re kind of listening right now, isn’t it? It can be used to understand the health of the business, to continually adapt as needed, to meet customer needs, to remain competitive, and to innovate.

It can also be used to better understand where threats are lurking or where security compromises have been made. But collecting and normalizing security data isn’t a small feat, and neither is contextualizing it with relevant business insights. What we’ve been talking about. So, enter DataBee. Made by security professionals for security professionals.

DataBee from Comcast Technology Solutions is a Cloud native security risk and compliance data fabric platform. It integrates and enriches data from disparate sources across your security, technology stack to deliver more connected insights that drive better business decisions. With DataBee, GRC teams can validate security controls and address noncompliance.

Data teams can accelerate AI initiatives and unlock business insights. And security teams can quickly discover and stop threats. So, learn more how DataBee can deliver the security data insights you need to stay ahead of the ever evolving threat and compliance landscape quickly and cost affectively.

I’m going to read you a web address. It’s a little funky, but I know you can memorize this. So, it’s comcast, but it’s Remember that’s

How do we determine what’s most important?


[David Spark] Steve Zalewski, who’s the other cohost of this show, said, “More is not better anymore. Only actionable results should be generated from your lake. You will find much of the data you currently ingest provides little context for the alerts you were focused on finding.” This goes into the data hoarding, Geoff, you mentioned.

Matt Eberhart of Query said, “Start with the problem to solve is key.” Aw, what you’ve been saying, the other Matt. The Matt on the show. But this Matt Eberhart continues by saying, “Many teams can point to the data but struggle to do anything valuable with it. We need to make it easier for security teams to use data across all elements of security operations.” And lastly William Hall of UNC Health said, “If you want security data lakes to be successful, what customer problem are you trying to solve, and how can you make it both dead simple to use and highly effective.”

[Geoff Belknap] Yeah, piece of cake. Everything we do is dead simple.

[David Spark] Well, I’m going to start with that last quote right there, Geoff, of, “Dead simple to use and highly effective.”

[Geoff Belknap] So, so affective. And listen, if I’ve ever worked with you and we’ve ever built anything together…

[David Spark] That is the holy grail, isn’t it?

[Geoff Belknap] …everything you did was perfect, simple, and effective. I think the key here is if you want to get that, you have to do just what we’ve been talking about. The traditional approach here is know what you want from the data, don’t collect the data that’s not going to be able to address the stuff that you want.

And, look, there’s a reason that common event format 20 years ago was super popular – because everybody had this problem of we’ll just stream endless amounts of data to the security team. Andt hen they got it. Right? All done. Everybody has got what they need. And the reality is… And this is a reality that my team lives.

I can ingest an infinite amount of data, but I have a very small amount of data that I can actually act upon.

I think in today’s world where people are asking you, “How are you leveraging AI? What are you going to do in terms of syntactic modeling between these things? How are you going to tell, just like Matt mentioned, who hasn’t done their training but is also out there blasting power shell across all of PROD [Phonetic 00:16:00]?” Well, those are things I might want to answer.

I want to make sure if I’ve got all that data in one place, I have an idea of how I’m going to link all of that data together to get great useful answers out of that. There is a ton of work that can go into making that seem simple and effective.

[David Spark] Matt, this segment kind of speaks your language – the other Matt here, Eberhart, and William Hall speaks to the idea of you got to start at the end before you even begin that matter. But I also throw out the…what Geoff said at the beginning. I don’t think a lot of people can even envision the end.

So, when you can’t even envision the end, where do you start?

[Matt Tharp] So, one is I’d say that’s where we start to see that data swamp concept come up over and over again where people put data in and then start to manipulate it to get to where they want. You don’t always have to know exactly where you want to be. You should have a question in mind because that’s what justifies going and getting that data.

But if you bring it in and you put it in a structured schema, the next layer of going, “Okay, and what’s the piece that…how I want to manipulate this?” Or, “I want to look for insights from it, or I want to put it into an ML model.” Those things start to be simple if you’ve done the homework at the beginning side instead of the end.

So, start with a question in mind but don’t neglect the middle piece. Because if you do that, that’s where the whole data lake goes downhill. You got to start at the end and go, “These are the data sources I need. This is the question I’m trying to answer. Now let me go get those data.” And format them appropriately.

Do the intelligent things on it up front like looking for all the devices and tying them to particular users or anything that you can do to enrich it in that process so that when it lands in the data lake, now you have what you already need focused on total cost of ownership still to make use of it and make it highly affective.

[David Spark] I’m getting the sense that maybe the people who you should be asking these questions are maybe the people running your SOC. I’m just throwing this out because, again… I’m going to throw out another analogy. I remember the very first time I got…I hired an accountant, I met this guy at a party, and he was asking me all these questions like, “Are you doing this?

Are you doing this?” And I realized I’m not even asking myself these questions. I wouldn’t even know to ask myself these questions. So, you got to go to the people who know the questions to ask. Who are those people?

[Matt Tharp] We found two of them. One of them, you’re right, is your SOC. Often times I say it’s your threat hunters. It’s the people who have been living in the mess and have enough of that data engineering background to go, “Aw, if only. If only this looked like this when I got it, I would have been able to do this differently.” That’s one.

The other one is your compliance process owners because they’re often dealing with the, “I need to get this data from all these different things and do something with it before it gets to my executives who are going to look at this dashboard or who are going to look at whatever.” So, those would be the two people that I’d say have a good enough understanding of what it is that they really need while also having the understanding of the end use case, the end in mind that you’re trying to get to.

What are the risks we’re dealing with?


[David Spark] Nathan Vega of Protegrity said, “How are people thinking about and handling the risk? If the data lake contained PII, we’d pseudo anonymize the sensitive data to add protection directly to the data that follows it even if it were to leak.” So, that seems like kind of an obvious thing, but I guess not a lot of the data coming in is anonymizing.

This is a really good question to be asking, Geoff, at the very beginning. Like what isn’t anonymized here? And we don’t want to open ourselves to potentially other compliance issues through this data lake experience, yes? Because that must be a fear. Like, “Oh, wait a second. Are we creating a compliance nightmare for ourselves?”

[Geoff Belknap] I think it’s always a fear. But in my experience, the reality is you have to address the risk for data where you’re certain it’s going to contain PII, or PHI, or regulated data. If you know for a fact you’re going to ingest something that might contain that, great. Do the work ahead of time.

Build a pipeline to leave that out. But a good risk manager really thinks about how to respond to things you can’t reasonably prevent. One of the common things that happens is an engineer will change something, and you’ll accidentally log something into a stream that you didn’t mean to. The reality is that you can’t prevent that comprehensively.

And my advice is don’t worry about it. Just make sure that you have a really good response process in place so that when it does happen you can clean it up and you can do the right thing.

[David Spark] Matt, let’s talk about PII and ingesting PII. Is just pseudo anonymizing it the only answer? What happens if you do it accidentally and it gets into the system? What are patterns you’ve seen?

[Matt Tharp] Honestly this is one of the reasons we like a data lake is often times you have column and row level access so that you can say to particular analysts or auditors, “Hey, look, you can’t see IPs or names.” Yes, they are in this table or this column in your data lake, but they’ve been masked out for you.

Somebody else can have access to that. It’s much easier to share the data within a data lake without opening yourself up to a lot of that PIIA exposure risk. That said, to Geoff’s point, yeah, you absolutely need to monitor this data just like any other data in your environment. Keep tight audit controls on it, make sure that maybe you’re putting those audit logs in a data lake.

I don’t know. Somewhere where then you’re going to go look at them. But yeah, you do need to make sure that you are policing that. You can pseudo anonymize or mask. But putting it into a data lake, that’s one of the benefits of data in a data lake is it’s easy to apply fine grained access controls or it can be easy.

[David Spark] Well, I think what we’ve learned today is that the data lake is wonderful. Try your best to figure out why you want to use it and then ask yourself all the questions. And if you don’t know the questions, go to the people who are in the frontlines who know the questions to ask and can tell you, “Hey, it’d be wonderful if you could do this.” This is my grand summary of our discussion.



[David Spark] But obviously, Matt, we greatly appreciate your insight on this. I’m going to let you have the very last word. First, though, I do want to ask you which quote was your favorite and why. So, pick any quote here. There were a lot of good ones, but pick your favorite and why you think it’s such a great quote.

[Matt Tharp] Yeah. So, I think my real favorite is going to be Chance Daniel’s, the 10x market cap.

[David Spark] Of course. [Laughs]

[Matt Tharp] But seriously, I think my favorite quote is probably Yaron Levi’s. It’s the end in mind. Because if you start there, you have a much better chance of making your data lake highly affective and successful and not something that you have a total cost of ownership problem two years down the road.

[David Spark] Which I’m sure that’s… I just think about the early days of databases. Just forget data lakes. When you don’t set that thing up right, oh, you pay for it badly down the road. And I’m assuming data lake is no exception in this story. Geoff, your favorite and why.

[Geoff Belknap] I promise he didn’t pay me for this, but I’m going to go with Steve Zalewski’s quote. You may know him as a cohost of…the other cohost of Defense in Depth. “More is not better anymore. Only actionable results should be generated from your data lake. You’ll find much of the data you currently ingest provides little context to the alerts that you’re already focused on finding.” I think this is very true, and I just want to repeat – more is more.

More is not better. More is not more efficient. It’s not more advanced. It’s just more. And I think just to echo the point our own Matt was making earlier. It’s like you can ingest a lot of data, but the goal is not a lot of data. It’s a lot of context. You want to ingest the things that are going to be valuable to you down the road, not the things like, ah, maybe someday I’ll want this dump of binary from a Java heap.

Be thoughtful about that. Collect the things you know you’re going to need at some point. You can always collect more later.

[David Spark] Excellent. Now, I want to wrap this sucker up and mention again Comcast DataBee. Thank you so much for sponsoring. Again, for those of you who don’t remember, transform your data chaos into connected outcomes with security, risk, and compliance data fabric platform. Remember to go check out more information.

You know the worst comcast. What if you were to slap a period between the A and the S in Comcast? Try that. And you will find all the information you want about this. Matt, I’m going to let you have the very last word here. Anything else you’d like to say about Comcast DataBee, an offer to our audience, anything like that?

[Matt Tharp] Yeah, mostly we welcome you to join us on this journey. We’re hiring, and we really make it easy for customers to ingest data into that data lake.

[David Spark] Aw. Well, I know they would all like that. And you could one day be the iTunes of data lakes. Maybe. iDatalake. You think that would be…you think that’s going to catch on, Geoff?

[Geoff Belknap] Maybe data.lake. I can think of a lot of better domains here. But I think the important thing is the value you’re going to derive from these kind of things.

[David Spark] Exactly. Thank you again to Matt, to Comcast DataBee, and thank you to our audience. We greatly appreciate your contributions and listening to Defense in Depth.

[Voiceover]  We’ve reached the end of Defense in Depth. Make sure to subscribe so you don’t miss yet another hot topic in cyber security. This show thrives on your contributions. Please write a review, leave a comment on LinkedIn or on our site,, where you’ll also see plenty of ways to participate including recording a question or a comment for the show.

If you’re interested in sponsoring the podcast, contact David Spark directly at Thank you for listening to Defense in Depth.

David Spark is the founder of CISO Series where he produces and co-hosts many of the shows. Spark is a veteran tech journalist having appeared in dozens of media outlets for almost three decades.