Thursday, November 14, 2013

Intrusion Detection Made Skeezy: How do you find the DEA in a haystack? | Analytics Made Skeezy

Heard about this last night at a data science meetup.  The intent is of course not to promote crime, but to make data science a bit more accessible with storytelling...   :-)

Intrusion Detection Made Skeezy: How do you find the DEA in a haystack?

NOTE: In this post we’ll be implementing the Local Outlier Factor algorithm introduced in 2000 by Breunig et. al. I’d highly recommend following along in the spreadsheet for this post. And don’t forget to sign up for the newsletter to hear about future Analytics Made Skeezy tutorials.
“We have a problem Alex,” said Victor. We sat together on a bench outside the Great American Scream Machine at Six Flags. I was beginning to suspect that Victor was picking these bizarre meeting locations on purpose. Certainly no one had stuck a hidden mic under this bench, although plenty of folks had stuck their gum there.
The evening air smelled of sweet garbage, like a rotten apple core. And the rickety roller coaster clacked up the first hill in the distance. Victor watched it through slit, thoughtful eyes. I’d never seen him like this. A scruffy beard, he fidgeted slightly.
“What kind of problem?” I asked nervously. I knew the normal Victor was a dangerous man; I assumed the paranoid Victor wouldn’t be much better.
He turned and looked at me, “There’s a fox in my henhouse.”
“What do you mean?” I asked. Someone on the roller coaster shrieked as it tumbled down a hill.
“I have a source within the DEA,” he said, eyeing me carefully as he spoke, “And this source has told me that there is a Judas waiting to betray me.”
My throat was beginning to close, and my heart leapt.
“Who is it?” I asked, faking a calm curiosity.
“A coke supplier is all I know. They’ve set a trap for me and some of my competition. But I don’t know which supplier,” He shook his head, “Which is terrible because there’s a shortage right now from all the gang violence south of the border, and I’m having to make deals with stranger and stranger suppliers just to meet demand.”
I breathed a sigh of relief. I wasn’t the traitor. At least not this time. But the fact that the DEA had a mole didn’t bode well for me. I made a note to warn Agent Bestroff.
“So do you have any clues?” I asked.
“I have a short list,” said Victor, “A list of all the coke suppliers offering up their goods to me right now who’ve passed basic vetting. But I have no idea which one has been turned.”
I nodded, “You got any data on these guys?”
“Yeah,” he said, “I’ve got some data. Simple checks we run, supply chain info, price, and availability data, but honestly, I already looked through it, and I didn’t see anything funny.”
He opened a spreadsheet that looked like this:
“Here they are,” he said with a shrug, “All 249 of them. I looked at their prices, the quantities they offered…we’ve got small and large players, but no one’s an outlier.”
I smiled, “That’s an interesting word you use, ‘outlier.’ How’d you check to see if you had any outliers?”
“I looked on the list for anyone named ‘Judas,’” he said with a laugh, “No, I computed the means and standard deviations for the price and quantity columns, but no values seemed too far from the others. And I made graphs of the numeric data, but it all looks fine.”
He pulled up a graph of the available coke each supplier had for sale:
“You can see from the jump in the graph that we’ve got about 80% small time suppliers and 20% or so larger suppliers,” said Victor, “But a minimum of 500 kg for sale is not extraordinary nor is a max of 5000 kg. These values are all expected.”
“And the same is true for the other numeric columns here?” I asked. Victor nodded yes.
“So we’re bumping up against a couple of issues in intrusion detection,” I said, “The first problem is that you’re looking at each column individually for an outlying variable, but maybe this fox is hiding right in the middle of your henhouse, and only when we examine all the data holistically will we find him. And by all the data, I mean this text data too.”
I pointed to all the categorical information in the spreadsheet, the ‘Y’s and ‘N’s and shipping methods.
“You said there were a couple issues here. What’s the other one?” Victor asked.
“Well, taking means and standard deviations are terrible for finding outliers here, even if we could do the analysis column by column, because as we can see from the jump in the graph, your data is pretty oddly distributed. Furthermore, when you take a mean or a standard deviation, those values actually include any outlying values in them. You should be using more robust statistics of spread and centrality.”
“What do you propose then?” asked Victor.
I thought for a moment, “Well, if there is an intruder in this data, a ‘wolf in sheep’s clothing’ as they say, then the first thing we should search for are people who may not necessarily be outliers globally on any one value but who nevertheless look unlike the suppliers nearest them. They’re trying to fit in with some subgroup of suppliers here, ‘putting on the sheep’s clothing,’ but they’re failing.”
“But let’s back up a minute. Can you explain all the columns in this sheet to me?” I asked.
“Of course,” said Victor. He distractedly smiled as he watched a boy walk by carrying an over-stuffed bear he’d won. The uncomfortable bench we sat on was starting to hurt my butt, and it irritated me that I’d had to fork over $30 just to meet a drug dealer inside an amusement park when the parking lot would have sufficed.
Victor turned his eyes back to the sheet.
“So the ‘Operate Own Transpo’ column is pretty straightforward,” he said, “Some suppliers contract out to drug runners, while others own and operate that vertical. Usually the larger guys will operate their own running operations.”
“The column next to it is the smuggling method they prefer to use,” he continued, “Personally, I don’t like to deal with someone who uses drug mules unless I have to. Too messy, too many things go wrong, and the quantities they can smuggle are small. The new narco subs are the most reliable, but even a car across the Canadian border through the Chippewa Indian Reservation isn’t a bad risk.”
“And what does ‘Counterintel?’ mean?” I asked.
“Do they have any counterintelligence they can bring to the table?” he said, “As in, do they have any tipsters in law enforcement? Do they have inside information that tips the scales in favor of successful transfers of product? Once again, the larger organizations are more likely to have this.”
“Then why not go with the larger guys every time?” I asked.
“Well,” said Victor, smiling, “Economies of scale don’t trickle down to pricing in the drug trade per se. The large guys are less likely to lose a load, get caught, get you busted, poison your customers, but they charge a hefty premium for their higher level of service. Much like shopping at the nice grocery store here in town, Publix I think it’s called, you pay extra for the nice experience. On top of that, I need the small guys to meet my total demand across the U.S.”
“Who are Jimbo, Stevie, and Margalo?” I asked, pointing to the next columns in the spreadsheet.
Victor laughed, “Strange, right? They’ve actually my competition in some major cities, but we pool information on suppliers. If one of them can vouch for a supplier, that’s better than nothing.”
Victor held a finger up, “But you can’t always trust it. You never know when they might get together and burn you. If one of them were to get busted, they could start feeding bad intel to the rest of the group.”
I nodded, “And the ‘Criminal Record’ column is whether the supplier has a criminal record?”
“Yes, their leadership,” he said, “Believe it or not, that’s a plus in this line of business. People who seem too clean often are.”
“What’s ‘need to meet?’” I asked.
“The exchange, money for product, can either go down in a meet or a drop. If you do a meet, the supplier is less likely to get burned because the bags change hands at the same time. But it’s dangerous and unpredictable. The big guys often prefer to just do a drop. Leave the drugs in the trunk of a car and get the money from you where you leave it. Clean hands. But many of the littler guys can’t afford to front the loss of even one shipment so they’re more likely to insist on a person-to-person meet,” said Victor, “As for the last three columns in the sheet, they’re all self-explanatory: the number of times we’ve personally dealt with the supplier, the max coke we can currently buy from them, and the price per kilo.”
I nodded, “Ok, so the first thing we need to do is get this categorical data into something we can analyse. We’re going to convert all the yes/no questions into +1/-1 values, and we’re going to take that transportation method column and split it into dummy variable columns with a 1 in the slot of the transportation method they prefer similarly to how we did it for our LSD trip AI model.”
“Yes, I remember dummy variables,” said Victor tapping his temple with his index finger.
I created a new sheet called “Numeric” and filled it in with the converted data using simple IF() formulas:
“So now the data is numeric, but what do we do with it?” asked Victor.
“Well, we’re going to compute a value for each supplier called their ‘Local Outlier Factor’ or LOF for short,” I said.
“And what will the LOF tell us?” asked Victor.
“The LOF is a single number that when it’s near 1 means ‘Hey, I look like all the suppliers whose data is most similar to mine’ and when it’s far greater than 1 it means ‘Hey, I look unusually different from those suppliers whose data is most similar to mine,’” I answered, “It’s a local outlier detection method as opposed to a global outlier detection method, meaning that while the outlying supplier may not be worlds away from everyone in this data set, he’s odd when compared to his closest peers.”
Victor gave a face that he tentatively accepted what I was saying.
I smiled, “It’s a bit of a journey to get there, but I swear once you see the result, you’ll get it.”
“OK,” said Victor, “So what do we do with this numeric data?”
“Well, the first thing we’re going to do is scale and center it,” I said, “So I’m going to take the trimmed mean and mean absolute deviation of each column to start.”
I added it two rows at the bottom of the numeric data.
“The trimmed mean is just like a regular mean except that I’m gonna toss out the lowest and highest 5% of values from each column to prevent any outliers in a single column from skewing the mean. A trimmed mean is kindof like what they use when judging gymnastics meets and they ‘throw out the highest and lowest scores.’”
“Ah,” said Victor, “That makes sense.”
“And the mean absolute deviation is just the average of the absolute values of the differences between each value and the trimmed mean. It’s like variance except we’re using absolute values instead of squared values to minimize the effect of any outliers,” I said.
“OK,” said Victor as he peered at the sheet, “And what do we do with them?”
“Well,” I said and created a new sheet called Scaled, “We’re going to subtract the trimmed mean from each value in the sheet to center the values around 0. Then we’ll divide through by that column’s mean absolute deviation. That’s going to take something like the price column and put it on the same scale as the ‘# of Past Deals’ column, so that no column is going to be numerically more important than any other just because it’s got a big scale.”
I filled in the ‘Scaled’ spreadsheet, showing Victor how, for example, in the Transpo column for the supplier ‘Abraham’ (B2) its new scaled value was just its old numeric value minus the mean divided by the spread. 
“Now we have all numeric, scaled data,” I said, smiling.
“And we use it to tell suppliers apart?” asked Victor.
“We want to use it to measure the distance between each pair of suppliers,” I said, “And by distance, there’s a lot of different distance metrics out there.”
“Like the cosine similarity we used on my wholesale data?” asked Victor.
“Yes, exactly, although this time let’s keep it simple and just use Euclidean distance,” I said.
“So when measuring the distance between two suppliers, then,” said Victor, “That’s just the square root of the sum of the squared differences of the values from each of the columns?”
“Yep, just like you used to do in elementary school when you calculated the length of the hypotenuse of a triangle as the square root of the height squared plus the length squared,” I said.
I created a new tab in the workbook called “Distances” and filled it with a Suppliers X Suppliers grid.
“Can you explain that Excel formula to me?” asked Victor.
“Sure,” I said and pointed to the top left distance in the spreadsheet, “I’m looking up the scaled data for the supplier for this row using a VLOOKUP and pulling out the numeric columns I want from the previous sheet with that curly braced list. I’m doing the same VLOOKUP for the column. Then I’m taking the difference between the two sets of looked up values, squaring each difference, summing up the list, and taking the square root a la Euclidean distance.”
Victor nodded, “And this is another one of your array formulas?”
“Yes,” I said, “Since we’re using the VLOOKUP to grab a list of values which we’re subtracting from another list, we need to use an array formula in Excel instead of a regular one to do that list operation. And to use an array formula, I just type the formula normally into the cell and then instead of hitting return I hit ‘control/command + shift + return’ to engage it. That’s what added those curly braces around the entire formula.”
“So now we have a distance between each (row, column) pair of suppliers?” asked Victor.
“Exactly,” I said, “Very similar to the grid we made for the wholesale customer clustering we did a while back. And the next thing we’re going to do is rank them. For a given column, we’re going to replace the distance to the supplier in a row with its rank-order starting with nearest first. Since the self-distance is on the diagonal, we’ll start the rankings at 0 to make it easy to ignore the self-distance.”
I created a new sheet called ‘Rank’ with the rankings. The formula for the top left rank looked like this:
“So you’re just using the RANK() formula to look up a distance and compare it to the rest of the distances on that column?” asked Victor.
“That’s it,” I said and made another sheet called ‘Rank-Inv’ which was a transpose of the rankings so they were row-oriented.
“Now,” I said, “The reason why this approach is called Local Outlier Factors is because it compares how dense a local neighborhood of points is gathered around you to how dense the neighborhoods are around those points neighboring you. If your neighborhood is  bit watery compared to the neighborhoods around you, then you’re an outlier. The people you consider your friends, don’t consider you much of a friend.”
“So how big is a neighborhood?” asked Victor.
“Well,” I said, “The size of the neighborhood is the only parameter we need to choose when calculating LOFs.”
“And what do we set it to?” asked Victor.
I shrugged, “Let’s start with a neighborhood of 5 points. The algorithm is fairly robust to changes in the size of K, but 5 is a good place to begin. And we’re going to create a sheet of ‘reach distances’ where if you’re in my neighborhood of 5 closest points, then your reach distance with respect to me is the distance to the perimeter of the neighborhood, but if you’re outside of my neighborhood, then the reach distance to you is our actual distance.”
Victor looked confused.
“Think of it like Lord of the Rings. If I’m the King of Gondor and you’re in my kingdom, i.e. Gondor, then the reach distance to you is the city walls. But if you’re not in my walled city, if you’re in say, Minas Tirith, then the reach distance between you and me is just our actual distance.”
“I hated that movie,” said Victor, “but I think I’m following you.”
“How can you hate Lord of the Rings?” I asked.
“I liked all the evil creatures,” he said, ”But the nice ones, the elves and the like, they were silly and poorly done.”
An international criminal would think that, wouldn’t he, I thought. But I said nothing.
I added a sheet called ‘Reach-dist,’ filled in ’5′ as the size of the neighborhood, and then calculated the distance to the fifth and furthest point within each supplier’s neighhood, called the ‘K-distance.’
Victor leaned in to read the formula, “So you’re going over to the Rank sheet, finding the rank that’s equal to 5, setting that value equal to 1 and the rest to 0, and then multiplying that unit vector with the column’s distance vector before summing that up. All that really does is just look up the fifth ranked distance in the end.”
“That’s right,” I said, “It’s just a ghetto-hacked lookup across the ‘Rank’ and ‘Distances’ sheets. And to do that array calculation, we have to make the whole thing into an array formula.”
He nodded, “That makes sense.”
“OK,” I said, “So now we’re going to copy in all the distances from the Distances sheet, except if you’re in my neighborhood, I’m going to replace the distance with the distance to my neighborhood’s edge, the K-distance.”
Here’s what the sheet looked like where the MAX function is used to pick between the neighborhood’s edge and the distance to the supplier:
“And now that we have all our Reach Distances set, we’re ready to finish this calculation up,” I said, “We know that the reach distance to my five neighbors with respect to me is just the distance to the fifth and farthest in the neighborhood. But what’s my reach distance with respect to them? If their neighborhood is smaller than mine, i.e. their neighbors are closer to them than mine to me, then while they’re inside my ‘city walls,’ I’m outside theirs.”
“We’re going to calculate the average reachability of each supplier with respect to their five neighbors,” I continued, “To do that we just read across the rows of the Reach Distance sheet, grabbing the Reach Distances from the five closest points using the ranking information from the Rank-Inv sheet that has row-oriented rankings on it.”
I created a new sheet called ‘LOF,’ pasted the names of the suppliers down the first column, and filled in the formula for the average reachability distance to ‘Abraham’ and showed it to Victor.
“Once again, the lookup of rows of data between sheets makes it an array formula,” Victor said, reading over the formula, “And you’re dividing the sum by 5 to make it an average.”
I dragged down the calculation and the sheet looked like this:
“So now that we have the average reachability of each supplier with respect to their neighbors, if we invert that value what we’re left with is basically a density of the area around you” I said, “We’ll call it ‘Local Outlier Density.’”
I inverted the Average Reachability column and dragged the density values down.
“Then, to finish this up,” I added, “The Local Outlier Factor for a supplier is just the average of the ratios of my 5 nearest neighbors’ densities to my own. If we’re all about the same density, then the value will tend toward 1, and I’m locally ordinary. If they’re in regions more dense than mine, then the ratios of their densities over my density will climb. That means that I consider them neighbors way more than they consider me a neighbor. I’m like that kid growing up who lived on a farm just outside of town. I’ve got friends in town who go to my school, but they’re better friends with each other than poor old me.”
I filled in the Local Outlier Factor formula which required that I look up the densities of my five nearest neighbors, divide each by the density of the point at hand, and average those ratios. The whole lookup was once again an array formula.
I tossed some conditional formatting on the column and rubbed my hands together.
“All we have to do now is look for the largest factor. That’s the point whose neighbors don’t really consider him a neighbor. The point is like Edward Scissorhands up in that castle at the edge of town,” I said.
I scrolled down the list. Near the bottom, I stopped.
“Whoa, check out this Zhenli guy,” I said, “He’s the only one with an outlier factor over two, so reachability is way out of whack with those points nearest him.”
“Hmmm,” Victor said, “I’ve not dealt with this supplier, but I’ve heard of him. Up and coming large supplier out of the pacific. Flip back to the raw data.”
I flipped back to the first sheet and scrolled down to the same supplier:
Victor nodded, “OK, he’s a big player. Operates his own narco sub, counterintelligence, has a criminal record. Even Margalo vouched for him, which isn’t nothing. But the other two don’t know him.”
Then Victor fell silent, “But look at his price.”
“Is it abnormally, low?” I asked.
“Not for a small player, but for someone operating transpo and counter-intel, it’s very cheap. And for good quantities too,” he said.
“And he demands that you meet in person,” I said.
“Yes,” said Victor, “Also rare for a big player. I don’t like it…maybe a trap.”
Victor suddenly stood and pulled out his phone, “I need to make a call.”
He walked some steps away where his voice was drowned out by the sound of the children, rides, and games. I feared that I’d just royally screwed the DEA. I needed to warn them. I couldn’t make out the words on Victor’s lips, but his left hand was clenched into a fist and his eyes had a far-away look as he spoke.
He hung up the phone and returned to the bench.
“We’ll see if this man is a problem or not,” said Victor.
“How are you going to do that?” I asked.
Victor just smiled, “I have ways of dealing with such matters.”
I didn’t contact agent Bestroff until I was back in my car and on I-20 speeding back toward Atlanta, safe from prying eyes and ears.
“Mr. Bestroff,” I said into my speakerphone, “I’m not sure if you’re working with a coke supplier named Zhenli Ye Gon, but if you are, Victor’s on to your honeypot.”
The other end of the phone was silent for a moment.
“Shit,” was all that finally came over the line.
“And to make matters worse,” I said, “You’ve got a mole inside the DEA. At least that’s where Victor says he got the intel. I sure hope you haven’t been using my real name in any of your interoffice communications.”
More silence.
“I’ll call you back,” was all Bestroff said, and he rang off.
That didn’t inspire confidence.
“I’m so screwed,” I said quietly to myself, and I sped back toward the frat house.

Big Data-ish Tip

At MailChimp most of our abuse is conducted not by malicious folks but by unthinking customers. If I download the stolen list of email addresses from the Sony hack and send to it regarding my plumbing company, sure that’s illegal and malicious, but it’s not adversarial per se. Joe the plumber is not trying to hide from me — he’s just being an idiot. Most of the abuse I see in my job does not evolve over time.
In that case, training an AI model like we did in an earlier post, using labelled abusers to find future abusers, is possible. But what if your abusers are adversarial? They’re trying to hide, perhaps blend in. They change their behavior with each intrusion.
In this case, a supervised AI model ain’t gonna do you as much good. That’s where outlier detection techniques like LOF come in. You don’t know apriori what an outlier looks like, but local outlier factors give you a value that helps you prioritize cases to investigate. The example above is somewhat artificial, but you can envision situations where you’ve got HTTP requests and timestamps in a vector from your website and maybe you want to use them to detect aberrant behavior. LOF might be the way to go there. How does a user’s HTTP request vector line him up with his neighbors?
Now, that said, once you have distance data, you can do all sorts of fun stuff with it. You could construct a trimmed kNN graph like we did in our wholesale graphing discussion. Just applying a layout algorithm on the graph using 1 / distance as the weight on the edges might show you something visually. In 2004, Hautamaki et. al. released a paper arguing that looking at the Indegreeof each node on the graph would let you know who’s an outlier.

Intrusion Detection Made Skeezy: How do you find the DEA in a haystack? | Analytics Made Skeezy

'via Blog this'

No comments:

Post a Comment