Skip navigation
All Places > Products > RSA Archer Suite > Blog > Author: rajeshnair.kc

RSA Archer Suite

4 Posts authored by: rajeshnair.kc Employee

You want to know what the real answer to all this Big Data challenge is ? It's in us!!!

Fantastic isn't it. Well till that becomes "commercially viable" , let's talk about what we can do today.

 

The right tool for the right job – that’s no doubt a cliché to many. But it’s surprising how often the tools at hand are used for any kind of job. In my last post, I talked about why dealing with Big Data is not just about data, but also about a new set of tools.

 

Let’s dissect a use case to understand the heart of the problem. In earlier posts, I talked about clickstream data. Clickstream data is data that is generated by user actions on web pages – this can include everything from components on the web page that were downloaded when a user clicked on something, the ip address, the time of the interaction, the session id, the length of time, number of downloads triggered, bytes transferred, referral URL etc – in “tech” speak, you can say it’s the electronic record of actions that a user triggers on a web page. All of this is recorded in your web server logs. On business web sites, these logs can grow to several gigabytes a day easily. Also, like I mentioned in my previous post, analysis of this data can lead to some very beneficial insights and potentially more business. To get some perspective, if you are an Archer Administrator, check out the size of the largest log on your IIS server that hosts the Archer web application. I am guessing the largest file is easily a few hundred megabytes if not close to gigabytes. OR WAIT, Archer History Log anyone?

Here’s a snippet from the web server log on my local Archer instance (on my laptop), which incidentally was about 17 MB in about a day (used primarily by me a couple of times a day):

53958

Now there are those who would argue that log data is not the best example of “Big Data” – part of the reasoning being that it does somewhat have a structure. Besides, weren’t businesses doing click stream analysis already before it was characterized as Big Data?

Yes they were – but there’s a little thing they do that is very inconspicuously described as “pre-processing”. Pre-processing is a diversion that hides fundamental challenges in dealing with all of the data in the logs. Logs themselves or raw data in the logs are second class citizens or even worse “homeless”. The web servers don’t want to hang on to them since the size of the logs can impact the web server host itself in terms of performance and storage. The systems that are going to use this data don’t want them in the raw format and don’t want all of it.

 

Typically, “pre-processing” involves some very expensive investments to clean the data, validate certain elements, aggregate and conform to quite often, a relational database schema or a data warehouse schema. Not only that, but both the content and the time window are crunched to accommodate what the existing infrastructure can handle. At the tail end of this transformation is the loading of this data into the warehouse or relational database system. Not only does this data now represent a fraction of the raw data from the logs, it could be several days between the raw data coming in and the final output into the target data source. In other words, by the time somebody looks at a report on usage stats and patterns for the day the actions were recorded, weeks could have gone by. And the raw data that was input to this is usually thrown away. There’s another big problem in this whole scenario in that you are clogging your network with terabyte data movements.  Let’s face it – where and how do you cost-effectively store data that is coming in at a rate of several hundred gigabytes a day? And once you do how do cost-effectively and efficiently process several terabytes or petabytes of this data later? This is a Big Data problem. We need a different paradigm to break this barrier.

What if, instead of trying to pump that 200 GB daily weblog into a SAN, you could break it apart and store it on a commodity hardware cluster comprised of a couple of machines with local storage?

And what if, you could push the work you want to do onto those machines with the units of data that constitute the file? In parallel? Instead of moving the data around?

 

Say hello to Hadoop – a software framework that has come to play a pivotal role in solving Big Data problems. Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hadoop at it’s core consists of two components :

1) HDFS or the Hadoop Distributed File System that provides high throughput access to data and designed to run on commodity hardware

2) Map-Reduce: A Programming model for processing large data sets

So what makes Hadoop the right tool for this problem?

  • Storing very large volumes of data across multiple commodity machines: With HDFS, large sets of large files can be distributed across a cluster of machines.
  • Fault tolerant: In computations involving a large number of nodes, failure of nodes is expected. This notion is built into Hadoop. Data from all files is duplicated across multiple nodes.
  • Move the computation, not the data: This is one of the core fundamental assumptions in Hadoop “Moving the computation is cheaper than moving the data”. Moving the processing of the data to where the data is not only reduces network congestion, but increases the overall throughput of the system. This is known as “data locality”.

 

In my next blog post, I'll explore some more aspects of Hadoop and talk about addtional tools in the Big Data quiver that gets you completely armed for your Big Data challenges.

You want to know what the real answer to all this Big Data challenge is ? It's in us!!!

Fantastic isn't it. Well till that becomes "commercially viable" , let's talk about what we can do today.

 

The right tool for the right job – that’s no doubt a cliché to many. But it’s surprising how often the tools at hand are used for any kind of job. In my last post, I talked about why dealing with Big Data is not just about data, but also about a new set of tools.

 

Let’s dissect a use case to understand the heart of the problem. In earlier posts, I talked about clickstream data. Clickstream data is data that is generated by user actions on web pages – this can include everything from components on the web page that were downloaded when a user clicked on something, the ip address, the time of the interaction, the session id, the length of time, number of downloads triggered, bytes transferred, referral URL etc – in “tech” speak, you can say it’s the electronic record of actions that a user triggers on a web page. All of this is recorded in your web server logs. On business web sites, these logs can grow to several gigabytes a day easily. Also, like I mentioned in my previous post, analysis of this data can lead to some very beneficial insights and potentially more business. To get some perspective, if you are an Archer Administrator, check out the size of the largest log on your IIS server that hosts the Archer web application. I am guessing the largest file is easily a few hundred megabytes if not close to gigabytes. OR WAIT, Archer History Log anyone?

Here’s a snippet from the web server log on my local Archer instance (on my laptop), which incidentally was about 17 MB in about a day (used primarily by me a couple of times a day):

53958

Now there are those who would argue that log data is not the best example of “Big Data” – part of the reasoning being that it does somewhat have a structure. Besides, weren’t businesses doing click stream analysis already before it was characterized as Big Data?

Yes they were – but there’s a little thing they do that is very inconspicuously described as “pre-processing”. Pre-processing is a diversion that hides fundamental challenges in dealing with all of the data in the logs. Logs themselves or raw data in the logs are second class citizens or even worse “homeless”. The web servers don’t want to hang on to them since the size of the logs can impact the web server host itself in terms of performance and storage. The systems that are going to use this data don’t want them in the raw format and don’t want all of it.

 

Typically, “pre-processing” involves some very expensive investments to clean the data, validate certain elements, aggregate and conform to quite often, a relational database schema or a data warehouse schema. Not only that, but both the content and the time window are crunched to accommodate what the existing infrastructure can handle. At the tail end of this transformation is the loading of this data into the warehouse or relational database system. Not only does this data now represent a fraction of the raw data from the logs, it could be several days between the raw data coming in and the final output into the target data source. In other words, by the time somebody looks at a report on usage stats and patterns for the day the actions were recorded, weeks could have gone by. And the raw data that was input to this is usually thrown away. There’s another big problem in this whole scenario in that you are clogging your network with terabyte data movements.  Let’s face it – where and how do you cost-effectively store data that is coming in at a rate of several hundred gigabytes a day? And once you do how do cost-effectively and efficiently process several terabytes or petabytes of this data later? This is a Big Data problem. We need a different paradigm to break this barrier.

What if, instead of trying to pump that 200 GB daily weblog into a SAN, you could break it apart and store it on a commodity hardware cluster comprised of a couple of machines with local storage?

And what if, you could push the work you want to do onto those machines with the units of data that constitute the file? In parallel? Instead of moving the data around?

 

Say hello to Hadoop – a software framework that has come to play a pivotal role in solving Big Data problems. Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hadoop at it’s core consists of two components :

1) HDFS or the Hadoop Distributed File System that provides high throughput access to data and designed to run on commodity hardware

2) Map-Reduce: A Programming model for processing large data sets

So what makes Hadoop the right tool for this problem?

  • Storing very large volumes of data across multiple commodity machines: With HDFS, large sets of large files can be distributed across a cluster of machines.
  • Fault tolerant: In computations involving a large number of nodes, failure of nodes is expected. This notion is built into Hadoop. Data from all files is duplicated across multiple nodes.
  • Move the computation, not the data: This is one of the core fundamental assumptions in Hadoop “Moving the computation is cheaper than moving the data”. Moving the processing of the data to where the data is not only reduces network congestion, but increases the overall throughput of the system. This is known as “data locality”.

 

In my next blog post, I'll explore some more aspects of Hadoop and talk about addtional tools in the Big Data quiver that gets you completely armed for your Big Data challenges.

In my last post, I covered certain elements of Big Data and how you identify with Big Data. Not everybody needs to deal with Big Data, but for those who do, they quickly realize that the hammers and wrenches they have been using to deal with traditional data no longer are the right tools.

 

Popular websites and portals easily get several million visitors a day and several billion page views per day. This “clickstream” data is very log like and while it has a pattern, does not necessarily fit the definition of “structured”. Further, the rate at which this data streams in is very, very fast.

 

Facebook gets over 2 billion likes and shares a day – to many, this is “fun” data and nobody really looks behind the scenes (nor do they need to) to see how Facebook manages this data. Today, this type of data (social media) is actually being mined by organizations to do things like “sentiment analysis”. This technique is very useful to business in making “course corrections”  based on their interpretation of “sentiment” say towards their products or product campaigns. Similarly,  “Likes” can be utilized for targeted advertising and marketing if it’s a page that’s owned and operated by a business. When you “like” something, news feeds and ads related to that product or service are constantly fed to you.

 

Consider the realm of security. Protecting cardholder information is critical and is a top priority for financial institutions. Understanding purchase patterns and buying behaviors is key to detecting fraud early and accurately. Payment platforms have to deal with several sources from point of sale systems, websites and mobile devices.  Although, many institutions do fraud detection today, they rely on “small” (smaller) subsets of data simply – technically known as “Sampling” to build the data that will eventually run analytics on. The rest of the data is pruned onto magnetic tapes (regulatory requirements) and may potentially never see the light of day or the “probing of BI tools”.

 

Problems like these can be easily related to when you think outside of business and IT. Let’s take this very simple (though a little exaggerated scenario). Let’s say I am helping my school goer with a data collection project to be done during spring break. My son wants to count and group cars that come into our neighborhood street by color – say 7-8 am in the morning and 5-6 pm in the evening for 5 days. We have 4 people in our household. One way we could do this is to have 1 person each for each of the days, with one person maybe covering day 1 and day 5. Another approach could be to have one person cover the morning hour and another cover the evening hour.

Pretty straightforward right? – The tools we would use are no more than a paper, pencil and a calculator potentially (maybe mental math is more than enough). The process is also not too complicated – look out the window or sit outside by the door ; start marking off counts by color:

Green : | | | | | | | | | | |

Red: | | | | | | | | | | |  | |  |  |  |  |  |  | | | | | |

Black: | | | | | | | | | | | | | | | | | |

 

At the end of the 5 days, we sit together by the breakfast table and total up the counts of the different car colors.

 

Now let’s say you want to cover two streets (two neighborhoods) – you can’t just sit outside your door anymore or look out your window. You can either enlist a friend for help in the other neighborhood or sit in front of your friend’s house for a couple of hours every day. That’s not bad – you have to go out of your way to enlist an additional resource (friend) or additional system in the process (friend’s house) – but it’s still doable with paper, pencil and maybe a calculator ( I know you probably don’t need one, but it’s a handy tool lying around the house).

 

You post this little project on your facebook page innocuously. Social media swings into action – ten  subdivisions now are wildly interested in knowing the count of cars grouped by color in all of their neighborhoods not just for 2 hours a day but for 8 hours every day. They want you to lead this effort. They will anxiously, excitedly wait for the results on the 6th day.

Ten subdivisions – let’s say each subdivision has 10 streets, that’s counting cars in and out from 100 streets for 8 hours.

This is a wildly exaggerated scenario – but consider this: even if it were three subdivisions with 30 streets and 2 hours a day, your process for 1 street – sitting outside your doorstep for an hour, counting cars on paper and finally tallying at end of day 5 – is no more feasible in this scenario. It’s certainly not feasible for 100 streets and 8 hours each day.

Your process needs to change to handle this scenario – you will need new tools (probably spreadsheets, a tablet maybe) and new ways to efficiently divide up the work and do the final aggregation. You certainly need more people involved and doing work for sure.

So, if your business needs require you to ingest and process this type of data – where the volume, the velocity and the variety is far greater than any scale that has been dealt with before- you need a different approach to tackling this data. You need new tools and new ways of handling this data deluge. This is really what Big Data is about.

I am purposely peeling the layers of the Big Data onion slowly. Many times, and quite too often, people think about the data deluge in terms of one element or one characteristic of Big Data (often volume, sometimes variety) and immediately run off to acquire tools for that element.

In the next blog post, I will start delving into some of the technologies and tools that are necessary when you start down the Big Data path.

 

Rajesh Nair

Senior Product Manager, RSA Archer

This is my first of a series of posts on the topic of data, data management and yes ultimately tying into Archer and GRC. But before we dive into Archer and GRC,
I am going to first talk about data management because fundamentally data is where it all begins.  Right? And what better topic to start off with than something that is trending red hot on the data meter : "Big Data".  Besides we all got a handle on "small" data, right ?

 

In the movie “BIG”, the character played by Tom Hanks literally grows big overnight. This overnight transformation posed immediate problems – he couldn't wear the same outfits anymore, he couldn't use his “boy” bed anymore, his normal mode of transportation wasn't "fitting of him and so on. At the same time, he slowly began to see and use the advantages from being BIG.

 

We can certainly draw from this if anything to shed light on some common concerns about Big Data.

 

1) You don’t wake up to “Oh My God, Where did all this data come from”? Well, hopefully, you don't. At least in general, most organizations don’t get a large shipment of data dumped in their backyard one day in one big visible heap. In fact in almost all organizations, data has been flowing in over the years; it’s been ingested, cleansed, analyzed, filtered, processed, published and archived. Till a few years ago, most of this was data from sources that organizations knew they needed to draw informatio from. Also till a few years ago, you had a data “funnel” – lots of data being ingested, but eventually, after you analyzed it, you only processed and persisted a small percent of the ingested data. Albeit, it should be noted that the variety and rate at which data has been flowing in has picked up in the last few years.

 

2) Do you(I) have a Big Data problem? I have heard this posed over and over again. Data is your “opportunity” not your “problem”. The real question that needs to be asked is what business problem you have or what opportunity you can now create by harnessing “Big Data”.

 

3) What do I do with my old (small) data? Absolutely continue to use it the way you have because your business still runs on it. Big Data doesn't mean that you have to completely rethink and re-hash everything you have as we will soon see.

 

That’s fine and dandy, you say, but can I identify “Big Data” if I needed to solve a business problem – AAAAAhhhh, now let’s talk!  This is a very valid question. Let's talk about this a bit. Fundamentally, you need to first identify your need and then delve into “Discovery mode” to find the data you need to satisfy your requirements. So how do you discover “your” Big Data? We could start with a definition but we will leave technical definitions of Big Data aside for now as there are many pundits who have already defined for the general use case. We will "characterize" Big Data shortly. As I mentioned, keep presumptions aside and focus on identifying the data you need:

  1. First don’t search for big data. When you get there it will be staring at you . Ok let’s not sidetrack. Start with listing all the data sources that you believe will collectively give you the data that you need to solve your business problem or create your business opportunity. The key here is identifying " the data you need", not identifying systems in your organization and this is a very important change in mindset - why? - because traditionally, whether you like it or not, whether consciously or sub-consciously, many look at what data is available within the organization as opposed to what data is needed.
  2. I mentioned the change in mindset needed here. Now that being said, chances are you will find that a lot of the data is already “groomed” and “usable” from systems you have today – transactional systems , data warehouses, data marts etc. You may require additional data to be utilized from these sources than you did before – that’s ok. The more your organization's information systems can be leveraged, the better. 
  3. So far so good - you are happy, you haven't identified anything that really can't be handled by the organization's data sources. But then you start thinking about other pieces of data you really need to achieve your goal:

Maybe you have a popular web retail front end and one of your objectives is to improve your understanding of your customers “click-through” on your website. This can give you all sorts of insights to improve say purchasing likelihood, “website stickiness” etc. So you want to capture and analyze clickstreams starting from the search page that a visitor found your link on. You want to look at the hit date and time, download time, user agent, search key words etc. You are now thinking where and how to ingest and store this data, pre-process and build a predictive model before loading certain information into the warehouse. And you want to keep all that data so you can mine over time,

OR

 

Maybe you are in the energy business and want to take the lead on smart meters by collecting meter data on an hourly basis. No wait- you want to leap frog the competition by building a solution that can take in readings from 10 million meters at 10 minute intervals. That’s 60 million readings per hour or 1.44 billion readings per day.

OR

 

Maybe you are the head of enterprise IT Security team on a mission to minimize threats. You want to take Enterprise IT security to the next level by  analyzing  traffic/data flow from all systems in the enterprise and detecting patterns that are indicative of a threat. That’s right – traffic from all systems in the Enterprise and provide maybe a daily report across all systems on findings.

 

If this type of data is raising your eyebrows, then, well, congratulations, you now have Big Data staring at you !

 

All of the above have at least two common characteristics – it’s volume of data on a scale that you have not handled before and the rate or frequency at which the data is coming in is very fast. There are some other facets to Big Data which I will get into later.

 

In my next few posts, I will explore in more detail the characteristics of Big Data and delve into technologies that can help you leverage Big Data for your business.

Stay tuned.

 

Raj Nair

Senior Product Manager, RSA Archer

Filter Blog

By date: By tag: