Skip navigation
All Places > Products > RSA Archer Suite > Blog > 2013 > January > 26

You want to know what the real answer to all this Big Data challenge is ? It's in us!!!

Fantastic isn't it. Well till that becomes "commercially viable" , let's talk about what we can do today.


The right tool for the right job – that’s no doubt a cliché to many. But it’s surprising how often the tools at hand are used for any kind of job. In my last post, I talked about why dealing with Big Data is not just about data, but also about a new set of tools.


Let’s dissect a use case to understand the heart of the problem. In earlier posts, I talked about clickstream data. Clickstream data is data that is generated by user actions on web pages – this can include everything from components on the web page that were downloaded when a user clicked on something, the ip address, the time of the interaction, the session id, the length of time, number of downloads triggered, bytes transferred, referral URL etc – in “tech” speak, you can say it’s the electronic record of actions that a user triggers on a web page. All of this is recorded in your web server logs. On business web sites, these logs can grow to several gigabytes a day easily. Also, like I mentioned in my previous post, analysis of this data can lead to some very beneficial insights and potentially more business. To get some perspective, if you are an Archer Administrator, check out the size of the largest log on your IIS server that hosts the Archer web application. I am guessing the largest file is easily a few hundred megabytes if not close to gigabytes. OR WAIT, Archer History Log anyone?

Here’s a snippet from the web server log on my local Archer instance (on my laptop), which incidentally was about 17 MB in about a day (used primarily by me a couple of times a day):


Now there are those who would argue that log data is not the best example of “Big Data” – part of the reasoning being that it does somewhat have a structure. Besides, weren’t businesses doing click stream analysis already before it was characterized as Big Data?

Yes they were – but there’s a little thing they do that is very inconspicuously described as “pre-processing”. Pre-processing is a diversion that hides fundamental challenges in dealing with all of the data in the logs. Logs themselves or raw data in the logs are second class citizens or even worse “homeless”. The web servers don’t want to hang on to them since the size of the logs can impact the web server host itself in terms of performance and storage. The systems that are going to use this data don’t want them in the raw format and don’t want all of it.


Typically, “pre-processing” involves some very expensive investments to clean the data, validate certain elements, aggregate and conform to quite often, a relational database schema or a data warehouse schema. Not only that, but both the content and the time window are crunched to accommodate what the existing infrastructure can handle. At the tail end of this transformation is the loading of this data into the warehouse or relational database system. Not only does this data now represent a fraction of the raw data from the logs, it could be several days between the raw data coming in and the final output into the target data source. In other words, by the time somebody looks at a report on usage stats and patterns for the day the actions were recorded, weeks could have gone by. And the raw data that was input to this is usually thrown away. There’s another big problem in this whole scenario in that you are clogging your network with terabyte data movements.  Let’s face it – where and how do you cost-effectively store data that is coming in at a rate of several hundred gigabytes a day? And once you do how do cost-effectively and efficiently process several terabytes or petabytes of this data later? This is a Big Data problem. We need a different paradigm to break this barrier.

What if, instead of trying to pump that 200 GB daily weblog into a SAN, you could break it apart and store it on a commodity hardware cluster comprised of a couple of machines with local storage?

And what if, you could push the work you want to do onto those machines with the units of data that constitute the file? In parallel? Instead of moving the data around?


Say hello to Hadoop – a software framework that has come to play a pivotal role in solving Big Data problems. Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. Hadoop at it’s core consists of two components :

1) HDFS or the Hadoop Distributed File System that provides high throughput access to data and designed to run on commodity hardware

2) Map-Reduce: A Programming model for processing large data sets

So what makes Hadoop the right tool for this problem?

  • Storing very large volumes of data across multiple commodity machines: With HDFS, large sets of large files can be distributed across a cluster of machines.
  • Fault tolerant: In computations involving a large number of nodes, failure of nodes is expected. This notion is built into Hadoop. Data from all files is duplicated across multiple nodes.
  • Move the computation, not the data: This is one of the core fundamental assumptions in Hadoop “Moving the computation is cheaper than moving the data”. Moving the processing of the data to where the data is not only reduces network congestion, but increases the overall throughput of the system. This is known as “data locality”.


In my next blog post, I'll explore some more aspects of Hadoop and talk about addtional tools in the Big Data quiver that gets you completely armed for your Big Data challenges.

In my last post I discussed how critically important risk taxonomy is for the success of an ERM program – the need for the organization to agree on risk-related terminology, formalizing it as part of the organization’s risk management practices, obtaining formal sign-off from executive management and the board of directors, communicating it to stakeholders, and operationalizing it within the organization’s governance tools.


Another critical aspect of ERM program enablement is the attitude and commitment of the organization’s senior leadership.  This “Tone at the Top” significantly influences the effectiveness of an organization’s ERM program in the following ways:


  • The scope of the ERM program.  To truly be an ERM program the scope must be holistic and include all operating units, geographies, risk types, products, processes, etc.
  • The degree to which managers feel responsible for risk management.  Optimally, risk management should be the responsibility of each and every manager, regardless of their position within the organization, and as risk managers, each manager should be accountable for understanding their key risks and maintaining the appropriate internal controls within their domain of responsibility.
  • The consistency of risk decisions.  Consistent risk decisions are encouraged by establishing and enforcing aligned risk appetite, tolerance, and delegated management risk-taking authorities, and escalating decisions to successively higher authority as thresholds are exceeded.  The degree to which the “official” risk management rules are set aside to fast-track an initiative, accommodate a pet project, or avoid confronting an exceptional, difficult, or politically connected manager will undermine the effectiveness of the ERM program.
  • Risk management agility.  The probability of the organization meeting its objectives, whatever they may be is dependent on how quickly management becomes aware of and responds to changes in its risk profile.  Fostering the necessary information transparency throughout an organization and the accountability to respond when appropriate, requires the commitment of senior leadership.
  • The amount of resources committed to manage risk.  Capital investment and human resource commitments should be aligned consistent with the degree of effectiveness necessary to manage risk within the appetite and tolerance of the organization.
  • Aligning compensation to desired behavior.  Incentive compensation that influences risk taking outside desired boundaries or potentially compromises the role of persons in control positions is inconsistent with sound risk management practice.


Individuals charged with the responsibility for the effectiveness of the organization’s ERM should seek to secure a tone at the top, both in word and action, that leaves no doubt about the organization’s commitment to risk management best practices.  Fortunately, executive management and boards of directors have plenty of obligations and incentive to establish a strong tone at the top including regulatory obligations, threat of shareholder suits, and empirical evidence of the superior performance of organizations practicing ethical business and holistic risk management. Focusing senior management’s attention on these obligations and incentives may go a long way to secure the necessary commitment, if such commitment is not everything it should be today.

Filter Blog

By date: By tag: