Fixing the Class Not Found Exception when running a CXF JAX-RS service

If you are trying to deploy a CXF JAX-RS service inside an OSGI container, you can be haunted by the exception given below. But, surprisingly, the fix could be pretty simple 2 step process.

Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: com.sun.ws.rs.ext.RuntimeDelegateImpl
	at javax.ws.rs.ext.RuntimeDelegate.findDelegate(RuntimeDelegate.java:122)
	at javax.ws.rs.ext.RuntimeDelegate.getInstance(RuntimeDelegate.java:91)
	at javax.ws.rs.core.MediaType.<clinit>(MediaType.java:44)
	... 56 more
Caused by: java.lang.ClassNotFoundException: com.sun.ws.rs.ext.RuntimeDelegateImpl
	at org.eclipse.osgi.internal.loader.BundleLoader.findClassInternal(BundleLoader.java:513)
	at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:429)
	at org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:417)
	at org.eclipse.osgi.internal.baseadaptor.DefaultClassLoader.loadClass(DefaultClassLoader.java:107)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:169)
	at javax.ws.rs.ext.FactoryFinder.newInstance(FactoryFinder.java:62)
	at javax.ws.rs.ext.FactoryFinder.find(FactoryFinder.java:155)
	at javax.ws.rs.ext.RuntimeDelegate.findDelegate(RuntimeDelegate.java:105)
	... 58 more

First, make sure you follow the steps to register your servlet through the osgi HTTP service as pointed out in the minimal osgi sample.

Then, the reason most probably would be a non-OSGIfied jsr 311 implementation present. Replace this jar with an OSGI compatible implementation located at org.apache.servicemix.specs.jsr311-api-1.1.1-1.9.0.jar.

That should (hopefully) solve your problem.

Inserting Internet Explorer specific scripts from XSLT

If you are working with generating an html page through xslt, you might be interested in loading some scripts just when the page is browsed from Internet Explorer (IE).

You can use the following XSLT section to include a IE specific script. In this case it will be for IE versions less than 9.


<xsl:text disable-output-escaping="yes">&lt;!--[if lt IE 9]&gt;&lt;script language="javascript" type="text/javascript" src="js/excanvas.min.js"&gt;&lt;/script&gt;&lt;![endif]--&gt;</xsl:text>

Your output will look like this:


<!--[if lt IE 9]><script language="javascript" type="text/javascript" src="js/excanvas.min.js"></script><![endif]-->

Adding CData tags through XSLT

CData tags are usually swallowed up by an XSLT transformation. If they are important to your output document, you can use the following method to include CData tags.

<xsl:text disable-output-escaping="yes">&lt;![CDATA[</xsl:text>

<!-- text between CData tags here -->

<xsl:text disable-output-escaping="yes">]]&gt;</xsl:text>

The output will look like:

<![CDATA[

]]>

Handling Exceptions with Axis2 Web Services (Code-First Approach)

If you are writing a web service using Java and deploying it in Axis2, you need to follow the steps below to propagate the exception properly to the client side.

Let’s say you have an exception you want to throw with a respective error message. Your method would look something like;

 public String foo throws MyException {
    if (true) {
       throw new MyException("This is my exception");
    }
 }

If you codegen your axis2 client stub you will see that the exception will be thrown when using the client. But if you catch that exception and print out the message, you will notice that the message is lost.

try {
   String s = client.foo();
} catch (MyExceptionException e) { // the codegen exception thrown
   System.out.println("The error that occurred is : " +  e.getMessage());
}

This will output something like;

The error that occurred is : MyException

To preserve the exception message, we need to include it as an attribute just like we would do in a bean class. Here’s an example of the MyException class;

public class MyException extends Exception {
    private String message;

    public MyException(String s, Throwable throwable) {
        super(s, throwable);
        this.message = s;
    }

    public MyException(String s) {
        super(s);
        this.message = s;
    }

    @Override
    public String getMessage() {
        return this.message;
    }

}

This can of course be extended to send any number of arbitrary attributes with the exception, given that they are web service friendly.

Now, you need to codegen your client and modify the client code as follows:

try {
   String s = client.foo();
} catch (MyExceptionException e) {
   System.out.println("The error that occurred is : " +  e.getFaultMessage().getMyException().getMessage());
}

When you run the code, the output will be as follows:

The error that occurred is : This is my exception

 

Useful commands for debugging OSGI issues and constraints

These are a few commands that I find very useful when faced with OSGI issues.

1. ss / ss <string> (abbr. for short status)

Lists down the state of all installed bundles or if followed by a string, does a wildcard match of that string. Useful for initial investigation of checking whether a bundle is active, installed or not present at all.

osgi> ss bam

Framework is launched.


id State Bundle
143 RESOLVED org.wso2.bam.styles_2.0.0.SNAPSHOT
Master=290
150 ACTIVE org.wso2.carbon.bam.analyzer.stub_4.0.0.SNAPSHOT
151 ACTIVE org.wso2.carbon.bam.core.stub_4.0.0.SNAPSHOT
152 ACTIVE org.wso2.carbon.bam.gadgetgenwizard_4.0.0.SNAPSHOT
153 ACTIVE org.wso2.carbon.bam.gadgetgenwizard.stub_4.0.0.SNAPSHOT
154 ACTIVE org.wso2.carbon.bam.gadgetgenwizard.ui_4.0.0.SNAPSHOT
155 ACTIVE org.wso2.carbon.bam.presentation.stub_4.0.0.SNAPSHOT
156 ACTIVE org.wso2.carbon.bam2.core_4.0.0.SNAPSHOT
157 ACTIVE org.wso2.carbon.bam2.core.ui_4.0.0.SNAPSHOT
158 ACTIVE org.wso2.carbon.bam2.presentation_4.0.0.SNAPSHOT
159 ACTIVE org.wso2.carbon.bam2.receiver_4.0.0.SNAPSHOT
160 ACTIVE org.wso2.carbon.bam2.service_4.0.0.SNAPSHOT


2. b  <id> (abbr. for bundle)

Displays details for the specified bundles. Useful for checking what your bundle exports and imports when narrowing down situations like uses constraints. Also, helps with the OSGI services used.

osgi> b 152
org.wso2.carbon.bam.gadgetgenwizard_4.0.0.SNAPSHOT [152]
Id=152, Status=ACTIVE Data Root=/Users/mackie/source-checkouts/carbon/platform/trunk/products/bam2/modules/distribution/product/target/wso2bam-2.0.0-SNAPSHOT/repository/components/configuration/org.eclipse.osgi/bundles/152/data
No registered services.
Services in use:
{org.wso2.carbon.utils.ConfigurationContextService}={service.id=123}
{org.wso2.carbon.base.api.ServerConfigurationService}={service.id=83}
Exported packages
org.wso2.carbon.bam.gadgetgenwizard.internal; version="4.0.0.SNAPSHOT"[exported]
org.wso2.carbon.bam.gadgetgenwizard.service; version="4.0.0.SNAPSHOT"[exported]
Imported packages
org.apache.commons.logging; version="1.1.1"
org.wso2.carbon.utils; version="4.0.0.SNAPSHOT"
org.wso2.carbon.user.core; version="4.0.0.SNAPSHOT"
org.wso2.carbon.registry.core.session; version="1.0.1"
org.wso2.carbon.registry.core.exceptions; version="1.0.1"
org.wso2.carbon.registry.core; version="1.0.1"
org.wso2.carbon.registry.common.services; version="1.0.1"
org.wso2.carbon.base.api; version="1.0.0"
org.osgi.service.component; version="1.1.0"
org.apache.commons.io; version="2.0.0"
org.apache.axiom.om.impl.jaxp; version="1.2.11.wso2v1"
org.apache.axiom.om.impl.builder; version="1.2.11.wso2v1"
org.apache.axiom.om; version="1.2.11.wso2v1"
javax.xml.transform.stream; version="0.0.0"
javax.xml.transform; version="0.0.0"
javax.xml.stream; version="1.0.1"
javax.xml.namespace; version="0.0.0"
net.sf.saxon; version="9.0.0.x"
No fragment bundles
Named class space
org.wso2.carbon.bam.gadgetgenwizard; bundle-version="4.0.0.SNAPSHOT"[provided]
No required bundles

3. p <package> (abbr. for packages)

Shows the bundles that export and import the specified packages. Extremely useful in debugging most OSGI issues.

osgi> p org.wso2.carbon.utils
org.wso2.carbon.utils; version="4.0.0.SNAPSHOT"
axis2_1.6.1.wso2v5 [19] imports
org.wso2.carbon.analytics.hive_4.0.0.SNAPSHOT [145] imports
org.wso2.carbon.application.deployer_4.0.0.SNAPSHOT [147] imports
org.wso2.carbon.bam.gadgetgenwizard_4.0.0.SNAPSHOT [152] imports
org.wso2.carbon.bam2.core_4.0.0.SNAPSHOT [156] imports
org.wso2.carbon.bam2.receiver_4.0.0.SNAPSHOT [159] imports
org.wso2.carbon.cassandra.dataaccess_4.0.0.SNAPSHOT [163] imports
org.wso2.carbon.cassandra.mgt_4.0.0.SNAPSHOT [164] imports
org.wso2.carbon.cluster.mgt.core_4.0.0.SNAPSHOT [169] imports
org.wso2.carbon.coordination.core_4.0.0.SNAPSHOT [172] imports
org.wso2.carbon.core_4.0.0.SNAPSHOT [173] imports
org.wso2.carbon.core.bootup.validator_4.0.0.SNAPSHOT [174] imports
org.wso2.carbon.core.services_4.0.0.SNAPSHOT [177] imports
org.wso2.carbon.dashboard_4.0.0.SNAPSHOT [178] imports
org.wso2.carbon.dashboard.common_4.0.0.SNAPSHOT [179] imports
org.wso2.carbon.dashboard.dashboardpopulator_4.0.0.SNAPSHOT [180] imports
org.wso2.carbon.dashboard.ui_4.0.0.SNAPSHOT [182] imports
org.wso2.carbon.datasource_4.0.0.SNAPSHOT [183] imports
org.wso2.carbon.event.client_4.0.0.SNAPSHOT [187] imports
org.wso2.carbon.event.common_4.0.0.SNAPSHOT [189] imports
org.wso2.carbon.event.core_4.0.0.SNAPSHOT [190] imports
org.wso2.carbon.event.ws_4.0.0.SNAPSHOT [191] imports
org.wso2.carbon.jaggery.app.mgt_1.0.0.SNAPSHOT [222] imports
org.wso2.carbon.jaggery.app.mgt.ui_1.0.0.SNAPSHOT [224] imports
org.wso2.carbon.jaggery.deployer_1.0.0.SNAPSHOT [226] imports
org.wso2.carbon.logging.service_4.0.0.SNAPSHOT [232] imports
org.wso2.carbon.ndatasource.core_4.0.0.SNAPSHOT [236] imports
org.wso2.carbon.ntask.core_4.0.0.SNAPSHOT [239] imports
org.wso2.carbon.registry.common_4.0.0.SNAPSHOT [246] imports
org.wso2.carbon.registry.core_4.0.0.SNAPSHOT [248] imports
org.wso2.carbon.registry.resource.ui_4.0.0.SNAPSHOT [254] imports
org.wso2.carbon.registry.server_4.0.0.SNAPSHOT [258] imports
org.wso2.carbon.registry.servlet_4.0.0.SNAPSHOT [259] imports
org.wso2.carbon.reporting.template.core_4.0.0.SNAPSHOT [263] imports
org.wso2.carbon.security.mgt_4.0.0.SNAPSHOT [273] imports
org.wso2.carbon.security.mgt.ui_4.0.0.SNAPSHOT [275] imports
org.wso2.carbon.server.admin_4.0.0.SNAPSHOT [276] imports
org.wso2.carbon.server.admin.ui_4.0.0.SNAPSHOT [279] imports
org.wso2.carbon.service.mgt_4.0.0.SNAPSHOT [280] imports
org.wso2.carbon.transport.http_4.0.0.SNAPSHOT [285] imports
org.wso2.carbon.transport.https_4.0.0.SNAPSHOT [286] imports
org.wso2.carbon.transport.mgt_4.0.0.SNAPSHOT [287] imports
org.wso2.carbon.ui_4.0.0.SNAPSHOT [290] imports
org.wso2.carbon.user.core_4.0.0.SNAPSHOT [295] imports
org.wso2.carbon.user.mgt.ui_4.0.0.SNAPSHOT [299] imports
org.wso2.carbon.webapp.mgt_4.0.0.SNAPSHOT [301] imports
org.wso2.carbon.wsdl2form_4.0.0.SNAPSHOT [302] imports

4. diag <bid> (abbr. for diagnose)

Shows any unsatisfied constraints of the bundle.

osgi> diag 159
reference:file:plugins/org.wso2.carbon.bam2.receiver_4.0.0.SNAPSHOT.jar [159]
No unresolved constraints.

 

5. ls (abbr. for list services)

Lists down the state of all OSGI services. In this list the most important would be identifying the unsatisfied components as in component 20 below.

osgi> ls
All Components:
ID State Component Name Located in bundle
1 Registered org.eclipse.equinox.frameworkadmin.equinox org.eclipse.equinox.frameworkadmin.equinox(bid=108)
2 Active org.eclipse.equinox.p2.artifact.repository org.eclipse.equinox.p2.artifact.repository(bid=114)
3 Active org.eclipse.equinox.p2.core.eventbus org.eclipse.equinox.p2.core(bid=116)
4 Active org.eclipse.equinox.p2.di.agentProvider org.eclipse.equinox.p2.core(bid=116)
5 Registered org.eclipse.equinox.p2.director org.eclipse.equinox.p2.director(bid=117)
6 Active org.eclipse.equinox.p2.planner org.eclipse.equinox.p2.director(bid=117)
7 Active org.eclipse.equinox.p2.engine.registry org.eclipse.equinox.p2.engine(bid=120)
8 Active org.eclipse.equinox.p2.engine org.eclipse.equinox.p2.engine(bid=120)
9 Active org.eclipse.equinox.p2.garbagecollector org.eclipse.equinox.p2.garbagecollector(bid=122)
10 Active org.eclipse.equinox.p2.metadata.repository org.eclipse.equinox.p2.metadata.repository(bid=125)
11 Registered org.eclipse.equinox.p2.repository org.eclipse.equinox.p2.repository(bid=128)
12 Registered org.eclipse.equinox.p2.transport.ecf org.eclipse.equinox.p2.transport.ecf(bid=132)
13 Registered org.eclipse.equinox.p2.updatechecker org.eclipse.equinox.p2.updatechecker(bid=133)
14 Registered org.eclipse.equinox.simpleconfigurator.manipulator org.eclipse.equinox.simpleconfigurator.manipulator(bid=138)
15 Active bam.hive.component org.wso2.carbon.analytics.hive(bid=145)
16 Active application.deployer.dscomponent org.wso2.carbon.application.deployer(bid=147)
17 Active gadgetgenwizard.component org.wso2.carbon.bam.gadgetgenwizard(bid=152)
18 Active bam.utils.component org.wso2.carbon.bam2.core(bid=156)
19 Active bam.presentation.component org.wso2.carbon.bam2.presentation(bid=158)
20 Unsatisfied bam.receiver.component org.wso2.carbon.bam2.receiver(bid=159)
21 Active org.wso2.carbon.cassandra.dataaccess.component org.wso2.carbon.cassandra.dataaccess(bid=163)
22 Active org.wso2.carbon.cassandra.mgt.component org.wso2.carbon.cassandra.mgt(bid=164)

6. comp <component id> or ls -c <bundleid>

Lists component specific information regarding OSGI declarative services. Useful for debugging issues with declarative services.

osgi> ls -c 152
Components in bundle org.wso2.carbon.bam.gadgetgenwizard:
ID Component details
17 Component[
name = gadgetgenwizard.component
factory = null
autoenable = true
immediate = true
implementation = org.wso2.carbon.bam.gadgetgenwizard.internal.GadgetGenWizardServiceComponent
state = Unsatisfied
properties = {service.pid=gadgetgenwizard.component}
serviceFactory = false
serviceInterface = null
references = {
Reference[name = config.context.service, interface = org.wso2.carbon.utils.ConfigurationContextService, policy = dynamic, cardinality = 1..1, target = null, bind = setConfigurationContextService, unbind = unsetConfigurationContextService]
Reference[name = server.configuration, interface = org.wso2.carbon.base.api.ServerConfigurationService, policy = dynamic, cardinality = 1..1, target = null, bind = setServerConfiguration, unbind = unsetServerConfiguration]
}
located in bundle = org.wso2.carbon.bam.gadgetgenwizard_4.0.0.SNAPSHOT [152]
]
Dynamic information :
The component is satisfied
All component references are satisfied
Component configurations :
Configuration properties:
service.pid = gadgetgenwizard.component
component.name = gadgetgenwizard.component
component.id = 16
Instances:
org.eclipse.equinox.internal.ds.impl.ComponentInstanceImpl@3fabb84d
Bound References:
String[org.wso2.carbon.utils.ConfigurationContextService]
-> org.wso2.carbon.utils.ConfigurationContextService@22d0e7e3
String[org.wso2.carbon.base.api.ServerConfigurationService]
-> org.wso2.carbon.base.ServerConfiguration@4127f9f0

Solving the XML Parsing error in Firebug

XML Parsing Error: no element found Location: moz-nullprincipal:{e04285c3-86a1-f146-b683-55bb667191ea} Line Number 1, Column 1:

^

If you are getting the above error in firebug, you are most probably making a cross browser request that is not allowed and not having invalid XML in your response. Cross browser requests are not allowed, unless you use jsonp or script as the data type.

To fix this problem, make sure the url doesn’t start with a http/ https:

$.ajax({ url : "http://localhost/somewhere" })

This is not allowed, unless you follow the rules to make a cross browser request. You probably want to do something like:

$.ajax({ url : "somewhere" })

Do you trust Google Big Query with your Big Data?

Google has come up with a fantastic service to analyze large amounts of data. It’s called BigQuery and it allows you to run analysis on big data on the cloud. As expected, the tool has a superb, intuitive web UI. The data analysis language uses SQL like queries. (Hive, anyone ;) ). Have a look at the  Big Query Tutorial, it looks pretty neat. So, now all you need to do to run queries is to upload your data to Google using the form shown below. It allows you to upload a file or point to it using Google’s cloud storage.

Now, the interesting question here is that to analyze using BigQuery how much of that data are you willing to give Google? And how long will that take? The answer won’t be “Let me quickly upload a 500 GB file and run some queries”. That amount of data would definitely take some time to upload. So, effectively, this SaaS becomes pretty useless as more and more data volumes need to be uploaded for analysis.

Everyone trusts Google ( :) ), so this concern might be easily ignored. But a potential other problem I see is the “Privacy Policies” that are violated. Usually, when you want to analyze data, it can contain sensitive data such as user behavior patterns and so forth. How comfortable will your customers be if you hand that data over to Google? Even anonymizing this data might not save you from a potential legal breach.

I still believe setting up your own data analysis and monitoring platform is the best way to go. Thoughts? I’d love to hear them.

Using Google collections’ MapMaker to quickly build a cache

Building a cache is standard practice in most programs to reduce the overhead of re-creating expensive objects. Using a concurrent Hash Map to build a thread safe cache is a widely user practice.

Even though each method call of concurrent Hash Map is thread safe, it is cumbersome to re-do the necessary operations. Let me list these down:

  • Compound operations of checking and inserting data into a HashMap – This can result in duplicate objects being created. You may have to use extensive checks using putIfAbsent to avoid these sort of situations.
  • Cache expiry – Getting a scheduled thread to clear the cache
  • Look at whether you want strong or weak references, strong or weak keys
  • Limit as to how many concurrent updates are allowd

Let me introduce MapMaker, a very convenient Factory object present in the Google Collections library. It lets me do everything mentioned in the above list in a single object initialization.

ConcurrentMap<Key, HeavyObject> graphs = new MapMaker()
 .concurrencyLevel(10)
 .softKeys()
 .weakValues()
 .expiration(15, TimeUnit.MINUTES)
 .makeComputingMap( new Function<Key, HeavyObject>() { 
 public Graph apply(Key key) { 
 return createHeavyObject(key); 

});

Now, all you have to do to use the cache is do a call to the get method:

graphs.get(someKey);

This will create the object, making sure other threads are blocked until the Heavy Object is created properly and the same object is returned for the specific key to any other thread as well.

Note: Make sure you implement the equals() and hashCode() methods if your Key is a complex object.

Hope you find it useful in your Java code.

WSO2 BAM 2.0.0-Alpha 2 released!

My team at WSO2 was able to release a 2nd alpha of our upcoming BAM 2.0. Do give it a spin.

The release note is below:

The WSO2 team is pleased to announce the release of version 2.0.0 – ALPHA 2 of WSO2 Business Activity Monitor.

WSO2 Business Activity Monitor (WSO2 BAM) is a comprehensive framework designed to solve the problems in the wide area of business activity monitoring. WSO2 BAM comprises of many modules to give the best of performance, scalability and customizability. These allow to achieve requirements of business users, dev ops, CxOs without spending countless months on customizing the solution without sacrificing performance or the ability to scale.

WSO2 BAM is powered by WSO2 Carbon, the SOA middleware component platform.

Downloads

The binary distribution can be downloaded at http://dist.wso2.org/products/bam/2.0.0-alpha2/wso2bam-2.0.0-ALPHA2.zip.

The documentation pack is available at http://dist.wso2.org/products/bam/2.0.0-alpha2/wso2bam-2.0.0-ALPHA2-docs.zip.

Samples
  1. Service Data Agent – Sample to install Service data agent, publish statistics and intercepted message activity from Service Hosting WSO2 Servers such as WSO2 AS, DSS, BPS, CEP, BRS and any other WSO2 Carbon server with the service hosting feature
  2. Mediation Data Agent – Sample to install Mediation data agent, publish mediation statistics and intercepted message activity using Message Activity Mediators from the WSO2 ESB
  3. Data center wide cluster monitoring – Sample to simulate two data centers each having two clusters sending statistics events, perform summarizations and visualize them in a dashboard
  4. End – End Message Tracing – Sample to simulate messages fired from a set of servers to WSO2 BAM and set up message tracing analytics and visualizations of respective messages
  5. KPI Definition – Sample to simulate receiving events from a server (ex: WSO2 AS), perform summarizations and visualize product and consumer data in a retail store
  6. Fault Detection & Alerting – Sample to simulate receiving events from a server (ex: WSO2 ESB), detect faults and fire email alerts

Features

  • Data Agents
    1. Pre built data agents – Service Data Agent for the WSO2 AS, DSS, BPS, CEP, BRS and any other WSO2 Carbon server with the service hosting feature and Mediation Data Agent for the WSO2 ESB
    2. A re-usable Agent API to publish events to the BAM server from any application (samples included)
    3. Apache Thrift based Agents to publish data at extremely high throughput rates
    4. Option to use Binary or HTTP protocols
  • Event Storage
    1. Apache Cassandra based scalable data architecture for high throughput of writes and reads
    2. Carbon based security mechanism on top of Cassandra
  • Analytics
    1. An Analyzer Framework with the capability of writing and plugging in any custom analysis tasks
    2. Built in Analyzers for common operations such as get, put aggregate, alert, fault detection, etc.
    3. Scheduling capability of analysis tasks
  • Visualization
    1. Drag and drop gadget IDE to visualize analyzed data with zero code
    2. Capability to plug in additional UI elements and Data sources to Gadget IDE
    3. Google gadgets based dashboard

Reporting Issues

WSO2 encourages you to report issues, enhancements and feature requests for WSO2 BAM. Use the issue tracker for reporting any of these.

The problems in Hadoop – When does it fail to deliver?

Hadoop is a great piece of software. It is not original but that certainly does not take away its glory. It builds on parallel processing, a concept that’s been around for decades. Although conceptually unoriginal, Hadoop shows the power of being free and open (as in beer!) and most of all shows about what usability is all about. It succeeded where most other parallel processing frameworks failed. So, now you know that I’m not a hater. On the contrary, I think Hadoop is amazing. But, it does not justify some blatant failures on the part of Hadoop, may it be architectural, conceptual or even documentation wise. Hadoop’s popularity should not shield it from the need to re-enginer and re-work problems in the Hadoop implementation. The point below are based on months of exploring and hacking around Hadoop. Do dig in.

  1. Did I hear someone say “Data Locality”?
  2. Hadoop harps over and over again on data locality. In some workshops conducted by Hadoop milkers, they just went on and on about this. They say whenever possible, Hadoop will attempt to start a task on a block of data that is stored locally on that node via HDFS. This sounds like a super feature, doesn’t it? It saves so much of bandwidth without having to transfer TBs of data, right?

    Hellll, no. It does not. What this means is that first you have to figure out a way of getting data into HDFS, the Hadoop Distributed File System. This is non trivial, unless you live in the last decade and all your data exists as files. Assuming that you do, let’s transfer the TBs of data over to HDFS. Now, it will start doing it’s whole “data locality” thing.

    Ermm, OK. Am I hit by a wave of brilliance or isn’t it what’s is supposed to do anyway? Let’s get our facts straight. To use Hadoop, our problem should be able to execute in parallel. If the problem or a at least a sub-problem can’t be parallelized it won’t gain much out of Hadoop. This means the task algorithm is independent of any specific part of the data it processes. Further simplifying this would be saying, any task can process any section of the data. So, doesn’t that mean the “data locality” thing is the obvious thing to do? Why, would the Hadoop developers even write some code that would make a task process data in another node unless something goes horribly wrong. The feature would be if it was doing otherwise! If a task has finished operating on the node’s local data and then would transfer data from another node and process this data, that would be a worthy feature of the conundrum. At least that would be worthy of noise.

  3. Would you please put everything back into files
  4. Do you have nicely structured data in databases? Maybe, you became a bit fancy and used the latest and greatest NoSQL data store? Now let me write down what you are thinking. “OK, let’s get some Hadoop jobs to run on this, cause I want to find all this hidden gold mines in my data, that will get me a front page of Forbes.” I hear you. Let’s get some Hadoop jobs rolling. But wait! What the …..? Why are all the samples in text files. A plethora of examples using CSV files, tab delimited files, space delimited files, and all other kind of neat files. Why is everyone going back a few decades and using files again? Haven’t all these guys heard of DBs and all that fancy stuff. It seems that you were too early an adopter of Data Stores.

    Files are the heroes of the Hadoop world. If you want to use Hadoop quickly and easily, the best path for you right is to export your data neatly into files and run all those snazzy word count samples (Pun intended!). Because without files Hadoop can’t do all that cool “data locality” shit. Everything has to be in HDFS first.

    So, what would you do to analyze your data in the hypothetical FUHadoopDB? First of all, implement about 10+ classes necessary to split and transfer data into the Hadoop nodes and run your tasks. Hadoop needs to know how to get data from FUHadoopDB, so let’s assume this is acceptable. Now, if you don’t store it in HDFS, you won’t get the data locality shit. If this is the case, when the task runs, they themselves will have to pull data from the FUHadoopDB to process the data. But, if you want the snazzy data locality shit you need to pull data from FUHadoopDB and store it in HDFS. You will not incur the penalty of pulling data while the tasks are running, but you pay it at the preparation stage of the job, in the form of transferring the data into HDFS. Oh and did I mention the additional disk space you would need to store the same data in HDFS. I wanted to save that disk space, so I chose to make my tasks pull data while running the tasks. The choice is yours.

  5. Java is OS independent, isn’t it?
  6. Java has its flaws but for the most part it runs smoothly on most OSs. Even if there are some OS issues, it can be ironed out easily. The Hadoop folks have issued document mostly based on Linux environments. They say Windows is supported, but ignored those ignorant people by not providing adequate documentation. Windows didn’t even make it to the recommended production environments. It can be used as a development platform, but then you will have to deploy it on Linux.

    I’m certainly not a windows fan. But if I write a Java program, I’d bother to make it run on Windows. If not, why the hell are you using Java? Why the trouble of coming up with freaking bytecode? Oh, the sleepless nights of all those good people who came up with byte code and JVMs and what not have gone to waste.

  7. CS 201: Object Oriented Programming
  8. If you are trying to integrate Hadoop into your platform, think again. Let me take the liberty of typing your thoughts. “Let’s just extend a few interfaces and plugin my authentication mechanism. It should be easy enough. I mean these guys designed the world’s greatest software that will end world hunger.”. I hear you again. If you are planning to do this, don’t. It’s like OOP anti patterns 101 in there. So many places that would say “if (kerberos)” and execute some security specific function. One of my colleagues went through this pain, and finally decided to that it’s easier to write keberos based authentication for his software and then make it work with Hadoop. With great power comes great responsibility. Hadoop fails to fulfil this responsibility.

Even with these issues, Hadoop’s popularity seems to be catching significant attention, and its rightfully deserved. Its ability to commodotize big data analytics should be exalted. But it’s my opinion that it got way too popular way too fast. The Hadoop community needs to have another go at revamping this great piece of software.