Large scale data processing Hbase vs Cassandra [closed]

I am nearly landed at Cassandra after my research on large scale data storage solutions. But its generally said that Hbase is better solution for large scale data processing and analysis.

While both are same key/value storage and both are/can run (Cassandra recently) Hadoop layer then what makes Hadoop a better candidate when processing/analysis is required on large data.

I also found good details about both at http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved/

but I’m still looking for concrete advantages of Hbase.

While I am more convinced about Cassandra because its simplicity for adding nodes and seamless replication and no point of failure features. And it also keeps secondary index feature so its a good plus.

Standard C++ library for large scale data processing

Could you please let me know some of the standard library of C++ useful for processing large scale data for example Natural Language Processing with huge data set,data set of protein protein interacti

Hbase vs Cassandra vs Kafka for high resolution time series data storage

Between Hbase, Cassandra and Kafka, what are the pros and cons of using either technology for for high resolution (s or even ms) time series data storage?

Log Structured Merge Tree in Hbase

I am working on Hbase. I have query regarding how Hbase store the data in sorted order with LSM. As per my understanding, Hbase use LSM Tree for data transfer in large scale data processing. when Data

Processing large data sets PHP vs. Javascript

I’m trying to figure out a way to make some web pages more efficient when processing and formatting the data of very large queries (just from the coding side of things). Currently php is being used to

Large data processing technology & books [closed]

I am looking for good resources on how to query large volume of data efficiently. Each data item is represented as many different attributes such as quantity, price, history info, etc. The client will

Cassandra: Load large data fast

We’re currently working with Cassandra on a single node cluster to test application development on it. Right now, we have a really huge data set consisting of approximately 70M lines of texts that we

Cassandra in-memory configuration

We currently evaluate the use of Apache Cassandra 1.2 as a large scale data processing solution. As our application is read-intensive and to provide users with the fastest possible response time we wo

Large scale televoting solution [closed]

Have been tasked to look into a system for a large scale televoting solution, similar to that used for something like Xfactor but after a long time surfing the web I am struggling to find an answer Sm

large scale data mining with clojure

I’m looking for a good reference on large scale data mining with Clojure I know of many good clojure programming books (Programming Clojure, Joy of Clojure, …), and many good data mining text books

MPI large data processing

My application of MPI will read a series images to build a 3-D data. It is very large data( about 4 GB). I don’t want the data distributed to every worker. I don’t know how to do with this. Shared mem

Answers

As a Cassandra developer, I’m better at answering the other side of the question:

  • Cassandra scales better. Cassandra is known to scale to over 400 nodes in a cluster; when Facebook deployed Messaging on top of HBase they had to shard it across 100-node HBase sub-clusters.
  • Cassandra supports hundreds, even thousands of ColumnFamilies. “HBase currently does not do well with anything above two or three column families.”
  • As a fully distributed system with no “special” nodes or processes, Cassandra is simpler to set up and operate, easier to troubleshoot, and more robust.
  • Cassandra’s support for multi-master replication means that not only do you get the obvious power of multiple datacenters — geographic redundancy, local latencies — but you can also split realtime and analytical workloads into separate groups, with realtime, bidirectional replication between them. If you don’t split those workloads apart they will contend spectacularly.
  • Because each Cassandra node manages its own local storage, Cassandra has a substantial performance advantage that is unlikely to be narrowed significantly. (E.g., it’s standard practice to put the Cassandra commitlog on a separate device so it can do its sequential writes unimpeded by random i/o from read requests.)
  • Cassandra allows you to choose how strong you want it to require consistency to be on a per-operation basis. Sometimes this is misunderstood as “Cassandra does not give you strong consistency,” but that is incorrect.
  • Cassandra offers RandomPartitioner as well as the more Bigtable-like OrderedPartitioner. RandomPartitioner is much less prone to hot spots.
  • Cassandra offers on- or off-heap caching with performance comparable to memcached, but without the cache consistency problems or complexity of requiring extra moving parts
  • Non-Java clients are not second-class citizens

To my knowledge, the main advantage HBase has right now (HBase 0.90.4 and Cassandra 0.8.4) is that Cassandra does not yet support transparent data compression. (This has been added for Cassandra 1.0, due in early October, but today that is a real advantage for HBase.) HBase may also be better optimized for the kinds of range scans done by Hadoop batch processing.

There are also some things that are not necessarily better, or worse, just different. HBase adheres more strictly to the Bigtable data model, where each column is versioned implicitly. Cassandra drops versioning, and adds SuperColumns instead.

Hope that helps!

The reason for using 100 node hBase clusters is not because HBase does not scale to larger sizes. It is because it is easier to do hBase/HDFS software upgrades on a rolling fashion without bringing down your entire service. Another reason is to prevent a single NameNode to be a SPOF for the entire service. Also, HBase is being used for various services (not just FB messages) and it is prudent to have a cookie-cutter approach to setting up numerous HBase clusters based on a 100-node pod approach. The number 100 is adhoc, we have not focussed on whether 100 is optimal or not.

Trying to determine which is best for you really depends on what you are going to use it for, they each have their advantages and without any more details it becomes more of a religious war. That post you referenced is also more than a year old and both have gone through many changes since then. Please also keep in mind I am not familiar with the more recent Cassandra developments.

Having said that, I’ll paraphrase HBase committer Andrew Purtell and add some of my own experiences:

  • HBase is in larger production environments (1000 nodes) although that is still in the ballpark of Cassandra’s ~400 node installs so its really a marginal difference.

  • HBase and Cassandra both supports replication between clusters/datacenters. I believe HBase’s exposes more to the user so it appears more complicated but then you also get more flexibility.

  • If strong consistency is what your application needs then HBase is likely a better fit. It is designed from the ground up to be consistent. For example it allows for simpler implementation of atomic counters (I think Cassandra just got them) as well as Check and Put operations.

  • Write performance is great, from what I understand that was one of the reasons Facebook went with HBase for their messenger.

  • I’m not sure of the current state of Cassandra’s ordered partitioner, but in the past it required manual rebalancing. HBase handles that for you if you want. The ordered partitioner is important for Hadoop style processing.

  • Cassandra and HBase are both complex, Cassandra just hides it better. HBase exposes it more via using HDFS for its storage, if you look at the codebase Cassandra is just as layered. If you compare the Dynamo and Bigtable papers you can see that Cassandra’s theory of operation is actually more complex.

  • HBase has more unit tests FWIW.

  • All Cassandra RPC is Thrift, HBase has a Thrift, REST and native Java. The Thrift and REST do only offer a subset of the total client API but if you want pure speed the native Java client is there.

  • There are advantages to both peer to peer and master to slave. The master – slave setup generally makes it easier to debug and reduces quite a bit of complexity.

  • HBase is not tied to only traditional HDFS, you can change out your underlying storage depending on your needs. MapR looks quite interesting and I have heard good things although I have not used it myself.