Cassandra Introduction -- data model
Introduction:
With the more and more data insertions and queries from the database, we may face the situation that we need to scale out the architecture by increasing new machines to handle the amount of data. However, in the traditional MySQL database, it needs a lot of work to add a new machine (i.e. shading, we partition the data into different machines). And sometimes only key-value queries are needed instead of JOIN operation. We can't help but think that if there is an alternative solution for database system scalability. By searching on the internet, we find many distributed key-value database are develop for this situation. Among these database systems, Cassandra is a java-based distributed key-value database which is created by Facebook. It is different from MySQL which contains the JOIN operation, Cassandra is good at dealing with the distributed data. You may view the whole cluster as a big hash table with all fault tolerant and data partition are handle by it. It provides "incremental scalability" (which means you can increase throughput by adding new nodes). And Cassandra also supports "Column" feature, it is more convenient than only key-value database systems.
Basic key-value database:
Table['key1'] = value1


Data Model:

So the query will look like this:
Key Space:
In Cassandra, you can define many Key Space. You can think it as the Table in MySQL. It contains {Row, [ColumnFamily]} list. Normally one Key Space per application.
Row:
For row key, you can have data from relative Column Family. The data in each Column Family is sorted according row key's order. The row key does not have to contains data in all column family.
In Column Family, it contains a list of Column or a list of Super Column. You must define it in config before Cassandra start. And each Column Family is stored in a separate file. The number of column in each column family is unlimited.
Column:
It is the smallest element of data, and it only contains a name, a value, and a timestamp. You can add new or delete column at anytime.
Super Column is the container to contain Columns.
Architecture:
Cassandra use consistent hash to do key distribution and partition. Each node in Cassandra cluster will take a token (0<token<2^32) in the ring. The size of the ring is 2^32. When the key is coming, it will make the md5 hash for the key and find the smallest token which is larger than the key md5. The the key is mapping the correspond node according to the token, so the data will be store in the corresponding node.

Replicate method:
If you want to store two replicas of data in Cassandra cluster. It will store data in the next two nodes.

Adding a new node:
In consistent hash method, adding a new node will only affect the nodes in neighbors. In this case, we do not need to rehash all data. Some data store in node 1 will now store in new node 4. The new node will choose a token randomly, and find the corresponding location according to the md5 hash.


Print This Page