1. Introduction

The main requirement for modern systems like Twitter and Facebook is to efficiently handle terabytes or petabytes of data. Existing relational database management systems (RDBMS) can not offer much for use cases like these due to RDBMS scaling limitations. This is where we choose to use NoSQL databases because they can scale and stay cost-effective using the distributed architecture while also handling non-structured data.

2. Cassandra Cluster

One more good thing is that NoSQL databases do not enforce a predefined schema. Currently, there are a lot of commercial and non-commercial NoSQL solutions. Compared to classic RDBMS, the main difference is that they are scaling out instead of scaling up, meaning that data is distributed throughout multiple nodes when load increases, while with RDBMS we are upgrading the hardware of existing nodes to handle load increase.

3. Why is it useful to run the Apache Cassandra cluster in Docker?

There are several benefits to running Cassandra in Docker:

  • Portability: Docker allows you to run Cassandra on any machine that has Docker installed, regardless of the underlying operating system or hardware.

  • Isolation: Running Cassandra in Docker containers allows you to isolate the Cassandra process from the host system. This prevents conflicts with other applications that are running on the same machine.

  • Scalability: Docker makes it easy to scale a Cassandra cluster by starting and stopping Docker containers as needed. You can use Docker Compose to automate the process of scaling the cluster.

    Ease of deployment: With Docker, you can create a Cassandra cluster by simply starting the necessary number of Docker containers. This can be much faster and easier than installing Cassandra on each machine manually.

In this article, we will create a Cassandra cluster composed of two nodes using simple Docker commands. Then, we'll create a simple keyspace (a concept similar to a database in RDBMS) and table.

4. Creating Cassandra Cluster Using Simple Docker Commands

 step 1. Let's first check if we have docker installed:

docker --version

We are creating a cluster of two Cassandra nodes, cassandra1 and cassandra2. We will interact with the Cassandra cluster using nodetool. You can learn more on nodetool here.

step 2. Let's create the first Cassandra node using latest image from the Docker Hub:

docker run --name cassandra1 -d cassandra:latest

--name flag indicates the cluster node name (in this case that would be cassandra1) and -d flag means that we're running container in detached mode, meaning that it will not block our current terminal thread. We can check if our container is running by typing the following command in terminal:

docker ps -a

We can check the status of our node using the nodetool utility:

docker exec -it cassandra1 nodetool status



We can see that the Cassandra node is running and that its status is UN meaning Up&Normal. Available statuses are Up/Down and available States are Normal/Leaving/Joining/Moving.

step 3. Let's now create the second Cassandra cluster node from the same Docker Hub image (since we already pulled it for the first node, this command will not take long) and link it to the existing node:

docker run --name cassandra2 -d --link cassandra1:cassandra cassandra:latest

-- link flag is used to to link cassandra1 to cassandra2 so that these two nodes behave as a cluster.

Let's check the status of our cluster by typing the following command:

docker exec -it cassandra1 nodetool status

If we run the same command on the cassandra2 node, we'll get the same output. We can see here that both nodes are in Up status and Normal state. If you don't see this result in the first 5-10sec after the running don't worry, wait a couple of minutes since it usually takes some time for containers to synchronize.

5. Creating Keyspace in Cassandra Cluster

Since now we have a running Cassandra cluster, we can run some commands on it and create keyspaces and tables and add some data. For this, we'll use cqlsh which is a command-line interface for executing Cassandra Query Language (CQL) commands on a Cassandra cluster. You can learn more about it here.

We'll run cqlsh on cassandra1 node using the following command:

docker exec -it cassandra1 bash -c 'cqlsh'

Basic concepts to understand when working with Cassandra query language (CQL) are:

  • keyspace - similar to a database in Relational Database Management Systems (RDBMS)
  • column family also called table - keyspace consists of several column families/tables, similar to an SQL table
  • primary key - primary key consists of a partition key and a cluster key. The partition key determines the node on which data is stored, and the Cluster key determines the order of data in a particular row

To create our first keyspace, we'll type the following command:

CREATE KEYSPACE testkeyspace WITH replication = {'class':'SimpleStrategy' , 'replication_factor' : 2};

The replication strategy we're using is SimpleStrategy which is usually used when there is a single data center and all the nodes are in the same datacenter. In this strategy, the number of replicas is specified and the data is replicated on that many nodes in the cluster. Also, it is the default replication strategy for Cassandra.

We'll set the replication factor to 2, here you can learn more about the replication factor. 

Let's create the first table in testkeyspace we created previously.

use testkeyspace;
create table student (student_id int primary key, name text, postal_code int);

Now let's add some data to our tables:

insert into student (student_id ,name ,postal_code) values (1, 'Mark', '14000');

And that's it, now we have a cluster with keyspace, table, and some data in it.

6. Conclusion

In this tutorial, we explain the basic Cassandra concepts and discuss the usage of Docker in Cassandra cluster creation. We also show how to use Docker to create Cassandra cluster, and how to execute commands for creating keyspace and tables on dockerized Cassandra cluster. If you want to learn more about Cassandra replication strategies, you can check here, and if you want to understand Nodetool and it's common usages, you can check here.