Gain visibility into the health, performance, and resource usage of your Cassandra datastore.
Apache Cassandra is a NoSQL distributed database designed to handle large amount of data across multiple servers. It follows peer-to-peer architecture and any node in the cluster provides the same functionality as any other node in the cluster.
Before jumping to troubleshooting, let's discuss how to check the overall health of the cluster. You can use nodetool status to access the status of cluster, which will be useful if one or more nodes are down or unreachable.
nodetool status
Nodetool status provides various information such as node's status, state, address, owns percentage, host id, and rack.
Datacenter: Datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.40.49.3 64.3 GB 256 48.9% c469cbec-e0e0-46a0-b31e-f1c49860b03f RAC2
UN 10.40.40.55 55.76 GB 256 50.8% 91cab37f-d427-454e-9656-4a25b93d9fe2 RAC1
UN 10.40.38.120 58.01 GB 256 49.7% 47219cd8-f55a-41f8-bb17-b21dbb4092fa RAC3
UN 10.40.40.56 58.88 GB 256 50.4% 2bc08dc1-f132-469f-a4bc-4a6156237061 RAC3
UN 10.40.8.75 58.54 GB 256 51.1% 27fcfee0-2a3f-4e16-878e-cc4897233802 RAC2
DN 10.40.17.20 57 GB 256 49.2% 0934da93-f777-445c-a801-9ed0d4972fef RAC1
Here you can see one of the nodes is unreachable or down. If you have many nodes, use the grep command to identify the abnormal nodes.
nodetool status | grep -v '^UN'
This will list only the nodes that are not Up and Normal.
Datacenter: Datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
DN 10.40.17.20 57 GB 256 49.2% 0934da93-f777-445c-a801-9ed0d4972fef RAC1
The first step in troubleshooting is to identify where the problem is. With the help of client logs, Cassandra logs, and other tools provided by Cassandra we have to identify if the problem is in client or any one node, or a subset of nodes.
Client logs will help us identify the source of the problem. Some of the common errors you could find in logs are:
When the client or driver is not able to establish connection with any node in the cluster, this error will be thrown. It can happen when:
DCAwareRoundRobinPolicy
with an incorrect localDC
or node address
In this case, the request has reached the coordinator node and there is not enough replicas available to meet the requested consistency level.
nodetool getendpoints
to get the possible list of nodes for that query and start from there. nodetool status
on the coordinator node to find its status.Operation timed out occurs when the coordinator does not respond within the timeout specified by the client. It can occur when the client side timeout is short or the particular query is fetching large amounts of data that could not be processed within the specified time. The error message generally displays the last host tried by the client. The coordinator will be a good starting point to debug the issue.
This is a server-side timeout that occurs between coordinator and replicas. This happens when a request reaches the coordinator node, and the coordinator believes there are enough replicas available for processing and request data from replicas. But for some reason, one or more replicas do not respond within the expected time, causing the read timeout or write timeout. This exception generally refers to a server-side problem as the default timeout is high.
If only a particular query consistently fails, you can use nodetool getendpoints to get all the nodes that own a particular partition key. You can start debugging from those list of nodes.
Assuming the following query is always failing, it means one or more nodes that own the given partition key is the problem.
Select * from sampleks.table1 where partitionkey=123;
Use the following to query the endpoints of "123":
nodetool getendpoints sampleks table1 123
#Lists the replica nodes
10.40.40.55
10.40.38.59
10.40.41.66
Once you have a list of faulty nodes, the next step is to narrow down the problems in node. You can use Cassandra logs and various nodetool commands to find out the status of node.
One of the common problems faced by Cassandra users is high read latency. High read latency can be either due to coordinator or individual replicas.
If coordinator node is the suspect, then you can use nodetool proxyhistograms to find the read and write latency.
./nodetool proxyhistograms
#Lists the coordinator node's latency
proxy histograms
Percentile Read Latency Write Latency Range Latency
(micros) (micros) (micros)
50% 315.85 379.02 263.21
75% 545.79 454.83 315.85
95% 943.13 654.95 454.83
98% 1358.10 943.13 454.83
99% 1955.67 1358.10 454.83
Min 51.01 88.15 152.32
Max 8409.01 2816.16 454.83
The common problem for coordinator's high latency are:
Local query latency refers to the latency of the replica nodes to process the request. Common causes for high local latency are:
nodetool tablehistograms
to find the sstable count.nodetool tablestats
to find the average and max partition size and use logs to see the max partition size that is flushed or compacted.tombstone count read
and nodetool tablestats
to check the average tombstone per slice. nodetool tpstats
to find the thread stats.The nodetool tablehistograms
provides latency of the local read/write.
./nodetool tablehistograms keyspace table
#provides table level latency
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 3.00 51.01 454.83 88148 258
75% 3.00 73.46 785.94 263210 770
95% 5.00 88.15 1358.10 1131752 2759
98% 5.00 105.78 1955.67 3379391 3311
99% 5.00 126.93 4055.27 4866323 4768
Min 0.00 8.24 51.01 30 0
Max 5.00 263.21 4866.32 802187438 42510
Tablehistograms provides the number of sstables read per query, read latency, write latency, partition size, and cell count.
If the number of sstables read per query is high, then read queries will have a high latency. This can happen if the wrong compaction strategy is chosen. You have to choose a compaction strategy based on your read and write load. Another situation could be if the minor compaction is disabled.
The partition size and cell count can help tune your data pattern.
If you suspect issues with one particular table, you can use nodetool tablestats
to get the table level latency and other statistics.
nodetool tablestats cassandra.table1
#List the tablestats
Keyspace: cassandra
Read Count: 28498299
Read Latency: 0.3223316537594051 ms.
Write Count: 35110896
Write Latency: 0.04817532471971094 ms.
Pending Flushes: 0
Table: table1
SSTable count: 13
Space used (live): 4551417734
Space used (total): 4551417734
Space used by snapshots (total): 0
Off heap memory used (total): 4548182
SSTable Compression Ratio: 0.14427946324401666
Number of partitions (estimate): 183506
Memtable cell count: 474
Memtable data size: 195955
Memtable off heap memory used: 0
Memtable switch count: 529
Local read count: 28498299
Local read latency: 0.122 ms
Local write count: 35110896
Local write latency: 0.039 ms
Pending flushes: 0
Bloom filter false positives: 17515
Bloom filter false ratio: 0.00000
Bloom filter space used: 431528
Bloom filter off heap memory used: 431424
Index summary off heap memory used: 53606
Compression metadata off heap memory used: 4063152
Compacted partition minimum bytes: 30
Compacted partition maximum bytes: 43388628
Compacted partition mean bytes: 181607
Average live cells per slice (last five minutes): 2.69670113244707
Maximum live cells per slice (last five minutes): 35
Average tombstones per slice (last five minutes): 1.0
Maximum tombstones per slice (last five minutes): 1
Cassandra maintains multiple thread pools for different tasks. Each threadpool has a specific function to perform. You can use nodetool tpstats
to find the state of these thread pools, such as number of tasks that are active, completed, and blocked.
You can also obtain the number of dropped messages, which can shed some light on which stage is creating a problem. In the example below the mutation and request response is dropped.
Pool Name Active Pending Completed Blocked All time blocked
MutationStage 0 0 385618082 0 0
ViewMutationStage 0 0 0 0 0
ReadStage 0 0 537370386 0 0
RequestResponseStage 0 0 376599718 0 0
ReadRepairStage 0 0 1603479 0 0
CounterMutationStage 0 0 0 0 0
MiscStage 0 0 0 0 0
CompactionExecutor 0 0 38697073 0 0
MemtableReclaimMemory 0 0 41849 0 0
PendingRangeCalculator 0 0 2 0 0
GossipStage 0 0 70732668 0 0
SecondaryIndexManagement 0 0 0 0 0
HintsDispatcher 0 0 188 0 0
MigrationStage 0 0 11 0 0
MemtablePostFlush 0 0 68006 0 0
ValidationExecutor 0 0 0 0 0
Sampler 0 0 0 0 0
MemtableFlushWriter 0 0 41555 0 0
InternalResponseStage 0 0 1122 0 0
AntiEntropyStage 0 0 0 0 0
CacheCleanupExecutor 0 0 0 0 0
Native-Transport-Requests 0 0 1668155897 0 72
UserDefinedFunctions 0 0 0 0 0
Message type Dropped
READ 0
RANGE_SLICE 0
_TRACE 0
HINT 0
MUTATION 10
COUNTER_MUTATION 0
BATCH_STORE 0
BATCH_REMOVE 0
REQUEST_RESPONSE 2
PAGED_RANGE 0
READ_REPAIR 0
Cassandra provides multiple log files for various purposes. Some of the main logs are:
This is the default general logging provided by Cassandra. Various fields available are level, thread_name/id, date, class name, line number, and message.
Sometimes checking the whole log file can be overwhelming. Use log level to filter the messages
# Below command will list last 5 warning/error messages
grep 'WARN\|ERROR' system.log | tail -5
By default, Cassandra will compress the old rotated log files, so if you want to filter through all the rotated logs use zgrep
# Search all compressed rotated files
zgrep 'WARN\|ERROR' system.log.* | less
GCInspector.java class logs message when there is a long gcpause.
#List last 10 gcpauses in system.log
grep 'GCInspector' system.log | tail -10
Tombstones can bring the entire cluster or node down if not handled properly. The read response time can increase when a large number of tombstones is required to be read.
grep 'tombstone cells' system.log | tail -10
Sample Output
WARN [SharedPool-Worker-1] 2021-07-30 02:01:53,990 ReadCommand.java:520 - Read 452 live rows and 1392 tombstone cells for query SELECT * FROM keyspace.table1 WHERE id = 456 LIMIT 5000 (see tombstone_warn_threshold)
WARN [SharedPool-Worker-1] 2021-07-30 02:01:53,990 ReadCommand.java:520 - Read 2 live rows and 1892 tombstone cells for query SELECT * FROM keyspace.table1 WHERE id = 9887 LIMIT 5000 (see tombstone_warn_threshold)
In this warning log, too many tombstones are read which will make the query slow. Modify gc_grace_seconds according to the load.
grep "Writing large partition" system.log | tail -10
Sample Output
WARN [CompactionExecutor:13606] 2021-08-08 00:00:35,968 BigTableWriter.java:171 - Writing large partition keyspace/table1:8354000001876068 (251029397 bytes to sstable /home/cassandra/data_directories/data/keyspace/table1-01e71f5bfc3d36fd80b7abac29bd7766/mc-77869-big-Data.db)
WARN [CompactionExecutor:14390] 2021-08-11 00:00:36,176 BigTableWriter.java:171 - Writing large partition keyspace/table1:8354000001876068 (258559119 bytes to sstable /home/cassandra/data_directories/data/keyspace/table1-01e71f5bfc3d36fd80b7abac29bd7766/mc-77873-big-Data.db)
Debug log is similar to system.log but provides some additional debugging information. This log is more noisier than system.log. It's recommended that you utilize any centralized logging or log analysis tool to accomplish this efficiently.
If you plan to search manually, it's better if you have a list of messages you are going to search.
#Lists the count of column familys flushed
grep 'Flushing largest CFS' debug.log | awk '{print $9, $10}' | sort | uniq -c
4 CFS(Keyspace='cassandra', ColumnFamily='table1')
1 CFS(Keyspace='cassandra', ColumnFamily='table2')
2 CFS(Keyspace='cassandra', ColumnFamily='table1')
1 CFS(Keyspace='system', ColumnFamily='size_estimates')
1 CFS(Keyspace='sampleks', ColumnFamily='sampletable')
This a normal conventional java GC log. You can use this to fine-tune JVM and GC configurations.
In this article, we went through various troubleshooting steps, including narrowing down the problem. We also determined the appropriate actions to take to resolve common issues, such as changing the data model, optimizing the querying pattern, and scaling nodes.
Write for Site24x7 is a special writing program that supports writers who create content for Site24x7 “Learn” portal. Get paid for your writing.
Apply Now