Thursday, 13 September 2012

Cassandra Read Performance!!

Recently I was working on improving Read Performance of Cassandra. While working I came across some facts and findings so I thought lets Share It so that we can Explore It.

I have been using Cassandra for over an year and as writes are sequential they are faster like anything.I have seen faster performance with sstableloader and BulkOutPutFrormat. But read performance is really a concern and can be tuned with various parameters and also depends on the way you model the data(database schema).But If you design it considering how Cassandra works reads can be faster. 

When a read request for a row comes in to a node, the row must be combined from all SSTables on that node that contain columns from the row in question, as well as from any unflushed memtables, to produce the requested data. To optimize this piecing-together process, Cassandra uses an in-memory structure called a Bloom filter. Each SSTable has a Bloom filter associated with it that checks if any data for the requested row exists in the SSTable before doing any disk I/O. As a result, Cassandra is very performant on reads when compared to other storage systems, even for read-heavy workloads. As with any database, reads are fastest when the most in-demand data (or hot working set) fits into memory.  Although all modern storage systems rely on some form of caching to allow for fast access to hot data, not all of them degrade gracefully when the cache capacity is exceeded and disk I/O is required. Cassandra's read performance benefits from built-in caching, but it also does not dip dramatically when random disk seeks are required. When I/O activity starts to increase in Cassandra due to increased read load, it is easy to remedy by adding more nodes to the cluster.

For rows that are accessed frequently, Cassandra has a built-in key cache (and an optional row cache). For more information about optimizing read performance using the built-in caching feature. 

 Cassandra read flow Can be summarized as follows
For each key requested it will

  1. Apply some kind of validation for if the CF exist and all.
  2. Get all the endpoint where key exist
  3. Select nearest and better performing endpoint according to endpoint snitch.
  4. Check if the key is cached in key cache (key cache stores primary index for the key) and get the data from the SSTable if it is in the same SSTable
  5. If key is not cached for every SSTable it checks if the key exist with the help of the bloom filter.
  6.  If bloom filter says key do not exist then it look for next SSTable. 
  7. If bloom filter says key exist then it access primary index to locate the key location in SSTable. Then it access SSTable to get requested data. But if key is not present in index (false positive) look for the next SSTable. (Please correct me in case I am missing something here)

Factors affecting Cassandra Read Performance:

Following are the factors that affect casandra read performance

Index Interval

Bloom Filter False Positive

Consistency Level

Read Repair Chance



Data modeling 

Cluster Deployment

    Index Interval

     (Default: 128) Each SSTable has an index file containing row keys and the position at which that row starts in the data file. At startup, Cassandra reads a sample of that index into memory. By default 1 row key out of every 128 is sampled.
To find a row, Cassandra performs a binary search on the sample, then does just one diskread of the index block corresponding to the closest sampled entry. The larger the sampling, the more effective the index is (at the cost of memory usage). A smaller value for this property results in a larger, more effective index. Generally, a value between 128 and 512 in combination with a large column family key cache offers the best trade off between memory usage and performance. You may want to increase the sample size if you have small rows, thus decreasing the index size and memory usage. For large rows, decreasing the sample size may improve read performance.


Bloom Filter False Positive 


A Bloom filter, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. Cassandra uses bloom filters to save IO when performing a key lookup: each SSTable has a bloom filter associated with it that Cassandra checks before doing any disk seeks, making queries for keys that don't exist almost free. Bloom filters are surprisingly simple: divide a memory area into buckets (one bit per bucket for a standard bloom filter; more -typically four - for a counting bloom filter). To insert a key, generate several hashes per key, and mark the buckets for each hash. To check if a key is present, check each bucket; if any bucket is empty, the key was never inserted in the filter. If all buckets are non-empty, though, the key is only probably inserted - other keys' hashes could have covered the same buckets. See All you ever wanted to know about writing bloom filters for details and in particular why getting a really good output distribution is important.


(Default: ~ 0.0007) Desired false-positive probability for SSTable Bloomfilters.Valid values are 0.0001 to 1.0. At 1.0 the Bloom filter is effectively disabled; this is reasonable if you are using LeveledCompactionStrategy and not querying for non-existent rows. If you have many keys per node and are worried about Bloom filter memory usage, a reasonable first step is to try 0.01, if that is still too large then 0.1



Consistency Level 


Consistency levels in Cassandra can be set on any read or write query. This allows application developers to tune consistency on a per-query basis depending on their requirements for response time versus data accuracy. Cassandra offers a number of consistency levels for both reads and writesWhen you do a read in Cassandra, the consistency level specifies how many replicas must respond before a result is returned to the client application.Cassandra checks the specified number of replicas for the most recent data to satisfy the read request (based on the timestamp).The following consistency levels are available, with ONE being the lowest consistency (but highest availability), and ALL being the highest consistency (but lowest availability). QUORUM is a good middle-ground ensuring strong consistency, yet still tolerating some level of failure.A quorum is calculated as (rounded down to a whole number): 
(replication_factor / 2) + 1 
For example, with a replication factor of 3, a quorum is 2 (can tolerate 1 replica down). With a replication factor of 6, a quorum is 4 (can tolerate 2 replicas down)Choosing a consistency level for reads and writes involves determining your requirements for consistent results (always reading the most recently written data) versus read or write latency (the time it takes for the requested data to be returned or for the write to succeed).If latency is a top priority, consider a consistency level of ONE (only one replica node must successfully respond to the read or write request). There is a higher probability of stale data being read with this consistency level (as the replicas contacted for reads may not always  have the most recent write). For some applications, this may be an acceptable trade-off. If it is an absolute requirement that a write never fail, you may also consider a write consistency level of ANY.This consistency level has the highest probability of a read not returning the latest written values (see hinted handoff).If consistency is top priority, you can ensure that a read will always reflect the most recent write by using the following formula: 
(nodes_written + nodes_read) > replication_factor 
For example, if your application is using the QUORUM consistency level for both write and read operations and you are using a replication factor of 3, then this ensures that 2 nodes are always written and 2 nodes are always read. The combination of nodes written and read (4) being greater than the replication factor (3) ensures strong read consistency. 

Read Repair Chance 

Read repair means that when a query is made against a given key, we perform a digest query against all the replicas of the key and push the most recent version to any out-of-date replicas. If a lower ConsistencyLevel than ALL was specified, this is done in the background after returning the data from the closest replica to the client; otherwise, it is done before returning the data.
This means that in almost all cases, at most the first instance of a query will return old data.
Range scans are not per-key and do not do read repair. (A range scan at CL > ONE *will* reconcile differences in replicas required to achive the given CL, but extra replicas are not compared in the background.)
(Default: 0.1) Specifies the probability with which read repairs should be invoked on non-quorum reads. The value must be between 0 and 1. A value of 0.1 means that a read repair is performed 10% of the time and a value of 1 means that a read repair is performed 100% of the time. Lower values improve read throughput, but increase the chances of stale values when not using a strong consistency level.



Tuning Data Caches
These caches are built into Cassandra and provide very efficient data caching: 
• Key cache: a cache of the primary key index for a Cassandra table. Enabled by default. 
• Row cache: similar to a traditional cache like memcached. Holds the entire row in memory so reads can be satisfied without using disk. Disabled by default.  
If read performance is critical, you can leverage the built-in caching to effectively pry dedicated caching tools, such as memcached, completely out of the stack. Such deployments remove a redundant layer and strengthen cache functionality in the lower tier where the data is already being stored. Caching never needs to be restarted in a completely cold state.

With proper tuning, key cache hit rates of 85% or better are possible with Cassandra, and each hit on a key cache can save one disk seek per SSTable. Row caching, when feasible, can save the system from performing any disk seeks at all when fetching a cached row. When growth in the read load begins to impact your hit rates, you can add capacity to restore optimal levels of caching. Typically, expect a 90% hit rate for row caches. If row cache hit rates are 30% or lower, it may make more sense to leave row caching disabled (the default). Using only the key cache makes the row cache available for other column families that need it.

How Caching Works  
When both row and key caches are configured, the row cache returns results whenever possible. In the event of a row cache miss, the key cache might still provide a hit that makes the disk seek much more efficient. This diagram depicts two read operations on a column family with both caches already populated. 
One read operation hits the row cache, returning the requested row without a disk seek. The other read operation requests a row that is not present in the row cache but is present in the key cache. After accessing the row in the SSTable, the system returns the data and populates the row cache with this read operation. 

When to Use Key Caching 
Because the key cache holds the location of keys in memory on a per-column family basis, turning this value up can have an immediate, positive impact on column family reads as soon as the cache warms. 
High levels of key caching are recommended for most scenarios. Cases for row caching are more pecialized, but whenever it can coexist peacefully with other demands on memory resources, row caching provides the most dramatic gains in efficiency. 
Using the default key cache setting, or a higher one, works well in most cases. Tune key cache sizes in conjunction with the Java heap size.

When to Use Row Caching 
Row caching is recommended in these cases: 
• Data access patterns follow a normal (Gaussian) distribution. 
• Rows contain heavily-read data and queries frequently return data from most or all of the columns.

General Cache Usage Tips 
Some tips for efficient cache use are: 
• Store lower-demand data or data with extremely long rows in a column family with minimal or no caching. 
• Deploy a large number of Cassandra nodes under a relatively light load per node. 
• Logically separate heavily-read data into discrete column families. 
Cassandra's memtables have overhead for index structures on top of the actual data they store. If the size of the values stored in the heavily-read columns is small compared to the number of columns and rows themselves (long, narrow rows), this overhead can be substantial. Short, narrow rows, on the other hand, lend themselves to highly efficient row caching. 

Enabling the Key and Row Caches 
Enable the key and row caches at the column family level using the CQL caching parameter. Unlike earlier Cassandra versions, cache sizes do not need to be specified per table. Just set caching to all, keys_only, rows_only, or none, and Cassandra weights the cached data by size and access frequency, and thus make optimal use of the cache memory without manual tuning. For archived tables, disable caching entirely because these tables are read infrequently. 

Setting Cache Options 
In the cassandra.yaml file, tune caching by changing these options: 
key_cache_size_in_mb: The capacity in megabytes of all key caches on the node. 
row_cache_size_in_mb: The capacity in megabytes of all row caches on the node. 
key_cache_save_period: How often to save the key caches to disk. 
row_cache_save_period: How often to save the key caches to disk. 
row_cache_provider: The implementation used for row caches on the node.  

Monitoring Cache Tune Ups 
Make changes to cache options in small, incremental adjustments, then monitor the effects of each change using one of the following tools:  
nodetool cfstats  

About the Off-Heap Row Cache 
Cassandra can store cached rows in native memory, outside the Java heap. This results in both a smaller per-row memory footprint and reduced JVM heap requirements, which helps keep the heap size in the sweet spot for JVM garbage collection performance. 
Using the off-heap row cache requires the JNA library to be installed; otherwise, Cassandra falls back on the on-heap cache provider.



In the background, Cassandra periodically merges SSTables together into larger SSTables using a process called compaction. Compaction merges row fragments together, removes expired tombstones (deleted columns), and rebuilds primary and secondary indexes. Since the SSTables are sorted by row key, this merge is efficient (no random disk I/O). Once a newly merged SSTable is complete, the input SSTables are marked as obsolete and eventually deleted by the JVM garbage collection (GC) process. However, during compaction, there is a temporary spike in disk space usage and disk I/O.
Compaction impacts read performance in two ways. While a compaction is in progress, it temporarily increases disk I/O and disk utilization which can impact read performance for reads that are not fulfilled by the cache. However, after a compaction has been completed, off-cache read performance improves since there are fewer SSTable files on disk that need to be checked in order to complete a read request.

Data modeling 


Planning a data model in Cassandra involves different design considerations than you may be used to if you work with relational databases. Ultimately, the data model you design depends on the data you want to capture and how you plan to access it. However, there are some common design considerations for Cassandra data model planning. 

Start with Queries 
The best way to approach data modeling for Cassandra is to start with your queries and work back from there. Think about the actions your application needs to perform, how you want to access the data, and then design column families to support those access patterns. A good rule of a thumb is one column family per query since you optimize column families for read performance. 

For example, start with listing the use cases your application needs to support. Think about the data you want to capture and the lookups your application needs to do. Also note any ordering, filtering or grouping requirements. For example, needing events in chronological order or considering only the last 6 months worth of data would be factors in your data model design. 

Denormalize to Optimize
In the relational world, the data model is usually designed up front with the goal of normalizing the data to minimize redundancy. Normalization typically involves creating smaller, well-structured tables and then defining relationships between them. During queries, related tables are joined to satisfy the request.

Cassandra does not have foreign key relationships like a relational database does, which means you cannot join multiple column families to satisfy a given query request. Cassandra performs best when the data needed to satisfy a given query is located in the same column family. Try to plan your data model so that one or more rows in a single column family are used to answer each query. This sacrifices disk space (one of the cheapest resources for a server) in order to reduce the number of disk seeks and the amount of network traffic.

Cluster   Deployment 


When planning a Cassandra cluster deployment, you should have a good idea of the initial volume of data you plan to store and a good estimate of your typical application workload. 
Selecting Hardware for Enterprise Implementations 
As with any application, choosing appropriate hardware depends on selecting the right balance of the following resources: memory, CPU, disks, number of nodes, and network. 

The more memory a Cassandra node has, the better read performance. More RAM allows for larger cache sizes and reduces disk I/O for reads. More RAM also allows memory tables (memtables) to hold more recently written data. Larger memtables lead to a fewer number of SSTables being flushed to disk and fewer files to scan during a read. The ideal amount of RAM depends on the anticipated size of your hot data. 
• For dedicated hardware, a minimum of than 8GB of RAM is needed. DataStax recommends 16GB - 32GB. 
• Java heap space should be set to a maximum of 8GB or half of your total RAM, whichever is lower. (A greater heap size has more intense garbage collection periods.) 
• For a virtual environment use a minimum of 4GB, such as Amazon EC2 Large instances. For production clusters with a healthy amount of traffic, 8GB is more common. 

Insert-heavy workloads are CPU-bound in Cassandra before becoming memory-bound. Cassandra is highly concurrent and uses as many CPU cores as available. 
• For dedicated hardware, 8-core processors are the current price-performance sweet spot. 
• For virtual environments, consider using a provider that allows CPU bursting, such as Rackspace cloud Servers. 

What you need for your environment depends a lot on the usage, so it's important to understand the mechanism. 
Cassandra writes data to disk for two purposes: 
• All data is written to the commit log for durability. 
• When thresholds are reached, Cassandra periodically flushes in-memory data structures  (memtables) to SSTable data files for persistent storage of column family data. 
Commit logs receive every write made to a Cassandra node, but are only read during node start up. Commit logs are purged after the corresponding data is flushed. Conversely, SSTable (data file) writes occur asynchronously and are read during client look-ups. Additionally, SSTables are periodically compacted. Compaction improves performance by merging and rewriting data and discarding old data. However, during compaction (or node repair), disk utilization and data directory volume can substantially increase. For this reason, DataStax recommends leaving an adequate amount of free disk space available on a node (50% [worst case] for tiered compaction, 10% for leveled compaction).  

• When choosing disks, consider both capacity (how much data you plan to store) and I/O (the write/read throughput rate). Most workloads are best served by using less expensive SATA disks and scaling  disk capacity and I/O by adding more nodes (with more RAM). 
• Solid-state drives (SSDs) are also a valid alternative for Cassandra. Cassandra's sequential, streaming write patterns minimize the undesirable effects of write amplification associated with SSDs. 
• Ideally Cassandra needs at least two disks, one for the commit log and the other for the data directories. At a minimum the commit log should be on its own partition. 
• Commit log disk - this disk does not need to be large, but it should be fast enough to receive all of your writes as appends (sequential I/O). 
• Data disks - use one or more disks and make sure they are large enough for the data volume and fast enough to both satisfy reads that are not cached in memory and to keep up with compaction. 
• RAID - compaction can temporarily require up to 100% of the free in-use disk space on a single data directory volume. This means when approaching 50% of disk capacity, you should use RAID 0 or RAID 10 for your data directory volumes. RAID also helps smooth out I/O hotspots within a single SSTable. 
• Use RAID0 if disk capacity is a bottleneck and rely on Cassandra's replication capabilities for disk  failure tolerance. If you lose a disk on a node, you can recover lost data through Cassandra's built-in repair. 
• Use RAID10 to avoid large repair operations after a single disk failure, or if you have disk capacity to spare. 
• Because data is stored in the memtable, generally RAID is not needed for the commit log disk, but if you need the extra redundancy, use RAID 1. 
• Extended file systems - On ext2 or ext3, the maximum file size is 2TB even using a 64-bit kernel. On ext4 it is 16TB. Because Cassandra can use almost half your disk space for a single file, use XFS when raiding large disks together, particularly if using a 32-bit kernel. XFS file size limits are 16TB max on a 32-bit kernel, and essentially unlimited on 64-bit. 

Number of Nodes 
The amount of data on each disk in the array isn't as important as the total size per node. Using a greater number of smaller nodes is better than using fewer larger nodes because of potential bottlenecks on larger nodes during compaction. 

Since Cassandra is a distributed data store, it puts load on the network to handle read/write requests and replication of data across nodes. Be sure to choose reliable, redundant network interfaces and make sure that your network can handle traffic between nodes without bottlenecksT. 
• Recommended bandwith is 1000 Mbit/s (Gigabit) or greater. 
• Bind the Thrift interface (listen_address) to a specific NIC (Network Interface Card). 
• Bind the RPC server inteface (rpc_address) to another NIC.  
Cassandra is efficient at routing requests to replicas that are geographically closest to the coordinator node handling the request. Cassandra will pick a replica in the same rack if possible, and will choose replicas located in the same data center over replicas in a remote data center. 

If using a firewall, make sure that nodes within a cluster can reach each other on these ports.


Anonymous said...

Thanks for such detailed and educative article on Cassandra read architecture. It was real pleasure to read it.


Actually we should say thanks to DataStax for such a nice documentation of Cassandra which is getting better everyday. I have just collected the points affecting read performance and have taken most of the details from documentation only.

Anonymous said...

Yeah thanks to them but still big thanks to you to actually put it all together to the point where it actually makes sense!

NMMM.NU said...

Thank you for good summaryzation :)

Henrique said...

I am developing a application with Cassandra for my graduation work, and this performance tricks were so much helpful to me! Thank you!