banner



How To Retrieve Data From Cassandra Using Java

Cassandra'due south normal use case is vast number of small-scale read and write operations distributed equally across the cluster.

However every then often you need to extract a big quantity of data in a single query. For instance, say you are storing customer events that you ordinarily query in minor slices, however one time a mean solar day yous want to extract all of them for a item client. For customers with a lot of events than this could be many hundreds of megabytes of data.

If you want to write a Java awarding that executes this query against a version of Cassandra prior to two.0 so you may run into some issues. Let united states look at the first ane..

Coordinator out of memory:

Previous versions of Cassandra used to bring all of the rows dorsum to the coordinator before sending them to your application, then if the result is too large for the coordinator'due south heap it would run out of memory.

Let's say you had just plenty retentiveness in the coordinator for the upshot, so you ran the risk of...

Awarding out of retentiveness:

To become around this you had to implement your own paging, where you carve up the query into many modest queries and processed them in batches. This can exist accomplished past limiting the results and issuing the side by side query later on the last outcome of the previous query.

If your application was streaming the results over HTTP then the architecture could look something like this:

Hither we place some kind of queue, say an ArrayBlockingQueue if using Java, betwixt the thread executing the queries and the thread streaming information technology out over HTTP. If the queue fills up the DAO thread is blocked, pregnant that it won't bring any more rows from Cassandra. If the DAO gets behind the WEB thread (perhaps a tomcat thread) blocks waiting to get more rows out of the queue. This works very nicely with the JAX-RS StreamingOutput.

This all sounds like a lot of hard work...

The 2.0+ solution

From version ii.0, Cassandra would no longer suffer from the coordinator out of memory. This is because the coordinator pages the response to the driver and doesn't bring the whole consequence into memory. However if your application reads the whole ResultSet into retentivity and then your application running out of retention is still an issue.

However the DataStax commuter'south ResultSet pages as well, which works really nicely with Rx-Java and JAX-RS StreamingOutput. Time go get real, let's take the following schema:

And y'all want to get all the events for a particular customer_id (the sectionalization fundamental). First let'south write the DAO:

Allow's get through this line by line:

2: Async Execute of the query that will bring back more rows that will fit in memory.
4: Convert the ListenableFuture to an RxJava Observable. The Appreciable has a actually nice callback interface / way to do transformation.
5: As ResultSet implements iterable we tin flatMap it to Row!
6: Finally map the Row object to CustomerEvent object to forbid commuter knowledge escaping the DAO.

And and so permit'south see the JAX-RS resources class:

Looks complicated merely information technology actually isn't, commencement a picayune virtually JAX-RS streaming.

The way JAX-RS works is we are given a StreamingOutput interface which we implement to get a agree of the raw OutputStream. The container east.k Tomcat or Jetty, will phone call the write method. Information technology is our responsibility to keep the container's thread in that method until we have finished streaming. With that knowledge let'southward go through the code:

5: Get the Observable<CustomerEvent> from the DAO.
half dozen: Create a CountDownLatch which nosotros'll utilize to block the container thread.
7: Register a callback to swallow all the rows and write them to the output stream,
12: When the rows are finished, close the OutputStream.
sixteen: Countdown the latch to release the container thread on line 33.
26: Each fourth dimension we get a CustomerEvent, write it to the OutputStream.
33: Wait on the latch to keep the container thread blocked.
39: Return the StreamingOutput case to the container so information technology can telephone call write.

Given that nosotros're dealing with the rows from Cassandra asynchronously y'all didn't wait the code to be in order did you? ;)

The total working instance is on my GitHub. To exam it all I put around 600mb of data in a Cassandra cluster for the same partition. There is a sample grade in the test directory to do this.

I then started the application with a MaxHeapSize of 256mb, and so used curl to hit the events/stream endpoint:

As you tin see 610M came back in 7 minutes. The whole time I had VisuamVM fastened to the application and the coordinator and monitored the memory usage.

Hither's the graph from the application:

The test was ran from 14:22 to 14:37. Fifty-fifty though nosotros were pumping through 610M of data through the application the heap was gittering between 50m and 200m, easily able to reclaim the memory of the data we have streamed out.

For those new to Cassandra and other distributed databases this might non seem that spectacular, simply I once wrote a rather large projection to do what we can manage here in a few lines. Crawly work past the Apache Cassandra commitors and the DataStax Java driver squad.


How To Retrieve Data From Cassandra Using Java,

Source: http://www.batey.info/streaming-large-payloads-over-http-from.html

Posted by: speerblema1996.blogspot.com

0 Response to "How To Retrieve Data From Cassandra Using Java"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel