In this post, we will talk about handling large datasets in distributed systems. This is not related to big data or machine learning where you handle large data. But in general, when you start scaling a distributed system, you will start processing various transactional and reporting data. How do you handle that kind of large datasets? If you are a beginner with distributed systems, you can read fundamentals of distributed systems or building event-driven microservices.
Why handle large datasets?
The first question arises why we need to handle the large datasets. In practicality, there can be many reasons like migrating a large set of data after upgrading the legacy system, or processing historical data, or your existing system is growing with transactional data. All these scenarios come with complexities and scale. When designing any distributed system, one can always make decisions to take into account such scenarios. But despite that, every decision in a distributed system is a trade-off. Despite how well you are prepared, you might come across a scenario that you have not taken into account. How do you handle such cases?
Ways for handling large datasets
We have a large dataset. How do we process? This dataset can be for reporting or even auditing purposes.
Chunking
Chunking is one way of processing this data. We take a large dataset and splits that into multiple data chunks. And then process each chunk. As simple as that, the terminology goes – process data in n number of chunks.
In this scenario, you need to know the characteristic of data to split it into multiple chunks. There can be other side effects with chunking. Imagine if you read a gigabyte of data into memory and then tried to split. That would create performance issues. In such scenarios, you need to think about how one can read data from a data store or database in chunks. Probably use filters. Pagination is one such example.
MapReduce
MapReduce is a programming model where you take data and pass that data through a map and reduce functions.
Map takes a key/value pair of input and produces a sequence of key/value pairs. The data is sorted this way that it groups keys together. Reduce reduces the accepted values with the same key and produce a new key/value pair.
Streaming
Streaming can be considered similar to chunking. But with streaming, you don’t need to have any custom filters. Also, many programming languages offer streaming syntax to process a large dataset. We can process a large data set from a file or a database through the stream. Streaming data is a continuous flow of data generated by various sources. Overall, there are systems that can transmit data through event streams.
Streaming allows the processing of data in real-time. Applications like Kafka allow to send data and consume it immediately. The performance measure for streaming is latency.
Batch Processing
Batch processing allows processing a set of data in batches. So if you have 100k records, you can set a limit to process only 10K records in a batch. Spring Boot also offers an option of Spring batch for batch processing. Batch processes are scheduled jobs that run a set program to process a set of data and then produce output. The performance measure of batch processing is throughput.
Conclusion
In this post, we discussed different ways to process a large set of data. Do you know any other way? Please comment on this post and I will add it to this list.
With ever-growing distributed systems, one will need to handle large data sets eventually. It’s always good to revisit these methods to understand the fundamentals of data processing.