The kind of data generated in every business environment varies, and these data sets only become useful once they are harnessed to give useful insights. Data engineers are the professionals often tasked with building and maintaining key systems that collect, manage and convert these data sets.
The huge amount of data generated in different industries has expanded the data engineering profession to cover a wide range of skills, including web-crawling, distributed computing, data cleansing, data storage, and data retrieval.
Over the years, data storage has become a subject of interest in the data engineering field, thanks to the rise of modern data storage options. Most data engineers and scientists are familiar with SQL databases such as MSSQL, PostgreSQL, and MySQL, but the shift in preference is slowly changing this narrative.
The need for speed, flexibility, and adaptability has also become apparent in data handling, and non-conventional data storage technologies are now coming to market. Several businesses are also embracing storage as a service solution, and the trend is just getting better. Below, we have discussed the three data storages that are increasingly becoming popular among data engineers.
Search engines, documents stores, and columnar stores are the three technologies that are seeing wider adoption in the data handling field. Here’s a quick overview of how they operate and why they are becoming storage options of choice.
When defining data storage in the data engineering field, three critical aspects are used to score the best storage solutions. These are data indexing, data sharing, and data aggregation.
Ideally, each data indexing technique improves specific queries but undermines others. So knowing the kind of queries used can often help you choose the right data storage option.
Data sharding is a process in which a single dataset is split and distributed across multiple databases so they can be stored in various data nodes. The goal is often to increase the total storage capacity of a given system. Sharding determines how the data infrastructure will grow as more data is stored in the system.
On the other hand, data aggregation is the process where data is collected and expressed in a more summarized manner before they are ready for statistical analysis. The wrong data aggregation strategy can limit the performance and the types of reports generated. Below, we’ve broken down the three data storage types based on the data indexing, sharding, and aggregation capabilities.
Search engine storage Elasticsearch is a data store that specializes in indexing texts. Unlike the traditional data stores that create indices based on the values in the field, this storage type allows for data retrieval with only a fragment of the text field. This is also done automatically through analyzers. The latter are modules that create multiple index keys after evaluating the field values and breaking them into smaller values.
Elasticsearch is built on top of Apache Lucene and provides a JSON-based REST API that refers to Lucene features. Scaling is often done by creating several Lucene shards and distributing them to multiple servers/nodes within a cluster. Therefore, each document is routed to its shard through the id field. When retrieving data, the master server sends each shard/ Lucene instance a copy of the query before it finally aggregates and ranks them for output.
Elasticsearch is document-based storage whose content can be bucketed by ranged, exact, or geolocation values. The buckets can also be grouped into finer details through nested aggregation. Metrics such as mean and standard deviations can be calculated easily for every layer, making it easy to analyze several parameters in a single query. However, it suffers the limitation of intra-document field comparisons. A solution is often to inject scripts as custom predicates, a feature that works for one-off analysis but is often unsustainable due to degraded performance in production.
MongoDB is a generic data store with lots of flexibility for indexing a wide range of data. However, unlike Elasticsearch, it’s designed to index the id field by default; hence you’ll need to manually create indices for the commonly queried fields. MongoDB’s text analyzer is also less powerful than that of Elasticsearch.
MongoDB’s cluster contains three types of servers: shard, config, and router. The servers will accept more requests when you scale the router, but most workloads are often directed to the shard servers. Like Elasticsearch, MongoDB documents are routed by default to their specific shards. When you execute a query request, the config server communicates to the router and shards the query. The router server then distributes the query and retrieves the results.
MongoDB’s Aggregation Pipeline is fast and very powerful. It operates on returned data in a stage-wise fashion, where each step can filter, transform and combine documents or unwind previously-aggregated groups. Since the operations are done step-by-step, the final documents are filtered, which minimizes the memory cost. Like Elasticsearch, MongoDB lacks the intra-document field comparison; hence it can’t use distributed computing.
Unlike MongoDB, Elasticsearch, and even the traditional SQL databases, Amazon Redshift doesn’t support data indexing. Instead, it reduces the query time by consistently sorting data on the disk. That is, each table has its sort key that determines how rows have been stored once the data is loaded.
Amazon Redshift’s cluster has one leader node and multiple compute nodes. The leader node computes and distributes queries before sampling intermediate results. Compared to MongoDB’s router servers, this leader node is very consistent and cannot be scaled horizontally. This creates some limitations but allows efficient caching for specific execution plans.
Since Amazon Redshift is a relational database that supports SQL, it’s quite popular among traditional database engineers. It also solves the slow aggregations common with MongoDB when analyzing mobile traffic. However, it doesn’t have the schema flexibility that Elasticsearch and MongoDB have. It’s also optimized for reading operations and hence suffers from performance issues during updates.
From the three alternative storage options above, choosing the ultimate best isn’t as obvious as it may seem. Depending on your unique data storage needs, one storage option is always better than the other. So instead of narrowing down to the ultimate best, you want to compare the different features and capabilities against your needs and then choose those that work best for you.