Big Data Tools: their relevance in storage and analytics

1223-bigdata_tools.jpg

Big Data Technologies

In today’s IT environment, we keep hearing the same stories about Big Data technologies. These technologies allow businesses to quickly and efficiently gain valuable insights into secret user behavior patterns that are not easily discoverable using traditional techniques.

So, what is Big Data? 

According to Wikipedia, Big Data is “the study and applications of data sets that are too complex for traditional data-processing application software.” In simple terms, Big Data is the collection of large data sets that are stored for analysis and the discovery of insights and patterns.

A simple example of Big Data is storing and analyzing web server logs to discover an application’s peak load times and match that information to user actions to discover potential bottlenecks.

Another example is the recording of all user actions on a web page to predict with certainty which user will buy a product and provide a discount to make the deal more attractive.

While at a glance, these examples look different, they are in fact very similar and share the following characteristics:

  • Collections of data that are not critical to the application.
  • Offline analysis of collected data is matched to critical data from the application.
  • Pattern observations.
  • Discovered patterns are applied to add value to critical functionalities.

With these examples you may wonder, how is Big Data different from traditional data analysis? 

  • For starters, only critical data was analyzed, and it was traditionally stored in SQL-like databases due to performance. 
  • Next, the only approach used for predictions is statistical analysis. Applying functions such as AVG, SUM, MED, and others require all dataset values instead of detecting patterns.
  • The amount of data available for analysis is limited due to high costs of SQL storage.

Ultimately, the main benefit of applying the Big Data approach is to discover patterns that are not observed with application-level data.

With all this information to take into account, it’s easy to identify the following elements of a Big Data pipeline:

  • Data logger: used to record additional events.
  • Data storage: in Big Data, this term is also referred to as Data Lake.
  • Data analysis toolchain: used to gain insights from recorded data in Data Lake.
  • Framework or tools to visualize insights or apply them to data in real time.

Oftentimes, vendors provide solutions that cover all items of the Big Data pipeline in one comprehensive application, but in large-scale solutions, separate toolchains are used for data storage and data analysis.

Big Data Analytics Tools

In this article, we will give you a glimpse into the most common Big Data solutions. Now, let’s start with prominent vendors who provide Big Data tools for data analysis:

  • Splunk: This product captures, indexes, and correlates real-time data in a searchable repository that generates graphs, reports, alerts, dashboards, and visualizations. In essence, Splunk is an all-in-one solution with predefined data loggers that log data into Splunk Data Lake in Cloud. Then, it analyzes the data and provides dashboards and alerts based on patterns detected. This is a worldwide trusted gold standard for all-in-one solutions but it’s a pricey option.
  • LogStash+ElasticSearch+Kibana, also known as ELK, is an open source alternative. LogStash is used as Data Logger, ElasticSearch is used as storage, and Kibana is used as a tool for data analysis and visualization. There are a number of hosted solutions based on ELK stack. For examples, see here.

While there are many Big Data analytics tools available, we selected these two popular solutions that essentially cover 90% of data analysis of data that come from logs.

This approach requires a certain amount of data structure and potential modification of software. But what if you want to blend data that is completely unstructured and that comes from various sources? For example, suppose you want to perform a forecast of ice cream sales based on criteria such as the weather, number of city visitors via public transportation, number of events happening in the city, and other criteria. In this case, Big Data analytics tools based on Splunk or ELK won’t be of much help because you will be unable to modify all components to collect data from them since they are 3rd party elements.

In this case, the best solution is a custom solution that imports 3rd party data from partners or open data providers (such as the weather forecast), into the Data Lake which will collect data for later analysis.

Data Storage Solutions

As simple data storage solutions, you can use a file system to store plain files, but this solution is not scalable or redundant. For this, the usual path is to use specific file systems such as Hadoop Distributed File System (HDFS), Quantcast, or GlusterFS.

Next, we are going to detail each of these data storage solutions. 

hadoop

HDFS: This distributed file system is a great approach to solve complex problems in Big Data. It allows users to read massive quantities of information from files on different computers, making it faster to process. Usually, one hard drive can’t provide performance for large amounts of information for reading purposes, such as searching specific data in binary files. To solve these types of problems, multiple servers can be connected with high-speed networks and work together. This speeds up the reading of information significantly.

quantcastQuantcast: This free solution specializes in AI-driven, deep, and real-time advertising, audience insights, and measurement. It’s a suite of audience-intelligence applications that help businesses grow through AI-powered insights. It’s considered the top AI-driven audience behavior platform for the internet that quantifies over 100 million mobile and web destinations. It’s comprised of 1st party data, self-adapting predictive models and integrated AI optimization. These machine-learning technologies interpret consumer behavior to offer a real-time pulse of the internet.

Gluster GlusterFS: Scalable network file system for data-intensive tasks such as cloud storage and media streaming. GlusterFS is a free and open source software solution that uses common hardware. This scale-out network attached storage file system aggregates various storage servers over Ethernet of Infiniband RDMA into a large parallel network file system. It’s simple to use and provides data reliability and availability through replication.

NoSQL databases

If data is received in formats such as XML or from Webservice, it is better to use semi-structured NoSQL databases such as:

apache couchdbApache CouchDB: The Apache CouchDB tools lets users access data as needed. It provides a seamless multi-master sync that focuses on a scalable architecture. It contains a document-oriented NoSQL database architecture, it uses JSON to store data, and it uses JavaScript as its query language with MapReduce. Unlike a relational database, each database is a collection of independent documents that maintain its own data and self-contained schema.

Couchbase ServerCouchbase Server: Open source, distributed and multi-model NoSQL document-oriented database that creates, stores, retrieves, aggregates, manipulates and presents data. Couchbase is designed to provide easy-to-scale key-value or JSON document access with low latency and high sustained throughput.

Apache CassandraApache Cassandra: Distributed database tool to manage large sets of data across servers. It is one of the best tools available to process structured data sets by providing a highly available service with no single point of failure. 

 

Big Data

 

These are its most prominent capabilities:

  • Scalability.
  • Straightforward operations.
  • Easy and simple distribution of data.
  • Superior performance.

Apache HBaseApache HBase: Known as the Hadoop database, it’s a distributed and scalable Big Data store. It provides random, real-time read-write access to Big Data. It’s capable of hosting large tables with billions of rows and columns. In essence, this open-source solution is a distributed, versioned, non-relational database that provides linear and modular scalability, automatic table sharding, easy-to-use Java API, and more.

MongoDBMongoDB: Open source NoSQL databases that are cross-platform compatible and is ideal to store any type of data, provide flexibility in cloud-based infrastructures, partition data across servers in a cloud structure, and use dynamic schemas to prepare data quickly and in real time.

RedisRedis: Open source, in-memory data structure store that is used as a database, cache, and message broker. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs and geospatial indexes with radius queries.

The cloud solutions for NoSQL storage are:

MlabMlab: Cloud database service that hosts MongoDB databases. It runs on cloud providers such as Amazon, Google, and Microsoft Azure. It allows using your MongoDB in the cloud without any problems and provides free accounts and cost-effective solutions.

RedislabsRedislabs: In-memory, a high-performance database that supports multiple data structures, built-in replication, multiple levels of on-disk persistence, high availability, and automatic partitioning.     Redislabs allows access to the Redis database remotely and thus avoids any problem with configuring and setting Redis on your server.

Google DataStore logoGoogle DataStore provides a high-quality NoSQL data storage solution. It allows using ACID transactions, SQL-like queries, indexes and much more.

As you can see from the lists above, Apache is a popular free software to use. It is an open source software that is considered the de facto gold standard for Data Lake implementation with a center product named Hadoop. Additionally, it includes more than 30 products that provide various services for data access and storage.

To build a Data Lake from scratch is a hard and complex project, which is why cloud providers such as Amazon or Microsoft already implement an infrastructure for Data Lake. Some examples of projects are:

Amazon Cloud Services

Amazon S3: This cloud computing web service provides object storage via web service interfaces like REST, SOAP and BitTorrent. It stores, collects and retrieves any amount of data from any data source. It provides comprehensive security and compliance capabilities to meet the most stringent regulatory requirements while offering flexibility to manage data for cost optimization, access control, and compliance. It runs powerful analytics directly on datasets at rest in S3.

Amazon Elastic Block Store (EBS): Amazon EBS provides persistent block storage volumes for Amazon EC2 instances within the AWS Cloud. Each Amazon EBS volume is automatically replicated within its Availability Zone to protect from component failure while offering high availability and durability. It offers consistent and low-latency performance to run on numerous workload sizes.

Amazon DynamoDB: This NoSQL database Amazon cloud services provided as part of the AWS portfolio supports key-value and document data structures. DynamoDB uses synchronous replication across data centers for high durability and availability.

Microsoft Cloud Services

Azure Cosmos DB: This database allows users to build fast, scalable applications with native support for NoSQL. It’s built as a solution with global distribution and horizontal scale at the core of its capabilities. It offers turnkey global distribution by scaling and replicating data wherever users are located. It scales all around the globe and provides native support for NoSQL and OSS APIs such as MongoDB, Cassandra, and more.

For data analysis toolchain and visualization, the choice of software and framework includes  3000 well-known tools, which makes it rather challenging to cover in a single article. Nonetheless, we make a few suggestions for specific toolchains that are usually preferred by data scientists. These tools cover various tasks of data analysis and cleanup:

These aforementioned Microsoft cloud services scratch the surface but there are many more available to suit the needs of a data scientist. An exact toolchain will be different based on the insights sought from Big Data and the environment.

Conclusion

Based on the type of project you are working on, you may require different types of Big Data tools. 

For example, if you are building your own application from scratch, it’s recommended to use ELK or Splunk as data loggers and then analyze data from the centralized storage. But if you’re interested in adding custom data from various data sources that you don’t own or control, it’s recommended to use a cloud-hosted Data Lake.

Unless data is in petabytes, there’s no need to build a private Data Lake as the costs of maintenance are sky high. Also, that infrastructure adds unnecessary complexity to your solution.

It’s critical to understand what insights you want to gain from the data you collect and what techniques you should use to achieve this. At Svitla Systems, we provide extensive knowledge on the subject with highly-skilled data scientists who are there to support your cloud business needs for data analysis and storage.

Reach out to us and a Svitla expert will be happy to assist you and further your projects with the right Big Data technologies and techniques that are tailored for the success of your business.