Chapter 14: Big Data Analytics and NoSQL

CIS 2210 - Database Management and Design

Chapter 14: Big Data Analytics and NoSQL

Objectives:

This lesson discusses material from chapter 14. Objectives important to this lesson:

Big data
Hadoop
NoSQL
Data analytics

Concepts:

Big Data

This is the last chapter to cover in this text. The first topic is Big Data, which the text has a hard time defining. It seems to be characterized, but not quantified, partly because hardware and software solutions keep changing. Things that are hard to do get easier with better hardware and software, so giving us measurements in the text would only be good for a relatively short time.

What the author can do is to explain that Big Data is characterized by being hard to handle in three different ways. Each of them can be remembered with a word that starts with the letter v:

Volume - This refers to a body of data that is hard to handle with the available technology because there is so much of it. The text remarks that Google and Amazon felt this problem early in their operations due to their success and continuing popularity, and the volume of data they keep, and that they provide to their increasing number of users.
Velocity - The text explains that this refers to the rate at which data is added to and changed in the organization's information systems. Again, think of a large vendor with an ever changing set of data that it provides to customers (or the public), some of which is new, some of which is old, and much of which must be updated quickly and regularly. The text discusses the Amazon's ability to track all the items a customer has browsed, in addition to the ones that were actually ordered. This may help you think about the increase such tracking causes to volume as well as to velocity.
Variety - This is about having data that does not have a common structure, which leads to our example organization having to handle more kinds of data, obtained from many sources, and stored in a variety of ways. Previously in the text, we encountered the ideas of structured and unstructured data. Big Data requires that the system have the ability to process unstructured data, data that has not been confined to tables built according to business rules. If the system can apply structures and interpretation of data when it is searched, it can still be used in the database. A term the chapter introduces is polyglot persistence, which means the continuation of using multiple languages. The phrase is not really about many languages, but about many data types, and many ways of managing the different types.

The text provides some details about data collected by the Disney company about each current guest in one of their parks, pointing out that such data changes continuously during each person's experience of that park. If makes you wonder about the advisability of keeping Big Data like that.

As the body of data that needs to be processed continues to grow, the text discusses two standard methods of handling the increased load.

Scaling up - Adding RAM and installing better processors are two classic methods to scale up a system, to increase its ability to grow by improving existing hardware.
Scaling out - Adding more hardware, such as creating a new cluster of servers to handle increasing loads, is an example of scaling out, adding new hardware to improve a system by making it larger. The text warns that clustering does not fit well with the design of a relational DBMS, which is based on having a central control over all the data being processed.

On pages 652 and 653, the text describes two kinds of data processing that affect the velocity aspect of data.

Stream processing analyzes data as it comes in, discarding data that is not needed based on functions that have been preset for the type of data. This reduces the amount of data that will actually be saved and searched later.
Feedback loop processing analyzes data that is already stored, asking the user if a particular sort of data is useful, then using the response to choose what to present to the user next. This is similar to what happens when YouTube shows you a list of videos you might want to see next, then modifies the list based on the choice that you make.

On page 654, the text introduces other factors that add to the V-problems listed above. Note that they apply to all data processing, not just Big Data.

Variability - This is different from variety. It means the degree to which the meaning of data varies, depending on who is looking at it and why. This is true of all data in general. An accountant sees an account receivable as an asset, but the manager restocking a warehouse sees it as money that can't be used by the business. The text offers an example of a phrase that could be meant literally by a speaker/customer, or could be meant ironically. A machine can't tell, but a human may get the point.
Veracity - This is the degree to which we trust data. Can we trust customer satisfaction scores that are older than (fill in the blank)? We should realize that some data represent facts, and other data represent opinions which can change.
Value, Viability - Is the data actually useful to the organization? Survey results are particularly prone to error if the survey is not tested on a focus group. If we are collecting data that is of no use to us, we probably should not be collecting it, much less analyzing it. Beware of the old warning about data: garbage in, garbage out.
Visualization - Can the data be presented in a way that leads to good information? A good chart, graph, or model may help us recognize a truth that a mere column of numbers may not.

Hadoop

The second section of the chapter opens with a discussion of Hadoop. Lets get past the silly name: it is named after a toy elephant belonging to the son of one of the technology developers, Doug Cutting. Hadoop is a Java-based technology for handling large amounts of data with clusters of computers. It is an open source tool of belonging to the Apache Software Foundation (ASF). It has two major components. Both are based on papers written by Google employees in 2003 and 2004. (See the article behind the link provided in this paragraph.)

Hadoop Distributed File System (HDFS) - A file system that is made to handle terabytes of information that is replicated across multiple computers. It can support larger volumes of data as well. Hadoop uses very large data block, reads entire files as streams, and, according to our text, writes files that cannot be updated, but may have additional data appended. There seems to have been an update to Hadoop to allow file editing, noted in this online Q/A. Otherwise, changing a file means rewriting the whole file, not part of it.

Hadoop systems have three kinds of nodes: client nodes, data nodes, and a name node that manages connection between client and data nodes. Each file that is added must have data about its location, and its replicas locations, stored in the name node. Each data node sends a block report every six hours to the name node, updating what data blocks are stored on that data node. Not often enough? Each data node also sends a heartbeat signal to the name node every three seconds, to let the name node know the data node is still functioning. A missing heartbeat will cause the name node to tell remaining data nodes to redistribute data as needed to maintain multiple data copies.
MapReduce - A model for writing programs to handle distributed processing of data. In its current form, we can think of MapReduce as an API that provides support for distributed data processing. The text goes into a lot of detail that will be interesting to some of you. We can leave it alone for now.

NoSQL

After the long section about Hadoop and its add-ons with silly names (Pig, Hive, Impala, Sqoop, Flume), the author remarks that NoSQL is an unfortunate name. It refers to technologies used to access data that is not stored in relational databases. Such systems can, in fact, support SQL in their own way, although none seem to support the ANSI standard.

Most of the NoSQL products fit into one of four types. The table on page 663 lists some examples of each type. Don't be surprised if you have never heard of any of them. These are the types:

Key-value databases - This type of database assigns a series of keys to particular "values". Value is a poor word choice. In these databases, the values can be entire documents, files, or other data types. The pairs are not kept in tables, they are kept in buckets. There are no relationships from one bucket to another. Operations specify the name of a bucket and the name of a key. Three operations are used: get (or fetch), store, and delete. The text shows an example of a bucket with three keys, and three key values. It warns us that this is being displayed in a table, but the actual bucket is not a table.
Document databases - It is not clear why this is a separate type. This type uses key-value pairs, but the values are always documents. More features are available than in key-value databases. Documents have tagged sections, which may correspond to particular parts of the document, or to particular information. Key-values for particular kinds of documents are put into collections, which are like buckets. (This may sound familiar if you have used a recent copy of SharePoint.) Operations require a collection name and a key name to retrieve a document. Tags can also be used in retrieval operations, using them like attribute names in SQL.

Column-oriented databases - Confusing as it may be, the text tells us that this term is applied to two different database technologies.

The text explains that relational tables are usually stored in data blocks, each block containing some number of rows of a table. A column-oriented database will store a each column of data in one or a few data blocks, which is more efficient if you are conducting the kind of data processing that requires you to read entire columns at a time. In a row-oriented database, that would require you to read the entire file.

The second type of column-oriented database is called a column family database. Some examples are the Google's BigTable and Facebook's Cassandra. The example on page 667 shows a less than clear association of some column name and data stored in separate rows. There are rows? Sort of. There are rows, but rows do not all hold the same data. If this is giving you the headache it gives me, take a look at this blog site about databases. Its author explains that in Cassandra, rows are the only things that are the same. Follow the link for more, if you like:

MySQL	Cassandra
Database Instance	Cluster
database	keyspace
table	column family
rows	rows
columns (same in every row)	columns (can be the same, but can be different in every row)

In this sort of database, columns can be grouped in column families as super columns. A super column is a group of columns that are related, like all the columns that hold the part of an address, or all the columns that hold parts of a customer's name. The text mentions that you can have super columns or regular columns in a column family, but not both.

Graph databases - This one is a little hard to understand from the material in the text. A better short explanation is found on an Amazon Web Services page, explaining that you have several nodes/vertices that seem to be instances of entities. They are linked by directional edges (lines with arrowheads) that show relationships such as "likes" or "has", as well as other properties. Take a look at the example from Amazon, then look at this one from Wikipedia.

In the example above, you see three nodes that are about two people and one group. The edges describe the people knowing each other and being members of the same group. This example is meant to show the potential for using this kind of database in a social network environment. In the Amazon example, there is only one edge between each pair of nodes, but in this one there is an edge going in each direction between each pair. Now imagine lots of people and lots of groups in a similarly constructed graph.

This is a pretty good talk about graph databases which is available on YouTube.

Data Analytics

The last topic in the chapter is a connection back to chapter 13. As it has already been discussed, we can leave it alone.

Assignments for Module 14:

Review all chapters.
There is a numbered Assignment, partially for this chapter, that began two weeks ago. It is Assignment 8. A link to a Word document containing the questions can be found in the Module 12 folder for this course on Blackboard. There are also questions in it that relate to chapter 13. This assignment is due before class in week 15.

Download the document, complete it, and save it with a new name that includes your login ID. This is the version of each file that you should upload for a grade.
Regarding this assignment and all the others, do not simply quote the material from my notes or the text. Use your own words to show your understanding of the concepts.

These are individual assignments. Duplicate work will not be counted as completed work.