Introduction to Big Data

Machine Learning

Jesus A. Gonzalez

August 10, 2019

Introduction

How much information can we process in our computers?
How much can we increase the processing power of our computers?
What type of data are we working with?
- Relational DBs
- Structured data
  - Facebook
  - Twitter
  - RFID sensors
- All at the same time

Introduction

The amount of data that we need to work with
- 1 TB
- 10 TB
- 100 TB
- more…

Hadoop

Hadoop is
- An Opensource project
- Apache Foundation
- Written in Java by Doug Cutting
  - Got the name of hadoop because of the toy elephant of his son
- Uses google technology
  - MapReduce
  - File System
- Use to manage large amounts of data with paralellism
- Different types of data
  - Structured
  - Non-structured
  - Semi-structured
- Uses unexpensive hardware in a convenient way

Hadoop

Really efficient
Works in batch mode in order to deal with large amounts of data
Replicates data in several computers
- Reliable: if one node falls, data is taken from another node
It is not designed for
- OLTP
- OLAP
- Decision Support Systems (DSS)
It is designed for
- Big data
- Complements OLTP and OLAP
- Doesn’t replace a relational DB

What is Big data?

Explosion in amount of data collected
Big data
- Describes large data collections (large datasets)
- Can be unstructured
- Grow so large and quick that it is difficult to keep them in a normal DB or with statistical tools

What is Big data?

Interesting statistics - Examples of big data
- 3.9 billions of users in internet (worldwide), more today
- 5 billions of mobile phones in 2017
- 12 TB of data processed by Twitter each day
- 500+ TB of data produced by Facebook each day in 2012
- Approximately 80% of this data is unstructured

What is Big data?

With these large amounts of data businesses require
- Speed
- Reliability
- Deeper data analysis
Big data solutions
- More relevant each day
  - i.e. Hadoop

Projects Related to Hadoop

Eclipse (IDE donated by IBM)
Lucene (text search engine library, in java)
H-BASE (Hadoop’s database)
Hive (Datawarehousing data tools used to extract, transform, and load data and to make queries to this data stored in hadoop files)
Pig (Platform to analyze large datasets, high level language for data analysis)
jaql or jackal (Query language for JavaScript open notation)
Zoo Keeper (Service configuration and centralized names registry for large distributed systems)
Avro (Data serialization system)
UIMA (Architecture for the development, discovery, composition and implementation of unstructured data analysis)

Hadoop Examples

IBM Watson: IBM’s supercomputer that played Jeopardy
- Defeated two of the most popular players
- Processed approximately 200 millions of texts using Hadoop
  - Distributed system
  - Advanced search
  - Analysis

Hadoop Examples

Telecommunications Industry
- China Mobile
- Cluster with hadoop for
  - Data mining of call data registries
  - 5 - 8 TB of data produced by day
  - Could process 10 times more data than with the previous system
  - At \(\frac{1}{5}\) of the previous cost

Hadoop Examples

The media industry
- The New York Times
- They wanted to make all their articles of public domain (from 1851 to 1922)
- Converted the articles of 11 millions of files and images to 1.5 TB of pdf documents
- Implemented by one employee
  - One job x 24 hours in a Hadoop cluster in Amazon EC2 with 100-instances
  - Low cost

Hadoop Examples

Technology industry
IBM ES2
- Search technology based on Hadoop, Lucene, and Jaql
- Business challenges such as: vocabulary, abbreviations, achronyms
- Can perform data mining tasks for
  - Create achronyms journals
  - Regular expressions patterns
  - Geo-classification rules

Hadoop Examples

Technology industry
- Internet enterprises or Social networking - Yahoo, Facebook, Amazon, eBay, Twitter, StumbleUpon, Rackspace, Ning, AOL, and many more

Hadoop Examples

Yahoo is one of the most productive users
- Application runs in a cluster
  - 10,000 linux machines
- It is the largest Hadoop contributor

Hadoop Examples

More hadoop examples here

Hadoop doesn’t solve just any problem

Hadoop is not good for:
- Processing transactions (random access)
- When our job cannot be parallelized
- For low response times (low latency)
- To process large amounts of small files
- For intensive computation with a small amount of data

Solutions for Big Data

More than just Hadoop
We add the Business Intelligence and Data Analysis functionalities
Used to create valuable information
Able to combine structured data (already existing in corporations) with unstructured data
Able to generate information from data in movement
- Streams of InfoSphere of IBM
  - Determine customers sentiments towards a new product
  - Based on Facebook or Twitter comments

Big Data and the Cloud

The cloud has become very popular
Combines very well with Big Data solutions
In the cloud we can configure a hadoop cluster in minutes, under demand, and can be used as it is used, without paying for more than what we use

Hadoop Architecture

Terms
HDFS
MapReduce
Types of nodes
Topology
Writing an HDFS file

Hadoop Architecture

Node
- A computer
Rack
- A collection of nodes
- 30 to 40, stored near from each other
- All connected to the same switch in the network
- Bandwidth between two nodes in the same Rack is larger than the bandwidth between two nodes in different Racks
Cluster
- Collection of Racks

Hadoop Architecture

Hadoop cluster

Hadoop Architecture

Two main components
- DFS (Distributed File System)
  - Several options
  - The main one is HDFS
- MapReduce engine
  - To make computations over the data on the DFS

Hadoop Distributed File System (HDFS)

Executed over the existing file system
- In each node of the Hadoop cluster
Designed for a very particular access pattern
- Very large files
- Data flow
- Works better with large files
  - The larger the size, the less time used to search for another data segment
  - Works to the limit of the disk access bandwidth (minimizes seeks using large files)
- Hadoop makes a seek for the beginning of a block and continues reading sequentially from there
Uses blocks to store a file or parts of a file

Hadoop Distributed File System (HDFS)

Hadoop Distributed File System (HDFS)

Blocks
- Large blocks
  - 64MB (default), 128MB (recommended)
  - Compared to a UNIX block (4KB)
- Robust
  - 1 HDFS block is supported by multiple operating systems OS blocks

Hadoop Distributed File System (HDFS)

Advantages
- Fixed size (easier to compute how they are stored on disk)
- A file could be larger than any disk in the network
  - Could be stored by blocks of more than a node
- A block could be smaller in order to save space
- It is easy to implement replication in order to provide
  - Fault tolerance
  - Availability

Hadoop Replication

A block can be replicated in multiple nodes
Even with node failures, data is not lost
Replication can be configured with more than two nodes

MapReduce Engine

HDFS is Google technology
MapReduce is also inspired in a paper published by Google
- Consists of two data transformations that can be applied several times
  - The Map and Reduce functions
- A MapReduce job is divided in Map tasks that are executed in parallel (independent from each other) and Reduce tasks

Types of Nodes

HDFS nodes
- NameNode
- DataNode
MapReduce nodes
- JobTracker
- TaskTracker
Other secundary nodes
- Secondary NameNode
- CheckPointNode
- BackUpNode

Type of Nodes: NameNode

Only one NameNode per Hadoop cluster for previous versions
- More than one for newer versions
Manages the namespace of the filesystem plus metadata
Point of failure for previous versions of hadoop (not anymore)
- This point of failure was mitigated by using NFS
Recommended to use a specialized node for this task
- The node with the best configuration should be the NameNode
- Good amount of memory

Types of Nodes: DataNode

Many of them for each Hadoop cluster
Manages the data blocks and works as a client service
Periodically reports to the NameNode the list of blocks stored
Can use commodity hardware (non-specialized, unexpensive)
- Data replication done at the SW level

Types of Nodes: JobTracker

One per Hadoop cluster
Receives jobs petitions sent by clients
Schedules and monitors the MapReduce jobs with the TaskTrackers
- Monitors failures, looking for those tasks that require being re-scheduled in other TaskTracker

Types of Nodes: TaskTracker

Many of them for each Hadoop cluster
- To achieve the parallelism of the MapReduce tasks
Execute the MapReduce operations
- Starts Java virtual machines to execute a Map or Reduce task

Topology Awareness

Hadoop knows the network topology
- Allows optimizing the jobs allocation according to the data location
- Allocate the job as close to the data in order to maximize the bandwidth while reading the data

Topology Awareness

1st option
- Assign the TaskTracker to receive the Map task that reads the data in the same node
2nd option
- Assign to a node in the same Rack as the data
Worst option
- Assign to a node in a Rack different to the one in which data is located
Recommended: configure Rack-Awareness
- Try to execute the task in the TaskTracker node with the highest bandwidth to access data

Writting a File to HDFS

Create request to a NameNode
- Verify if file already exists
- Allow creating the file
NameNode deternimes the node in which the first block will be allocated (B1)
- If the client is running a DataNode, it will locate it there. Otherwise, it will randomly choose anotherone
By default, data replicates in two nodes in the cluster
- Sends acknowledgements of replicas and DataNode to client
- Repeats for each block
- For each block, it replicates in at least 2 Racks
Client tells DataNode that it finished when it receives acknowledgements

Command Line Interface

Command Line Interface

copyFromLocal: hadoop fs -copyFromLocal
put: hadoop fs -put
copyToLocal
get
getmerge
put
setrep

MapReduce

Decompose operations into Map and Reduce
- Comes from functional programming languages
- Allow passing functions as arguments to other functions
Example:
- for loop to duplicate each element in an array
Map operation
- When we do something to each element of an array we have a common operation
- We can do this with functional programming

MapReduce

We can generalize our code and this function can be converted into a Map function
- We can create map functions for many types of operations

MapReduce

Advantage of the Map operation
- When we want to execute a function on very large DBs
  - Billions of elements
- Take advantage of one thousand of computers in parallel
  - Fast processing
  - Use a parallel implementation of the Map function instead of converting all the program to parallel

Reduce Operation

Similar to the Map operation
Example: sum the elements of an array
- We can do it with a for loop
- We can also generalize

Reduce Operation

Function fn takes the sum s and the current element to sum from the array as argument \(a[i]\)

Reduce Operation

We can make that function sum takes fn as argument

Reduce Operation

Now we have a generalized Reduce operation

Reduce Operation

We can use the Reduce operation for other operations
- Combne values of an array in certain way (sum, concatenation, etc.)

Reduce Operation

Advantages of the Reduce operation
- Also good to deal with large amounts of data
- Parelalize our code for the Reduce function
- We don’t modify all the program, only the Reduce functions

Hadoop MapReduce

This is what Hadoop MapReduce does
- The map and reduce implementations are parallel, distributed, fault tolerant, and know the network topology
- Map and reduce are very efficient with large amounts of data

Example of a MapReduce Job

8 main steps
1. The MapReduce program asks the JobClient to execute a MapReduce job
2. Sends message to JobTracker that assigns a unique ID to the job
3. The JobClient copies the resources of the job to the shared filesystem (as HDFS)
4. The JobClient asks the JobTracker to start the job
5. The JobTracker intializes the job
  - Computes how to divide the data for different mapper processes to maximize throughput (volume of work)
  - Recovers these input divisions from the distributed filesystem
6. TaskTrackers continuosly send messages to the JobTracker. When JobTracker has work for them, it returns a map or reduce task as answer to the messages
7. TaskTrackers need to obtain the code to execute, they get it from the shared filesystem
8. TaskTrackers start a virtual machine with a child process and this child process executes the map or reduce code

Example of a MapReduce Job

MapReduce Operations in Sequence

The job has only one Map step and only one Reduce step
The Map step is executed first
- Takes a subset of the data input Split and applies map to each row
  - Such as multiply by 2 in the example
- There can be multiple Map operations, running in parallel with a different input Split
- The output goes to memory and stores to disk
  - Orders and partitions using a key (using the default partitioner)
  - Uses merge sort to sort each partition
- Partitions are sent to the Reducers
  - Partition 1 to reducer 1, partition 2 to other reducer
- Each Reducer makes its merge steps and executes the code of the Reduce task
  - Sum, as in the example
  - Produces a sorted output in each reducer

MapReduce Operations in Sequence

Fundamental Data Types

Input and output data of mappers and reducers take specific forms
- They get to Hadoop in an unstructured form
- Before they enter the mapper, Hadoop change the data into key-value pairs, Hadoop provides the keys
- The mapper also produces key-value pairs, these pairs values can change
  - Now, there can be duplicated keys comming out the mappers
  - The *shuffle step groups values with the same key
- Output from shuffle is the input to the reducer
  - We have a list of keys that are output from the mapper but grouped by their keys, there are no more than a record with the same key
- Output from reducer can be a new key-value pair
  - The reducer could have sum the values associated to each key

Fault Tolerance

Failure in a task
- A bug in the code of a map or reduce task
- The JVM tells the TaskTracker, and Hadoop considers it as a job failure, it starts a new task to recover that job
- If the task hangs on, this can also be detected and the JobTracker runs the task in other machine (in case this is a HW failure)
  - If the task continues failing, the job is considered as a failure

Fault Tolerance

The the TaskTracker fails
- The JobTracker will know it because it waits heartbeat messages
- If it doesn’t receive the heartbeat, it removes the TaskTracker from the list

Introduction to Big Data

Machine Learning

Introduction

Introduction

Introduction

Introduction

Hadoop

Hadoop

What is Big data?

What is Big data?

What is Big data?

Projects Related to Hadoop

Hadoop Examples

Hadoop Examples

Hadoop Examples

Hadoop Examples

Hadoop Examples

Hadoop Examples

Hadoop Examples

Hadoop doesn’t solve just any problem

Solutions for Big Data

Big Data and the Cloud

Hadoop Architecture

Hadoop Architecture

Hadoop Architecture

Hadoop Architecture

Hadoop Distributed File System (HDFS)

Hadoop Distributed File System (HDFS)

Hadoop Distributed File System (HDFS)

Hadoop Distributed File System (HDFS)

Hadoop Replication

MapReduce Engine

Types of Nodes

Type of Nodes

Type of Nodes: NameNode

Types of Nodes: DataNode

Types of Nodes: JobTracker

Types of Nodes: TaskTracker

Topology Awareness

Topology Awareness

Writting a File to HDFS

Command Line Interface

Command Line Interface

MapReduce

MapReduce

MapReduce

MapReduce

MapReduce

Reduce Operation

Reduce Operation

Reduce Operation

Reduce Operation

Reduce Operation

Reduce Operation

Hadoop MapReduce

Example of a MapReduce Job

Example of a MapReduce Job

MapReduce Operations in Sequence

MapReduce Operations in Sequence

Fundamental Data Types

Data Flow Example

Fault Tolerance

Fault Tolerance

Fault Tolerance