Introduction to Big Data
Machine Learning
Jesus A. Gonzalez
August 10, 2019
Introduction
Introduction
- How much information can we process in our computers?
- How much can we increase the processing power of our computers?
- What type of data are we working with?
- Relational DBs
- Structured data
- Facebook
- Twitter
- RFID sensors
- All at the same time
Introduction
Introduction
- The amount of data that we need to work with
- 1 TB
- 10 TB
- 100 TB
- more…
Hadoop
- Hadoop is
- An Opensource project
- Apache Foundation
- Written in Java by Doug Cutting
- Got the name of hadoop because of the toy elephant of his son
- Uses google technology
- Use to manage large amounts of data with paralellism
- Different types of data
- Structured
- Non-structured
- Semi-structured
- Uses unexpensive hardware in a convenient way
Hadoop
- Really efficient
- Works in batch mode in order to deal with large amounts of data
- Replicates data in several computers
- Reliable: if one node falls, data is taken from another node
- It is not designed for
- OLTP
- OLAP
- Decision Support Systems (DSS)
- It is designed for
- Big data
- Complements OLTP and OLAP
- Doesn’t replace a relational DB
What is Big data?
- Explosion in amount of data collected
- Big data
- Describes large data collections (large datasets)
- Can be unstructured
- Grow so large and quick that it is difficult to keep them in a normal DB or with statistical tools
What is Big data?
- Interesting statistics - Examples of big data
What is Big data?
- With these large amounts of data businesses require
- Speed
- Reliability
- Deeper data analysis
- Big data solutions
Hadoop Examples
Hadoop Examples
- Telecommunications Industry
- China Mobile
- Cluster with hadoop for
- Data mining of call data registries
- 5 - 8 TB of data produced by day
- Could process 10 times more data than with the previous system
- At \(\frac{1}{5}\) of the previous cost
Hadoop Examples
- The media industry
- The New York Times
- They wanted to make all their articles of public domain (from 1851 to 1922)
- Converted the articles of 11 millions of files and images to 1.5 TB of pdf documents
- Implemented by one employee
- One job x 24 hours in a Hadoop cluster in Amazon EC2 with 100-instances
- Low cost
Hadoop Examples
- Technology industry
- IBM ES2
- Search technology based on Hadoop, Lucene, and Jaql
- Business challenges such as: vocabulary, abbreviations, achronyms
- Can perform data mining tasks for
- Create achronyms journals
- Regular expressions patterns
- Geo-classification rules
Hadoop Examples
- Technology industry
- Internet enterprises or Social networking - Yahoo, Facebook, Amazon, eBay, Twitter, StumbleUpon, Rackspace, Ning, AOL, and many more
Hadoop Examples
- Yahoo is one of the most productive users
- Application runs in a cluster
- It is the largest Hadoop contributor
Hadoop doesn’t solve just any problem
- Hadoop is not good for:
- Processing transactions (random access)
- When our job cannot be parallelized
- For low response times (low latency)
- To process large amounts of small files
- For intensive computation with a small amount of data
Solutions for Big Data
- More than just Hadoop
- We add the Business Intelligence and Data Analysis functionalities
- Used to create valuable information
- Able to combine structured data (already existing in corporations) with unstructured data
- Able to generate information from data in movement
- Streams of InfoSphere of IBM
- Determine customers sentiments towards a new product
- Based on Facebook or Twitter comments
Big Data and the Cloud
- The cloud has become very popular
- Combines very well with Big Data solutions
- In the cloud we can configure a hadoop cluster in minutes, under demand, and can be used as it is used, without paying for more than what we use
Hadoop Architecture
- Terms
- HDFS
- MapReduce
- Types of nodes
- Topology
- Writing an HDFS file
Hadoop Architecture
- Node
- Rack
- A collection of nodes
- 30 to 40, stored near from each other
- All connected to the same switch in the network
- Bandwidth between two nodes in the same Rack is larger than the bandwidth between two nodes in different Racks
- Cluster
Hadoop Architecture
Hadoop Architecture
- Two main components
- DFS (Distributed File System)
- Several options
- The main one is HDFS
- MapReduce engine
- To make computations over the data on the DFS
Hadoop Distributed File System (HDFS)
- Executed over the existing file system
- In each node of the Hadoop cluster
- Designed for a very particular access pattern
- Very large files
- Data flow
- Works better with large files
- The larger the size, the less time used to search for another data segment
- Works to the limit of the disk access bandwidth (minimizes seeks using large files)
- Hadoop makes a seek for the beginning of a block and continues reading sequentially from there
- Uses blocks to store a file or parts of a file
Hadoop Distributed File System (HDFS)
Hadoop Distributed File System (HDFS)
- Blocks
- Large blocks
- 64MB (default), 128MB (recommended)
- Compared to a UNIX block (4KB)
- Robust
- 1 HDFS block is supported by multiple operating systems OS blocks
Hadoop Distributed File System (HDFS)
- Advantages
- Fixed size (easier to compute how they are stored on disk)
- A file could be larger than any disk in the network
- Could be stored by blocks of more than a node
- A block could be smaller in order to save space
- It is easy to implement replication in order to provide
- Fault tolerance
- Availability
Hadoop Replication
- A block can be replicated in multiple nodes
- Even with node failures, data is not lost
- Replication can be configured with more than two nodes
MapReduce Engine
- HDFS is Google technology
- MapReduce is also inspired in a paper published by Google
- Consists of two data transformations that can be applied several times
- The Map and Reduce functions
- A MapReduce job is divided in Map tasks that are executed in parallel (independent from each other) and Reduce tasks
Types of Nodes
- HDFS nodes
- MapReduce nodes
- Other secundary nodes
- Secondary NameNode
- CheckPointNode
- BackUpNode
Type of Nodes
Type of Nodes: NameNode
- Only one NameNode per Hadoop cluster for previous versions
- More than one for newer versions
- Manages the namespace of the filesystem plus metadata
- Point of failure for previous versions of hadoop (not anymore)
- This point of failure was mitigated by using NFS
- Recommended to use a specialized node for this task
- The node with the best configuration should be the NameNode
- Good amount of memory
Types of Nodes: DataNode
- Many of them for each Hadoop cluster
- Manages the data blocks and works as a client service
- Periodically reports to the NameNode the list of blocks stored
- Can use commodity hardware (non-specialized, unexpensive)
- Data replication done at the SW level
Types of Nodes: JobTracker
- One per Hadoop cluster
- Receives jobs petitions sent by clients
- Schedules and monitors the MapReduce jobs with the TaskTrackers
- Monitors failures, looking for those tasks that require being re-scheduled in other TaskTracker
Types of Nodes: TaskTracker
- Many of them for each Hadoop cluster
- To achieve the parallelism of the MapReduce tasks
- Execute the MapReduce operations
- Starts Java virtual machines to execute a Map or Reduce task
Topology Awareness
- Hadoop knows the network topology
- Allows optimizing the jobs allocation according to the data location
- Allocate the job as close to the data in order to maximize the bandwidth while reading the data
Topology Awareness
- 1st option
- Assign the TaskTracker to receive the Map task that reads the data in the same node
- 2nd option
- Assign to a node in the same Rack as the data
- Worst option
- Assign to a node in a Rack different to the one in which data is located
- Recommended: configure Rack-Awareness
- Try to execute the task in the TaskTracker node with the highest bandwidth to access data
Writting a File to HDFS
- Create request to a NameNode
- Verify if file already exists
- Allow creating the file
- NameNode deternimes the node in which the first block will be allocated (B1)
- If the client is running a DataNode, it will locate it there. Otherwise, it will randomly choose anotherone
- By default, data replicates in two nodes in the cluster
- Sends acknowledgements of replicas and DataNode to client
- Repeats for each block
- For each block, it replicates in at least 2 Racks
- Client tells DataNode that it finished when it receives acknowledgements
Command Line Interface
Command Line Interface
- copyFromLocal: hadoop fs -copyFromLocal
- put: hadoop fs -put
- copyToLocal
- get
- getmerge
- put
- setrep
MapReduce
- Decompose operations into Map and Reduce
- Comes from functional programming languages
- Allow passing functions as arguments to other functions
- Example:
- for loop to duplicate each element in an array
- Map operation
- When we do something to each element of an array we have a common operation
- We can do this with functional programming
MapReduce
- We can generalize our code and this function can be converted into a Map function
- We can create map functions for many types of operations
MapReduce
MapReduce
MapReduce
- Advantage of the Map operation
- When we want to execute a function on very large DBs
- Take advantage of one thousand of computers in parallel
- Fast processing
- Use a parallel implementation of the Map function instead of converting all the program to parallel
Reduce Operation
- Similar to the Map operation
- Example: sum the elements of an array
- We can do it with a for loop
- We can also generalize
Reduce Operation
- Function fn takes the sum s and the current element to sum from the array as argument \(a[i]\)
Reduce Operation
- We can make that function sum takes fn as argument
Reduce Operation
- Now we have a generalized Reduce operation
Reduce Operation
- We can use the Reduce operation for other operations
- Combne values of an array in certain way (sum, concatenation, etc.)
Reduce Operation
- Advantages of the Reduce operation
- Also good to deal with large amounts of data
- Parelalize our code for the Reduce function
- We don’t modify all the program, only the Reduce functions
Hadoop MapReduce
- This is what Hadoop MapReduce does
- The map and reduce implementations are parallel, distributed, fault tolerant, and know the network topology
- Map and reduce are very efficient with large amounts of data
Example of a MapReduce Job
- 8 main steps
- The MapReduce program asks the JobClient to execute a MapReduce job
- Sends message to JobTracker that assigns a unique ID to the job
- The JobClient copies the resources of the job to the shared filesystem (as HDFS)
- The JobClient asks the JobTracker to start the job
- The JobTracker intializes the job
- Computes how to divide the data for different mapper processes to maximize throughput (volume of work)
- Recovers these input divisions from the distributed filesystem
- TaskTrackers continuosly send messages to the JobTracker. When JobTracker has work for them, it returns a map or reduce task as answer to the messages
- TaskTrackers need to obtain the code to execute, they get it from the shared filesystem
- TaskTrackers start a virtual machine with a child process and this child process executes the map or reduce code
Example of a MapReduce Job
MapReduce Operations in Sequence
- The job has only one Map step and only one Reduce step
- The Map step is executed first
- Takes a subset of the data input Split and applies map to each row
- Such as multiply by 2 in the example
- There can be multiple Map operations, running in parallel with a different input Split
- The output goes to memory and stores to disk
- Orders and partitions using a key (using the default partitioner)
- Uses merge sort to sort each partition
- Partitions are sent to the Reducers
- Partition 1 to reducer 1, partition 2 to other reducer
- Each Reducer makes its merge steps and executes the code of the Reduce task
- Sum, as in the example
- Produces a sorted output in each reducer
MapReduce Operations in Sequence
Fundamental Data Types
- Input and output data of mappers and reducers take specific forms
- They get to Hadoop in an unstructured form
- Before they enter the mapper, Hadoop change the data into key-value pairs, Hadoop provides the keys
- The mapper also produces key-value pairs, these pairs values can change
- Now, there can be duplicated keys comming out the mappers
- The *shuffle step groups values with the same key
- Output from shuffle is the input to the reducer
- We have a list of keys that are output from the mapper but grouped by their keys, there are no more than a record with the same key
- Output from reducer can be a new key-value pair
- The reducer could have sum the values associated to each key
Data Flow Example
Fault Tolerance
- Failure in a task
- A bug in the code of a map or reduce task
- The JVM tells the TaskTracker, and Hadoop considers it as a job failure, it starts a new task to recover that job
- If the task hangs on, this can also be detected and the JobTracker runs the task in other machine (in case this is a HW failure)
- If the task continues failing, the job is considered as a failure
Fault Tolerance
- The the TaskTracker fails
- The JobTracker will know it because it waits heartbeat messages
- If it doesn’t receive the heartbeat, it removes the TaskTracker from the list
Fault Tolerance
- If the JobTracker fails
- There is only one JobTracker per cluster, it it fails, the job fails