How does Apache Hive

Hive - introduction



The term "Big Data" is used for collections of large data sets that encompass large volumes, high speed, and a variety of data that is increasing day by day. With the help of traditional data management systems, it is difficult to process big data. Hence, the Apache Software Foundation introduced a framework called Hadoop to solve big data management and processing challenges.

Hadoop

Hadoop is an open source framework for storing and processing big data in a distributed environment. It contains two modules, one is Map Reduce and the other is Hadoop Distributed File System (HDFS).

  • reduce card: It is a parallel programming model for processing large amounts of structured, semi-structured and unstructured data over large clusters of standard hardware.

  • HDFS:Hadoop Distributed File System is part of the Hadoop Framework, used to store and process the data set. It provides a fault tolerant filesystem to run on asset hardware.

The Hadoop ecosystem contains various sub-projects (tools), such as Sqoop, Pig, and Hive, that are used to help Hadoop modules.

  • Sqoop: It is use for importing and exporting data back and forth between HDFS and RDBMS.

  • Pig: It is a procedural language platform used to script MapReduce operations.

  • Hive: It's a platform to use to develop SQL-type scripts to do MapReduce operations.

Note: There are several ways to execute MapReduce operations:

  • The traditional approach using Java MapReduce program for structured, semi-structured and unstructured data.
  • The scripting approach for MapReduce to process structured and semi-structured data with the help of Pig.
  • The Hive Query Language (HiveQL or HQL) for MapReduce to process structured data with the help of Beehive.

What is hive

Hive is a data warehouse infrastructure tool for processing structured data in Hadoop. It sits on top of the Hadoop to summarize big data, and makes query and analysis easy.

Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it further as open source under the name Apache Hive. It is used by different companies. For example, Amazon uses Elastic MapReduce in Amazon.

Hive isn't

  • A relational database
  • A Draft for Online Transaction Processing (OLTP)
  • A language for real-time queries and updates at the line level

Features Hive

  • It stores schemas in a database and processes data in HDFS.
  • It is designed for OLAP.
  • It provides SQL-type language for querying called HiveQL or HQL.
  • It's familiar, fast, scalable, and expandable.

Architecture from Hive

The following component diagram shows the architecture of the Hive:

This component diagram contains different units. In the following table the individual units:

Unit namebusiness
user interfaceHive is a data warehouse infrastructure software that can create interaction between users and HDFS. The user interfaces that Hive props are Hive Web UI, Hive Command Line, and Hive HD Insight (In Windows Server).
Meta storeHive selects the respective database server to store the schema or metadata of tables, databases, columns in a table, the data types and HDFS mapping.
HiveQL process engineHiveQL is similar to SQL for querying schema information on the metastore. It is one of the replacements of the traditional approach to the MapReduce program. Instead of writing a MapReduce program in Java, we can write a query for MapReduce job and process it.
Execution engineThe connection part of HiveQL Process Engine and MapReduce is Hive Execution Engine. The execution engine processes the query and produces results the same as MapReduce results. It uses the flavor of MapReduce.
HDFS or HBASEHadoop Distributed File System or Hbase are the data storage techniques for storing data in file systems.

Working from Hive

The following diagram shows the workflow between Hive and Hadoop.

The following table shows how Hive interacts with the Hadoop framework:

Step no.business
1Run query

The Hive interface like command line or web interface sends query drivers (any database driver like JDBC, ODBC, etc.) to be executed.

2Get plan

The driver takes the help of the query compiler, which parses the query to check the syntax and query plan or requirement of the query.

3Get metadata

The compiler sends metadata requests to metastore (a database).

4Send metadata

Metastore sends metadata as a response to the compiler.

5Send plan

The compiler checks the request and sends the plan for the driver. Up to here is the complete analysis and creation of a query.

6Execute plan

The driver sends the execution plan for the execution engine.

7Run job

Inside the process of executing job is a MapReduce job. The execution module sends the job to the job tracker, which is assigned to the name node and this job to the task tracker, which is assigned to the data node. Here the query is executed MapReduce Job.

7.1Metadata Ops

Meanwhile, in execution, the execution engine can perform metadata operations with Metastore.

8Fetch results

The execution engine receives the results from data nodes.

9Send results

The execution engine sends these resulting values ​​to the driver.

10Send results

The driver sends the results to Hive interfaces.